GSE107122 Processing Pipeline
Publication
Epistatic interactions between NMD and TRP53 control progenitor cell maintenance and brain size.Neuron (2024) — PMID 38697111
Dataset
GSE107122Developmental emergence of adult neural stem cells as revealed by single cell transcriptional profiling
Processing Steps
Generate Jupyter Notebook-
1
FASTQ sequencing reads were processed, aligned to the mouse genome (mm10) and converted to digital gene expression matrices using the Drop-seq tools (version 1.12, http://mccarrolllab.com/dropseq/) with settings as described in the Drop-seq Alignment Cookbook (version 1.2 Jan2016, http://mccarrolllab.com/dropseq/)
$ Bash example
# Install Drop-seq tools (example using git clone and build, or direct download) # git clone https://github.com/broadinstitute/Drop-seq.git # cd Drop-seq # ./gradlew build # export PATH=$PATH:$(pwd)/bin # Add Drop-seq bin directory to PATH # Define reference genome and annotation paths for mouse (mm10) # Replace with actual paths to your mm10 files. # These files are typically downloaded from Ensembl, UCSC, or NCBI. # For mm10 (GRCm38), common sources are Ensembl or GRC. # Example paths for GRCm38 (mm10) MM10_FASTA="/path/to/mm10/GRCm38.primary_assembly.genome.fa" MM10_GTF="/path/to/mm10/gencode.vM25.annotation.gtf" # Or Ensembl GTF for GRCm38 # Drop-seq tools often require a refFlat file and ribosomal interval list, # which can be generated from the GTF using Picard tools. # Example generation (commented out): # java -jar picard.jar GtfToRefFlat I=${MM10_GTF} O=${MM10_REFFLAT} # java -jar picard.jar MakeRibosomalIntervals I=${MM10_GTF} O=${MM10_RIBOSOMAL_INTERVALS} S=Mus_musculus MM10_REFFLAT="/path/to/mm10/gencode.vM25.refFlat" MM10_RIBOSOMAL_INTERVALS="/path/to/mm10/gencode.vM25.ribosomal.interval_list" # Define input FASTQ files (assuming R1 for barcodes/UMI, R2 for cDNA) INPUT_FASTQ_R1="sample_R1.fastq.gz" INPUT_FASTQ_R2="sample_R2.fastq.gz" # Define output directory and prefix OUTPUT_DIR="dropseq_results" OUTPUT_PREFIX="sample" mkdir -p "${OUTPUT_DIR}" TMP_DIR="${OUTPUT_DIR}/tmp" mkdir -p "${TMP_DIR}" # Define Drop-seq specific parameters as per Drop-seq Alignment Cookbook (v1.2 Jan2016) # These are common values for Drop-seq. Adjust if your library prep differs. CELL_BARCODE_LENGTH=12 UMI_LENGTH=8 # READ_STRUCTURE describes Read 1: 12bp Cell Barcode, 8bp UMI, then cDNA (T) after trimming 1bp. # This is a common structure for Drop-seq. READ_STRUCTURE="12C8M(TRIM_START=1)T" # Execute the Drop-seq alignment and Digital Gene Expression (DGE) matrix generation pipeline # The Drop-seq_alignment.sh script orchestrates multiple Drop-seq tools. # Ensure Drop-seq tools are in your PATH or specify the full path to the script. Drop-seq_alignment.sh \ REFERENCE_FASTA="${MM10_FASTA}" \ GENE_MODEL_GTF="${MM10_GTF}" \ REF_FLAT="${MM10_REFFLAT}" \ RIBOSOMAL_INTERVALS="${MM10_RIBOSOMAL_INTERVALS}" \ OUTPUT_PREFIX="${OUTPUT_DIR}/${OUTPUT_PREFIX}" \ FASTQ_1="${INPUT_FASTQ_R1}" \ FASTQ_2="${INPUT_FASTQ_R2}" \ CELL_BARCODE_LENGTH="${CELL_BARCODE_LENGTH}" \ UMI_LENGTH="${UMI_LENGTH}" \ READ_STRUCTURE="${READ_STRUCTURE}" \ TMP_DIR="${TMP_DIR}" \ NUM_CORE=8 # Example: Use 8 cores for alignment and processing -
2
The number of cell barcodes per embryonic age was identified by calculating the cumulative fraction of reads attributable to each individual cell barcode and arranging these in decreasing order.
$ Bash example
# Inferring a Python script using pandas for this data manipulation. # This script assumes an input file 'barcode_counts.tsv' with two columns: 'barcode' and 'read_count'. # Install pandas if not already installed # pip install pandas # Example: Create a dummy input file for demonstration purposes # echo -e "barcode\tread_count\nCELL1\t10000\nCELL2\t8000\nCELL3\t5000\nCELL4\t2000\nCELL5\t1000\nCELL6\t500\nCELL7\t100" > barcode_counts.tsv python3 -c " import pandas as pd # Load the barcode counts data # Assuming the input file is tab-separated with headers 'barcode' and 'read_count' df = pd.read_csv('barcode_counts.tsv', sep='\t') # Sort by read_count in decreasing order df_sorted = df.sort_values(by='read_count', ascending=False).reset_index(drop=True) # Calculate cumulative sum of read counts df_sorted['cumulative_reads'] = df_sorted['read_count'].cumsum() # Calculate total reads total_reads = df_sorted['read_count'].sum() # Calculate cumulative fraction of reads df_sorted['cumulative_fraction'] = df_sorted['cumulative_reads'] / total_reads # Print the results to standard output (or save to a new file) print(df_sorted[['barcode', 'read_count', 'cumulative_fraction']].to_csv(index=False, sep='\t')) " > barcode_counts_cumulative_fraction.tsv -
3
2000 Barcodes were selected at E11.5, E13.5 and E17.5 and 5000 Barcodes at E15.5
Custom Python script (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# This script selects a specified number of barcodes for each embryonic stage. # It assumes an input TSV file where each row represents a barcode and includes its associated stage. # # Example input_barcodes_with_metadata.tsv: # barcode_id stage other_info # BC0001 E11.5 data1 # BC0002 E13.5 data2 # BC0003 E11.5 data3 # BC0004 E15.5 data4 # ... (ensure enough barcodes for each stage or handle warnings) # # Expected output: A file (e.g., selected_barcodes.tsv) containing one selected barcode ID per line. # Define input and output files INPUT_BARCODES_FILE="input_barcodes_with_metadata.tsv" OUTPUT_SELECTED_BARCODES_FILE="selected_barcodes.tsv" # Ensure pandas is installed # conda install -c anaconda pandas # pip install pandas # Execute the Python script for barcode selection python3 -c " import pandas as pd import sys import os input_file = '$INPUT_BARCODES_FILE' output_file = '$OUTPUT_SELECTED_BARCODES_FILE' selection_counts = { 'E11.5': 2000, 'E13.5': 2000, 'E17.5': 2000, 'E15.5': 5000 } if not os.path.exists(input_file): print(f\"Error: Input file '{input_file}' not found.\", file=sys.stderr) sys.exit(1) try: df = pd.read_csv(input_file, sep='\t') except Exception as e: print(f\"Error reading input file '{input_file}': {e}\", file=sys.stderr) sys.exit(1) if 'stage' not in df.columns or 'barcode_id' not in df.columns: print(\"Error: Input file must contain 'stage' and 'barcode_id' columns.\", file=sys.stderr) sys.exit(1) selected_barcodes_list = [] for stage, count in selection_counts.items(): stage_df = df[df['stage'] == stage] if len(stage_df) < count: print(f\"Warning: Not enough barcodes for stage {stage}. Requested {count}, found {len(stage_df)}. Selecting all available.\", file=sys.stderr) selected_barcodes_list.append(stage_df['barcode_id']) else: # Use random_state for reproducibility of sampling selected_barcodes_list.append(stage_df['barcode_id'].sample(n=count, random_state=42)) if selected_barcodes_list: final_selection = pd.concat(selected_barcodes_list).reset_index(drop=True) final_selection.to_csv(output_file, index=False, header=False) print(f\"Selected barcodes written to '{output_file}'\") else: print(\"No barcodes were selected.\", file=sys.stderr) " -
4
Two batches data collected from E15.5 embryos (Batch1 and Batch2) were combined together.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Assuming 'Batch1' and 'Batch2' refer to aligned BAM files from two different batches # and need to be combined into a single BAM file. samtools merge -o combined_batches.bam batch1.bam batch2.bam
Tools Used
Raw Source Text
FASTQ sequencing reads were processed, aligned to the mouse genome (mm10) and converted to digital gene expression matrices using the Drop-seq tools (version 1.12, http://mccarrolllab.com/dropseq/) with settings as described in the Drop-seq Alignment Cookbook (version 1.2 Jan2016, http://mccarrolllab.com/dropseq/) The number of cell barcodes per embryonic age was identified by calculating the cumulative fraction of reads attributable to each individual cell barcode and arranging these in decreasing order. 2000 Barcodes were selected at E11.5, E13.5 and E17.5 and 5000 Barcodes at E15.5 Two batches data collected from E15.5 embryos (Batch1 and Batch2) were combined together. Genome_build: mm10 Supplementary_files_format_and_content: tab-delimited text files containing raw digital gene expression for each embryonic age. Two processed data files are included per age. One containing all data for all cells and one containing only cells predicted as cortical in origin.