GSE107122 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Epistatic interactions between NMD and TRP53 control progenitor cell maintenance and brain size.

Neuron (2024) — PMID 38697111

Dataset

GSE107122

Developmental emergence of adult neural stem cells as revealed by single cell transcriptional profiling

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    FASTQ sequencing reads were processed, aligned to the mouse genome (mm10) and converted to digital gene expression matrices using the Drop-seq tools (version 1.12, http://mccarrolllab.com/dropseq/) with settings as described in the Drop-seq Alignment Cookbook (version 1.2 Jan2016, http://mccarrolllab.com/dropseq/)

    $ Bash example
    # Install Drop-seq tools (example using git clone and build, or direct download)
    # git clone https://github.com/broadinstitute/Drop-seq.git
    # cd Drop-seq
    # ./gradlew build
    # export PATH=$PATH:$(pwd)/bin # Add Drop-seq bin directory to PATH
    
    # Define reference genome and annotation paths for mouse (mm10)
    # Replace with actual paths to your mm10 files.
    # These files are typically downloaded from Ensembl, UCSC, or NCBI.
    # For mm10 (GRCm38), common sources are Ensembl or GRC.
    # Example paths for GRCm38 (mm10)
    MM10_FASTA="/path/to/mm10/GRCm38.primary_assembly.genome.fa"
    MM10_GTF="/path/to/mm10/gencode.vM25.annotation.gtf" # Or Ensembl GTF for GRCm38
    # Drop-seq tools often require a refFlat file and ribosomal interval list,
    # which can be generated from the GTF using Picard tools.
    # Example generation (commented out):
    # java -jar picard.jar GtfToRefFlat I=${MM10_GTF} O=${MM10_REFFLAT}
    # java -jar picard.jar MakeRibosomalIntervals I=${MM10_GTF} O=${MM10_RIBOSOMAL_INTERVALS} S=Mus_musculus
    
    MM10_REFFLAT="/path/to/mm10/gencode.vM25.refFlat"
    MM10_RIBOSOMAL_INTERVALS="/path/to/mm10/gencode.vM25.ribosomal.interval_list"
    
    # Define input FASTQ files (assuming R1 for barcodes/UMI, R2 for cDNA)
    INPUT_FASTQ_R1="sample_R1.fastq.gz"
    INPUT_FASTQ_R2="sample_R2.fastq.gz"
    
    # Define output directory and prefix
    OUTPUT_DIR="dropseq_results"
    OUTPUT_PREFIX="sample"
    mkdir -p "${OUTPUT_DIR}"
    TMP_DIR="${OUTPUT_DIR}/tmp"
    mkdir -p "${TMP_DIR}"
    
    # Define Drop-seq specific parameters as per Drop-seq Alignment Cookbook (v1.2 Jan2016)
    # These are common values for Drop-seq. Adjust if your library prep differs.
    CELL_BARCODE_LENGTH=12
    UMI_LENGTH=8
    # READ_STRUCTURE describes Read 1: 12bp Cell Barcode, 8bp UMI, then cDNA (T) after trimming 1bp.
    # This is a common structure for Drop-seq.
    READ_STRUCTURE="12C8M(TRIM_START=1)T"
    
    # Execute the Drop-seq alignment and Digital Gene Expression (DGE) matrix generation pipeline
    # The Drop-seq_alignment.sh script orchestrates multiple Drop-seq tools.
    # Ensure Drop-seq tools are in your PATH or specify the full path to the script.
    Drop-seq_alignment.sh \
        REFERENCE_FASTA="${MM10_FASTA}" \
        GENE_MODEL_GTF="${MM10_GTF}" \
        REF_FLAT="${MM10_REFFLAT}" \
        RIBOSOMAL_INTERVALS="${MM10_RIBOSOMAL_INTERVALS}" \
        OUTPUT_PREFIX="${OUTPUT_DIR}/${OUTPUT_PREFIX}" \
        FASTQ_1="${INPUT_FASTQ_R1}" \
        FASTQ_2="${INPUT_FASTQ_R2}" \
        CELL_BARCODE_LENGTH="${CELL_BARCODE_LENGTH}" \
        UMI_LENGTH="${UMI_LENGTH}" \
        READ_STRUCTURE="${READ_STRUCTURE}" \
        TMP_DIR="${TMP_DIR}" \
        NUM_CORE=8 # Example: Use 8 cores for alignment and processing
  2. 2

    The number of cell barcodes per embryonic age was identified by calculating the cumulative fraction of reads attributable to each individual cell barcode and arranging these in decreasing order.

    Python (with pandas) (Inferred with models/gemini-2.5-flash) v3.x GitHub
    $ Bash example
    # Inferring a Python script using pandas for this data manipulation.
    # This script assumes an input file 'barcode_counts.tsv' with two columns: 'barcode' and 'read_count'.
    
    # Install pandas if not already installed
    # pip install pandas
    
    # Example: Create a dummy input file for demonstration purposes
    # echo -e "barcode\tread_count\nCELL1\t10000\nCELL2\t8000\nCELL3\t5000\nCELL4\t2000\nCELL5\t1000\nCELL6\t500\nCELL7\t100" > barcode_counts.tsv
    
    python3 -c "
    import pandas as pd
    
    # Load the barcode counts data
    # Assuming the input file is tab-separated with headers 'barcode' and 'read_count'
    df = pd.read_csv('barcode_counts.tsv', sep='\t')
    
    # Sort by read_count in decreasing order
    df_sorted = df.sort_values(by='read_count', ascending=False).reset_index(drop=True)
    
    # Calculate cumulative sum of read counts
    df_sorted['cumulative_reads'] = df_sorted['read_count'].cumsum()
    
    # Calculate total reads
    total_reads = df_sorted['read_count'].sum()
    
    # Calculate cumulative fraction of reads
    df_sorted['cumulative_fraction'] = df_sorted['cumulative_reads'] / total_reads
    
    # Print the results to standard output (or save to a new file)
    print(df_sorted[['barcode', 'read_count', 'cumulative_fraction']].to_csv(index=False, sep='\t'))
    " > barcode_counts_cumulative_fraction.tsv
  3. 3

    2000 Barcodes were selected at E11.5, E13.5 and E17.5 and 5000 Barcodes at E15.5

    Custom Python script (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # This script selects a specified number of barcodes for each embryonic stage.
    # It assumes an input TSV file where each row represents a barcode and includes its associated stage.
    #
    # Example input_barcodes_with_metadata.tsv:
    # barcode_id    stage   other_info
    # BC0001        E11.5   data1
    # BC0002        E13.5   data2
    # BC0003        E11.5   data3
    # BC0004        E15.5   data4
    # ... (ensure enough barcodes for each stage or handle warnings)
    #
    # Expected output: A file (e.g., selected_barcodes.tsv) containing one selected barcode ID per line.
    
    # Define input and output files
    INPUT_BARCODES_FILE="input_barcodes_with_metadata.tsv"
    OUTPUT_SELECTED_BARCODES_FILE="selected_barcodes.tsv"
    
    # Ensure pandas is installed
    # conda install -c anaconda pandas
    # pip install pandas
    
    # Execute the Python script for barcode selection
    python3 -c "
    import pandas as pd
    import sys
    import os
    
    input_file = '$INPUT_BARCODES_FILE'
    output_file = '$OUTPUT_SELECTED_BARCODES_FILE'
    
    selection_counts = {
        'E11.5': 2000,
        'E13.5': 2000,
        'E17.5': 2000,
        'E15.5': 5000
    }
    
    if not os.path.exists(input_file):
        print(f\"Error: Input file '{input_file}' not found.\", file=sys.stderr)
        sys.exit(1)
    
    try:
        df = pd.read_csv(input_file, sep='\t')
    except Exception as e:
        print(f\"Error reading input file '{input_file}': {e}\", file=sys.stderr)
        sys.exit(1)
    
    if 'stage' not in df.columns or 'barcode_id' not in df.columns:
        print(\"Error: Input file must contain 'stage' and 'barcode_id' columns.\", file=sys.stderr)
        sys.exit(1)
    
    selected_barcodes_list = []
    
    for stage, count in selection_counts.items():
        stage_df = df[df['stage'] == stage]
        if len(stage_df) < count:
            print(f\"Warning: Not enough barcodes for stage {stage}. Requested {count}, found {len(stage_df)}. Selecting all available.\", file=sys.stderr)
            selected_barcodes_list.append(stage_df['barcode_id'])
        else:
            # Use random_state for reproducibility of sampling
            selected_barcodes_list.append(stage_df['barcode_id'].sample(n=count, random_state=42))
    
    if selected_barcodes_list:
        final_selection = pd.concat(selected_barcodes_list).reset_index(drop=True)
        final_selection.to_csv(output_file, index=False, header=False)
        print(f\"Selected barcodes written to '{output_file}'\")
    else:
        print(\"No barcodes were selected.\", file=sys.stderr)
    "
    
  4. 4

    Two batches data collected from E15.5 embryos (Batch1 and Batch2) were combined together.

    samtools merge (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Assuming 'Batch1' and 'Batch2' refer to aligned BAM files from two different batches
    # and need to be combined into a single BAM file.
    samtools merge -o combined_batches.bam batch1.bam batch2.bam

Tools Used

Raw Source Text
FASTQ sequencing reads were processed, aligned to the mouse genome (mm10) and converted to digital gene expression matrices using the Drop-seq tools (version 1.12, http://mccarrolllab.com/dropseq/) with settings as described in the Drop-seq Alignment Cookbook (version 1.2 Jan2016, http://mccarrolllab.com/dropseq/)
The number of cell barcodes per embryonic age was identified by calculating the cumulative fraction of reads attributable to each individual cell barcode and arranging these in decreasing order.
2000 Barcodes were selected at E11.5, E13.5 and E17.5 and 5000 Barcodes at E15.5
Two batches data collected from E15.5 embryos (Batch1 and Batch2) were combined together.
Genome_build: mm10
Supplementary_files_format_and_content: tab-delimited text files containing raw digital gene expression for each embryonic age. Two processed data files are included per age. One containing all data for all cells and one containing only cells predicted as cortical in origin.
← Back to Analysis