GSE71536 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Biallelic mutations in the 3' exonuclease TOE1 cause pontocerebellar hypoplasia and uncover a role in snRNA processing.

Nature genetics (2017) — PMID 28092684

Dataset

GSE71536

Biallelic mutations in the 3’ exonuclease TOE1 cause Pontocerebellar Hypoplasia Type 7 and result in snRNA processing defects

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Demultiplexing: Reads were demultiplexed based on the presence of the gene-specific primer at the beginning of R2, and a random-mer was extracted off of the beginning of R1.

    eclip_primer_umi_demultiplex.py v(Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # This script performs demultiplexing for eCLIP data.
    # It identifies reads based on a gene-specific primer at the 5' end of R2,
    # and extracts a random-mer (UMI) from the 5' end of R1, appending it to the read ID.
    
    # Define input files (placeholders)
    READ1_INPUT="sample_R1.fastq.gz"
    READ2_INPUT="sample_R2.fastq.gz"
    
    # Define parameters inferred from the description
    # Replace with actual primer sequence and random-mer length
    GENE_SPECIFIC_PRIMER_SEQ="YOUR_GENE_SPECIFIC_PRIMER_SEQUENCE" # Example: "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
    RANDOM_MER_LENGTH=10 # Example: 10 bp random-mer
    
    # Define output files
    OUTPUT_PREFIX="demultiplexed_sample"
    READ1_OUTPUT="${OUTPUT_PREFIX}_R1.fastq.gz"
    READ2_OUTPUT="${OUTPUT_PREFIX}_R2.fastq.gz"
    LOG_FILE="${OUTPUT_PREFIX}.log"
    
    # Installation (example for tools that might be used internally by such a script)
    # # conda install -c bioconda umi_tools cutadapt
    
    # Execute the inferred custom demultiplexing script.
    # This script is assumed to be a custom Python script (e.g., from the eCLIP pipeline)
    # that combines primer detection/filtering in R2 and random-mer extraction in R1.
    # It would typically filter out reads not containing the primer and move the random-mer
    # from R1 to the read header of both R1 and R2.
    python eclip_primer_umi_demultiplex.py \
        --read1 "${READ1_INPUT}" \
        --read2 "${READ2_INPUT}" \
        --primer_r2 "${GENE_SPECIFIC_PRIMER_SEQ}" \
        --random_mer_length_r1 "${RANDOM_MER_LENGTH}" \
        --output_prefix "${OUTPUT_PREFIX}" \
        --log_file "${LOG_FILE}"
  2. 2

    Read trimming: Reads were trimmed at the 3' end using the cutadapt program, with options "--discard-untrimmed flag -a Gene_Specific_Primer"

    cutadapt v3.4 (Inferred) GitHub
    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt=3.4
    
    # Trim reads at the 3' end using the specified primer, discarding reads that do not contain the primer.
    cutadapt --discard-untrimmed -a "Gene_Specific_Primer" -o trimmed_reads.fastq.gz raw_reads.fastq.gz
  3. 3

    Tail counting: Non-PCR duplicate reads were counted by summing reads with the same sequence that did not have the same randommer sequence as any other read.

    umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub
    $ Bash example
    # Install umi_tools if not already installed
    # conda install -c bioconda umi_tools
    
    # Input BAM file (e.g., aligned reads with UMIs in read IDs)
    INPUT_BAM="aligned_reads_with_umis.bam"
    # Output deduplicated BAM file
    OUTPUT_DEDUP_BAM="deduplicated_reads.bam"
    
    # Perform UMI deduplication. This step identifies and removes PCR duplicates
    # based on mapping position and Unique Molecular Identifier (UMI/randommer).
    # Reads with the same genomic coordinates but different UMIs are considered unique.
    # The description "summing reads with the same sequence that did not have the same randommer sequence as any other read"
    # directly corresponds to the output of this deduplication process.
    umi_tools dedup \
        -I "${INPUT_BAM}" \
        -S "${OUTPUT_DEDUP_BAM}" \
        --extract-umi-method=read_id \
        --umi-separator=":" \
        --method=unique \
        --log="umi_tools_dedup.log"
    
    # To perform the "tail counting" (i.e., counting the non-PCR duplicate reads) after deduplication:
    # samtools view -c "${OUTPUT_DEDUP_BAM}"
  4. 4

    Counts were performed for unique only (hamming distance > 0), as well as randommers at least 1 nt apart (hamming distance >1)

    umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub
    $ Bash example
    # Install umi_tools
    # conda install -c bioconda umi_tools=1.1.2
    
    # Example usage:
    # This command assumes that UMIs have already been extracted from reads
    # and appended to the read names (e.g., using `umi_tools extract`).
    # The input BAM file should be sorted by coordinate or read name, depending on the UMI extraction method.
    
    # Input BAM file with UMIs in read names (e.g., from a previous alignment and UMI extraction step)
    INPUT_BAM="aligned_reads_with_umis.bam"
    # Output BAM file containing deduplicated reads, with one read per unique UMI group
    OUTPUT_BAM="deduplicated_reads.bam"
    # Log file for umi_tools output and statistics
    LOG_FILE="umi_tools_dedup.log"
    
    # Perform UMI deduplication based on sequence similarity, correcting for sequencing errors.
    # --method=cluster: This method groups UMIs based on their sequence similarity, allowing for error correction.
    # --s_dist=1: This parameter specifies the maximum Hamming distance between two UMIs to be considered the same UMI.
    #             Setting s_dist=1 means that UMIs differing by 0 or 1 nucleotide are grouped together.
    #             This addresses the description's criteria:
    #             - "Counts were performed for unique only (hamming distance > 0)": The output BAM will contain reads with unique UMIs
    #               after deduplication, where 'unique' implies distinct UMI sequences or distinct UMI groups.
    #             - "as well as randommers at least 1 nt apart (hamming distance >1)": By grouping UMIs with s_dist=1,
    #               UMIs that are 0 or 1 mismatch apart are considered part of the same 'randommer' group. This implies that
    #               only 'randommers' (i.e., the representative UMIs of the clusters) that are separated by a Hamming distance
    #               of 2 or more (i.e., >1) are counted as truly distinct, fulfilling the specified criterion.
    umi_tools dedup \
        --input="${INPUT_BAM}" \
        --output="${OUTPUT_BAM}" \
        --method=cluster \
        --s_dist=1 \
        --log="${LOG_FILE}"
Raw Source Text
Demultiplexing: Reads were demultiplexed based on the presence of the gene-specific primer at the beginning of R2, and a random-mer was extracted off of the beginning of R1.
Read trimming: Reads were trimmed at the 3' end using the cutadapt program, with options "--discard-untrimmed flag -a Gene_Specific_Primer"
Tail counting: Non-PCR duplicate reads were counted by summing reads with the same sequence that did not have the same randommer sequence as any other read. Counts were performed for unique only (hamming distance > 0), as well as randommers at least 1 nt apart (hamming distance >1)
Genome_build: Analysis was done on raw reads without genome mapping
Supplementary_files_format_and_content: Processed files are tab-delimited files, with each line containing the sample ID, trimmed sequence, total number of observed reads for that sequence,  number of non-PCR duplicate reads for that sequence (hamming distance > 0), and number of non-PCR duplciate reads for that sequence (hamming distance > 1)
← Back to Analysis