GSE210437 Processing Pipeline

RNA-Seq code_examples 9 steps

Publication

Epistatic interactions between NMD and TRP53 control progenitor cell maintenance and brain size.

Neuron (2024) — PMID 38697111

Dataset

GSE210437

Enhanced CLIP (eCLIP) identified differential UPF1 binding sites in mouse neural progenitor cells (NPCs) samples

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Data was processed using the eCLIP pipeline and available at: http://github.com/yeolab/eclip

    eCLIP vNot explicitly versioned, inferred from repository activity (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # The eCLIP pipeline is a CWL workflow. This script demonstrates how to run it using cwltool.
    # It assumes the eCLIP repository has been cloned and cwltool is installed.
    
    # Clone the eCLIP pipeline repository (if not already present)
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    
    # Install cwltool if not already installed
    # pip install cwltool
    
    # Define placeholder input files and reference data.
    # In a real scenario, these paths would point to actual data.
    # Reference datasets (e.g., hg38 genome, GENCODE annotation) are not specified in the description,
    # so common human reference data (hg38, GENCODE v38) are used as placeholders.
    
    # Create a dummy inputs.yaml file for the eCLIP CWL workflow.
    # Replace '/path/to/...' with actual file paths.
    cat << EOF > eclip_inputs.yaml
    fastq_replicate1_r1:
      class: File
      path: /path/to/replicate1_R1.fastq.gz
    fastq_replicate1_r2: # Optional, if paired-end data is available
      class: File
      path: /path/to/replicate1_R2.fastq.gz
    fastq_replicate2_r1:
      class: File
      path: /path/to/replicate2_R1.fastq.gz
    fastq_replicate2_r2: # Optional, if paired-end data is available
      class: File
      path: /path/to/replicate2_R2.fastq.gz
    fastq_input_r1:
      class: File
      path: /path/to/input_R1.fastq.gz
    fastq_input_r2: # Optional, if paired-end data is available
      class: File
      path: /path/to/input_R2.fastq.gz
    genome_fasta:
      class: File
      path: /path/to/reference/hg38.fa # Placeholder: Human genome assembly hg38
    genome_gtf:
      class: File
      path: /path/to/reference/gencode.v38.annotation.gtf # Placeholder: GENCODE v38 annotation
    star_index_tar:
      class: File
      path: /path/to/reference/star_index_hg38.tar # Placeholder: STAR index for hg38
    chrom_sizes:
      class: File
      path: /path/to/reference/hg38.chrom.sizes # Placeholder: Chromosome sizes for hg38
    blacklist_bed:
      class: File
      path: /path/to/reference/hg38_blacklist.bed # Placeholder: ENCODE blacklist for hg38
    output_prefix: "my_eclip_experiment"
    EOF
    
    # Execute the eCLIP CWL workflow using cwltool.
    # This command assumes you are in the 'eclip' directory or have adjusted paths to 'cwl/eclip_workflow.cwl'.
    # Key tools used within this pipeline (inferred from CWL files in the repository):
    # - STAR (e.g., v2.7.10a for alignment)
    # - Picard MarkDuplicates (e.g., v2.27.4 for deduplication)
    # - CLIPper (e.g., v1.0 for peak calling)
    # - merge_peaks (e.g., v1.0 for IDR)
    cwltool cwl/eclip_workflow.cwl eclip_inputs.yaml
  2. 2

    Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract

    UMI-tools vInferred with models/gemini-2.5-flash
    $ Bash example
    # Install umi-tools if not already installed
    # conda install -c bioconda umi-tools
    
    # UMIs were extracted from raw sequencing reads.
    # This example assumes a 10bp UMI is located at the very beginning of the sequencing read.
    # Replace 'raw_reads.fastq.gz' with your actual input file and 'umi_extracted_reads.fastq.gz' with your desired output file.
    
    umi_tools extract \
        --extract-method=regex \
        --bc-pattern="^(?P<umi_1>.{10})" \
        -I raw_reads.fastq.gz \
        -S umi_extracted_reads.fastq.gz
  3. 3

    Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.

    cutadapt v4.0 GitHub
    $ Bash example
    # Install cutadapt (example using conda)
    # conda create -n cutadapt_env cutadapt=4.0
    # conda activate cutadapt_env
    
    # Define input/output files
    INPUT_FASTQ="post_umi_extracted_reads.fastq.gz"
    OUTPUT_FASTQ="trimmed_reads.fastq.gz"
    
    # Define adapters based on common eCLIP practices (e.g., from Yeo lab pipelines)
    # 3' Illumina adapter sequence
    ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    # 5' adapter/barcode sequence (example from Yeo lab's skipper pipeline, adjust if specific barcode is known)
    ADAPTER_5PRIME="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCA"
    
    # Minimum read length after trimming
    MIN_LENGTH=18
    
    # Number of threads to use
    THREADS=8
    
    cutadapt \
      -a "${ADAPTER_3PRIME}" \
      -g "${ADAPTER_5PRIME}" \
      -o "${OUTPUT_FASTQ}" \
      --minimum-length "${MIN_LENGTH}" \
      --cores "${THREADS}" \
      --discard-untrimmed \
      "${INPUT_FASTQ}"
  4. 4

    Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)

    STAR v2.7.10a
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star=2.7.10a
    
    # Create STAR index for RepBase (if not already done)
    # This step requires a RepBase FASTA file and will generate the necessary index files.
    # STAR --runMode genomeGenerate \
    #      --genomeDir /path/to/repbase_star_index \
    #      --genomeFastaFiles /path/to/RepBase.fasta \
    #      --runThreadN <threads>
    
    # Define variables
    THREADS=8 # Adjust as needed based on available resources
    REPBASE_INDEX="/path/to/repbase_star_index" # Path to your pre-built RepBase STAR index
    TRIMMED_READS_R1="trimmed_reads_R1.fastq.gz" # Path to your trimmed R1 reads
    TRIMMED_READS_R2="trimmed_reads_R2.fastq.gz" # Path to your trimmed R2 reads (remove if single-end)
    OUTPUT_PREFIX="star_repbase_filtered" # Prefix for all output files
    OUTPUT_DIR="./star_repbase_output" # Directory for STAR output files
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run STAR to map against RepBase and output unmapped reads in Fastx format
    STAR \
        --runThreadN "${THREADS}" \
        --genomeDir "${REPBASE_INDEX}" \
        --readFilesIn "${TRIMMED_READS_R1}" "${TRIMMED_READS_R2}" \
        --outFileNamePrefix "${OUTPUT_DIR}/${OUTPUT_PREFIX}_" \
        --outFilterMultimapNmax 30 \
        --alignEndsType EndToEnd \
        --outFilterMultimapScoreRange 1 \
        --outSAMmode Full \
        --outFilterType BySJout \
        --outSAMtype BAM Unsorted \
        --outFilterScoreMin 10 \
        --outReadsUnmapped Fastx \
        --outSAMattributes All
    
    # After execution, the unmapped reads will be in files like:
    # "${OUTPUT_DIR}/${OUTPUT_PREFIX}_Unmapped.out.mate1" (for R1)
    # "${OUTPUT_DIR}/${OUTPUT_PREFIX}_Unmapped.out.mate2" (for R2)
    # The mapped reads (BAM) will be in "${OUTPUT_DIR}/${OUTPUT_PREFIX}_Aligned.out.bam"
  5. 5

    Remaining reads were mapped to the appropriate genome build (mm10) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)

    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables
    NUM_THREADS="8" # Adjust as needed
    STAR_INDEX_DIR="/path/to/STAR_index/mm10" # Replace with the actual path to your mm10 STAR genome index
    INPUT_READS_FILE="remaining_reads.fastq.gz" # Replace with your input FASTQ file (e.g., filtered reads)
    OUTPUT_PREFIX="aligned_reads_" # Prefix for output files (e.g., aligned_reads_Aligned.out.bam)
    
    # Create STAR genome index if it doesn't exist (example command, typically done once)
    # STAR --runThreadN "${NUM_THREADS}" --runMode genomeGenerate \
    #      --genomeDir "${STAR_INDEX_DIR}" \
    #      --genomeFastaFiles /path/to/mm10.fa \
    #      --sjdbGTFfile /path/to/mm10.gtf \
    #      --sjdbOverhang 100 # Adjust sjdbOverhang based on read length
    
    # Map reads using STAR
    STAR --runThreadN "${NUM_THREADS}" \
         --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${INPUT_READS_FILE}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outFilterMultimapNmax 1 \
         --alignEndsType EndToEnd \
         --outFilterMultimapScoreRange 1 \
         --outSAMmode Full \
         --outFilterType BySJout \
         --outSAMtype BAM Unsorted \
         --outFilterScoreMin 10 \
         --outReadsUnmapped Fastx \
         --outSAMattributes All
    
    # Note: The output BAM file (e.g., aligned_reads_Aligned.out.bam) will be unsorted.
    # You might want to sort and index it afterwards:
    # samtools sort -o "${OUTPUT_PREFIX}Aligned.sorted.bam" "${OUTPUT_PREFIX}Aligned.out.bam"
    # samtools index "${OUTPUT_PREFIX}Aligned.sorted.bam"
  6. 6

    Uniquely mapped reads were removed of PCR duplicates with umi_tools

    UMI-tools v1.1.2 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install UMI-tools (e.g., using conda)
    # conda install -c bioconda umi-tools
    
    # Define input and output file paths
    # INPUT_BAM should be a coordinate-sorted BAM file of uniquely mapped reads
    # where UMIs have been extracted and appended to read names (e.g., by umi_tools extract).
    INPUT_BAM="uniquely_mapped_reads.bam"
    OUTPUT_BAM="uniquely_mapped_reads.dedup.bam"
    STATS_FILE="uniquely_mapped_reads.dedup.stats"
    LOG_FILE="uniquely_mapped_reads.dedup.log"
    
    # Perform PCR duplicate removal using umi_tools dedup
    # --extract-umis-from-read-names: Assumes UMIs are in the read names (e.g., from a prior umi_tools extract step)
    # --paired: Specifies that the input reads are paired-end
    # --method unique: Uses the 'unique' method for deduplication (default and common for most applications)
    # -I: Input BAM file
    # -S: Output deduplicated BAM file
    # --output-stats: Generates a file with deduplication statistics
    # --log: Writes log messages to a file
    umi_tools dedup \
        --extract-umis-from-read-names \
        --paired \
        --method unique \
        -I "${INPUT_BAM}" \
        -S "${OUTPUT_BAM}" \
        --output-stats "${STATS_FILE}" \
        --log "${LOG_FILE}"
  7. 7

    Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper

    CLIPper vNot specified GitHub
    $ Bash example
    # Install dependencies (if not already installed)
    # pip install numpy scipy pysam pybedtools
    
    # Clone the CLIPper repository if not already available
    # git clone https://github.com/YeoLab/clipper.git
    # cd clipper
    
    # Placeholder variables - replace with actual file paths and parameters
    # INPUT_BAM: Path to the input BAM file (e.g., IP sample aligned reads)
    # CONTROL_BAM: Path to the control BAM file (e.g., input/mock IP sample aligned reads) - highly recommended for eCLIP
    # OUTPUT_PREFIX: Prefix for output files (e.g., peak BED file, log file)
    # GENOME_FASTA: Path to the reference genome FASTA file (e.g., hg38.fa)
    # ANNOTATION_GTF: Path to the gene annotation GTF/GFF file (e.g., hg38.gtf)
    # P_VALUE_THRESHOLD: P-value threshold for peak calling (e.g., 0.01)
    # FDR_THRESHOLD: False Discovery Rate threshold for peak calling (e.g., 0.05)
    # SIZE_FACTOR: Normalization factor for library size differences between IP and control (e.g., 1.0 if already normalized)
    
    INPUT_BAM="path/to/your/input.bam"
    CONTROL_BAM="path/to/your/control.bam"
    OUTPUT_PREFIX="clipper_peaks"
    GENOME_FASTA="path/to/hg38.fa" # Placeholder: Use latest human assembly like hg38
    ANNOTATION_GTF="path/to/hg38.gtf" # Placeholder: Use latest human assembly like hg38
    P_VALUE_THRESHOLD="0.01"
    FDR_THRESHOLD="0.05"
    SIZE_FACTOR="1.0"
    
    # Execute CLIPper for peak calling
    # Ensure clipper.py is executable and in your PATH, or provide the full path to clipper.py
    python clipper.py \
        -b "${INPUT_BAM}" \
        -c "${CONTROL_BAM}" \
        -s "${SIZE_FACTOR}" \
        -o "${OUTPUT_PREFIX}" \
        -p "${P_VALUE_THRESHOLD}" \
        -f "${FDR_THRESHOLD}" \
        -g "${GENOME_FASTA}" \
        -a "${ANNOTATION_GTF}"
  8. 8

    Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl

    eCLIP vv1.0.0
    $ Bash example
    # Install Perl if not available
    # sudo apt-get update && sudo apt-get install -y perl
    # Or via conda:
    # conda install -c bioconda perl
    
    # Download the script from the eCLIP repository
    # wget https://raw.githubusercontent.com/yeolab/eclip/v1.0.0/scripts/overlap_peakfi_with_bam.pl
    # chmod +x overlap_peakfi_with_bam.pl
    
    # Define input and output files (placeholders)
    # ECLIP_PEAKS: Path to the eCLIP peak file (clusters) in BED format
    ECLIP_PEAKS="eCLIP_peaks.bed"
    # SMINPUT_BAM: Path to the size-matched input BAM file
    SMINPUT_BAM="SMInput.bam"
    # OUTPUT_FILE: Path for the output file containing enriched clusters
    OUTPUT_FILE="enriched_clusters.bed"
    
    # Execute the script to identify enriched clusters
    # The description mentions "Clusters enriched", implying specific enrichment thresholds were applied.
    # However, no specific parameters (e.g., --min_fold_enrichment, --min_log2_fold_change, --min_fdr)
    # were provided in the description. The command below uses the script's default filtering.
    # If specific thresholds were used, they would be added here, e.g.:
    # --min_log2_fold_change 1 --min_fdr 0.05
    perl overlap_peakfi_with_bam.pl \
        --peak_file "${ECLIP_PEAKS}" \
        --bam_file "${SMINPUT_BAM}" \
        --output_file "${OUTPUT_FILE}"
  9. 9

    Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl

    eCLIP vPart of eCLIP pipeline (yeolab/eclip repository) GitHub
    $ Bash example
    # Clone the eCLIP repository to obtain the script
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    
    # Example usage: Merge two replicate peak files (replace with actual input files)
    # Assuming peak_replicate1.bed and peak_replicate2.bed are the enriched clusters from replicates
    perl bin/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl merged_replicate_peaks.bed peak_replicate1.bed peak_replicate2.bed

Tools Used

Raw Source Text
Data was processed using the eCLIP pipeline and available at: http://github.com/yeolab/eclip
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract
Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
Remaining reads were mapped to the appropriate genome build (mm10) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
Uniquely mapped reads were removed of PCR duplicates with umi_tools
Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper
Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl
Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
Assembly: mm10
Supplementary files format and content: bigwigs contain RPM-normalized read densities of uniquely-mapped reads
Supplementary files format and content: BED files contain CLIPper peak clusters. Columns 4 and 5 describe the -log10(p-value) and log2(fold) enrichment IP over corresponding SMInput.
← Back to Analysis