GSE210437 Processing Pipeline
Publication
Epistatic interactions between NMD and TRP53 control progenitor cell maintenance and brain size.Neuron (2024) — PMID 38697111
Dataset
GSE210437Enhanced CLIP (eCLIP) identified differential UPF1 binding sites in mouse neural progenitor cells (NPCs) samples
Processing Steps
Generate Jupyter Notebook-
1
Data was processed using the eCLIP pipeline and available at: http://github.com/yeolab/eclip
eCLIP vNot explicitly versioned, inferred from repository activity (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# The eCLIP pipeline is a CWL workflow. This script demonstrates how to run it using cwltool. # It assumes the eCLIP repository has been cloned and cwltool is installed. # Clone the eCLIP pipeline repository (if not already present) # git clone https://github.com/yeolab/eclip.git # cd eclip # Install cwltool if not already installed # pip install cwltool # Define placeholder input files and reference data. # In a real scenario, these paths would point to actual data. # Reference datasets (e.g., hg38 genome, GENCODE annotation) are not specified in the description, # so common human reference data (hg38, GENCODE v38) are used as placeholders. # Create a dummy inputs.yaml file for the eCLIP CWL workflow. # Replace '/path/to/...' with actual file paths. cat << EOF > eclip_inputs.yaml fastq_replicate1_r1: class: File path: /path/to/replicate1_R1.fastq.gz fastq_replicate1_r2: # Optional, if paired-end data is available class: File path: /path/to/replicate1_R2.fastq.gz fastq_replicate2_r1: class: File path: /path/to/replicate2_R1.fastq.gz fastq_replicate2_r2: # Optional, if paired-end data is available class: File path: /path/to/replicate2_R2.fastq.gz fastq_input_r1: class: File path: /path/to/input_R1.fastq.gz fastq_input_r2: # Optional, if paired-end data is available class: File path: /path/to/input_R2.fastq.gz genome_fasta: class: File path: /path/to/reference/hg38.fa # Placeholder: Human genome assembly hg38 genome_gtf: class: File path: /path/to/reference/gencode.v38.annotation.gtf # Placeholder: GENCODE v38 annotation star_index_tar: class: File path: /path/to/reference/star_index_hg38.tar # Placeholder: STAR index for hg38 chrom_sizes: class: File path: /path/to/reference/hg38.chrom.sizes # Placeholder: Chromosome sizes for hg38 blacklist_bed: class: File path: /path/to/reference/hg38_blacklist.bed # Placeholder: ENCODE blacklist for hg38 output_prefix: "my_eclip_experiment" EOF # Execute the eCLIP CWL workflow using cwltool. # This command assumes you are in the 'eclip' directory or have adjusted paths to 'cwl/eclip_workflow.cwl'. # Key tools used within this pipeline (inferred from CWL files in the repository): # - STAR (e.g., v2.7.10a for alignment) # - Picard MarkDuplicates (e.g., v2.27.4 for deduplication) # - CLIPper (e.g., v1.0 for peak calling) # - merge_peaks (e.g., v1.0 for IDR) cwltool cwl/eclip_workflow.cwl eclip_inputs.yaml
-
2
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract
UMI-tools vInferred with models/gemini-2.5-flash$ Bash example
# Install umi-tools if not already installed # conda install -c bioconda umi-tools # UMIs were extracted from raw sequencing reads. # This example assumes a 10bp UMI is located at the very beginning of the sequencing read. # Replace 'raw_reads.fastq.gz' with your actual input file and 'umi_extracted_reads.fastq.gz' with your desired output file. umi_tools extract \ --extract-method=regex \ --bc-pattern="^(?P<umi_1>.{10})" \ -I raw_reads.fastq.gz \ -S umi_extracted_reads.fastq.gz -
3
Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
$ Bash example
# Install cutadapt (example using conda) # conda create -n cutadapt_env cutadapt=4.0 # conda activate cutadapt_env # Define input/output files INPUT_FASTQ="post_umi_extracted_reads.fastq.gz" OUTPUT_FASTQ="trimmed_reads.fastq.gz" # Define adapters based on common eCLIP practices (e.g., from Yeo lab pipelines) # 3' Illumina adapter sequence ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # 5' adapter/barcode sequence (example from Yeo lab's skipper pipeline, adjust if specific barcode is known) ADAPTER_5PRIME="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCA" # Minimum read length after trimming MIN_LENGTH=18 # Number of threads to use THREADS=8 cutadapt \ -a "${ADAPTER_3PRIME}" \ -g "${ADAPTER_5PRIME}" \ -o "${OUTPUT_FASTQ}" \ --minimum-length "${MIN_LENGTH}" \ --cores "${THREADS}" \ --discard-untrimmed \ "${INPUT_FASTQ}" -
4
Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
STAR v2.7.10a$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star=2.7.10a # Create STAR index for RepBase (if not already done) # This step requires a RepBase FASTA file and will generate the necessary index files. # STAR --runMode genomeGenerate \ # --genomeDir /path/to/repbase_star_index \ # --genomeFastaFiles /path/to/RepBase.fasta \ # --runThreadN <threads> # Define variables THREADS=8 # Adjust as needed based on available resources REPBASE_INDEX="/path/to/repbase_star_index" # Path to your pre-built RepBase STAR index TRIMMED_READS_R1="trimmed_reads_R1.fastq.gz" # Path to your trimmed R1 reads TRIMMED_READS_R2="trimmed_reads_R2.fastq.gz" # Path to your trimmed R2 reads (remove if single-end) OUTPUT_PREFIX="star_repbase_filtered" # Prefix for all output files OUTPUT_DIR="./star_repbase_output" # Directory for STAR output files # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run STAR to map against RepBase and output unmapped reads in Fastx format STAR \ --runThreadN "${THREADS}" \ --genomeDir "${REPBASE_INDEX}" \ --readFilesIn "${TRIMMED_READS_R1}" "${TRIMMED_READS_R2}" \ --outFileNamePrefix "${OUTPUT_DIR}/${OUTPUT_PREFIX}_" \ --outFilterMultimapNmax 30 \ --alignEndsType EndToEnd \ --outFilterMultimapScoreRange 1 \ --outSAMmode Full \ --outFilterType BySJout \ --outSAMtype BAM Unsorted \ --outFilterScoreMin 10 \ --outReadsUnmapped Fastx \ --outSAMattributes All # After execution, the unmapped reads will be in files like: # "${OUTPUT_DIR}/${OUTPUT_PREFIX}_Unmapped.out.mate1" (for R1) # "${OUTPUT_DIR}/${OUTPUT_PREFIX}_Unmapped.out.mate2" (for R2) # The mapped reads (BAM) will be in "${OUTPUT_DIR}/${OUTPUT_PREFIX}_Aligned.out.bam" -
5
Remaining reads were mapped to the appropriate genome build (mm10) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables NUM_THREADS="8" # Adjust as needed STAR_INDEX_DIR="/path/to/STAR_index/mm10" # Replace with the actual path to your mm10 STAR genome index INPUT_READS_FILE="remaining_reads.fastq.gz" # Replace with your input FASTQ file (e.g., filtered reads) OUTPUT_PREFIX="aligned_reads_" # Prefix for output files (e.g., aligned_reads_Aligned.out.bam) # Create STAR genome index if it doesn't exist (example command, typically done once) # STAR --runThreadN "${NUM_THREADS}" --runMode genomeGenerate \ # --genomeDir "${STAR_INDEX_DIR}" \ # --genomeFastaFiles /path/to/mm10.fa \ # --sjdbGTFfile /path/to/mm10.gtf \ # --sjdbOverhang 100 # Adjust sjdbOverhang based on read length # Map reads using STAR STAR --runThreadN "${NUM_THREADS}" \ --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${INPUT_READS_FILE}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outFilterMultimapNmax 1 \ --alignEndsType EndToEnd \ --outFilterMultimapScoreRange 1 \ --outSAMmode Full \ --outFilterType BySJout \ --outSAMtype BAM Unsorted \ --outFilterScoreMin 10 \ --outReadsUnmapped Fastx \ --outSAMattributes All # Note: The output BAM file (e.g., aligned_reads_Aligned.out.bam) will be unsorted. # You might want to sort and index it afterwards: # samtools sort -o "${OUTPUT_PREFIX}Aligned.sorted.bam" "${OUTPUT_PREFIX}Aligned.out.bam" # samtools index "${OUTPUT_PREFIX}Aligned.sorted.bam" -
6
Uniquely mapped reads were removed of PCR duplicates with umi_tools
$ Bash example
# Install UMI-tools (e.g., using conda) # conda install -c bioconda umi-tools # Define input and output file paths # INPUT_BAM should be a coordinate-sorted BAM file of uniquely mapped reads # where UMIs have been extracted and appended to read names (e.g., by umi_tools extract). INPUT_BAM="uniquely_mapped_reads.bam" OUTPUT_BAM="uniquely_mapped_reads.dedup.bam" STATS_FILE="uniquely_mapped_reads.dedup.stats" LOG_FILE="uniquely_mapped_reads.dedup.log" # Perform PCR duplicate removal using umi_tools dedup # --extract-umis-from-read-names: Assumes UMIs are in the read names (e.g., from a prior umi_tools extract step) # --paired: Specifies that the input reads are paired-end # --method unique: Uses the 'unique' method for deduplication (default and common for most applications) # -I: Input BAM file # -S: Output deduplicated BAM file # --output-stats: Generates a file with deduplication statistics # --log: Writes log messages to a file umi_tools dedup \ --extract-umis-from-read-names \ --paired \ --method unique \ -I "${INPUT_BAM}" \ -S "${OUTPUT_BAM}" \ --output-stats "${STATS_FILE}" \ --log "${LOG_FILE}" -
7
Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper
$ Bash example
# Install dependencies (if not already installed) # pip install numpy scipy pysam pybedtools # Clone the CLIPper repository if not already available # git clone https://github.com/YeoLab/clipper.git # cd clipper # Placeholder variables - replace with actual file paths and parameters # INPUT_BAM: Path to the input BAM file (e.g., IP sample aligned reads) # CONTROL_BAM: Path to the control BAM file (e.g., input/mock IP sample aligned reads) - highly recommended for eCLIP # OUTPUT_PREFIX: Prefix for output files (e.g., peak BED file, log file) # GENOME_FASTA: Path to the reference genome FASTA file (e.g., hg38.fa) # ANNOTATION_GTF: Path to the gene annotation GTF/GFF file (e.g., hg38.gtf) # P_VALUE_THRESHOLD: P-value threshold for peak calling (e.g., 0.01) # FDR_THRESHOLD: False Discovery Rate threshold for peak calling (e.g., 0.05) # SIZE_FACTOR: Normalization factor for library size differences between IP and control (e.g., 1.0 if already normalized) INPUT_BAM="path/to/your/input.bam" CONTROL_BAM="path/to/your/control.bam" OUTPUT_PREFIX="clipper_peaks" GENOME_FASTA="path/to/hg38.fa" # Placeholder: Use latest human assembly like hg38 ANNOTATION_GTF="path/to/hg38.gtf" # Placeholder: Use latest human assembly like hg38 P_VALUE_THRESHOLD="0.01" FDR_THRESHOLD="0.05" SIZE_FACTOR="1.0" # Execute CLIPper for peak calling # Ensure clipper.py is executable and in your PATH, or provide the full path to clipper.py python clipper.py \ -b "${INPUT_BAM}" \ -c "${CONTROL_BAM}" \ -s "${SIZE_FACTOR}" \ -o "${OUTPUT_PREFIX}" \ -p "${P_VALUE_THRESHOLD}" \ -f "${FDR_THRESHOLD}" \ -g "${GENOME_FASTA}" \ -a "${ANNOTATION_GTF}" -
8
Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl
eCLIP vv1.0.0$ Bash example
# Install Perl if not available # sudo apt-get update && sudo apt-get install -y perl # Or via conda: # conda install -c bioconda perl # Download the script from the eCLIP repository # wget https://raw.githubusercontent.com/yeolab/eclip/v1.0.0/scripts/overlap_peakfi_with_bam.pl # chmod +x overlap_peakfi_with_bam.pl # Define input and output files (placeholders) # ECLIP_PEAKS: Path to the eCLIP peak file (clusters) in BED format ECLIP_PEAKS="eCLIP_peaks.bed" # SMINPUT_BAM: Path to the size-matched input BAM file SMINPUT_BAM="SMInput.bam" # OUTPUT_FILE: Path for the output file containing enriched clusters OUTPUT_FILE="enriched_clusters.bed" # Execute the script to identify enriched clusters # The description mentions "Clusters enriched", implying specific enrichment thresholds were applied. # However, no specific parameters (e.g., --min_fold_enrichment, --min_log2_fold_change, --min_fdr) # were provided in the description. The command below uses the script's default filtering. # If specific thresholds were used, they would be added here, e.g.: # --min_log2_fold_change 1 --min_fdr 0.05 perl overlap_peakfi_with_bam.pl \ --peak_file "${ECLIP_PEAKS}" \ --bam_file "${SMINPUT_BAM}" \ --output_file "${OUTPUT_FILE}" -
9
Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
$ Bash example
# Clone the eCLIP repository to obtain the script # git clone https://github.com/yeolab/eclip.git # cd eclip # Example usage: Merge two replicate peak files (replace with actual input files) # Assuming peak_replicate1.bed and peak_replicate2.bed are the enriched clusters from replicates perl bin/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl merged_replicate_peaks.bed peak_replicate1.bed peak_replicate2.bed
Raw Source Text
Data was processed using the eCLIP pipeline and available at: http://github.com/yeolab/eclip Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt. Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All) Remaining reads were mapped to the appropriate genome build (mm10) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All) Uniquely mapped reads were removed of PCR duplicates with umi_tools Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl Assembly: mm10 Supplementary files format and content: bigwigs contain RPM-normalized read densities of uniquely-mapped reads Supplementary files format and content: BED files contain CLIPper peak clusters. Columns 4 and 5 describe the -log10(p-value) and log2(fold) enrichment IP over corresponding SMInput.