GSE210437 Processing Pipeline

RNA-Seq code_examples 9 steps

Publication

Epistatic interactions between NMD and TRP53 control progenitor cell maintenance and brain size.

Neuron (2024) — PMID 38697111

Dataset

Enhanced CLIP (eCLIP) identified differential UPF1 binding sites in mouse neural progenitor cells (NPCs) samples

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Data was processed using the eCLIP pipeline and available at: http://github.com/yeolab/eclip

eCLIP vNot explicitly versioned, inferred from repository activity (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# The eCLIP pipeline is a CWL workflow. This script demonstrates how to run it using cwltool.
# It assumes the eCLIP repository has been cloned and cwltool is installed.

# Clone the eCLIP pipeline repository (if not already present)
# git clone https://github.com/yeolab/eclip.git
# cd eclip

# Install cwltool if not already installed
# pip install cwltool

# Define placeholder input files and reference data.
# In a real scenario, these paths would point to actual data.
# Reference datasets (e.g., hg38 genome, GENCODE annotation) are not specified in the description,
# so common human reference data (hg38, GENCODE v38) are used as placeholders.

# Create a dummy inputs.yaml file for the eCLIP CWL workflow.
# Replace '/path/to/...' with actual file paths.
cat << EOF > eclip_inputs.yaml
fastq_replicate1_r1:
  class: File
  path: /path/to/replicate1_R1.fastq.gz
fastq_replicate1_r2: # Optional, if paired-end data is available
  class: File
  path: /path/to/replicate1_R2.fastq.gz
fastq_replicate2_r1:
  class: File
  path: /path/to/replicate2_R1.fastq.gz
fastq_replicate2_r2: # Optional, if paired-end data is available
  class: File
  path: /path/to/replicate2_R2.fastq.gz
fastq_input_r1:
  class: File
  path: /path/to/input_R1.fastq.gz
fastq_input_r2: # Optional, if paired-end data is available
  class: File
  path: /path/to/input_R2.fastq.gz
genome_fasta:
  class: File
  path: /path/to/reference/hg38.fa # Placeholder: Human genome assembly hg38
genome_gtf:
  class: File
  path: /path/to/reference/gencode.v38.annotation.gtf # Placeholder: GENCODE v38 annotation
star_index_tar:
  class: File
  path: /path/to/reference/star_index_hg38.tar # Placeholder: STAR index for hg38
chrom_sizes:
  class: File
  path: /path/to/reference/hg38.chrom.sizes # Placeholder: Chromosome sizes for hg38
blacklist_bed:
  class: File
  path: /path/to/reference/hg38_blacklist.bed # Placeholder: ENCODE blacklist for hg38
output_prefix: "my_eclip_experiment"
EOF

# Execute the eCLIP CWL workflow using cwltool.
# This command assumes you are in the 'eclip' directory or have adjusted paths to 'cwl/eclip_workflow.cwl'.
# Key tools used within this pipeline (inferred from CWL files in the repository):
# - STAR (e.g., v2.7.10a for alignment)
# - Picard MarkDuplicates (e.g., v2.27.4 for deduplication)
# - CLIPper (e.g., v1.0 for peak calling)
# - merge_peaks (e.g., v1.0 for IDR)
cwltool cwl/eclip_workflow.cwl eclip_inputs.yaml

View on GitHub

Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract

UMI-tools vInferred with models/gemini-2.5-flash

$ Bash example

# Install umi-tools if not already installed
# conda install -c bioconda umi-tools

# UMIs were extracted from raw sequencing reads.
# This example assumes a 10bp UMI is located at the very beginning of the sequencing read.
# Replace 'raw_reads.fastq.gz' with your actual input file and 'umi_extracted_reads.fastq.gz' with your desired output file.

umi_tools extract \
    --extract-method=regex \
    --bc-pattern="^(?P<umi_1>.{10})" \
    -I raw_reads.fastq.gz \
    -S umi_extracted_reads.fastq.gz

Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.

cutadapt v4.0 GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda create -n cutadapt_env cutadapt=4.0
# conda activate cutadapt_env

# Define input/output files
INPUT_FASTQ="post_umi_extracted_reads.fastq.gz"
OUTPUT_FASTQ="trimmed_reads.fastq.gz"

# Define adapters based on common eCLIP practices (e.g., from Yeo lab pipelines)
# 3' Illumina adapter sequence
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# 5' adapter/barcode sequence (example from Yeo lab's skipper pipeline, adjust if specific barcode is known)
ADAPTER_5PRIME="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCA"

# Minimum read length after trimming
MIN_LENGTH=18

# Number of threads to use
THREADS=8

cutadapt \
  -a "${ADAPTER_3PRIME}" \
  -g "${ADAPTER_5PRIME}" \
  -o "${OUTPUT_FASTQ}" \
  --minimum-length "${MIN_LENGTH}" \
  --cores "${THREADS}" \
  --discard-untrimmed \
  "${INPUT_FASTQ}"

View on GitHub

Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)

STAR v2.7.10a

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star=2.7.10a

# Create STAR index for RepBase (if not already done)
# This step requires a RepBase FASTA file and will generate the necessary index files.
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/repbase_star_index \
#      --genomeFastaFiles /path/to/RepBase.fasta \
#      --runThreadN <threads>

# Define variables
THREADS=8 # Adjust as needed based on available resources
REPBASE_INDEX="/path/to/repbase_star_index" # Path to your pre-built RepBase STAR index
TRIMMED_READS_R1="trimmed_reads_R1.fastq.gz" # Path to your trimmed R1 reads
TRIMMED_READS_R2="trimmed_reads_R2.fastq.gz" # Path to your trimmed R2 reads (remove if single-end)
OUTPUT_PREFIX="star_repbase_filtered" # Prefix for all output files
OUTPUT_DIR="./star_repbase_output" # Directory for STAR output files

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run STAR to map against RepBase and output unmapped reads in Fastx format
STAR \
    --runThreadN "${THREADS}" \
    --genomeDir "${REPBASE_INDEX}" \
    --readFilesIn "${TRIMMED_READS_R1}" "${TRIMMED_READS_R2}" \
    --outFileNamePrefix "${OUTPUT_DIR}/${OUTPUT_PREFIX}_" \
    --outFilterMultimapNmax 30 \
    --alignEndsType EndToEnd \
    --outFilterMultimapScoreRange 1 \
    --outSAMmode Full \
    --outFilterType BySJout \
    --outSAMtype BAM Unsorted \
    --outFilterScoreMin 10 \
    --outReadsUnmapped Fastx \
    --outSAMattributes All

# After execution, the unmapped reads will be in files like:
# "${OUTPUT_DIR}/${OUTPUT_PREFIX}_Unmapped.out.mate1" (for R1)
# "${OUTPUT_DIR}/${OUTPUT_PREFIX}_Unmapped.out.mate2" (for R2)
# The mapped reads (BAM) will be in "${OUTPUT_DIR}/${OUTPUT_PREFIX}_Aligned.out.bam"

Remaining reads were mapped to the appropriate genome build (mm10) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)

STAR v2.5.2b GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
NUM_THREADS="8" # Adjust as needed
STAR_INDEX_DIR="/path/to/STAR_index/mm10" # Replace with the actual path to your mm10 STAR genome index
INPUT_READS_FILE="remaining_reads.fastq.gz" # Replace with your input FASTQ file (e.g., filtered reads)
OUTPUT_PREFIX="aligned_reads_" # Prefix for output files (e.g., aligned_reads_Aligned.out.bam)

# Create STAR genome index if it doesn't exist (example command, typically done once)
# STAR --runThreadN "${NUM_THREADS}" --runMode genomeGenerate \
#      --genomeDir "${STAR_INDEX_DIR}" \
#      --genomeFastaFiles /path/to/mm10.fa \
#      --sjdbGTFfile /path/to/mm10.gtf \
#      --sjdbOverhang 100 # Adjust sjdbOverhang based on read length

# Map reads using STAR
STAR --runThreadN "${NUM_THREADS}" \
     --genomeDir "${STAR_INDEX_DIR}" \
     --readFilesIn "${INPUT_READS_FILE}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outFilterMultimapNmax 1 \
     --alignEndsType EndToEnd \
     --outFilterMultimapScoreRange 1 \
     --outSAMmode Full \
     --outFilterType BySJout \
     --outSAMtype BAM Unsorted \
     --outFilterScoreMin 10 \
     --outReadsUnmapped Fastx \
     --outSAMattributes All

# Note: The output BAM file (e.g., aligned_reads_Aligned.out.bam) will be unsorted.
# You might want to sort and index it afterwards:
# samtools sort -o "${OUTPUT_PREFIX}Aligned.sorted.bam" "${OUTPUT_PREFIX}Aligned.out.bam"
# samtools index "${OUTPUT_PREFIX}Aligned.sorted.bam"

View on GitHub

Uniquely mapped reads were removed of PCR duplicates with umi_tools

UMI-tools v1.1.2 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install UMI-tools (e.g., using conda)
# conda install -c bioconda umi-tools

# Define input and output file paths
# INPUT_BAM should be a coordinate-sorted BAM file of uniquely mapped reads
# where UMIs have been extracted and appended to read names (e.g., by umi_tools extract).
INPUT_BAM="uniquely_mapped_reads.bam"
OUTPUT_BAM="uniquely_mapped_reads.dedup.bam"
STATS_FILE="uniquely_mapped_reads.dedup.stats"
LOG_FILE="uniquely_mapped_reads.dedup.log"

# Perform PCR duplicate removal using umi_tools dedup
# --extract-umis-from-read-names: Assumes UMIs are in the read names (e.g., from a prior umi_tools extract step)
# --paired: Specifies that the input reads are paired-end
# --method unique: Uses the 'unique' method for deduplication (default and common for most applications)
# -I: Input BAM file
# -S: Output deduplicated BAM file
# --output-stats: Generates a file with deduplication statistics
# --log: Writes log messages to a file
umi_tools dedup \
    --extract-umis-from-read-names \
    --paired \
    --method unique \
    -I "${INPUT_BAM}" \
    -S "${OUTPUT_BAM}" \
    --output-stats "${STATS_FILE}" \
    --log "${LOG_FILE}"

View on GitHub

Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper

CLIPper vNot specified GitHub

$ Bash example

# Install dependencies (if not already installed)
# pip install numpy scipy pysam pybedtools

# Clone the CLIPper repository if not already available
# git clone https://github.com/YeoLab/clipper.git
# cd clipper

# Placeholder variables - replace with actual file paths and parameters
# INPUT_BAM: Path to the input BAM file (e.g., IP sample aligned reads)
# CONTROL_BAM: Path to the control BAM file (e.g., input/mock IP sample aligned reads) - highly recommended for eCLIP
# OUTPUT_PREFIX: Prefix for output files (e.g., peak BED file, log file)
# GENOME_FASTA: Path to the reference genome FASTA file (e.g., hg38.fa)
# ANNOTATION_GTF: Path to the gene annotation GTF/GFF file (e.g., hg38.gtf)
# P_VALUE_THRESHOLD: P-value threshold for peak calling (e.g., 0.01)
# FDR_THRESHOLD: False Discovery Rate threshold for peak calling (e.g., 0.05)
# SIZE_FACTOR: Normalization factor for library size differences between IP and control (e.g., 1.0 if already normalized)

INPUT_BAM="path/to/your/input.bam"
CONTROL_BAM="path/to/your/control.bam"
OUTPUT_PREFIX="clipper_peaks"
GENOME_FASTA="path/to/hg38.fa" # Placeholder: Use latest human assembly like hg38
ANNOTATION_GTF="path/to/hg38.gtf" # Placeholder: Use latest human assembly like hg38
P_VALUE_THRESHOLD="0.01"
FDR_THRESHOLD="0.05"
SIZE_FACTOR="1.0"

# Execute CLIPper for peak calling
# Ensure clipper.py is executable and in your PATH, or provide the full path to clipper.py
python clipper.py \
    -b "${INPUT_BAM}" \
    -c "${CONTROL_BAM}" \
    -s "${SIZE_FACTOR}" \
    -o "${OUTPUT_PREFIX}" \
    -p "${P_VALUE_THRESHOLD}" \
    -f "${FDR_THRESHOLD}" \
    -g "${GENOME_FASTA}" \
    -a "${ANNOTATION_GTF}"

View on GitHub

Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl

eCLIP vv1.0.0

$ Bash example

# Install Perl if not available
# sudo apt-get update && sudo apt-get install -y perl
# Or via conda:
# conda install -c bioconda perl

# Download the script from the eCLIP repository
# wget https://raw.githubusercontent.com/yeolab/eclip/v1.0.0/scripts/overlap_peakfi_with_bam.pl
# chmod +x overlap_peakfi_with_bam.pl

# Define input and output files (placeholders)
# ECLIP_PEAKS: Path to the eCLIP peak file (clusters) in BED format
ECLIP_PEAKS="eCLIP_peaks.bed"
# SMINPUT_BAM: Path to the size-matched input BAM file
SMINPUT_BAM="SMInput.bam"
# OUTPUT_FILE: Path for the output file containing enriched clusters
OUTPUT_FILE="enriched_clusters.bed"

# Execute the script to identify enriched clusters
# The description mentions "Clusters enriched", implying specific enrichment thresholds were applied.
# However, no specific parameters (e.g., --min_fold_enrichment, --min_log2_fold_change, --min_fdr)
# were provided in the description. The command below uses the script's default filtering.
# If specific thresholds were used, they would be added here, e.g.:
# --min_log2_fold_change 1 --min_fdr 0.05
perl overlap_peakfi_with_bam.pl \
    --peak_file "${ECLIP_PEAKS}" \
    --bam_file "${SMINPUT_BAM}" \
    --output_file "${OUTPUT_FILE}"

Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl

eCLIP vPart of eCLIP pipeline (yeolab/eclip repository) GitHub

$ Bash example

# Clone the eCLIP repository to obtain the script
# git clone https://github.com/yeolab/eclip.git
# cd eclip

# Example usage: Merge two replicate peak files (replace with actual input files)
# Assuming peak_replicate1.bed and peak_replicate2.bed are the enriched clusters from replicates
perl bin/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl merged_replicate_peaks.bed peak_replicate1.bed peak_replicate2.bed

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Data was processed using the eCLIP pipeline and available at: http://github.com/yeolab/eclip
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract
Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
Remaining reads were mapped to the appropriate genome build (mm10) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
Uniquely mapped reads were removed of PCR duplicates with umi_tools
Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper
Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl
Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
Assembly: mm10
Supplementary files format and content: bigwigs contain RPM-normalized read densities of uniquely-mapped reads
Supplementary files format and content: BED files contain CLIPper peak clusters. Columns 4 and 5 describe the -log10(p-value) and log2(fold) enrichment IP over corresponding SMInput.

← Back to Analysis