GSE240521 Processing Pipeline

RIP-Seq code_examples 9 steps

Publication

High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.

Nature communications (2024) — PMID 39152130

Dataset

GSE240521

An in situ method for identification of transcriptome-wide protein-RNA interactions in cells [eCLIP-seq ]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Data was processed using the eCLIP pipeline and available at: http://github.com/yeolab/eclip

    $ Bash example
    # Install cwltool if not already installed
    # pip install cwltool
    
    # Clone the eCLIP pipeline repository
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    
    # Placeholder for input data and reference genome.
    # Replace with actual paths to your FASTQ files and reference genome (e.g., hg38).
    # Example:
    # READ1_FASTQ="/path/to/sample_R1.fastq.gz"
    # READ2_FASTQ="/path/to/sample_R2.fastq.gz"
    # REFERENCE_GENOME_FASTA="/path/to/hg38.fa"
    # REFERENCE_GENOME_GTF="/path/to/hg38.gtf" # Or appropriate annotation file
    
    # Create an inputs.yaml file based on the eclip.cwl requirements.
    # Refer to https://github.com/yeolab/eclip/blob/master/example_inputs.yaml for a detailed example.
    # Example simplified inputs.yaml structure:
    # cat << EOF > inputs.yaml
    # read1:
    #   class: File
    #   path: ${READ1_FASTQ}
    # read2:
    #   class: File
    #   path: ${READ2_FASTQ}
    # genome_fasta:
    #   class: File
    #   path: ${REFERENCE_GENOME_FASTA}
    # genome_gtf:
    #   class: File
    #   path: ${REFERENCE_GENOME_GTF}
    # output_prefix: "my_eclip_output"
    # EOF
    
    # Execute the eCLIP CWL pipeline.
    # This command assumes 'eclip.cwl' and 'inputs.yaml' are in the current directory.
    cwltool eclip.cwl inputs.yaml
  2. 2

    Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract

    UMI-tools vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install UMI-tools
    # conda install -c bioconda umi-tools
    
    # Example: Extract UMIs (assuming 12bp at the start of Read 1) from raw sequencing reads
    # and append them to the read header.
    # Input: raw_reads_R1.fastq.gz
    # Output: umi_extracted_reads_R1.fastq.gz
    umi_tools extract \
        --bc-pattern="^(?P<umi_1>.{12})" \
        --extract-method=regex \
        -I raw_reads_R1.fastq.gz \
        -S umi_extracted_reads_R1.fastq.gz \
        --log=umi_tools_extract.log
  3. 3

    Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.

    cutadapt v4.0 GitHub
    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output files
    INPUT_READS="sample_umi_extracted.fastq.gz"
    OUTPUT_TRIMMED_READS="sample_trimmed.fastq.gz"
    OUTPUT_UNTRIMMED_READS="sample_untrimmed.fastq.gz" # Optional, for reads where no adapter was found
    
    # Define adapter sequences for eCLIP (from yeolab/skipper workflow)
    # Illumina TruSeq Small RNA 3' Adapter (commonly found at 3' end of cDNA)
    ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
    # Illumina TruSeq Universal Adapter (part of P7, sometimes found at 5' end of cDNA)
    ADAPTER_5PRIME="AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"
    
    # Define trimming parameters (from yeolab/skipper workflow)
    MIN_LENGTH=18 # Minimum read length after trimming
    QUALITY_CUTOFF=20 # Quality cutoff for 3' end trimming (Phred score)
    THREADS=8 # Number of CPU threads to use
    
    # Execute cutadapt for adapter and barcode trimming
    cutadapt \
        -a "${ADAPTER_3PRIME}" \
        -g "${ADAPTER_5PRIME}" \
        -o "${OUTPUT_TRIMMED_READS}" \
        --untrimmed-output "${OUTPUT_UNTRIMMED_READS}" \
        --minimum-length "${MIN_LENGTH}" \
        --quality-cutoff "${QUALITY_CUTOFF}" \
        --cores "${THREADS}" \
        "${INPUT_READS}"
  4. 4

    Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)

    $ Bash example
    # Install STAR (example)
    # conda install -c bioconda star
    
    # Define variables
    # Placeholder for input trimmed reads. Adjust for paired-end if necessary.
    READS_R1="trimmed_reads_R1.fastq.gz"
    # READS_R2="trimmed_reads_R2.fastq.gz" # Uncomment for paired-end
    
    # Placeholder for RepBase FASTA file. Obtaining RepBase might require a license or specific download steps.
    # Example: wget -O RepBase.fasta "http://www.girinst.org/repbase/update/RepBase.fasta.gz" (actual URL may vary)
    REPBASE_FASTA="RepBase.fasta"
    
    # Directory for the STAR genome index
    STAR_INDEX_DIR="RepBase_STAR_index"
    
    # Prefix for output files
    OUTPUT_PREFIX="repbase_filtered"
    
    # Number of threads to use
    THREADS=8 # Adjust as needed
    
    # --- Step 1: Build STAR index for RepBase (if not already built) ---
    # This step is typically performed once. If the index already exists, skip this.
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${STAR_INDEX_DIR} \
    #      --genomeFastaFiles ${REPBASE_FASTA} \
    #      --runThreadN ${THREADS} \
    #      --genomeSAindexNbases 10 # Adjust based on genome size, 10 is suitable for small decoy genomes
    
    # --- Step 2: Map trimmed reads against RepBase and extract unmapped reads ---
    STAR --genomeDir ${STAR_INDEX_DIR} \
         --readFilesIn ${READS_R1} \
         --runThreadN ${THREADS} \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outFilterMultimapNmax 30 \
         --alignEndsType EndToEnd \
         --outFilterMultimapScoreRange 1 \
         --outSAMmode Full \
         --outFilterType BySJout \
         --outSAMtype BAM Unsorted \
         --outFilterScoreMin 10 \
         --outReadsUnmapped Fastx \
         --outSAMattributes All
    
    # The unmapped reads will be in files like: ${OUTPUT_PREFIX}Unmapped.out.mate1 (and .mate2 for paired-end)
    
  5. 5

    Remaining reads were mapped to the appropriate genome build (mm10) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables for clarity
    # Replace /path/to/STAR_index/mm10 with the actual path to your mm10 STAR genome index
    GENOME_DIR="/path/to/STAR_index/mm10"
    # Replace remaining_reads.fastq.gz with your actual input FASTQ file
    INPUT_FASTQ="remaining_reads.fastq.gz"
    # Prefix for output files (e.g., aligned_mm10_Aligned.out.bam, aligned_mm10_Unmapped.out.mate1)
    OUTPUT_PREFIX="aligned_mm10_"
    
    # Run STAR alignment
    STAR \
      --runThreadN 8 \
      --genomeDir "${GENOME_DIR}" \
      --readFilesIn "${INPUT_FASTQ}" \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outFilterMultimapNmax 1 \
      --alignEndsType EndToEnd \
      --outFilterMultimapScoreRange 1 \
      --outSAMmode Full \
      --outFilterType BySJout \
      --outSAMtype BAM Unsorted \
      --outFilterScoreMin 10 \
      --outReadsUnmapped Fastx \
      --outSAMattributes All
  6. 6

    Uniquely mapped reads were removed of PCR duplicates with umi_tools

    UMI-tools v1.1.2 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install UMI-tools (example using conda)
    # conda install -c bioconda umi-tools
    
    # Check UMI-tools version
    # umi_tools --version
    
    # Remove PCR duplicates from uniquely mapped reads
    # This command assumes UMIs are embedded in the read ID, separated by a colon.
    # Adjust --extract-umi-method and --umi-separator (or use --umi-tag if UMIs are in a BAM tag)
    # based on how UMIs were incorporated during library preparation and mapping.
    # Placeholder filenames are used for input and output.
    umi_tools dedup \
        --input mapped_reads.bam \
        --output deduplicated_reads.bam \
        --extract-umi-method=read_id \
        --umi-separator=':' \
        --output-stats umi_tools_dedup_stats.tsv \
        --log umi_tools_dedup.log
  7. 7

    Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper

    CLIPper vunspecified GitHub
    $ Bash example
    # Clone the CLIPper repository
    # git clone https://github.com/YeoLab/clipper.git
    # cd clipper
    
    # Install dependencies (if not already installed)
    # pip install numpy scipy pysam
    
    # Example usage for eCLIP peak calling with human hg38 genome.
    # Replace <TREATED_BAM>, <CONTROL_BAM>, <OUTPUT_PREFIX> with actual paths/names.
    # Reference genome and annotation files (e.g., hg38.fa, gencode.v38.annotation.gtf) should be pre-indexed and available.
    # The size factor (-s) can be calculated based on library sizes or spike-ins, or omitted if CLIPper calculates it or defaults to 1.0.
    
    python clipper.py \
        -b treated_sample.bam \
        -c control_sample.bam \
        -o clipper_peaks \
        -g hg38.fa \
        -a gencode.v38.annotation.gtf \
        -p 0.05 \
        -f 0.05 \
        -u 100 \
        -d 100
  8. 8

    Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl

    $ Bash example
    # --- Installation (commented out) ---
    # Ensure Perl is installed
    # sudo apt-get update && sudo apt-get install -y perl
    
    # Clone the eCLIP repository to access the script
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    # git checkout master # Or a specific commit/tag if available and desired
    # cd ..
    
    # --- Define Input Files and Parameters ---
    # Placeholder for the input peak file (e.g., output from CLIPper)
    PEAK_FILE="path/to/your/input_peaks.bed"
    
    # Placeholder for the size-matched input BAM file
    SM_INPUT_BAM="path/to/your/sm_input.bam"
    
    # Output file prefix for the enriched clusters
    OUTPUT_PREFIX="enriched_clusters"
    
    # Genome assembly for chromosome sizes (e.g., hg38, mm10)
    # Using hg38 (GRCh38) as a common latest human assembly placeholder
    GENOME_ASSEMBLY="hg38"
    
    # Minimum overlap percentage for peak identification (default 0.5)
    MIN_OVERLAP=0.5
    
    # Minimum number of reads in a cluster (default 5)
    MIN_READS=5
    
    # Minimum fold enrichment over SMInput (default 2)
    MIN_FOLD_ENRICHMENT=2
    
    # --- Execute the Script ---
    # The script is located in the 'bin' directory of the cloned eCLIP repository
    perl eclip/bin/overlap_peakfi_with_bam.pl \
        --peakfile "${PEAK_FILE}" \
        --bamfile "${SM_INPUT_BAM}" \
        --output "${OUTPUT_PREFIX}" \
        --genome "${GENOME_ASSEMBLY}" \
        --min_overlap "${MIN_OVERLAP}" \
        --min_reads "${MIN_READS}" \
        --min_fold_enrichment "${MIN_FOLD_ENRICHMENT}"
  9. 9

    Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl

    $ Bash example
    # Install Perl if not already available
    # sudo apt-get update && sudo apt-get install perl # For Debian/Ubuntu
    # yum install perl # For CentOS/RHEL
    
    # Download the script (if not part of a larger pipeline installation, e.g., cloning the eCLIP repository)
    # wget https://raw.githubusercontent.com/yeolab/eclip/master/bin/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
    # chmod +x compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
    
    # Example usage:
    # Assuming input peak files are named rep1_peaks.bed, rep2_peaks.bed, etc.
    # And the desired output prefix for the merged file is 'merged_replicates'
    perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl rep1_peaks.bed rep2_peaks.bed merged_replicates

Tools Used

Raw Source Text
Data was processed using the eCLIP pipeline and available at: http://github.com/yeolab/eclip
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract
Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
Remaining reads were mapped to the appropriate genome build (mm10) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
Uniquely mapped reads were removed of PCR duplicates with umi_tools
Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper
Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl
Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
Assembly: mm10
Supplementary files format and content: bigwigs contain RPM-normalized read densities of uniquely-mapped reads
Supplementary files format and content: BED files contain CLIPper peak clusters. Columns 4 and 5 describe the -log10(p-value) and log2(fold) enrichment IP over corresponding SMInput.
← Back to Analysis