GSE230717 Processing Pipeline

RIP-Seq code_examples 9 steps

Publication

Large-scale map of RNA-binding protein interactomes across the mRNA life cycle.

Molecular cell (2024) — PMID 39303721

Dataset

GSE230717

Large-scale map of RNA binding protein interactomes across the mRNA life-cycle

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Data was processed using the eCLIP pipeline and available at https://github.com/YeoLab/skipper

    $ Bash example
    git clone https://github.com/YeoLab/skipper.git
    cd skipper
    
    # Configure the workflow by editing config/config.yaml and config/samples.yaml
    # Example reference genome (hg38 inferred as common default for human eCLIP):
    # In config/config.yaml, set:
    #   species: human
    #   assembly: hg38
    #   genome_fasta: /path/to/your/hg38.fasta
    #   gtf: /path/to/your/hg38.gtf
    #   star_index: /path/to/your/star_index_hg38
    #   ... other reference paths ...
    
    # In config/samples.yaml, define your samples and their fastq files.
    
    # (Optional) Create and activate the Snakemake environment
    # conda env create -f envs/snakemake.yaml
    # conda activate snakemake
    
    # Execute the Skipper eCLIP processing workflow
    # The --use-conda flag will create environments for individual rules as needed.
    # Adjust --cores based on available resources.
    snakemake --use-conda --cores 8
  2. 2

    Unique Molecular Identifiers (UMIs) were extracted using fastp 0.11.5 (https://github.com/OpenGene/fastp)

    fastp v0.11.5 GitHub
    $ Bash example
    # Install fastp (if not already installed)
    # conda install -c bioconda fastp
    
    # Example command for UMI extraction using fastp.
    # This command assumes UMIs are at the beginning of Read 1 and are 10 bases long.
    # Adjust input/output filenames, UMI location (--umi_loc), and UMI length (--umi_len) as per your library preparation.
    # fastp will prepend the UMI sequence to the read ID in the output FASTQ files.
    fastp \
      -i input_read1.fastq.gz \
      -o output_read1_umi_extracted.fastq.gz \
      -I input_read2.fastq.gz \
      -O output_read2_umi_extracted.fastq.gz \
      --umi \
      --umi_loc per_read \
      --umi_len 10 \
      --thread 8 \
      --json fastp.json \
      --html fastp.html
  3. 3

    Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples)

    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt
    
    # Define input and output files
    INPUT_FASTQ="input_umi_extracted_reads.fastq"
    OUTPUT_FASTQ="trimmed_reads.fastq"
    
    # Define adapter sequences and minimum read length
    # These are placeholder sequences. Actual sequences should be obtained from the library preparation protocol
    # or the specific eCLIP protocol used (e.g., from the Yeo lab).
    ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example: Illumina Universal Adapter
    ADAPTER_5PRIME="GTTCAGAGTTCTACAGTCCGACGATC" # Example: A common linker or 5' barcode sequence to be trimmed after UMI extraction
    MIN_READ_LENGTH=18
    
    # Execute cutadapt to trim adapter and barcode sequences
    cutadapt \
      -a "${ADAPTER_3PRIME}" \
      -g "${ADAPTER_5PRIME}" \
      -m ${MIN_READ_LENGTH} \
      -o "${OUTPUT_FASTQ}" \
      "${INPUT_FASTQ}"
  4. 4

    Mapping was then performed against the full human genome (hg38) including a database of splice junctions with STAR (v 2.7.6) allowing up to 100 multimapped regions.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star=2.7.6
    
    # Define variables
    READ1_FASTQ="path/to/your/read1.fastq.gz" # Replace with your R1 FASTQ file
    READ2_FASTQ="path/to/your/read2.fastq.gz" # Replace with your R2 FASTQ file (remove if single-end)
    STAR_GENOME_DIR="path/to/your/star_hg38_index" # Replace with path to pre-built STAR hg38 genome index
    OUTPUT_PREFIX="sample_name" # Prefix for output files
    NUM_THREADS=8 # Number of threads to use
    
    # Note: The STAR genome index (STAR_GENOME_DIR) must be pre-built using hg38 FASTA and a GTF/GFF file for splice junctions.
    # Example command for building index (run once, adjust sjdbOverhang based on read length - 1):
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${STAR_GENOME_DIR} \
    #      --genomeFastaFiles path/to/hg38.fa \
    #      --sjdbGTFfile path/to/gencode.vXX.annotation.gtf \
    #      --sjdbOverhang 100 \
    #      --runThreadN ${NUM_THREADS}
    
    # Perform mapping with STAR
    STAR --genomeDir ${STAR_GENOME_DIR} \
         --readFilesIn ${READ1_FASTQ} ${READ2_FASTQ} \
         --readFilesCommand zcat \
         --outFileNamePrefix ${OUTPUT_PREFIX}_ \
         --outFilterMultimapNmax 100 \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --runThreadN ${NUM_THREADS}
    
  5. 5

    Reads were then PCR deduplicated using UMIcollaps (https://github.com/Daniel-Liu-c0deb0t/UMICollapse

    UMIcollaps vlatest GitHub
    $ Bash example
    # Clone the repository if not already available
    # git clone https://github.com/Daniel-Liu-c0deb0t/UMICollapse.git
    # cd UMICollapse
    
    # Execute UMIcollaps for PCR deduplication
    # Assuming input_reads.fastq is the input file and deduplicated_reads.fastq is the desired output.
    # The UMI length (-l) is an example (default or common value); adjust if the actual UMI length is known from experimental design.
    python UMICollapse.py -i input_reads.fastq -o deduplicated_reads.fastq -l 10
  6. 6

    Enriched “windows” (IP versus SM-Input) was called on deduplicated reads using a GC-bias aware beta-binomial model.

    clipper (Inferred with models/gemini-2.5-flash) vPython 3.8 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install clipper (if not already installed)
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # # Ensure dependencies are met, e.g., pysam, numpy, scipy
    # pip install pysam numpy scipy
    
    # Define input and output files
    IP_BAM="path/to/deduplicated_IP.bam" # Immunoprecipitated reads, deduplicated
    SM_INPUT_BAM="path/to/deduplicated_SM_Input.bam" # Size-matched input reads, deduplicated
    GENOME_ASSEMBLY="hg38" # Placeholder: Replace with the actual genome assembly (e.g., hg19, mm10)
    OUTPUT_PREFIX="eCLIP_peaks"
    
    # Run clipper to call enriched windows using a GC-bias aware beta-binomial model
    python clipper.py \
        -s ${GENOME_ASSEMBLY} \
        -u ${IP_BAM} \
        -c ${SM_INPUT_BAM} \
        -o ${OUTPUT_PREFIX}
  7. 7

    Each window (~100 b.p.) were partitioned from Gencode v38 and associated with a specific type of genomic region (CDS, UTR, proximal introns near splice site… etc).

    GENCODE vv38 GitHub
    $ Bash example
    # Download Gencode v38 annotation GTF
    wget -nc https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
    gunzip -f gencode.v38.annotation.gtf.gz
    
    # Download hg38 chrom.sizes for bedtools makewindows
    wget -nc http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
    
    # # Installation for bedtools:
    # # conda install -c bioconda bedtools
    
    # Extract specific genomic regions from Gencode v38 GTF into BED format
    # Note: This is a simplified representation. A robust pipeline would use dedicated GTF parsers
    # or more complex scripting to accurately define all region types, especially introns.
    
    # Extract CDS regions
    awk '$3 == "CDS"' gencode.v38.annotation.gtf | \
    awk -v OFS='\t' '{print $1, $4-1, $5, $10, $12, $7}' | \
    sed 's/;//g; s/"//g' | \
    sort -k1,1 -k2,2n > gencode.v38.cds.bed
    
    # Extract UTR regions
    awk '$3 == "UTR"' gencode.v38.annotation.gtf | \
    awk -v OFS='\t' '{print $1, $4-1, $5, $10, $12, $7}' | \
    sed 's/;//g; s/"//g' | \
    sort -k1,1 -k2,2n > gencode.v38.utr.bed
    
    # Extract Exon regions (often used to derive introns)
    awk '$3 == "exon"' gencode.v38.annotation.gtf | \
    awk -v OFS='\t' '{print $1, $4-1, $5, $10, $12, $7}' | \
    sed 's/;//g; s/"//g' | \
    sort -k1,1 -k2,2n > gencode.v38.exons.bed
    
    # Conceptual step: Derive intron regions and proximal introns near splice sites.
    # This typically involves custom scripting based on exon coordinates (e.g., using bedtools complement
    # on exons within transcripts, then defining proximal regions +/- X bp around splice sites).
    # For this example, we assume these files (e.g., gencode.v38.introns.bed, gencode.v38.proximal_introns.bed)
    # would be generated by a preceding step or custom script.
    
    # Create genome-wide windows of ~100 b.p.
    bedtools makewindows -g hg38.chrom.sizes -w 100 > genome.100bp_windows.bed
    
    # Associate each window with a specific type of genomic region
    # This step involves intersecting the windows with the derived region BED files
    # and assigning a primary region type based on overlap hierarchy (e.g., CDS > UTR > Intron).
    # This is often done with a custom script after multiple bedtools intersect/map calls.
    
    # Example of intersecting windows with CDS regions:
    bedtools intersect -a genome.100bp_windows.bed -b gencode.v38.cds.bed -wa -wb > windows_overlapping_cds.bed
    
    # Further processing (e.g., with a Python/R script) would then assign a single region type
    # to each window based on the overlaps with CDS, UTR, Intron, etc., following a defined hierarchy.
    # For instance:
    # python assign_region_type.py genome.100bp_windows.bed windows_overlapping_cds.bed windows_overlapping_utr.bed ... > final_annotated_windows.bed
  8. 8

    Windows are filtered using FDR < 0.2 and only reproducible windows between two replicates were used.

    merge_replicates.py (Inferred with models/gemini-2.5-flash) vlatest GitHub
    $ Bash example
    # Clone the merge_peaks repository
    # git clone https://github.com/yeolab/merge_peaks.git
    # cd merge_peaks
    
    # Example usage of merge_replicates.py
    # This command filters windows (peaks) from two replicate BED files
    # using an FDR threshold of 0.2 and identifies reproducible windows.
    # Replace 'replicate1_peaks.bed' and 'replicate2_peaks.bed' with your actual input peak files.
    # The output will be prefixed with 'reproducible_windows'.
    python merge_replicates.py --fdr_threshold 0.2 replicate1_peaks.bed replicate2_peaks.bed --output_prefix reproducible_windows
  9. 9

    Data was processed using the eCLIP pipeline and available at https://github.com/YeoLab/skipper

    Skipper vLatest commit GitHub
    $ Bash example
    # Clone the Skipper workflow repository
    git clone https://github.com/YeoLab/skipper.git
    cd skipper
    
    # --- Configuration Setup (User needs to adapt these files) ---
    # The workflow requires a configuration file (e.g., config/config.yaml)
    # and a samplesheet (e.g., config/samples.tsv).
    #
    # Example config/config.yaml (adjust genome_dir and other parameters as needed):
    # genome_assembly: "hg38"
    # genome_dir: "/path/to/your/genome/data/hg38"
    #
    # Example config/samples.tsv (tab-separated, adjust paths to your FASTQ files):
    # sample_id    fastq_r1    fastq_r2    replicate
    # sample1_rep1 /path/to/sample1_rep1_R1.fastq.gz /path/to/sample1_rep1_R2.fastq.gz 1
    # sample1_rep2 /path/to/sample1_rep2_R1.fastq.gz /path/to/sample1_rep2_R2.fastq.gz 2
    #
    # --- Environment Setup (Optional, but recommended) ---
    # # Install mamba (if not already installed)
    # # conda install -c conda-forge mamba
    #
    # # Create and activate the Snakemake environment
    # # mamba env create -f envs/snakemake.yaml
    # # conda activate snakemake
    
    # --- Run the Skipper eCLIP workflow ---
    # Adjust --cores based on available resources.
    # Ensure config/config.yaml and config/samples.tsv are properly configured.
    snakemake --use-conda --cores 8 --configfile config/config.yaml

Tools Used

Raw Source Text
Data was processed using the eCLIP pipeline and available at https://github.com/YeoLab/skipper
Unique Molecular Identifiers (UMIs) were extracted using fastp 0.11.5 (https://github.com/OpenGene/fastp)
Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples)
Mapping was then performed against the full human genome (hg38) including a database of splice junctions with STAR (v 2.7.6) allowing up to 100 multimapped regions.
Reads were then PCR deduplicated using UMIcollaps (https://github.com/Daniel-Liu-c0deb0t/UMICollapse
Enriched “windows” (IP versus SM-Input) was called on deduplicated reads using a GC-bias aware beta-binomial model. Each window (~100 b.p.) were partitioned from Gencode v38 and associated with a specific type of genomic region (CDS, UTR, proximal introns near splice site… etc). Windows are filtered using FDR < 0.2 and only reproducible windows between two replicates were used.
Data was processed using the eCLIP pipeline and available at https://github.com/YeoLab/skipper
Assembly: hg38
Supplementary files format and content: bed files contain Skipper windows.
← Back to Analysis