GSE125809 Processing Pipeline

RIP-Seq code_examples 6 steps

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

GSE125809

The RNA helicase DDX6 regulation in human skeletal myoblasts [eCLIP-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    library strategy: CLIP-seq

    CLIP-seq vInfer from description GitHub
    $ Bash example
    # --- Installation Instructions (commented out) ---
    # For STAR:
    # conda install -c bioconda star
    
    # For clipper:
    # pip install clipper
    
    # For merge_peaks (custom pipeline):
    # git clone https://github.com/yeolab/merge_peaks.git
    # # Follow installation instructions within the cloned repository, e.g.:
    # # cd merge_peaks
    # # python setup.py install
    
    # --- Placeholder Variables ---
    # Reference genome directory for STAR (e.g., human hg38)
    # Replace with your actual path to the STAR index
    GENOME_DIR="/path/to/STAR_index/hg38"
    # Reference genome fasta file (e.g., human hg38)
    GENOME_FASTA="/path/to/hg38.fa"
    # GTF annotation file (e.g., Gencode for hg38)
    GTF_FILE="/path/to/gencode.vXX.annotation.gtf"
    
    # Input CLIP-seq raw reads (e.g., FastQ files)
    # Replace with your actual input file paths
    CLIP_READS_R1="sample_clip_R1.fastq.gz"
    CLIP_READS_R2="sample_clip_R2.fastq.gz" # Omit if single-end
    # Input control raw reads (e.g., FastQ files from input/mock IP)
    CONTROL_READS_R1="sample_control_R1.fastq.gz"
    CONTROL_READS_R2="sample_control_R2.fastq.gz" # Omit if single-end
    
    # Output prefixes
    CLIP_OUTPUT_PREFIX="sample_clip"
    CONTROL_OUTPUT_PREFIX="sample_control"
    
    # --- Step 1: Alignment of CLIP-seq and Control Reads with STAR ---
    echo "--- Step 1: Aligning CLIP-seq reads with STAR ---"
    # Align CLIP-seq reads
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${CLIP_READS_R1}" "${CLIP_READS_R2}" \
         --runThreadN 8 \
         --outFileNamePrefix "${CLIP_OUTPUT_PREFIX}_star_" \
         --outSAMtype BAM SortedByCoordinate \
         --quantMode GeneCounts \
         --outFilterMultimapNmax 10 \
         --outFilterMismatchNmax 3 # Example parameters, adjust as needed
    
    # Align Control reads
    echo "--- Step 1: Aligning Control reads with STAR ---"
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${CONTROL_READS_R1}" "${CONTROL_READS_R2}" \
         --runThreadN 8 \
         --outFileNamePrefix "${CONTROL_OUTPUT_PREFIX}_star_" \
         --outSAMtype BAM SortedByCoordinate \
         --quantMode GeneCounts \
         --outFilterMultimapNmax 10 \
         --outFilterMismatchNmax 3
    
    # Define aligned BAM files
    CLIP_BAM="${CLIP_OUTPUT_PREFIX}_star_Aligned.sortedByCoord.out.bam"
    CONTROL_BAM="${CONTROL_OUTPUT_PREFIX}_star_Aligned.sortedByCoord.out.bam"
    
    # --- Step 2: Peak Calling with clipper ---
    echo "--- Step 2: Calling peaks with clipper ---"
    # clipper requires aligned BAM files and a genome size.
    # Use 'hg38' as a placeholder for genome size, or provide a custom .sizes file.
    clipper -b "${CLIP_BAM}" \
            -c "${CONTROL_BAM}" \
            -s hg38 \
            -o "${CLIP_OUTPUT_PREFIX}_peaks.bed" \
            -p 0.01 \
            --bonferroni # Example parameters, adjust as needed
    
    # Define peak file from clipper
    CLIP_PEAKS="${CLIP_OUTPUT_PREFIX}_peaks.bed"
    
    # --- Step 3: IDR (Identifying Reproducible Peaks) with merge_peaks pipeline ---
    # This step typically requires multiple biological replicates.
    # Assuming two replicates for demonstration:
    # Replace with actual peak files from replicate 1 and replicate 2
    REP1_PEAKS="replicate1_clip_peaks.bed"
    REP2_PEAKS="replicate2_clip_peaks.bed"
    IDR_OUTPUT_PREFIX="combined_idr_peaks"
    
    echo "--- Step 3: Performing IDR with merge_peaks pipeline (requires replicates) ---"
    # The merge_peaks repository contains scripts for IDR analysis.
    # This is a conceptual command representing the execution of an IDR script from that pipeline.
    # You would typically navigate to the cloned merge_peaks directory and run a script like 'run_idr.py'
    python /path/to/merge_peaks/scripts/run_idr.py \
           --peak_files "${REP1_PEAKS}" "${REP2_PEAKS}" \
           --output_prefix "${IDR_OUTPUT_PREFIX}" \
           --idr_threshold 0.05 \
           --peak_caller clipper # Specify the peak caller used
  2. 2

    Reads were adapter-trimmed and mapped to human-specific repetitive elements from RepBase (version 18.04) by STAR.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    INPUT_READS="trimmed_reads.fastq.gz" # Placeholder for adapter-trimmed reads
    REPBASE_INDEX_DIR="/path/to/repbase_18_04_star_index" # Placeholder for STAR index built from RepBase 18.04 human-specific repetitive elements
    OUTPUT_PREFIX="repbase_mapping_output"
    NUM_THREADS=8 # Adjust as needed
    
    # Note: The STAR index for RepBase 18.04 human-specific repetitive elements needs to be pre-built.
    # This typically involves creating a FASTA file of the desired repetitive elements
    # and then running STAR --runMode genomeGenerate ...
    
    # Map adapter-trimmed reads to human-specific repetitive elements from RepBase
    STAR --runThreadN ${NUM_THREADS} \
         --genomeDir ${REPBASE_INDEX_DIR} \
         --readFilesIn ${INPUT_READS} \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outFilterMultimapNmax 100 \
         --outFilterMismatchNmax 10 \
         --alignIntronMax 1 # Assuming repetitive elements do not contain large introns, or are treated as single exons
    
  3. 3

    Repeat-mapping reads were removed and remaining reads mapped to the human genome assembly hg19 with STAR.

    $ Bash example
    # conda install -c bioconda star bowtie2 samtools
    
    # Placeholder for rRNA/repeat genome index (e.g., from UCSC hg19 rRNA or a custom repeat library).
    # This step assumes 'repeat-mapping reads' refers to reads mapping to a dedicated repeat/rRNA genome,
    # a common pre-processing step in eCLIP pipelines.
    # bowtie2-build /path/to/rRNA.fa /path/to/rRNA_index
    
    # Step 1: Remove repeat-mapping reads (e.g., rRNA reads) using Bowtie2.
    # Reads that *do not* map to the repeat genome are passed to STAR.
    bowtie2 -x /path/to/rRNA_index \
            -U reads.fastq.gz \
            --un-gz filtered_reads.fastq.gz \
            -S /dev/null # Discard mapped reads
    
    # Step 2: Map remaining reads to human genome hg19 with STAR.
    # Create STAR index (if not already present) for hg19.
    # STAR --runMode genomeGenerate \
    #      --genomeDir /path/to/STAR_index/hg19 \
    #      --genomeFastaFiles /path/to/hg19.fa \
    #      --sjdbGTFfile /path/to/gencode.v19.annotation.gtf \
    #      --runThreadN 8
    
    STAR --genomeDir /path/to/STAR_index/hg19 \
         --readFilesIn filtered_reads.fastq.gz \
         --readFilesCommand zcat \
         --outFileNamePrefix aligned_ \
         --outSAMtype BAM SortedByCoordinate \
         --runThreadN 8
    
    # Rename output file for clarity
    mv aligned_Aligned.sortedByCoord.out.bam aligned.bam
  4. 4

    PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5’ adapter and remaining reads retained as ‘usable reads’.

    umi_tools (Inferred with models/gemini-2.5-flash) v1.1.1 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install umi-tools
    # conda install -c bioconda umi-tools=1.1.1
    
    # Example command for UMI-based deduplication.
    # This command assumes that the UMIs have already been extracted from the 5' adapter
    # and appended to the read names (e.g., by a prior 'umi_tools extract' step).
    # Replace 'input_aligned_with_umis.bam' with your actual input BAM file
    # and 'output_deduplicated.bam' with your desired output file name.
    # The --paired-end flag should be used if the input reads are paired-end.
    # The --method unique is a common deduplication strategy.
    umi_tools dedup \
        --extract-umis-from-read-names \
        --method unique \
        --paired-end \
        -I input_aligned_with_umis.bam \
        -S output_deduplicated.bam \
        --output-stats deduplication_stats.tsv
  5. 5

    Peaks were called on the usable reads by CLIPper ,and assigned to gene regions annotated in Gencode (v19)

    CLIPper v0.0.1 GitHub
    $ Bash example
    # Install CLIPper (example, adjust based on actual installation method)
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # python setup.py install
    
    # Define input and output files
    INPUT_BAM="input_reads.bam"
    OUTPUT_BED="peaks.bed"
    
    # Define reference files for Gencode v19 (corresponding to hg19/GRCh37)
    # Download Gencode v19 GTF
    # wget -O gencode.v19.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
    # gunzip gencode.v19.annotation.gtf.gz
    GENCODE_GTF="gencode.v19.annotation.gtf"
    
    # Download hg19/GRCh37 primary assembly FASTA (Ensembl release 75 corresponds to Gencode v19)
    # wget -O Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
    # gunzip Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
    GENOME_FASTA="Homo_sapiens.GRCh37.dna.primary_assembly.fa"
    
    # Generate chromosome sizes file from FASTA
    # samtools faidx "${GENOME_FASTA}"
    # cut -f1,2 "${GENOME_FASTA}".fai > hg19.chrom.sizes
    GENOME_SIZES="hg19.chrom.sizes"
    
    # Execute CLIPper to call peaks and assign to gene regions
    python clipper.py -b "${INPUT_BAM}" \
                      -s "${GENOME_SIZES}" \
                      -o "${OUTPUT_BED}" \
                      -g "${GENOME_FASTA}" \
                      -a "${GENCODE_GTF}"
    
  6. 6

    Each peak was normalized to the size-matched input (SMInput)

    normalize_peaks.py (from yeolab/skipper) (Inferred with models/gemini-2.5-flash) vFrom yeolab/skipper pipeline (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # --- Setup for normalize_peaks.py ---
    # The normalize_peaks.py script is part of the yeolab/skipper pipeline.
    # It can be found at: https://github.com/yeolab/skipper/blob/master/scripts/normalize_peaks.py
    # Ensure this script is available in your PATH or specified by its full path.
    # For example, you might clone the repository and use the script:
    # git clone https://github.com/yeolab/skipper.git
    # NORMALIZE_SCRIPT="./skipper/scripts/normalize_peaks.py" # Adjust path as needed
    
    # Define variables for clarity
    IP_PEAKS="ip_peaks.bed" # Placeholder: Path to your called IP peaks (e.g., from clipper output)
    IP_BAM="ip_aligned.bam" # Placeholder: Path to your IP sample's aligned BAM file
    SM_INPUT_BAM="sm_input_aligned.bam" # Placeholder: Path to your size-matched input sample's aligned BAM file
    OUTPUT_NORMALIZED_PEAKS="normalized_peaks.bed" # Desired output file name for normalized peaks
    GENOME_SIZE="2.7e9" # Placeholder: Effective genome size (e.g., 2.7e9 for human hg38, 2.9e9 for hg19)
    
    # Install dependencies (if not already installed)
    # conda install -c bioconda python bedtools samtools numpy scipy
    
    # Execute the normalization script
    # Ensure normalize_peaks.py is executable and its Python dependencies (e.g., numpy, scipy) are met.
    python normalize_peaks.py \
        -p "${IP_PEAKS}" \
        -i "${IP_BAM}" \
        -c "${SM_INPUT_BAM}" \
        -o "${OUTPUT_NORMALIZED_PEAKS}" \
        -g "${GENOME_SIZE}"

Tools Used

Raw Source Text
library strategy: CLIP-seq
Reads were adapter-trimmed and mapped to human-specific repetitive elements from RepBase (version 18.04) by STAR. Repeat-mapping reads were removed and remaining reads mapped to the human genome assembly hg19 with STAR. PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5’ adapter and remaining reads retained as ‘usable reads’.
Peaks were called on the usable reads by CLIPper ,and assigned to gene regions annotated in Gencode (v19)
Each peak was normalized to the size-matched input (SMInput)
Genome_build: Homo sapiens UCSC hg19
Supplementary_files_format_and_content: Each bed file was gernated by read normalization between IP over SMInput. The columns in the bed files represent (chr, start, stop,-log10(pvalue),log2(fold change), strand). The CSV files contain the log2 fold change of reads upon IP over SMInput in annotated regions of each transcripts. Each bigwig file contains read distribution of each RBP, and bigbed contains clusters of predicted RBP binding.
← Back to Analysis