GSE155728 Processing Pipeline

RIP-Seq code_examples 21 steps

Publication

Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.

Nature methods (2021) — PMID 33963355

Dataset

GSE155728

Robust single-cell discovery of RNA targets of RNA binding proteins and ribosomes [tia1_eclip]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1).

    eclipdemux v0.0.1 GitHub
    $ Bash example
    # Clone the eclip repository to get the eclipdemux script
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip/workflow/scripts
    
    # Example usage of eclipdemux.py
    # This command assumes a 6bp randomer followed by a 4bp barcode at the 5' end of the read.
    # The randomer will be moved to the read header, and the barcode will be trimmed.
    python eclipdemux.py \
        --input reads.fastq.gz \
        --output demuxed_reads.fastq.gz \
        --barcode-length 4 \
        --randomer-length 6
  2. 2

    Args: --length 10

    Generic Tool (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # This is a placeholder command as the specific tool could not be inferred from "Args: --length 10".
    # The --length argument typically specifies a minimum length for reads or a specific feature.
    generic_tool --length 10
  3. 3

    Reads were then trimmed with cutadapt (1.14).

    cutadapt v1.14 GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=1.14
    
    # Trim reads for quality and minimum length (common default parameters)
    # Replace 'raw_reads.fastq.gz' with your input file and 'trimmed_reads.fastq.gz' with your desired output file.
    # If you have specific adapter sequences, add them using -a (3' adapter) and/or -g (5' adapter).
    cutadapt -q 20 -m 20 -o trimmed_reads.fastq.gz raw_reads.fastq.gz
  4. 4

    Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt
    
    # Define input and output files (placeholders)
    INPUT_FASTQ="input.fastq.gz"
    OUTPUT_FASTQ="trimmed.fastq.gz"
    
    # Adapter sequences generated by parsebarcodes.sh
    G_ADAPTERS="g_adapters.fasta"
    A_ADAPTERS="A_adapters.fasta"
    a_ADAPTERS="a_adapters.fasta"
    
    # Execute cutadapt with parameters from the description
    cutadapt \
      --match-read-wildcards \
      -O 1 \
      --times 1 \
      -e 0.1 \
      --quality-cutoff 6 \
      -m 18 \
      -g "${G_ADAPTERS}" \
      -A "${A_ADAPTERS}" \
      -a "${a_ADAPTERS}" \
      -o "${OUTPUT_FASTQ}" \
      "${INPUT_FASTQ}"
  5. 5

    Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events.

    cutadapt v1.9.1 GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=1.9.1
    
    # Trim reads to remove 3' adapter sequences, addressing potential double-ligation events.
    # Replace 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' with the actual 3' adapter sequence used in the experiment.
    # Replace 'input_reads.fastq.gz' and 'output_trimmed_reads.fastq.gz' with your actual input and output file names.
    cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
             -o output_trimmed_reads.fastq.gz \
             input_reads.fastq.gz
  6. 6

    Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt=1.18
    
    # Define input and output files
    INPUT_FASTQ="input_reads.fastq.gz" # Placeholder for input FASTQ file
    OUTPUT_FASTQ="output_trimmed_reads.fastq.gz" # Placeholder for output trimmed FASTQ file
    ADAPTER_FILE="A_adapters.fasta" # Fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline
    
    # Execute cutadapt command for adapter trimming and quality filtering
    cutadapt \
      --match-read-wildcards \
      -O 5 \
      --times 1 \
      -e 0.1 \
      --quality-cutoff 6 \
      -m 18 \
      -A "${ADAPTER_FILE}" \
      -o "${OUTPUT_FASTQ}" \
      "${INPUT_FASTQ}"
  7. 7

    Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05).

    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star
    
    # Define paths for input and output
    # Replace with actual paths to your trimmed reads and STAR index
    TRIMMED_READS="path/to/trimmed_reads.fastq.gz" # Assuming single-end reads. For paired-end, use "read1.fastq.gz read2.fastq.gz"
    STAR_INDEX_DIR="path/to/repbase_18.05_star_index" # This directory should contain the STAR index built from RepBase 18.05
    OUTPUT_PREFIX="mapped_repbase_18.05" # Prefix for output files
    NUM_THREADS=8 # Number of threads to use, adjust as needed
    
    # Create output directory if it doesn't exist
    mkdir -p "$(dirname ${OUTPUT_PREFIX})"
    
    # Map trimmed reads to the human-specific repeat element database (RepBase 18.05) using STAR
    STAR --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${TRIMMED_READS}" \
         --runThreadN "${NUM_THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate \
         --outFilterMultimapNmax 100 # Allow reads to map to up to 100 locations, common for repeat mapping
    
  8. 8

    Args: --runThreadN 16 '--genomeDir human_repbase '--readFilesIn path/to/read1 path/to/read2 '--outFileNamePrefix out_prefix '--outReadsUnmapped Fastx '--outSAMtype BAM Unsorted '--outSAMattributes All '--outSAMunmapped Within '--outSAMattrRGline ID:foo '--outFilterType BySJout '--outFilterMultimapNmax 30 '--outFilterMultimapScoreRange 1 '--outFilterScoreMin 10 '--alignEndsType EndToEnd

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables (replace with actual paths and values)
    # GENOME_DIR refers to the directory containing the STAR genome index files.
    # 'human_repbase' is specified in the description, implying a pre-built index.
    # A common human reference assembly for STAR indices is GRCh38.
    GENOME_DIR="human_repbase"
    READ1="path/to/read1.fastq.gz"
    READ2="path/to/read2.fastq.gz"
    OUT_PREFIX="out_prefix"
    
    # Run STAR alignment
    STAR \
      --runThreadN 16 \
      --genomeDir "${GENOME_DIR}" \
      --readFilesIn "${READ1}" "${READ2}" \
      --outFileNamePrefix "${OUT_PREFIX}" \
      --outReadsUnmapped Fastx \
      --outSAMtype BAM Unsorted \
      --outSAMattributes All \
      --outSAMunmapped Within \
      --outSAMattrRGline ID:foo \
      --outFilterType BySJout \
      --outFilterMultimapNmax 30 \
      --outFilterMultimapScoreRange 1 \
      --outFilterScoreMin 10 \
      --alignEndsType EndToEnd
  9. 9

    Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19).

    $ Bash example
    # Install STAR (example using conda)
    # conda create -n star_env star=2.4.0i -c bioconda -y
    # conda activate star_env
    
    # Define variables
    STAR_INDEX_DIR="path/to/star_hg19_index" # Directory containing pre-built hg19 STAR index
    READS_FASTQ="unmapped_filtered_reads.fastq.gz" # Input FASTQ file of unmapped reads
    OUTPUT_PREFIX="aligned_to_hg19"
    NUM_THREADS=8 # Adjust based on available resources
    
    # --- Genome Index Generation (Run once if index is not available) ---
    # To build the hg19 index, you would need hg19.fa and hg19.gtf files.
    # Example:
    # mkdir -p "${STAR_INDEX_DIR}"
    # STAR --runMode genomeGenerate \
    #      --genomeDir "${STAR_INDEX_DIR}" \
    #      --genomeFastaFiles "path/to/hg19.fa" \
    #      --sjdbGTFfile "path/to/hg19.gtf" \
    #      --runThreadN "${NUM_THREADS}"
    
    # --- Alignment Step ---
    # Maps unmapped reads (filtered of repeat elements) to the human genome (hg19)
    STAR --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${READS_FASTQ}" \
         --runThreadN "${NUM_THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outFilterMultimapNmax 20 # A common parameter to limit multi-mapping reads
    
  10. 10

    Args: --runThreadN 16 '--genomeDir genomedir '--readFilesIn /path/to/read1 /path/to/read2 '--outFileNamePrefix out_prefix '--outReadsUnmapped Fastx '--outSAMtype BAM Unsorted '--outSAMattributes All '--outSAMunmapped Within '--outSAMattrRGline ID:foo '--outFilterType BySJout '--outFilterMultimapNmax 1 '--outFilterMultimapScoreRange 1 '--outFilterScoreMin 10 '--alignEndsType EndToEnd

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables for input and output
    # Replace /path/to/STAR_genome_index with the actual path to your STAR genome index
    # This index should be built using a reference genome (e.g., GRCh38/hg38) and GTF annotation.
    GENOME_DIR="/path/to/STAR_genome_index"
    READ1="/path/to/read1"
    READ2="/path/to/read2"
    OUT_PREFIX="out_prefix"
    
    # Execute STAR alignment
    STAR \
      --runThreadN 16 \
      --genomeDir "${GENOME_DIR}" \
      --readFilesIn "${READ1}" "${READ2}" \
      --outFileNamePrefix "${OUT_PREFIX}" \
      --outReadsUnmapped Fastx \
      --outSAMtype BAM Unsorted \
      --outSAMattributes All \
      --outSAMunmapped Within \
      --outSAMattrRGline ID:foo \
      --outFilterType BySJout \
      --outFilterMultimapNmax 1 \
      --outFilterMultimapScoreRange 1 \
      --outFilterScoreMin 10 \
      --alignEndsType EndToEnd
  11. 11

    Aligned reads were sorted with samtools (1.4.1)

    samtools v1.4.1 GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools=1.4.1
    
    # Sort aligned reads by coordinate
    samtools sort -o output.sorted.bam input.bam
  12. 12

    Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines.

    $ Bash example
    # Install eclip (if not already installed)
    # pip install eclip # or via conda
    
    # Define input and output files
    INPUT_R1="sorted_reads_R1.fastq.gz"
    INPUT_R2="sorted_reads_R2.fastq.gz"
    OUTPUT_PREFIX="collapsed_reads"
    BARCODE_LENGTH=6 # Example barcode length, adjust as needed based on experimental design
    MIN_READS_PER_BARCODE=1 # Example minimum reads per unique barcode, adjust as needed
    
    # Execute barcodecollapsepe.py
    barcodecollapsepe.py -i "${INPUT_R1}" -j "${INPUT_R2}" -o "${OUTPUT_PREFIX}" -b "${BARCODE_LENGTH}" -m "${MIN_READS_PER_BARCODE}"
  13. 13

    Args: -o preRmDup.bam -m metrics_file -b rmDup.bam

    picard MarkDuplicates (Inferred with models/gemini-2.5-flash) v2.27.4 GitHub
    $ Bash example
    # Install Picard (if not already installed)
    # conda install -c bioconda picard
    
    # Execute Picard MarkDuplicates
    # Assuming picard.jar is in the PATH or specified with its full path
    java -jar $(which picard.jar) MarkDuplicates \
        I=rmDup.bam \
        O=preRmDup.bam \
        M=metrics_file \
        REMOVE_DUPLICATES=false \
        VALIDATION_STRINGENCY=SILENT \
        ASSUME_SORTED=true
  14. 14

    PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)

    samtools v1.4.1 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools=1.4.1
    
    # Merge PCR de-duped reads from each inline barcode
    # Replace input1.bam, input2.bam with the actual paths to your de-duped BAM files for each barcode
    samtools merge merged.bam input1.bam input2.bam
  15. 15

    Merged alignments were split to keep just read2 using samtools (1.4.1) view.

    samtools v1.4.1 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install samtools=1.4.1
    
    # Split merged alignments to keep just read2
    # Input: merged_alignments.bam (merged alignments)
    # Output: read2_alignments.bam (alignments containing only read2)
    samtools view -b -f 0x80 merged_alignments.bam > read2_alignments.bam
  16. 16

    Args: -h -b -f 128

    Generic Command (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # The specific bioinformatics tool could not be inferred from the generic arguments: -h -b -f 128.
    # This is a placeholder command. A specific tool and context (e.g., assay type, task) would be needed to generate a meaningful command.
    
    # Example of how the arguments might be used with a hypothetical tool:
    # placeholder_tool -h -b -f 128
  17. 17

    Read2 BAM files were used to identify peak clusters with Clipper (1.2.2).

    CLIPper v1.2.2 GitHub
    $ Bash example
    # Install CLIPper if not already installed
    # pip install clipper # Or clone from GitHub and install
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # python setup.py install
    
    # Define input and output files
    INPUT_BAM="read2.bam" # Placeholder for Read2 BAM file
    OUTPUT_PREFIX="read2_clipper_peaks"
    GENOME_SIZE="hg38" # Placeholder for genome size (e.g., hg38, mm10, or a custom value)
    
    # Run CLIPper to identify peak clusters
    # Using common parameters: -b for input BAM, -s for genome size, -o for output prefix
    clipper.py -b "${INPUT_BAM}" -s "${GENOME_SIZE}" -o "${OUTPUT_PREFIX}"
  18. 18

    Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam

    eCLIP_BAM_Processor.py (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install dependencies for a custom Python script (example, adjust as needed)
    # conda create -n eclip_env python=3.8
    # conda activate eclip_env
    # conda install -c bioconda pysam numpy scipy pandas # Common dependencies for BAM processing and pickle
    #
    # Reference genome: hg19 (Human Genome build 19)
    # Source: UCSC Genome Browser (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz) or NCBI.
    
    # Execute the inferred eCLIP BAM processing script
    python eCLIP_BAM_Processor.py \
        --species hg19 \
        --bam path/to/input.bam \
        --timeout 3600000 \
        --maxgenes 1000000 \
        --save-pickle \
        --outfile path/to/output.bam
  19. 19

    Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.

    $ Bash example
    # Install eCLIP tools (example, actual installation might vary)
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    # # Follow installation instructions, e.g., setting up Perl dependencies and adding eclip/bin to PATH
    
    # The description mentions "overlap_peakfi_with_bam_PE.pl" which is often used
    # to filter peak files based on BAM coverage prior to normalization.
    # Assuming 'pre_normalized_peaks.narrowPeak' is the output of such a filtering step
    # or the initial peak file to be normalized.
    
    # Placeholder variables
    IP_BAM="ip_sample_read2.bam"
    INPUT_BAM="input_sample_read2.bam"
    PEAK_FILE="pre_normalized_peaks.narrowPeak" # Input peak file for normalization
    OUTPUT_PREFIX="normalized_peaks" # Prefix for output files (e.g., normalized_peaks.bed, normalized_peaks.log)
    
    # Execute the normalization using peaksnormalize.pl
    # Usage: peaksnormalize.pl <IP_BAM> <INPUT_BAM> <PEAK_FILE> <OUTPUT_PREFIX>
    peaksnormalize.pl "${IP_BAM}" "${INPUT_BAM}" "${PEAK_FILE}" "${OUTPUT_PREFIX}"
  20. 20

    Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

    $ Bash example
    # Install Perl (if not already available)
    # sudo apt-get update && sudo apt-get install -y perl
    #
    # Clone the merge_peaks repository to get the script
    # git clone https://github.com/yeolab/merge_peaks.git
    #
    # Add the script's directory to PATH or call it directly
    # export PATH=$PATH:$(pwd)/merge_peaks/bin
    #
    # Example usage:
    # Assuming 'rep1_normalized_peaks.bed' and 'rep2_normalized_peaks.bed' are the input normalized peak regions.
    # The script merges overlapping regions from multiple replicate BED files based on L2 fold enrichment.
    perl merge_peaks/bin/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \
        rep1_normalized_peaks.bed \
        rep2_normalized_peaks.bed \
        > merged_replicate_peaks.bed
  21. 21

    Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.

    IDR v2.0.2 GitHub
    $ Bash example
    # Install IDR (if not already installed)
    # conda install -c bioconda idr
    
    # Install merge_peaks pipeline (if not already installed)
    # git clone https://github.com/yeolab/merge_peaks.git
    # export PATH=$PATH:/path/to/merge_peaks/scripts
    
    # Define input files (placeholders)
    INPUT_REP1_BED="replicate1.compressed.bed"
    INPUT_REP2_BED="replicate2.compressed.bed"
    GENOME_FASTA="/path/to/genome/hg38.fa" # Placeholder for reference genome fasta (e.g., hg38)
    OUTPUT_PREFIX="idr_output"
    
    # Step 1: Rank peaks by entropy score using make_informationcontent_from_peaks.pl
    # This script is part of the merge_peaks pipeline (https://github.com/yeolab/merge_peaks).
    # It takes a BED file and a genome fasta, and outputs a new BED file with an information content score (entropy) in the 5th column.
    RANKED_REP1_BED="${INPUT_REP1_BED%.bed}.ranked.bed"
    RANKED_REP2_BED="${INPUT_REP2_BED%.bed}.ranked.bed"
    
    make_informationcontent_from_peaks.pl "${INPUT_REP1_BED}" "${GENOME_FASTA}" "${RANKED_REP1_BED}"
    make_informationcontent_from_peaks.pl "${INPUT_REP2_BED}" "${GENOME_FASTA}" "${RANKED_REP2_BED}"
    
    # Step 2: Run IDR (Irreproducible Discovery Rate) with the ranked peak files
    # The --rank-by signal.value parameter tells IDR to use the 5th column (score) of the input BED files,
    # which now contains the entropy score generated by make_informationcontent_from_peaks.pl.
    idr --samples "${RANKED_REP1_BED}" "${RANKED_REP2_BED}" \
        --output-file "${OUTPUT_PREFIX}.idr" \
        --rank-by signal.value \
        --plot \
        --log-output-file "${OUTPUT_PREFIX}.log"
    

Tools Used

Raw Source Text
Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1). Args: --length 10
Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 '--genomeDir human_repbase '--readFilesIn path/to/read1 path/to/read2 '--outFileNamePrefix out_prefix '--outReadsUnmapped Fastx '--outSAMtype BAM Unsorted '--outSAMattributes All '--outSAMunmapped Within '--outSAMattrRGline ID:foo '--outFilterType BySJout '--outFilterMultimapNmax 30 '--outFilterMultimapScoreRange 1 '--outFilterScoreMin 10 '--alignEndsType EndToEnd
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19). Args: --runThreadN 16 '--genomeDir genomedir '--readFilesIn /path/to/read1 /path/to/read2 '--outFileNamePrefix out_prefix '--outReadsUnmapped Fastx '--outSAMtype BAM   Unsorted '--outSAMattributes All '--outSAMunmapped Within '--outSAMattrRGline ID:foo '--outFilterType BySJout '--outFilterMultimapNmax 1 '--outFilterMultimapScoreRange 1 '--outFilterScoreMin 10 '--alignEndsType EndToEnd
Aligned reads were sorted with samtools (1.4.1)
Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines. Args: -o preRmDup.bam -m metrics_file -b rmDup.bam
PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)
Merged alignments were split to keep just read2 using samtools (1.4.1) view. Args: -h -b -f 128
Read2 BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
Genome_build: hg19
Supplementary_files_format_and_content: BigWig, BED
← Back to Analysis