GSE155728 Processing Pipeline

RIP-Seq code_examples 21 steps

Publication

Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.

Nature methods (2021) — PMID 33963355

Dataset

Robust single-cell discovery of RNA targets of RNA binding proteins and ribosomes [tia1_eclip]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1).

eclipdemux v0.0.1 GitHub

$ Bash example

# Clone the eclip repository to get the eclipdemux script
# git clone https://github.com/yeolab/eclip.git
# cd eclip/workflow/scripts

# Example usage of eclipdemux.py
# This command assumes a 6bp randomer followed by a 4bp barcode at the 5' end of the read.
# The randomer will be moved to the read header, and the barcode will be trimmed.
python eclipdemux.py \
    --input reads.fastq.gz \
    --output demuxed_reads.fastq.gz \
    --barcode-length 4 \
    --randomer-length 6

View on GitHub

Args: --length 10

Generic Tool (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This is a placeholder command as the specific tool could not be inferred from "Args: --length 10".
# The --length argument typically specifies a minimum length for reads or a specific feature.
generic_tool --length 10

Reads were then trimmed with cutadapt (1.14).

cutadapt v1.14 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt=1.14

# Trim reads for quality and minimum length (common default parameters)
# Replace 'raw_reads.fastq.gz' with your input file and 'trimmed_reads.fastq.gz' with your desired output file.
# If you have specific adapter sequences, add them using -a (3' adapter) and/or -g (5' adapter).
cutadapt -q 20 -m 20 -o trimmed_reads.fastq.gz raw_reads.fastq.gz

View on GitHub

Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

eCLIP v0.1.5 GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda install -c bioconda cutadapt

# Define input and output files (placeholders)
INPUT_FASTQ="input.fastq.gz"
OUTPUT_FASTQ="trimmed.fastq.gz"

# Adapter sequences generated by parsebarcodes.sh
G_ADAPTERS="g_adapters.fasta"
A_ADAPTERS="A_adapters.fasta"
a_ADAPTERS="a_adapters.fasta"

# Execute cutadapt with parameters from the description
cutadapt \
  --match-read-wildcards \
  -O 1 \
  --times 1 \
  -e 0.1 \
  --quality-cutoff 6 \
  -m 18 \
  -g "${G_ADAPTERS}" \
  -A "${A_ADAPTERS}" \
  -a "${a_ADAPTERS}" \
  -o "${OUTPUT_FASTQ}" \
  "${INPUT_FASTQ}"

View on GitHub

Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events.

cutadapt v1.9.1 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt=1.9.1

# Trim reads to remove 3' adapter sequences, addressing potential double-ligation events.
# Replace 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' with the actual 3' adapter sequence used in the experiment.
# Replace 'input_reads.fastq.gz' and 'output_trimmed_reads.fastq.gz' with your actual input and output file names.
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
         -o output_trimmed_reads.fastq.gz \
         input_reads.fastq.gz

View on GitHub

Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

eCLIP v0.1.5 GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda install -c bioconda cutadapt=1.18

# Define input and output files
INPUT_FASTQ="input_reads.fastq.gz" # Placeholder for input FASTQ file
OUTPUT_FASTQ="output_trimmed_reads.fastq.gz" # Placeholder for output trimmed FASTQ file
ADAPTER_FILE="A_adapters.fasta" # Fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline

# Execute cutadapt command for adapter trimming and quality filtering
cutadapt \
  --match-read-wildcards \
  -O 5 \
  --times 1 \
  -e 0.1 \
  --quality-cutoff 6 \
  -m 18 \
  -A "${ADAPTER_FILE}" \
  -o "${OUTPUT_FASTQ}" \
  "${INPUT_FASTQ}"

View on GitHub

Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star

# Define paths for input and output
# Replace with actual paths to your trimmed reads and STAR index
TRIMMED_READS="path/to/trimmed_reads.fastq.gz" # Assuming single-end reads. For paired-end, use "read1.fastq.gz read2.fastq.gz"
STAR_INDEX_DIR="path/to/repbase_18.05_star_index" # This directory should contain the STAR index built from RepBase 18.05
OUTPUT_PREFIX="mapped_repbase_18.05" # Prefix for output files
NUM_THREADS=8 # Number of threads to use, adjust as needed

# Create output directory if it doesn't exist
mkdir -p "$(dirname ${OUTPUT_PREFIX})"

# Map trimmed reads to the human-specific repeat element database (RepBase 18.05) using STAR
STAR --genomeDir "${STAR_INDEX_DIR}" \
     --readFilesIn "${TRIMMED_READS}" \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterMultimapNmax 100 # Allow reads to map to up to 100 locations, common for repeat mapping

View on GitHub

Args: --runThreadN 16 '--genomeDir human_repbase '--readFilesIn path/to/read1 path/to/read2 '--outFileNamePrefix out_prefix '--outReadsUnmapped Fastx '--outSAMtype BAM Unsorted '--outSAMattributes All '--outSAMunmapped Within '--outSAMattrRGline ID:foo '--outFilterType BySJout '--outFilterMultimapNmax 30 '--outFilterMultimapScoreRange 1 '--outFilterScoreMin 10 '--alignEndsType EndToEnd

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables (replace with actual paths and values)
# GENOME_DIR refers to the directory containing the STAR genome index files.
# 'human_repbase' is specified in the description, implying a pre-built index.
# A common human reference assembly for STAR indices is GRCh38.
GENOME_DIR="human_repbase"
READ1="path/to/read1.fastq.gz"
READ2="path/to/read2.fastq.gz"
OUT_PREFIX="out_prefix"

# Run STAR alignment
STAR \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READ1}" "${READ2}" \
  --outFileNamePrefix "${OUT_PREFIX}" \
  --outReadsUnmapped Fastx \
  --outSAMtype BAM Unsorted \
  --outSAMattributes All \
  --outSAMunmapped Within \
  --outSAMattrRGline ID:foo \
  --outFilterType BySJout \
  --outFilterMultimapNmax 30 \
  --outFilterMultimapScoreRange 1 \
  --outFilterScoreMin 10 \
  --alignEndsType EndToEnd

View on GitHub

Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR (example using conda)
# conda create -n star_env star=2.4.0i -c bioconda -y
# conda activate star_env

# Define variables
STAR_INDEX_DIR="path/to/star_hg19_index" # Directory containing pre-built hg19 STAR index
READS_FASTQ="unmapped_filtered_reads.fastq.gz" # Input FASTQ file of unmapped reads
OUTPUT_PREFIX="aligned_to_hg19"
NUM_THREADS=8 # Adjust based on available resources

# --- Genome Index Generation (Run once if index is not available) ---
# To build the hg19 index, you would need hg19.fa and hg19.gtf files.
# Example:
# mkdir -p "${STAR_INDEX_DIR}"
# STAR --runMode genomeGenerate \
#      --genomeDir "${STAR_INDEX_DIR}" \
#      --genomeFastaFiles "path/to/hg19.fa" \
#      --sjdbGTFfile "path/to/hg19.gtf" \
#      --runThreadN "${NUM_THREADS}"

# --- Alignment Step ---
# Maps unmapped reads (filtered of repeat elements) to the human genome (hg19)
STAR --genomeDir "${STAR_INDEX_DIR}" \
     --readFilesIn "${READS_FASTQ}" \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 20 # A common parameter to limit multi-mapping reads

View on GitHub

Args: --runThreadN 16 '--genomeDir genomedir '--readFilesIn /path/to/read1 /path/to/read2 '--outFileNamePrefix out_prefix '--outReadsUnmapped Fastx '--outSAMtype BAM Unsorted '--outSAMattributes All '--outSAMunmapped Within '--outSAMattrRGline ID:foo '--outFilterType BySJout '--outFilterMultimapNmax 1 '--outFilterMultimapScoreRange 1 '--outFilterScoreMin 10 '--alignEndsType EndToEnd

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables for input and output
# Replace /path/to/STAR_genome_index with the actual path to your STAR genome index
# This index should be built using a reference genome (e.g., GRCh38/hg38) and GTF annotation.
GENOME_DIR="/path/to/STAR_genome_index"
READ1="/path/to/read1"
READ2="/path/to/read2"
OUT_PREFIX="out_prefix"

# Execute STAR alignment
STAR \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READ1}" "${READ2}" \
  --outFileNamePrefix "${OUT_PREFIX}" \
  --outReadsUnmapped Fastx \
  --outSAMtype BAM Unsorted \
  --outSAMattributes All \
  --outSAMunmapped Within \
  --outSAMattrRGline ID:foo \
  --outFilterType BySJout \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --outFilterScoreMin 10 \
  --alignEndsType EndToEnd

View on GitHub

Aligned reads were sorted with samtools (1.4.1)

samtools v1.4.1 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.4.1

# Sort aligned reads by coordinate
samtools sort -o output.sorted.bam input.bam

View on GitHub

Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines.

eCLIP v0.1.5 GitHub

$ Bash example

# Install eclip (if not already installed)
# pip install eclip # or via conda

# Define input and output files
INPUT_R1="sorted_reads_R1.fastq.gz"
INPUT_R2="sorted_reads_R2.fastq.gz"
OUTPUT_PREFIX="collapsed_reads"
BARCODE_LENGTH=6 # Example barcode length, adjust as needed based on experimental design
MIN_READS_PER_BARCODE=1 # Example minimum reads per unique barcode, adjust as needed

# Execute barcodecollapsepe.py
barcodecollapsepe.py -i "${INPUT_R1}" -j "${INPUT_R2}" -o "${OUTPUT_PREFIX}" -b "${BARCODE_LENGTH}" -m "${MIN_READS_PER_BARCODE}"

View on GitHub

Args: -o preRmDup.bam -m metrics_file -b rmDup.bam

picard MarkDuplicates (Inferred with models/gemini-2.5-flash) v2.27.4 GitHub

$ Bash example

# Install Picard (if not already installed)
# conda install -c bioconda picard

# Execute Picard MarkDuplicates
# Assuming picard.jar is in the PATH or specified with its full path
java -jar $(which picard.jar) MarkDuplicates \
    I=rmDup.bam \
    O=preRmDup.bam \
    M=metrics_file \
    REMOVE_DUPLICATES=false \
    VALIDATION_STRINGENCY=SILENT \
    ASSUME_SORTED=true

View on GitHub

PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)

samtools v1.4.1 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.4.1

# Merge PCR de-duped reads from each inline barcode
# Replace input1.bam, input2.bam with the actual paths to your de-duped BAM files for each barcode
samtools merge merged.bam input1.bam input2.bam

View on GitHub

Merged alignments were split to keep just read2 using samtools (1.4.1) view.

samtools v1.4.1 GitHub

$ Bash example

# Install samtools if not already installed
# conda install samtools=1.4.1

# Split merged alignments to keep just read2
# Input: merged_alignments.bam (merged alignments)
# Output: read2_alignments.bam (alignments containing only read2)
samtools view -b -f 0x80 merged_alignments.bam > read2_alignments.bam

View on GitHub

Args: -h -b -f 128

Generic Command (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# The specific bioinformatics tool could not be inferred from the generic arguments: -h -b -f 128.
# This is a placeholder command. A specific tool and context (e.g., assay type, task) would be needed to generate a meaningful command.

# Example of how the arguments might be used with a hypothetical tool:
# placeholder_tool -h -b -f 128

Read2 BAM files were used to identify peak clusters with Clipper (1.2.2).

CLIPper v1.2.2 GitHub

$ Bash example

# Install CLIPper if not already installed
# pip install clipper # Or clone from GitHub and install
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# python setup.py install

# Define input and output files
INPUT_BAM="read2.bam" # Placeholder for Read2 BAM file
OUTPUT_PREFIX="read2_clipper_peaks"
GENOME_SIZE="hg38" # Placeholder for genome size (e.g., hg38, mm10, or a custom value)

# Run CLIPper to identify peak clusters
# Using common parameters: -b for input BAM, -s for genome size, -o for output prefix
clipper.py -b "${INPUT_BAM}" -s "${GENOME_SIZE}" -o "${OUTPUT_PREFIX}"

View on GitHub

Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam

eCLIP_BAM_Processor.py (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install dependencies for a custom Python script (example, adjust as needed)
# conda create -n eclip_env python=3.8
# conda activate eclip_env
# conda install -c bioconda pysam numpy scipy pandas # Common dependencies for BAM processing and pickle
#
# Reference genome: hg19 (Human Genome build 19)
# Source: UCSC Genome Browser (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz) or NCBI.

# Execute the inferred eCLIP BAM processing script
python eCLIP_BAM_Processor.py \
    --species hg19 \
    --bam path/to/input.bam \
    --timeout 3600000 \
    --maxgenes 1000000 \
    --save-pickle \
    --outfile path/to/output.bam

View on GitHub

Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.

eCLIP v0.1.5 GitHub

$ Bash example

# Install eCLIP tools (example, actual installation might vary)
# git clone https://github.com/yeolab/eclip.git
# cd eclip
# # Follow installation instructions, e.g., setting up Perl dependencies and adding eclip/bin to PATH

# The description mentions "overlap_peakfi_with_bam_PE.pl" which is often used
# to filter peak files based on BAM coverage prior to normalization.
# Assuming 'pre_normalized_peaks.narrowPeak' is the output of such a filtering step
# or the initial peak file to be normalized.

# Placeholder variables
IP_BAM="ip_sample_read2.bam"
INPUT_BAM="input_sample_read2.bam"
PEAK_FILE="pre_normalized_peaks.narrowPeak" # Input peak file for normalization
OUTPUT_PREFIX="normalized_peaks" # Prefix for output files (e.g., normalized_peaks.bed, normalized_peaks.log)

# Execute the normalization using peaksnormalize.pl
# Usage: peaksnormalize.pl <IP_BAM> <INPUT_BAM> <PEAK_FILE> <OUTPUT_PREFIX>
peaksnormalize.pl "${IP_BAM}" "${INPUT_BAM}" "${PEAK_FILE}" "${OUTPUT_PREFIX}"

View on GitHub

Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

eCLIP v0.1.5 GitHub

$ Bash example

# Install Perl (if not already available)
# sudo apt-get update && sudo apt-get install -y perl
#
# Clone the merge_peaks repository to get the script
# git clone https://github.com/yeolab/merge_peaks.git
#
# Add the script's directory to PATH or call it directly
# export PATH=$PATH:$(pwd)/merge_peaks/bin
#
# Example usage:
# Assuming 'rep1_normalized_peaks.bed' and 'rep2_normalized_peaks.bed' are the input normalized peak regions.
# The script merges overlapping regions from multiple replicate BED files based on L2 fold enrichment.
perl merge_peaks/bin/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \
    rep1_normalized_peaks.bed \
    rep2_normalized_peaks.bed \
    > merged_replicate_peaks.bed

View on GitHub

Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.

IDR v2.0.2 GitHub

$ Bash example

# Install IDR (if not already installed)
# conda install -c bioconda idr

# Install merge_peaks pipeline (if not already installed)
# git clone https://github.com/yeolab/merge_peaks.git
# export PATH=$PATH:/path/to/merge_peaks/scripts

# Define input files (placeholders)
INPUT_REP1_BED="replicate1.compressed.bed"
INPUT_REP2_BED="replicate2.compressed.bed"
GENOME_FASTA="/path/to/genome/hg38.fa" # Placeholder for reference genome fasta (e.g., hg38)
OUTPUT_PREFIX="idr_output"

# Step 1: Rank peaks by entropy score using make_informationcontent_from_peaks.pl
# This script is part of the merge_peaks pipeline (https://github.com/yeolab/merge_peaks).
# It takes a BED file and a genome fasta, and outputs a new BED file with an information content score (entropy) in the 5th column.
RANKED_REP1_BED="${INPUT_REP1_BED%.bed}.ranked.bed"
RANKED_REP2_BED="${INPUT_REP2_BED%.bed}.ranked.bed"

make_informationcontent_from_peaks.pl "${INPUT_REP1_BED}" "${GENOME_FASTA}" "${RANKED_REP1_BED}"
make_informationcontent_from_peaks.pl "${INPUT_REP2_BED}" "${GENOME_FASTA}" "${RANKED_REP2_BED}"

# Step 2: Run IDR (Irreproducible Discovery Rate) with the ranked peak files
# The --rank-by signal.value parameter tells IDR to use the 5th column (score) of the input BED files,
# which now contains the entropy score generated by make_informationcontent_from_peaks.pl.
idr --samples "${RANKED_REP1_BED}" "${RANKED_REP2_BED}" \
    --output-file "${OUTPUT_PREFIX}.idr" \
    --rank-by signal.value \
    --plot \
    --log-output-file "${OUTPUT_PREFIX}.log"

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1). Args: --length 10
Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 '--genomeDir human_repbase '--readFilesIn path/to/read1 path/to/read2 '--outFileNamePrefix out_prefix '--outReadsUnmapped Fastx '--outSAMtype BAM Unsorted '--outSAMattributes All '--outSAMunmapped Within '--outSAMattrRGline ID:foo '--outFilterType BySJout '--outFilterMultimapNmax 30 '--outFilterMultimapScoreRange 1 '--outFilterScoreMin 10 '--alignEndsType EndToEnd
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19). Args: --runThreadN 16 '--genomeDir genomedir '--readFilesIn /path/to/read1 /path/to/read2 '--outFileNamePrefix out_prefix '--outReadsUnmapped Fastx '--outSAMtype BAM   Unsorted '--outSAMattributes All '--outSAMunmapped Within '--outSAMattrRGline ID:foo '--outFilterType BySJout '--outFilterMultimapNmax 1 '--outFilterMultimapScoreRange 1 '--outFilterScoreMin 10 '--alignEndsType EndToEnd
Aligned reads were sorted with samtools (1.4.1)
Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines. Args: -o preRmDup.bam -m metrics_file -b rmDup.bam
PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)
Merged alignments were split to keep just read2 using samtools (1.4.1) view. Args: -h -b -f 128
Read2 BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
Genome_build: hg19
Supplementary_files_format_and_content: BigWig, BED

← Back to Analysis