GSE126263 Processing Pipeline

RIP-Seq code_examples 25 steps

Publication

Characterization of an RNA binding protein interactome reveals a context-specific post-transcriptional landscape of MYC-amplified medulloblastoma.

Nature communications (2022) — PMID 36473869

Dataset

GSE126263

Musashi-1 is a master regulator of aberrant translation in Group 3 medulloblastoma

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1).

eclipdemux v0.0.1 GitHub

$ Bash example

# Assuming eclipdemux is installed and in your PATH, or you have the script available.
# For example, if using the script directly:
# git clone https://github.com/yeolab/eclip.git
# cd eclip/scripts
# python eclipdemux.py --input_fastq input.fastq.gz --output_fastq output_demux.fastq.gz --barcode_length 6 --randomer_length 10

# Placeholder for input and output files
INPUT_FASTQ="input.fastq.gz"
OUTPUT_FASTQ="output_demux.fastq.gz"

# Inferring common parameters for barcode and randomer lengths
# (These values are examples; actual lengths would depend on library preparation)
BARCODE_LENGTH=6
RANDOMER_LENGTH=10

eclipdemux --input_fastq ${INPUT_FASTQ} --output_fastq ${OUTPUT_FASTQ} --barcode_length ${BARCODE_LENGTH} --randomer_length ${RANDOMER_LENGTH}

View on GitHub

Args: --length 10

fastp (Inferred with models/gemini-2.5-flash) v0.23.2 GitHub

$ Bash example

# Install fastp if not already installed
# conda install -c bioconda fastp

# Example usage: Filter reads to a minimum length of 10
# Replace input.fastq.gz and output.fastq.gz with actual file names
fastp --in1 input.fastq.gz --out1 output.fastq.gz --length_required 10

View on GitHub

Reads were then trimmed with cutadapt (1.9.1).

cutadapt v1.9.1 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=1.9.1

# Define input and output file names (replace with actual file paths)
INPUT_R1="input_R1.fastq.gz"
INPUT_R2="input_R2.fastq.gz"
OUTPUT_R1="trimmed_R1.fastq.gz"
OUTPUT_R2="trimmed_R2.fastq.gz"
LOG_FILE="cutadapt_trimming.log"

# Define adapter sequences (replace with actual adapters if known)
# For eCLIP, common adapters might be specific to the library preparation.
# Example Illumina universal adapters (replace with your specific adapters):
# ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
# If adapters are not explicitly known, cutadapt can sometimes auto-detect or you might use a generic sequence.
# Using placeholders as no specific adapters were provided in the description.
ADAPTER_R1="YOUR_ADAPTER_SEQUENCE_R1"
ADAPTER_R2="YOUR_ADAPTER_SEQUENCE_R2"

# Define quality cutoff and minimum length (common values, adjust as needed)
QUALITY_CUTOFF=20 # Phred quality score cutoff
MIN_LENGTH=20   # Minimum read length after trimming

# Execute cutadapt for paired-end trimming
cutadapt -a "${ADAPTER_R1}" \
         -A "${ADAPTER_R2}" \
         -q "${QUALITY_CUTOFF}" \
         -m "${MIN_LENGTH}" \
         -o "${OUTPUT_R1}" \
         -p "${OUTPUT_R2}" \
         "${INPUT_R1}" "${INPUT_R2}" \
         &> "${LOG_FILE}"

echo "Trimming complete. Output files: ${OUTPUT_R1}, ${OUTPUT_R2}"
echo "Log file: ${LOG_FILE}"

View on GitHub

Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

eCLIP v0.1.5 GitHub

$ Bash example

# Clone the eCLIP pipeline repository (if not already installed)
# git clone https://github.com/yeolab/eclip.git
# cd eclip
# git checkout 0.1.5 # Or the specific version
# export PATH=$(pwd)/bin:$PATH # Add eCLIP scripts to PATH

# Example input and output files
INPUT_FASTQ="input.fastq.gz"
OUTPUT_FASTQ="output.fastq.gz"

# Adapter sequences (these are typically generated or provided by the pipeline
# and contain the actual adapter sequences to be trimmed. The description
# implies these specific fasta files were prepared by a previous step
# within the eCLIP pipeline).
# For demonstration, assuming they exist in the current directory or a specified path.
G_ADAPTERS="g_adapters.fasta"
A_ADAPTERS="A_adapters.fasta"
a_ADAPTERS="a_adapters.fasta"

# Execute parsebarcodes.sh with the specified arguments
parsebarcodes.sh "${INPUT_FASTQ}" "${OUTPUT_FASTQ}" \
    --match-read-wildcards \
    -O 1 \
    --times 1 \
    -e 0.1 \
    --quality-cutoff 6 \
    -m 18 \
    -g "${G_ADAPTERS}" \
    -A "${A_ADAPTERS}" \
    -a "${a_ADAPTERS}"

View on GitHub

Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events.

cutadapt v1.9.1 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=1.9.1

# Define input and output file names
INPUT_FASTQ="input_reads.fastq.gz"
OUTPUT_FASTQ="trimmed_reads_double_ligation.fastq.gz"

# Define the adapter sequence associated with double-ligation events.
# This sequence needs to be determined based on the specific library preparation protocol.
# For eCLIP, this might be a specific internal adapter or the 3' adapter itself appearing internally.
# Placeholder: Replace 'ADAPTER_SEQUENCE_FOR_DOUBLE_LIGATION' with the actual sequence.
# A common Illumina 3' adapter is 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCA'.
# If a specific internal adapter is used in the eCLIP protocol for double-ligation events,
# that sequence should be used here.
ADAPTER_SEQUENCE_FOR_DOUBLE_LIGATION="SPECIFIC_DOUBLE_LIGATION_ADAPTER_SEQUENCE"

# Execute cutadapt to remove the double-ligation adapter sequence
# -b ADAPTER_SEQUENCE: Trims the adapter from anywhere in the read (5', 3', or internal).
# -o: Specifies the output file.
cutadapt -b "${ADAPTER_SEQUENCE_FOR_DOUBLE_LIGATION}" \
         -o "${OUTPUT_FASTQ}" \
         "${INPUT_FASTQ}"

View on GitHub

Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

eCLIP v0.1.5 GitHub

$ Bash example

# Install cutadapt
# conda install -c bioconda cutadapt

# Placeholder for input and output files
# Replace 'input.fastq.gz' with your actual input FASTQ file
# Replace 'trimmed.fastq.gz' with your desired output FASTQ file

# A_adapters.fasta should be generated from parsebarcodes.sh as mentioned in the description.
# Example: parsebarcodes.sh <input_fastq> > A_adapters.fasta

cutadapt \
    --match-read-wildcards \
    -O 5 \
    --times 1 \
    -e 0.1 \
    --quality-cutoff 6 \
    -m 18 \
    -A A_adapters.fasta \
    -o trimmed.fastq.gz \
    input.fastq.gz

View on GitHub

Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star

# Define variables
TRIMMED_READS="trimmed_reads.fastq.gz" # Placeholder for trimmed reads input
REPBASE_INDEX="/path/to/RepBase18.05_STAR_index" # Placeholder for STAR index built from RepBase 18.05
OUTPUT_PREFIX="mapped_to_repbase" # Prefix for output files
NUM_THREADS=8 # Placeholder for number of threads

# Run STAR alignment
STAR --genomeDir "${REPBASE_INDEX}" \
     --readFilesIn "${TRIMMED_READS}" \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate

View on GitHub

Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Placeholder: Replace with the actual path to your STAR genome index directory for human_repbase
GENOME_DIR="human_repbase"
READ1="path/to/read1"      # Placeholder: Replace with the actual path to your R1 FASTQ file
READ2="path/to/read2"      # Placeholder: Replace with the actual path to your R2 FASTQ file
OUT_PREFIX="out_prefix"    # Desired prefix for output files

# Run STAR alignment
STAR \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READ1}" "${READ2}" \
  --outFileNamePrefix "${OUT_PREFIX}" \
  --outReadsUnmapped Fastx \
  --outSAMtype BAM Unsorted \
  --outSAMattributes All \
  --outSAMunmapped Within \
  --outSAMattrRGline ID:foo \
  --outFilterType BySJout \
  --outFilterMultimapNmax 30 \
  --outFilterMultimapScoreRange 1 \
  --outFilterScoreMin 10 \
  --alignEndsType EndToEnd

View on GitHub

Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR (version 2.4.0i)
# conda create -n star_env star=2.4.0i -c bioconda -c conda-forge
# conda activate star_env

# Placeholder for STAR genome index directory
# This directory should contain the STAR index files generated for hg19.
# Example command to build index (run once):
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/hg19_STAR_index \
#      --genomeFastaFiles /path/to/hg19.fa \
#      --sjdbGTFfile /path/to/hg19.gtf \
#      --sjdbOverhang 100 \
#      --runThreadN 8 # Adjust threads as needed

GENOME_DIR="/path/to/hg19_STAR_index" # Replace with actual path to hg19 STAR index
INPUT_READS="unmapped_filtered_reads.fastq.gz" # Placeholder for input FASTQ file (e.g., from previous filtering step)
OUTPUT_PREFIX="mapped_hg19" # Prefix for output files

STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${INPUT_READS}" \
     --runThreadN 8 \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate

View on GitHub

Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 /path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

STAR (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Placeholder for STAR genome index generation (if not already done)
# This step needs to be performed once for a given genome and annotation.
# Example for human GRCh38:
# STAR --runThreadN 16 \
#      --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_genome_index_GRCh38 \
#      --genomeFastaFiles /path/to/GRCh38.fa \
#      --sjdbGTFfile /path/to/GRCh38.gtf

# Align reads using STAR
STAR \
    --runThreadN 16 \
    --genomeDir /path/to/STAR_genome_index_GRCh38 \
    --readFilesIn /path/to/read1 /path/to/read2 \
    --outFileNamePrefix out_prefix \
    --outReadsUnmapped Fastx \
    --outSAMtype BAM Unsorted \
    --outSAMattributes All \
    --outSAMunmapped Within \
    --outSAMattrRGline ID:foo \
    --outFilterType BySJout \
    --outFilterMultimapNmax 1 \
    --outFilterMultimapScoreRange 1 \
    --outFilterScoreMin 10 \
    --alignEndsType EndToEnd

View on GitHub

Aligned reads were sorted with samtools (1.4.1)

samtools v1.4.1 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.4.1

# Sort aligned reads
# Assuming 'aligned_reads.bam' is the input BAM file
# And 'sorted_reads.bam' will be the sorted output BAM file
samtools sort -o sorted_reads.bam aligned_reads.bam

View on GitHub

Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines.

eCLIP v0.1.5 GitHub

$ Bash example

# Install eclip (assuming Python 3 is available)
# pip install eclip
# Or, if using conda:
# conda create -n eclip_env python=3.8
# conda activate eclip_env
# pip install eclip

# Define input and output files (placeholders)
INPUT_R1="sorted_reads_R1.fastq.gz"
INPUT_R2="sorted_reads_R2.fastq.gz"
OUTPUT_R1="collapsed_reads_R1.fastq.gz"
OUTPUT_R2="collapsed_reads_R2.fastq.gz"
BARCODE_LENGTH=10 # Common barcode length in eCLIP protocols (e.g., 10bp)
MIN_READS=2     # Minimum number of reads to collapse (default in some protocols)

# Execute barcodecollapsepe.py
barcodecollapsepe.py \
    --input "${INPUT_R1}" \
    --input2 "${INPUT_R2}" \
    --output "${OUTPUT_R1}" \
    --output2 "${OUTPUT_R2}" \
    --barcode-length "${BARCODE_LENGTH}" \
    --min-reads "${MIN_READS}"

View on GitHub

Args: -o preRmDup.bam -m metrics_file -b rmDup.bam

picard MarkDuplicates (Inferred with models/gemini-2.5-flash) v2.27.4 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Picard (e.g., via Conda):
# conda install -c bioconda picard

# Execute picard MarkDuplicates
# The arguments -b, -o, -m are mapped to Picard's I=, O=, M= respectively.
picard MarkDuplicates I=rmDup.bam O=preRmDup.bam M=metrics_file

View on GitHub

PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)

samtools v1.4.1 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.4.1

# Merge PCR de-duped BAM files
samtools merge merged.bam input1_deduped.bam input2_deduped.bam inputN_deduped.bam

View on GitHub

Merged alignments were split to keep just read2 using samtools (1.4.1) view.

samtools v1.4.1 GitHub

$ Bash example

# Define input and output file paths
INPUT_BAM="merged_alignments.bam" # Placeholder for your merged alignment BAM file
OUTPUT_READ2_BAM="read2_alignments.bam" # Placeholder for the output BAM containing only read2

# Split merged alignments to keep just read2 using samtools view
# -f 128: Selects alignments where the 0x80 flag (read is second in pair) is set.
# -b: Output in BAM format.
samtools view -f 128 -b "${INPUT_BAM}" > "${OUTPUT_READ2_BAM}"

View on GitHub

Args: -h -b -f 128

Unknown Tool (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# The tool could not be inferred from the provided description: "Args: -h -b -f 128".
# The arguments -h, -b, -f 128 are generic and do not point to a specific bioinformatics tool.
# Please provide more context or a tool name to generate a specific command.

# No specific installation instructions can be provided as the tool is unknown.

# Placeholder command reflecting the provided arguments:
unknown_tool -h -b -f 128

Read2 BAM files were used to identify peak clusters with Clipper (1.2.2).

CLIPper v1.2.2 GitHub

$ Bash example

# Install CLIPper and its dependencies (e.g., using conda)
# conda create -n clipper_env python=3.8
# conda activate clipper_env
# pip install pysam numpy scipy

# Clone the CLIPper repository if not already available
# git clone https://github.com/yeolab/clipper.git
# cd clipper

# Run CLIPper to identify peak clusters
# Replace 'input_read2.bam' with your actual Read2 BAM file.
# Replace 'hg38' with the appropriate genome assembly if different.
# The output files will be prefixed with 'clipper_peaks'.
python clipper.py \
    -g hg38 \
    -o clipper_peaks \
    input_read2.bam

View on GitHub

Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam

Gene-aware BAM Processor (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# This tool is inferred to be a custom Python script or a specialized gene-aware BAM processor, likely part of an eCLIP pipeline.
# The specific tool name and installation steps are not explicitly provided in the description.
# Installation instructions would depend on the specific script and its dependencies.
# Example (placeholder):
# git clone https://github.com/some/repo.git
# cd repo
# pip install -r requirements.txt

# Reference genome: hg19 (Human Genome build 19)
# Source: UCSC Genome Browser or NCBI

# Execute the inferred tool
gene_aware_bam_processor --species hg19 \
                             --bam path/to/input.bam \
                             --timeout 3600000 \
                             --maxgenes 1000000 \
                             --save-pickle \
                             --outfile path/to/output.bam

View on GitHub

Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.

eCLIP v0.1.5 GitHub

$ Bash example

# Install eCLIP scripts (if not already installed)
# git clone https://github.com/yeolab/eclip.git
# export PATH=$PATH:$(pwd)/eclip/bin

# Define input files and output prefix
PEAK_FILE="peak_clusters.bed" # Input peak clusters file (e.g., from a previous peak calling step)
IP_BAM="ip.bam"             # Original IP BAM file (paired-end)
INPUT_BAM="input.bam"         # Original INPUT BAM file (paired-end)
OUTPUT_PREFIX="normalized_peaks" # Prefix for output files

# Filter IP and INPUT BAM files to retain only read2 (second in pair)
# The -f 0x80 flag for samtools view selects the second segment in a pair.
# The -b flag outputs in BAM format.
samtools view -b -f 0x80 "${IP_BAM}" > "${IP_BAM%.bam}_read2.bam"
samtools view -b -f 0x80 "${INPUT_BAM}" > "${INPUT_BAM%.bam}_read2.bam"

IP_READ2_BAM="${IP_BAM%.bam}_read2.bam"
INPUT_READ2_BAM="${INPUT_BAM%.bam}_read2.bam"

# Run peaksnormalize.pl for normalization
# Usage: peaksnormalize.pl <peakfile> <IP_BAM> <INPUT_BAM> <output_prefix>
# This script will output files such as <output_prefix>.normalized.bed, <output_prefix>.log, etc.
peaksnormalize.pl "${PEAK_FILE}" "${IP_READ2_BAM}" "${INPUT_READ2_BAM}" "${OUTPUT_PREFIX}"

# Optional: Clean up intermediate read2 BAM files
# rm "${IP_READ2_BAM}" "${INPUT_READ2_BAM}"

View on GitHub

Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

eCLIP v0.1.5 GitHub

$ Bash example

# Install eCLIP pipeline (version 0.1.5 or compatible)
# This script is typically part of the eCLIP pipeline distribution.
# You might need to clone the repository and add its scripts directory to your PATH,
# or call the script with its full path.
# Example:
# git clone https://github.com/yeolab/eclip.git
# cd eclip
# # If a specific tag for v0.1.5 exists, use it:
# # git checkout v0.1.5
# # Otherwise, use a commit close to that version or the main branch.
# export PATH=$PATH:$(pwd)/scripts # Assuming the script is in a 'scripts' directory within the cloned repo

# Define input and output files
# These are placeholder names for normalized peak regions in BED format.
INPUT_PEAKS_REP1="replicate1_normalized_peaks.bed"
INPUT_PEAKS_REP2="replicate2_normalized_peaks.bed"
OUTPUT_MERGED_PEAKS="merged_replicate_overlapping_peaks.bed"

# Execute the merging script
# The script 'compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl'
# is designed to merge overlapping normalized peak regions from replicates.
# It likely takes multiple BED files as input and outputs a single merged BED file.
# The 'l2foldenrpeakfi' in the name suggests it processes L2 fold enrichment information.
perl /path/to/eclip/scripts/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \
    "${INPUT_PEAKS_REP1}" \
    "${INPUT_PEAKS_REP2}" \
    > "${OUTPUT_MERGED_PEAKS}"

View on GitHub

Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.

IDR v2.0.2 GitHub

$ Bash example

# Install IDR (if not already installed)
# conda install -c bioconda idr

# Example input files (replace with actual paths to normalized peak files, 
# which have likely been processed by make_informationcontent_from_peaks.pl 
# to include entropy scores, potentially in the signal.value column).
REP1_PEAKS="replicate1_normalized_peaks.narrowPeak"
REP2_PEAKS="replicate2_normalized_peaks.narrowPeak"
OUTPUT_PREFIX="idr_reproducible_peaks"

# Run IDR to determine reproducible peaks
idr \
    --samples "${REP1_PEAKS}" "${REP2_PEAKS}" \
    --output-file "${OUTPUT_PREFIX}.idr" \
    --rank signal.value \
    --plot \
    --log-output-file "${OUTPUT_PREFIX}.log" \
    --soft-idr-threshold 0.05

View on GitHub

Reproducible peaks were annotated by overlapping peak regions with Gencode hg19 annotations

GENCODE v2.29.2 GitHub

$ Bash example

# Install bedtools (if not already installed)
# conda install -c bioconda bedtools=2.29.2

# Download Gencode hg19 annotation GTF
wget -nc http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
gunzip -f gencode.v19.annotation.gtf.gz

# Convert Gencode GTF to a BED file containing gene regions
# This extracts chromosome, 0-based start, 1-based end, gene_id, strand, and a placeholder score (e.g., 0)
# The gene_id is extracted from the attributes column.
awk -v OFS='\t' '$3 == "gene" {
    # Extract gene_id from the 9th column (attributes)
    gene_id = "";
    if (match($9, /gene_id "([^"]+)"/)) {
        gene_id = substr($9, RSTART + 9, RLENGTH - 10);
    }
    print $1, $4-1, $5, gene_id, 0, $7
}' gencode.v19.annotation.gtf > gencode.v19.genes.bed

# Placeholder for reproducible peaks file
# Replace 'reproducible_peaks.bed' with the actual path to your peak file
# Example: reproducible_peaks.bed could be the output of an IDR step.
# For eCLIP, this might be the merged peaks from yeolab/merge_peaks.
# Example: cp /path/to/your/reproducible_peaks.bed .

# Annotate reproducible peaks by overlapping with Gencode hg19 gene annotations
# -a: reproducible peaks file (e.g., reproducible_peaks.bed)
# -b: Gencode hg19 gene annotations BED file (gencode.v19.genes.bed)
# -wa: write the original entry in A for each overlap
# -wb: write the original entry in B for each overlap
# The output will contain the peak region followed by the overlapping gene region.
bedtools intersect -a reproducible_peaks.bed -b gencode.v19.genes.bed -wa -wb > annotated_peaks.bed

View on GitHub

Region-based fold enriched was calculated by mapped reads fold change between IP and Input.

deeptools bigwigCompare (Inferred with models/gemini-2.5-flash) v3.5.1 GitHub

$ Bash example

# Install deeptools if not already installed
# conda install -c bioconda deeptools=3.5.1

# Calculate genome-wide log2 fold enrichment between IP and Input bigWig files.
# This output bigWig file represents the 'region-based fold enriched' signal across the genome.
# It can then be used for further region-specific analysis (e.g., averaging over called peaks).
# Assuming 'ip.bigWig' and 'input.bigWig' are already normalized bigWig files (e.g., RPKM, CPM).
# The --pseudocount 1 is added to avoid issues with zero values when calculating log2.
# The bigWig files are assumed to be based on a common reference genome assembly, e.g., hg38.
deep_tools bigwigCompare \
    -b1 ip.bigWig \
    -b2 input.bigWig \
    --operation log2 \
    --pseudocount 1 \
    -o fold_enrichment.bigWig \
    --numberOfProcessors auto \
    --verbose

View on GitHub

Each read was assigned to a single region with the following descending priority order: CDS, 5âUTR, 3âUTR, Intron.

Feature Assignment Script (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# Define input and output files
INPUT_BAM="aligned_reads.bam" # Placeholder: Replace with your actual sorted and indexed BAM file
OUTPUT_BAM_TAGGED="assigned_reads_tagged.bam"
OUTPUT_SUMMARY="read_assignment_summary.tsv"

# Reference files (placeholders - replace with actual paths to your reference files)
# For hg38 and GENCODE v44:
# GENOME_FASTA="path/to/GRCh38.primary_assembly.genome.fa"
# GTF_FILE="path/to/gencode.v44.annotation.gtf"

# --- Installation (commented out) ---
# # Install necessary tools if not already available.
# # This script requires Python 3, pysam, pybedtools, bedtools, and samtools.
# # Example using conda:
# # conda create -n read_assign python=3.9 pysam pybedtools bedtools samtools -c bioconda -y
# # conda activate read_assign

# --- Step 1: Generate feature BED files from GTF ---
# This step extracts coordinates for CDS, 5'UTR, 3'UTR, and Introns from the GTF.
# This is a conceptual representation. In a real pipeline, you would parse
# your GTF file (e.g., gencode.v44.annotation.gtf) to generate these BED files.
# A robust implementation would handle overlapping features within the same category
# and accurately derive intron coordinates (e.g., from gene regions minus exons).

# For the purpose of this example, we create dummy BED files.
# Replace this section with your actual GTF parsing logic or pre-generated BED files.
echo -e "chr1\t1000\t2000\tgeneA_CDS\t0\t+" > cds.bed
echo -e "chr1\t500\t999\tgeneA_CDS_exon2\t0\t+" >> cds.bed # Example of multiple CDS regions
echo -e "chr1\t200\t499\tgeneA_5UTR\t0\t+" > utr5.bed
echo -e "chr1\t2001\t2500\tgeneA_3UTR\t0\t+" > utr3.bed
echo -e "chr1\t2501\t3000\tgeneA_Intron1\t0\t+" > intron.bed
echo -e "chr1\t3001\t4000\tgeneA_Intron2\t0\t+" >> intron.bed

# Sort and merge BED files to ensure proper intersection behavior and remove overlaps within categories.
# This is crucial for accurate feature assignment.
sort -k1,1 -k2,2n cds.bed | bedtools merge -i - > cds.merged.bed
sort -k1,1 -k2,2n utr5.bed | bedtools merge -i - > utr5.merged.bed
sort -k1,1 -k2,2n utr3.bed | bedtools merge -i - > utr3.merged.bed
sort -k1,1 -k2,2n intron.bed | bedtools merge -i - > intron.merged.bed

# --- Step 2: Prioritized Read Assignment Script ---
# This Python script assigns each read to a single region based on the defined priority.
# It uses pysam to read/write BAM files and pybedtools for efficient genomic interval intersection.
# Reads will be tagged with an 'RA' (Read Assignment) tag indicating their assigned feature type.

# Create the Python script inline for demonstration purposes.
cat << 'EOF' > assign_reads_priority.py
import pysam
import pybedtools
import sys
import os

def main():
    if len(sys.argv) != 8:
        print("Usage: python assign_reads_priority.py <input_bam> <output_bam> <output_summary> <cds_bed> <utr5_bed> <utr3_bed> <intron_bed>")
        sys.exit(1)

    input_bam = sys.argv[1]
    output_bam = sys.argv[2]
    output_summary = sys.argv[3]
    cds_bed_path = sys.argv[4]
    utr5_bed_path = sys.argv[5]
    utr3_bed_path = sys.argv[6]
    intron_bed_path = sys.argv[7]

    # Load feature BED files
    # Ensure BED files are sorted and merged for optimal performance
    cds_features = pybedtools.BedTool(cds_bed_path)
    utr5_features = pybedtools.BedTool(utr5_bed_path)
    utr3_features = pybedtools.BedTool(utr3_bed_path)
    intron_features = pybedtools.BedTool(intron_bed_path)

    # Open input and output BAM files
    samfile = pysam.AlignmentFile(input_bam, "rb")
    outfile = pysam.AlignmentFile(output_bam, "wb", template=samfile)

    assigned_counts = {'CDS': 0, '5UTR': 0, '3UTR': 0, 'Intron': 0, 'Unassigned': 0}
    total_reads = 0

    for read in samfile.fetch():
        total_reads += 1
        assigned = False
        # Create a pybedtools Interval for the read
        # Ensure coordinates are 0-based for pybedtools
        read_interval = pybedtools.create_interval_from_list([read.reference_name, str(read.reference_start), str(read.reference_end), read.query_name, '0', '-' if read.is_reverse else '+'])

        # Priority 1: CDS
        if cds_features.intersect(read_interval, wa=True, u=True):
            read.set_tag('RA', 'CDS', value_type='Z') # Read Assignment tag
            assigned_counts['CDS'] += 1
            assigned = True
        # Priority 2: 5'UTR (if not assigned yet)
        elif utr5_features.intersect(read_interval, wa=True, u=True):
            read.set_tag('RA', '5UTR', value_type='Z')
            assigned_counts['5UTR'] += 1
            assigned = True
        # Priority 3: 3'UTR (if not assigned yet)
        elif utr3_features.intersect(read_interval, wa=True, u=True):
            read.set_tag('RA', '3UTR', value_type='Z')
            assigned_counts['3UTR'] += 1
            assigned = True
        # Priority 4: Intron (if not assigned yet)
        elif intron_features.intersect(read_interval, wa=True, u=True):
            read.set_tag('RA', 'Intron', value_type='Z')
            assigned_counts['Intron'] += 1
            assigned = True
        else:
            read.set_tag('RA', 'Unassigned', value_type='Z')
            assigned_counts['Unassigned'] += 1

        outfile.write(read)

    samfile.close()
    outfile.close()

    # Write summary
    with open(output_summary, 'w') as f:
        f.write(f'Total_Reads\t{total_reads}\n')
        for category, count in assigned_counts.items():
            f.write(f'{category}\t{count}\n')

    # Index the output BAM
    pysam.index(output_bam)

    print("Read assignment complete. Output BAM tagged and indexed.")

if __name__ == "__main__":
    main()
EOF

# Execute the Python script
python assign_reads_priority.py \
    "${INPUT_BAM}" \
    "${OUTPUT_BAM_TAGGED}" \
    "${OUTPUT_SUMMARY}" \
    cds.merged.bed \
    utr5.merged.bed \
    utr3.merged.bed \
    intron.merged.bed

# Clean up temporary files (optional)
rm cds.bed utr5.bed utr3.bed intron.bed \
   cds.merged.bed utr5.merged.bed utr3.merged.bed intron.merged.bed
rm assign_reads_priority.py

View on GitHub

For each gene, reads were summed up across each region to calculate final region counts, and fold change.

featureCounts (Inferred with models/gemini-2.5-flash) v2.0.3

$ Bash example

# Install Subread (which includes featureCounts) if not already installed
# conda install -c bioconda subread

# Define input and output paths
BAM_FILE="aligned_reads.bam" # Replace with your actual aligned BAM file
GTF_FILE="Homo_sapiens.GRCh38.109.gtf" # Replace with your actual GTF annotation file (e.g., from Ensembl or Gencode)
OUTPUT_COUNTS_FILE="gene_region_counts.txt"

# Sum reads across gene regions to calculate final region counts
# -a: annotation file (GTF/GFF format)
# -o: output file for counts
# -t exon: specify feature type to count (e.g., 'exon' for gene-level counts)
# -g gene_id: specify attribute type to aggregate counts by (e.g., 'gene_id')
# -p: specify if reads are paired-end (remove if single-end)
# -s 0: strand-specific (0=unstranded, 1=stranded, 2=reverse stranded) - adjust based on library prep
featureCounts -a "${GTF_FILE}" \
              -o "${OUTPUT_COUNTS_FILE}" \
              -t "exon" \
              -g "gene_id" \
              -p \
              -s 0 \
              "${BAM_FILE}"

# Note: Calculating fold change is typically a subsequent step, often performed
# using statistical packages like DESeq2 or edgeR in R, which take the
# count matrix (e.g., gene_region_counts.txt) as input along with sample metadata.
# This part is not a direct bash command but a conceptual follow-up analysis.
# Example conceptual R code (not bash):
# library(DESeq2)
# counts_data <- read.table("gene_region_counts.txt", header=TRUE, row.names=1, skip=1) # Adjust skip based on featureCounts header
# # ... further DESeq2 steps to create a DESeqDataSet and calculate fold change ...

Tools Used

eCLIP STAR

Raw Source Text

Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1). Args: --length 10
Reads were then trimmed with cutadapt (1.9.1). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 \  --genomeDir human_repbase \  --readFilesIn path/to/read1 path/to/read2 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 30 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19). Args: --runThreadN 16 \  --genomeDir genomedir \  --readFilesIn /path/to/read1 /path/to/read2 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM   Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 1 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Aligned reads were sorted with samtools (1.4.1)
Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines. Args: -o preRmDup.bam -m metrics_file -b rmDup.bam
PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)
Merged alignments were split to keep just read2 using samtools (1.4.1) view. Args: -h -b -f 128
Read2 BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
Reproducible peaks were annotated by overlapping peak regions with Gencode hg19 annotations
Region-based fold enriched was calculated by mapped reads fold change between IP and Input. Each read was assigned to a single region with the following descending priority order: CDS, 5âUTR, 3âUTR, Intron. For each gene, reads were summed up across each region to calculate final region counts, and fold change.
Genome_build: hg19
Supplementary_files_format_and_content: tab-delimited text files include -log10pValues and Log2FoldChange values for each IDR peaks called after cutoff of 3,3 in each parameter

← Back to Analysis