GSE157259 Processing Pipeline

RIP-Seq code_examples 22 steps

Publication

RNA binding protein DDX5 restricts RORγt<sup>+</sup> T<sub>reg</sub> suppressor function to promote intestine inflammation.

Science advances (2023) — PMID 36724232

Dataset

GSE157259

DDX5 RNA interactome in cultured T cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1).

eclipdemux v0.0.1 GitHub

$ Bash example

# Install eclipdemux (if not already installed)
# pip install eclipdemux

# Example usage of eclipdemux
# input_reads.fastq.gz: Your raw sequenced reads containing inline barcodes.
# demuxed_reads.fastq.gz: Output file with barcodes removed and randomers added to headers.
# inline_barcodes.txt: A file containing the inline barcode sequences to be removed.
# 10: Placeholder for the length of the randomer to be added to read headers.
eclipdemux -i input_reads.fastq.gz -o demuxed_reads.fastq.gz -b inline_barcodes.txt -r 10

View on GitHub

2
Args: --length 10

cutadapt (Inferred with models/gemini-2.5-flash) v4.1 GitHub
$ Bash example
```
# conda install -c bioconda cutadapt
cutadapt --minimum-length 10 -o output.fastq.gz input.fastq.gz
```
View on GitHub

Reads were then trimmed with cutadapt (1.9.1).

cutadapt v1.9.1 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=1.9.1

# Define input and output file names (placeholders)
# Replace with actual input FASTQ files
INPUT_R1="input_reads_R1.fastq.gz"
INPUT_R2="input_reads_R2.fastq.gz"
OUTPUT_R1="trimmed_reads_R1.fastq.gz"
OUTPUT_R2="trimmed_reads_R2.fastq.gz"

# Trim reads for quality and minimum length.
# -q 20,20: Trim low-quality bases from 5' and 3' ends with a quality threshold of 20.
# -m 20: Discard reads shorter than 20 bp after trimming.
# No specific adapters were mentioned in the description, so adapter trimming is omitted.
cutadapt -q 20,20 -m 20 -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" "${INPUT_R1}" "${INPUT_R2}"

View on GitHub

Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

eCLIP v0.1.5 GitHub

$ Bash example

# Clone the eCLIP pipeline repository to access parsebarcodes.sh
# git clone https://github.com/yeolab/eclip.git
# cd eclip
# # Ensure parsebarcodes.sh is executable and in PATH, or call it directly
# # chmod +x bin/parsebarcodes.sh

# Placeholder for input raw reads file (e.g., a FASTQ file containing reads with barcodes/adapters)
input_fastq="raw_reads.fastq"

# Placeholder for output prefix
# parsebarcodes.sh typically outputs processed FASTQ files (e.g., processed_reads.fastq) and a log file.
output_prefix="processed_reads"

# Adapter files (placeholders - these would be specific files from the eCLIP pipeline resources)
# These files contain the adapter sequences to be matched and trimmed.
# They are typically located in the 'resources' or 'adapters' directory of the eCLIP pipeline.
g_adapters_fasta="g_adapters.fasta" # Example: eclip/resources/g_adapters.fasta
A_adapters_fasta="A_adapters.fasta" # Example: eclip/resources/A_adapters.fasta
a_adapters_fasta="a_adapters.fasta" # Example: eclip/resources/a_adapters.fasta

# Execute parsebarcodes.sh with the specified arguments
# The script processes the input_fastq, identifies and trims adapters/barcodes,
# and outputs processed reads to files named with the output_prefix.
parsebarcodes.sh \
    "${input_fastq}" \
    "${output_prefix}" \
    --match-read-wildcards \
    -O 1 \
    --times 1 \
    -e 0.1 \
    --quality-cutoff 6 \
    -m 18 \
    -g "${g_adapters_fasta}" \
    -A "${A_adapters_fasta}" \
    -a "${a_adapters_fasta}"

View on GitHub

Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events.

cutadapt v1.9.1 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=1.9.1

# Define input and output files
# INPUT_FASTQ should be the output from the previous trimming step or raw reads if this is the first trimming.
INPUT_FASTQ="input_reads_from_previous_trimming.fastq.gz"
OUTPUT_FASTQ="trimmed_reads_final.fastq.gz"

# Define the 3' adapter sequence commonly associated with double-ligation events in eCLIP.
# This sequence is a placeholder for the actual 3' adapter used in the library preparation (e.g., Illumina TruSeq Small RNA 3' Adapter).
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"

# Define minimum read length after trimming. Common for eCLIP is 18bp.
MIN_READ_LENGTH=18

# Execute cutadapt to remove the 3' adapter and filter by minimum length.
# The -a flag specifies a 3' adapter to be removed.
# The -m flag specifies the minimum length of reads to be kept after trimming.
cutadapt -a "${ADAPTER_3PRIME}" \
         -m "${MIN_READ_LENGTH}" \
         -o "${OUTPUT_FASTQ}" \
         "${INPUT_FASTQ}"

View on GitHub

Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

eCLIP v0.1.5 GitHub

$ Bash example

# Install cutadapt if not already available
# conda install -c bioconda cutadapt

# Assuming 'input_reads.fastq.gz' are the raw reads and 'trimmed_reads.fastq.gz' will be the output.
# A_adapters.fasta is generated by parsebarcodes.sh as described within the eclip 0.1.5+ pipeline.
cutadapt \
  --match-read-wildcards \
  -O 1 \
  --times 1 \
  -e 0.1 \
  --quality-cutoff 6 \
  -m 18 \
  -A A_adapters.fasta \
  -o trimmed_reads.fastq.gz \
  input_reads.fastq.gz

View on GitHub

Trimmed reads were then mapped with STAR (2.4.0i) against a mouse-specific repeat element database (RepBase 18.05).

STAR v2.4.0i GitHub

$ Bash example

# Assume STAR is installed and available in PATH.
# Assume the RepBase 18.05 mouse repeat element database FASTA file (e.g., RepBase18.05_mouse.fasta) has been indexed by STAR.
# Example command to generate STAR index (if not already done):
# mkdir -p /path/to/repbase_18.05_star_index
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/repbase_18.05_star_index \
#      --genomeFastaFiles /path/to/RepBase18.05_mouse.fasta \
#      --genomeSAindexNbases 10 # Adjust based on repeat element lengths and total size

# Placeholder for STAR genome index directory for RepBase 18.05
REPBASE_STAR_INDEX="/path/to/repbase_18.05_star_index"

# Placeholder for input trimmed reads file (single-end or first read of paired-end)
TRIMMED_READS_FASTQ="trimmed_reads.fastq.gz"

# If paired-end reads, use:
# TRIMMED_READS_FASTQ_R1="trimmed_reads_R1.fastq.gz"
# TRIMMED_READS_FASTQ_R2="trimmed_reads_R2.fastq.gz"
# And modify --readFilesIn to "${TRIMMED_READS_FASTQ_R1}" "${TRIMMED_READS_FASTQ_R2}"

# Placeholder for output prefix
OUTPUT_PREFIX="star_repbase_mapped"

# Placeholder for number of threads
NUM_THREADS=8

# STAR mapping command for repeat elements
# Key parameters for repeat mapping:
# --alignIntronMax 1: Disables splicing, as repeats are not expected to have introns.
# --outFilterMultimapNmax 1000: Allows reads to map to up to 1000 locations, crucial for highly repetitive sequences.
# --outSAMtype BAM SortedByCoordinate: Outputs a sorted BAM file.
STAR --runThreadN ${NUM_THREADS} \
     --genomeDir ${REPBASE_STAR_INDEX} \
     --readFilesIn ${TRIMMED_READS_FASTQ} \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterMultimapNmax 1000 \
     --alignIntronMax 1 \
     --outSAMattributes NH HI AS nM NM MD

View on GitHub

Args: --runThreadN 16 \ --genomeDir mouse_repbase \ --readFilesIn path/to/read1 path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

STAR (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Install STAR (example, uncomment if needed)
# conda install -c bioconda star

# Define variables (placeholders)
# 'mouse_repbase' implies a custom STAR genome index built for mouse (e.g., GRCm39) 
# and potentially including repetitive elements from databases like Repbase.
# Replace with the actual path to your pre-built STAR genome index.
GENOME_DIR="path/to/mouse_genome_index_GRCm39_with_repbase"
READ1="path/to/read1.fastq.gz"
READ2="path/to/read2.fastq.gz"
OUTPUT_PREFIX="out_prefix"

# Run STAR alignment
STAR \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READ1}" "${READ2}" \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outReadsUnmapped Fastx \
  --outSAMtype BAM Unsorted \
  --outSAMattributes All \
  --outSAMunmapped Within \
  --outSAMattrRGline ID:foo \
  --outFilterType BySJout \
  --outFilterMultimapNmax 30 \
  --outFilterMultimapScoreRange 1 \
  --outFilterScoreMin 10 \
  --alignEndsType EndToEnd

View on GitHub

Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a mouse genome (mm10).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR if not already available
# conda install -c bioconda star

# Define variables
READS_FILE="unmapped_filtered_reads.fastq.gz" # Placeholder for input FASTQ file
GENOME_DIR="/path/to/mm10_star_index" # Placeholder for pre-built STAR genome index for mm10
OUTPUT_PREFIX="aligned_reads_" # Prefix for output files
THREADS=8 # Number of threads to use

# Note: The STAR genome index for mm10 needs to be built once prior to alignment.
# Example command to build the index (replace with actual paths):
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/Mus_musculus.GRCm38.dna.primary_assembly.fa \
#      --sjdbGTFfile /path/to/Mus_musculus.GRCm38.102.gtf \
#      --runThreadN ${THREADS}

# Align unmapped reads to the mouse genome (mm10) using STAR
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READS_FILE} \
     --runThreadN ${THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --readFilesCommand zcat

View on GitHub

Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 /path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

STAR (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
# Replace 'path/to/STAR_genome_index' with the actual path to your STAR genome index directory.
# This index should be built for the relevant reference genome (e.g., hg38/GRCh38 for human).
STAR_GENOME_DIR="path/to/STAR_genome_index"

# Replace '/path/to/read1' and '/path/to/read2' with the actual paths to your input FASTQ files.
# Ensure these paths are correct for your data.
READ1_FASTQ="/path/to/read1"
READ2_FASTQ="/path/to/read2"

# Replace 'out_prefix' with your desired prefix for output files (e.g., sample_aligned).
OUTPUT_PREFIX="out_prefix"

STAR \
    --runThreadN 16 \
    --genomeDir "${STAR_GENOME_DIR}" \
    --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \
    --outFileNamePrefix "${OUTPUT_PREFIX}" \
    --outReadsUnmapped Fastx \
    --outSAMtype BAM Unsorted \
    --outSAMattributes All \
    --outSAMunmapped Within \
    --outSAMattrRGline ID:foo \
    --outFilterType BySJout \
    --outFilterMultimapNmax 1 \
    --outFilterMultimapScoreRange 1 \
    --outFilterScoreMin 10 \
    --alignEndsType EndToEnd

View on GitHub

Aligned reads were sorted with samtools (1.4.1)

samtools v1.4.1 GitHub

$ Bash example

# Install samtools (version 1.4.1)
# conda install -c bioconda samtools=1.4.1

# Sort aligned reads by coordinate
samtools sort -o output.sorted.bam input.bam

View on GitHub

Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines.

eCLIP v0.1.5 GitHub

$ Bash example

# Install eCLIP (assuming a specific version or cloning the repository)
# git clone https://github.com/yeolab/eclip.git
# cd eclip
# # If using a specific release, checkout the tag, e.g., git checkout v0.1.5
# # Ensure Python environment is set up with necessary dependencies

# Define input and output file paths
INPUT_READ1="sorted_reads_R1.fastq.gz"
INPUT_READ2="sorted_reads_R2.fastq.gz"
OUTPUT_PREFIX="collapsed_reads"
STATS_FILE="collapsed_reads_stats.txt"

# Execute barcodecollapsepe.py to collapse sorted paired-end reads
# The script is located within the cloned eclip repository at eclip/eclip/barcodecollapsepe.py
python eclip/eclip/barcodecollapsepe.py \
  -r1 "${INPUT_READ1}" \
  -r2 "${INPUT_READ2}" \
  -o "${OUTPUT_PREFIX}" \
  -s "${STATS_FILE}" \
  -b 10 \
  -m 10

View on GitHub

Args: -o preRmDup.bam -m metrics_file -b rmDup.bam

sambamba (Inferred with models/gemini-2.5-flash) v0.8.2 GitHub

$ Bash example

# Install sambamba if not already available
# conda install -c bioconda sambamba

# Define input and output files based on the description's arguments
# -b: Input BAM file (e.g., a BAM file that still contains duplicates)
# -o: Output BAM file (e.g., a BAM file with duplicates removed)
# -m: Metrics file (sambamba markdup outputs statistics to stderr, which can be redirected)
INPUT_BAM="rmDup.bam" # Placeholder for the input BAM file
OUTPUT_BAM="preRmDup.bam" # Placeholder for the output BAM file after deduplication
METRICS_FILE="metrics_file" # Placeholder for the deduplication metrics file

# Execute sambamba markdup to remove duplicates
# The command assumes that rmDup.bam is the input and preRmDup.bam is the output.
# Deduplication statistics are redirected to the metrics_file.
sambamba markdup "${INPUT_BAM}" "${OUTPUT_BAM}" 2> "${METRICS_FILE}"

View on GitHub

PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)

samtools v1.4.1 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.4.1

# Merge PCR de-duped reads from multiple inline barcodes
# Replace input_barcode_1.bam, input_barcode_2.bam, etc. with your actual input BAM files.
samtools merge merged.bam input_barcode_1.bam input_barcode_2.bam input_barcode_N.bam

View on GitHub

Merged alignments were split to keep just read2 using samtools (1.4.1) view.

samtools v1.4.1 GitHub

$ Bash example

# samtools version 1.4.1
# Input: merged.bam (placeholder for merged alignments)
# Output: read2.bam (placeholder for alignments containing only read2)

samtools view -f 0x80 -b -o read2.bam merged.bam

View on GitHub

Args: -h -b -f 128

Unknown (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# The specific tool for these arguments is not provided in the description.
# This command uses a placeholder 'unknown_tool'.
unknown_tool -h -b -f 128

Read2 BAM files were used to identify peak clusters with Clipper (1.2.2).

CLIPper v1.2.2 GitHub

$ Bash example

# Install CLIPper (if not already installed)
# pip install clipper # Or clone from GitHub and run setup.py

# Define input and output files
INPUT_BAM="read2.bam" # Placeholder for "Read2 BAM files"
OUTPUT_DIR="clipper_peaks"
SPECIES="hg38" # Placeholder for species, e.g., hg38, mm10. Adjust as per experiment.

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run CLIPper to identify peak clusters
# CLIPper version 1.2.2
# Note: For eCLIP, it's common to use a control BAM (e.g., input, IgG) with the -c flag.
# Example: clipper.py -s "${SPECIES}" -o "${OUTPUT_DIR}" -c "control.bam" "${INPUT_BAM}"
clipper.py -s "${SPECIES}" -o "${OUTPUT_DIR}" "${INPUT_BAM}"

View on GitHub

Args: --species mm10 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam

gene_bam_processor (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This is a placeholder for a custom script or tool that processes BAM files
# based on gene information and saves intermediate data in pickle format.
# The specific installation instructions would depend on the actual script/tool.
# For example, if it's a Python script:
# # conda create -n gene_bam_env python=3.9
# # conda activate gene_bam_env
# # pip install pandas numpy # (example dependencies)

# Reference genome for mm10 (Mouse)
# Source: UCSC Genome Browser (e.g., for genome annotation like GTF/GFF for gene definitions)
# MM10_GENOME_ANNOTATION="/path/to/mm10.ncbiRefSeq.gtf" # Placeholder for actual annotation file

# Execute the inferred gene_bam_processor
# Assuming the script is named 'gene_bam_processor.py'
python gene_bam_processor.py \
    --species mm10 \
    --bam path/to/input.bam \
    --timeout 3600000 \
    --maxgenes 1000000 \
    --save-pickle \
    --outfile path/to/output.bam

Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.

eCLIP v0.1.5 GitHub

$ Bash example

# Clone the eCLIP repository if not already available
# git clone https://github.com/yeolab/eclip.git
# cd eclip/scripts

# Example usage of peaksnormalize.pl
# Assuming initial_peaks.bed is the input peak file (peak clusters),
# ip_read2.bam is the IP read2 BAM file (containing only read2s),
# and input_read2.bam is the INPUT read2 BAM file (containing only read2s).
# The script will output normalized peak files with the specified prefix (e.g., normalized_peaks.bed).
perl peaksnormalize.pl initial_peaks.bed ip_read2.bam input_read2.bam normalized_peaks

View on GitHub

Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

eCLIP v0.1.5 GitHub

$ Bash example

# Install Perl if not available
# sudo apt-get update && sudo apt-get install perl
#
# Clone the eCLIP repository to get the script
# git clone https://github.com/yeolab/eclip.git
# cd eclip
#
# Define input and output files (placeholders)
REPLICATE1_PEAKS="replicate1_normalized_peaks.bed"
REPLICATE2_PEAKS="replicate2_normalized_peaks.bed"
OUTPUT_PREFIX="merged_reproducible_peaks"
MIN_OVERLAP_REPLICATES=2 # Example: require overlap in at least 2 replicates
MIN_LOG2_FOLD_ENRICHMENT=1.0 # Example: require a minimum log2 fold enrichment of 1.0

# Path to the script (assuming eclip repo is cloned and current directory is eclip)
SCRIPT_PATH="./scripts/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl"

# Execute the script to merge overlapping normalized peak regions
perl "${SCRIPT_PATH}" \
    "${REPLICATE1_PEAKS}" \
    "${REPLICATE2_PEAKS}" \
    "${OUTPUT_PREFIX}" \
    "${MIN_OVERLAP_REPLICATES}" \
    "${MIN_LOG2_FOLD_ENRICHMENT}"

View on GitHub

Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.

IDR v2.0.2 GitHub

$ Bash example

# Define input files and output prefix
REP1_PEAKS="rep1.normalized.peaks.bed" # Placeholder for normalized peak file from replicate 1
REP2_PEAKS="rep2.normalized.peaks.bed" # Placeholder for normalized peak file from replicate 2
OUTPUT_PREFIX="idr_reproducible_peaks"

# --- Step 1: Rank peak files by entropy score ---
# make_informationcontent_from_peaks.pl is included within the merge_peaks pipeline (https://github.com/yeolab/merge_peaks).
# Ensure the script is in your PATH or provide its full path.
# Example installation (assuming merge_peaks is cloned):
# git clone https://github.com/yeolab/merge_peaks.git
# export PATH=$(pwd)/merge_peaks/bin:$PATH

echo "Ranking replicate 1 peaks..."
make_informationcontent_from_peaks.pl "${REP1_PEAKS}" > "${OUTPUT_PREFIX}.rep1.ranked.bed"

echo "Ranking replicate 2 peaks..."
make_informationcontent_from_peaks.pl "${REP2_PEAKS}" > "${OUTPUT_PREFIX}.rep2.ranked.bed"

# --- Step 2: Run IDR to determine reproducible peaks ---
# IDR (Irreproducible Discovery Rate) version 2.0.2
# Installation:
# conda create -n idr_env idr=2.0.2 -c bioconda
# conda activate idr_env

echo "Running IDR (2.0.2) to find reproducible peaks..."
idr --input-file-type narrowPeak \
    --rank-by signal.value \
    --output-file "${OUTPUT_PREFIX}.idr.peaks.bed" \
    --plot \
    --log-output-file "${OUTPUT_PREFIX}.idr.log" \
    --soft-idr-threshold 0.05 \
    "${OUTPUT_PREFIX}.rep1.ranked.bed" \
    "${OUTPUT_PREFIX}.rep2.ranked.bed"

echo "IDR analysis complete. Reproducible peaks saved to ${OUTPUT_PREFIX}.idr.peaks.bed"

View on GitHub

Reproducible peaks were annotated by overlapping peak regions with Gencode M10 annotations

GENCODE v2.30.0 GitHub

$ Bash example

# Install bedtools (if not already installed)
# conda install -c bioconda bedtools

# Download Gencode M10 annotation GTF file
# wget -O gencode.vM10.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M10/gencode.vM10.annotation.gtf.gz
# gunzip gencode.vM10.annotation.gtf.gz

# Convert Gencode GTF to BED format for bedtools intersection.
# This command extracts gene, transcript, and exon features, converting 1-based start to 0-based.
# It also extracts gene_id, gene_name, and strand to create a 6-column BED file.
awk -F'\t' '($3 == "gene" || $3 == "transcript" || $3 == "exon") {
    OFS="\t";
    split($9, a, "; ");
    gene_id=""; gene_name="";
    for (i=1; i<=length(a); i++) {
        if (a[i] ~ /gene_id/) { gene_id = substr(a[i], index(a[i], "\"")+1, length(a[i])-index(a[i], "\"")-1); }
        if (a[i] ~ /gene_name/) { gene_name = substr(a[i], index(a[i], "\"")+1, length(a[i])-index(a[i], "\"")-1); }
    }
    print $1, $4-1, $5, gene_id":"gene_name, "0", $7
}' gencode.vM10.annotation.gtf > gencode.vM10.annotation.bed

# Placeholder for input reproducible peak regions file (e.g., from IDR or peak calling)
# reproducible_peaks.bed

# Annotate reproducible peaks by overlapping with Gencode M10 annotations.
# -a: input peak file (e.g., reproducible_peaks.bed)
# -b: Gencode M10 annotation BED file
# -wao: write original entry in A and overlap from B, report 0 for no overlap.
#       This ensures all peaks are reported, with annotation details appended if an overlap exists.
bedtools intersect -a reproducible_peaks.bed -b gencode.vM10.annotation.bed -wao > annotated_reproducible_peaks.bed

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1). Args: --length 10
Reads were then trimmed with cutadapt (1.9.1). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events. Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Trimmed reads were then mapped with STAR (2.4.0i) against a mouse-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 \  --genomeDir mouse_repbase \  --readFilesIn path/to/read1 path/to/read2 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 30 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a mouse genome (mm10). Args: --runThreadN 16 \  --genomeDir genomedir \  --readFilesIn /path/to/read1 /path/to/read2 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM   Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 1 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Aligned reads were sorted with samtools (1.4.1)
Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines. Args: -o preRmDup.bam -m metrics_file -b rmDup.bam
PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)
Merged alignments were split to keep just read2 using samtools (1.4.1) view. Args: -h -b -f 128
Read2 BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species mm10 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
Reproducible peaks were annotated by overlapping peak regions with Gencode M10 annotations
Genome_build: mm10
Supplementary_files_format_and_content: bigWig files; tab-delimited text file includes -log10pValues and Log2FoldChange values for each IDR peaks called after cutoff of 3,3 in each parameter

← Back to Analysis