GSE77633 Processing Pipeline

RIP-Seq code_examples 7 steps

Publication

Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP).

Nature methods (2016) — PMID 27018577

Dataset

Enhanced CLIP (eCLIP) enables robust and scalable transcriptome-wide discovery and characterization of RNA binding protein binding sites [iCLIP]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequencing reads from CLIP-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.

cutadapt v1.18 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt

# Define input and output filenames (placeholders)
# Replace 'input.fastq.gz' with your actual input FASTQ file
# Replace 'output.fastq.gz' with your desired output FASTQ file
INPUT_FASTQ="input.fastq.gz"
OUTPUT_FASTQ="output.fastq.gz"

cutadapt \
    --match-read-wildcards \
    --times 2 \
    -e 0 \
    -O 5 \
    --quality-cutoff 6 \
    -m 18 \
    -b TCGTATGCCGTCTTCTGCTTG \
    -b ATCTCGTATGCCGTCTTCTGCTTG \
    -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC \
    -b TGGAATTCTCGGGTGCCAAGG \
    -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA \
    -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT \
    -o "${OUTPUT_FASTQ}" \
    "${INPUT_FASTQ}"

View on GitHub

Reads were then mapped against a database of repetitive elements derived from RepBase18.05.

bowtie2 (Inferred with models/gemini-2.5-flash) v2.4.5 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bowtie2 if not already installed
# conda install -c bioconda bowtie2

# Define input reads (adjust for single-end or paired-end as needed)
# For paired-end reads:
READS_R1="reads_R1.fastq.gz"
READS_R2="reads_R2.fastq.gz"

# For single-end reads (uncomment and adjust if applicable):
# READS_SINGLE="reads.fastq.gz"

OUTPUT_SAM="mapped_to_repeats.sam"
UNMAPPED_PREFIX="unmapped_to_repeats" # Prefix for unmapped reads files (e.g., unmapped_to_repeats_1.fastq.gz, unmapped_to_repeats_2.fastq.gz)

# Define repetitive elements database (RepBase18.05)
# RepBase is a commercial database. Users typically obtain a license or use derived public datasets.
# For demonstration, assume 'RepBase18.05.fasta' is available in the working directory.
REPEATS_FASTA="RepBase18.05.fasta"
REPEATS_INDEX_PREFIX="RepBase18.05_index"

# Build Bowtie2 index for repetitive elements
# This step only needs to be run once per reference database
# bowtie2-build "${REPEATS_FASTA}" "${REPEATS_INDEX_PREFIX}"

# Map reads against the repetitive elements database (Paired-end example)
# Using --very-sensitive-local for robust mapping to potentially fragmented repeats
# --un-conc-gz to output unmapped reads (paired-end) for subsequent mapping to the main genome
bowtie2 \
  --very-sensitive-local \
  -x "${REPEATS_INDEX_PREFIX}" \
  -1 "${READS_R1}" \
  -2 "${READS_R2}" \
  --un-conc-gz "${UNMAPPED_PREFIX}" \
  -S "${OUTPUT_SAM}"

# If using single-end reads, the command would be:
# bowtie2 \
#   --very-sensitive-local \
#   -x "${REPEATS_INDEX_PREFIX}" \
#   -U "${READS_SINGLE}" \
#   --un-gz "${UNMAPPED_PREFIX}.fastq.gz" \
#   -S "${OUTPUT_SAM}"

# Optional: Convert SAM to BAM and sort
# samtools view -bS "${OUTPUT_SAM}" | samtools sort -o "${OUTPUT_SAM%.sam}.bam"
# samtools index "${OUTPUT_SAM%.sam}.bam"

View on GitHub

Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).

Bowtie v1.0.0 GitHub

$ Bash example

# Install Bowtie (version 1.0.0 might require specific steps or older environments)
# For example, if using conda:
# conda create -n bowtie1_env bowtie=1.0.0
# conda activate bowtie1_env

# Align reads using Bowtie 1.0.0
# Assuming 'repbase_index' is the prefix for the Bowtie index files (e.g., repbase_index.1.ebwt, repbase_index.2.ebwt, etc.)
# and 'reads.fastq' contains the input reads.
# The output will be in SAM format due to the -S flag.
bowtie -S -q -p 16 -e 100 -l 20 repbase_index reads.fastq > aligned_reads.sam

View on GitHub

Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.

STAR v2.3.0e

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star=2.3.0e

# Define variables (replace with actual paths)
GENOME_DIR="/path/to/hg19_STAR_index"
INPUT_FASTQ="unmapped_reads.fastq"
OUTPUT_PREFIX="aligned_to_hg19_"

# Run STAR alignment
STAR \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${INPUT_FASTQ}" \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1

Reads that were PCR replicates were removed from each CLIP-seq library using a custom script.

CLIP-seq v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Define input and output file names
INPUT_BAM="aligned_reads.bam"
OUTPUT_DEDUP_BAM="deduplicated_reads.bam"
TEMP_FIXMATE_BAM="temp_fixmate.bam"
TEMP_NAMESORT_BAM="temp_namesort.bam"
FINAL_COORD_SORTED_BAM="deduplicated_reads_coord_sorted.bam"

# 1. Add mate score tags and fixmate information
# This step is crucial for samtools markdup to correctly identify paired-end duplicates.
# The output is unsorted.
samtools fixmate -m "${INPUT_BAM}" "${TEMP_FIXMATE_BAM}"

# 2. Sort by read name
# samtools markdup requires name-sorted input for optimal performance.
samtools sort -n "${TEMP_FIXMATE_BAM}" -o "${TEMP_NAMESORT_BAM}"

# 3. Remove PCR duplicates
# -r: Remove duplicate reads (instead of just marking them).
# -s: Output statistics to stderr (optional, but good for logging).
samtools markdup -r "${TEMP_NAMESORT_BAM}" "${OUTPUT_DEDUP_BAM}" 2> "${OUTPUT_DEDUP_BAM}.stats"

# 4. Re-sort the deduplicated BAM by coordinate
# This is often required for downstream tools like peak callers or visualization.
samtools sort -o "${FINAL_COORD_SORTED_BAM}" "${OUTPUT_DEDUP_BAM}"

# 5. Index the final BAM file
# Indexing allows for fast retrieval of reads by genomic location.
samtools index "${FINAL_COORD_SORTED_BAM}"

# Clean up temporary files
rm "${TEMP_FIXMATE_BAM}" "${TEMP_NAMESORT_BAM}" "${OUTPUT_DEDUP_BAM}"

View on GitHub

Briefly one read with a unique barcode was kept at each nucleotide position when more than one with the same barcode was mapped to the same location

umi_tools dedup (Inferred with models/gemini-2.5-flash) v1.1.2 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install umi_tools if not already installed
# conda install -c bioconda umi_tools

# Placeholder for input and output files
INPUT_BAM="aligned_reads_with_umis.bam"
OUTPUT_BAM="deduplicated_reads.bam"

# Deduplicate reads based on Unique Molecular Identifier (UMI) and mapping position.
# This command assumes that the UMI is already stored in a BAM tag named 'UB' (UMI Barcode).
# If the UMI is part of the read name, you would typically use '--extract-method'
# and '--umi-separator' or '--umi-pattern' instead.
# For example, if UMI is at the start of the read name separated by '_':
# umi_tools dedup -I "${INPUT_BAM}" -S "${OUTPUT_BAM}" --extract-method=string --umi-separator='_'
# The default behavior of umi_tools dedup is to group reads by mapping position
# and UMI, then keep one representative read.
umi_tools dedup \
    -I "${INPUT_BAM}" \
    -S "${OUTPUT_BAM}" \
    --umi-tag=UB \
    --log "${OUTPUT_BAM%.bam}.log"

View on GitHub

Clusters were then assigned using the CLIPper software with parameters --bonferroni --superlocal --threshold- software (Lovci et al., 2013).

CLIPper v1.0 (Inferred from Lovci et al., 2013) GitHub

$ Bash example

# Install CLIPper (assuming Python environment)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# python setup.py install # Or ensure clipper.py is in your PATH or run directly

# Example usage of CLIPper for peak calling
# Replace 'path/to/genome.sizes' with the actual path to your genome size file (e.g., hg38.chrom.sizes)
# Replace 'treatment.bam' and 'control.bam' with your actual aligned BAM files
# The '--threshold-' parameter in the description was incomplete. Assuming it should be '--threshold <value>'.
python clipper.py \
    -s path/to/genome.sizes \
    -o clipper_output_prefix \
    treatment.bam \
    control.bam \
    --bonferroni \
    --superlocal \
    --threshold 0.05 # Placeholder value for threshold, as it was incomplete in description

View on GitHub

Tools Used

STAR CLIP-seq

Raw Source Text

Sequencing reads from CLIP-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.
Reads were then mapped against a database of repetitive elements derived from RepBase18.05. Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).
Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.
Reads that were PCR replicates were removed from each CLIP-seq library using a custom script. Briefly one read with a unique barcode was kept at each nucleotide position when more than one with the same barcode was mapped to the same location
Clusters were then assigned using the CLIPper software with parameters --bonferroni --superlocal --threshold- software (Lovci et al., 2013).
Genome_build: hg19
Supplementary_files_format_and_content: bigWig, bigBed format, contains clusters of predicted RBFOX2 binding

← Back to Analysis