GSE104949 Processing Pipeline

RIP-Seq code_examples 9 steps

Publication

Transcriptome regulation by PARP13 in basal and antiviral states in human cells.

iScience (2024) — PMID 38495826

Dataset

RNA-binding activity of TRIM25 is mediated by its PRY/SPRY domain and is required for ubiquitination

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

library strategy: CLIP-seq

CLIP-seq vv1.0.2 GitHub

$ Bash example

# This script outlines key steps for CLIP-seq data processing,
# drawing from common practices and tools mentioned in eCLIP guidelines.
# For a complete eCLIP pipeline, refer to the yeolab/eclip (CWL) or yeolab/skipper (Snakemake) workflows.

# --- Configuration ---
# Placeholder for input FASTQ files (assuming single-end for simplicity)
INPUT_FASTQ="sample_R1.fastq.gz"
OUTPUT_PREFIX="clip_seq_processed"

# Reference genome (using hg38 as a placeholder for human)
# Ensure STAR genome index is built for hg38.
# Example: STAR --runMode genomeGenerate --genomeDir /path/to/STAR_genome_index/hg38 --genomeFastaFiles /path/to/genome/hg38.fa --sjdbGTFfile /path/to/annotations/gencode.v38.annotation.gtf --runThreadN 8
STAR_GENOME_DIR="/path/to/STAR_genome_index/hg38"
GENOME_FASTA="/path/to/genome/hg38.fa" # Required for CLIPper

# --- 1. Alignment with STAR (splice-aware aligner) ---
# conda install -c bioconda star
STAR --runThreadN 8 \
     --genomeDir "${STAR_GENOME_DIR}" \
     --readFilesIn "${INPUT_FASTQ}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}_STAR_" \
     --outSAMtype BAM SortedByCoordinate \
     --outReadsUnmapped Fastx \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 30000000000 # Example: 30GB RAM for sorting

ALIGNED_BAM="${OUTPUT_PREFIX}_STAR_Aligned.sortedByCoord.out.bam"

# --- 2. Deduplication (using samtools) ---
# conda install -c bioconda samtools
samtools fixmate -m "${ALIGNED_BAM}" "${OUTPUT_PREFIX}_fixmate.bam"
samtools sort -o "${OUTPUT_PREFIX}_sorted_fixmate.bam" "${OUTPUT_PREFIX}_fixmate.bam"
samtools markdup -r "${OUTPUT_PREFIX}_sorted_fixmate.bam" "${OUTPUT_PREFIX}_dedup.bam"
samtools index "${OUTPUT_PREFIX}_dedup.bam"

DEDUP_BAM="${OUTPUT_PREFIX}_dedup.bam"

# --- 3. Peak Calling with CLIPper ---
# conda install -c bioconda clipper
# CLIPper requires a genome FASTA file and the deduplicated BAM.
# The '-s' parameter specifies the genome assembly (e.g., hg38, mm10).
# The '-t' parameter is a threshold for peak calling (e.g., 5 for 5-fold enrichment).
clipper -b "${DEDUP_BAM}" -s hg38 -o "${OUTPUT_PREFIX}_peaks.bed" \
        -f "${GENOME_FASTA}" -t 5

# --- 4. Reproducible Peak Identification (IDR) ---
# For eCLIP, IDR typically uses the yeolab/merge_peaks pipeline.
# This step requires multiple replicates and often a control sample.
# Example (assuming two replicate peak files from CLIPper):
# REPLICATE1_PEAKS="replicate1_peaks.bed"
# REPLICATE2_PEAKS="replicate2_peaks.bed"
# git clone https://github.com/yeolab/merge_peaks.git
# python merge_peaks/merge_peaks.py \
#     --peak_files "${REPLICATE1_PEAKS}" "${REPLICATE2_PEAKS}" \
#     --output_prefix "${OUTPUT_PREFIX}_idr_reproducible" \
#     --idr_threshold 0.05 \
#     --genome_fasta "${GENOME_FASTA}"

View on GitHub

novoindex

novoindex (Inferred with models/gemini-2.5-flash) v4.04.00 GitHub

$ Bash example

# Installation (novoalign is commercial software, typically downloaded from Novocraft or installed via specific channels if licensed)
# Example via Bioconda (requires a license for full functionality):
# conda install -c bioconda novoalign

# Placeholder for the latest human reference genome (hg38)
# You would typically download this from a source like UCSC, NCBI, or Ensembl.
# For example:
# wget -O hg38.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
# gunzip hg38.fa.gz
REF_FASTA="hg38.fa"
OUTPUT_INDEX="hg38.idx"

# Build the novoalign index
# -k: k-mer size (e.g., 14 for human genome, common values 12-14)
# -s: step size (e.g., 1, common values 1-2)
novoindex -k 14 -s 1 "${OUTPUT_INDEX}" "${REF_FASTA}"

View on GitHub

Remove 3âadapter using flexbar

flexbar v3.0.3 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install flexbar (example using conda)
# conda install -c bioconda flexbar

# Define variables
INPUT_FASTQ="input_reads.fastq.gz"
OUTPUT_PREFIX="output_reads_trimmed"
# Placeholder for a common Illumina 3' adapter sequence. 
# Replace with the actual adapter used in your experiment.
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"

# Execute flexbar for 3' adapter removal and quality trimming
# Parameters are based on common eCLIP preprocessing workflows (e.g., yeolab/eclip)
flexbar \
    -r "${INPUT_FASTQ}" \
    -t "${OUTPUT_PREFIX}" \
    -a "${ADAPTER_SEQUENCE}" \
    -ao 3 \
    -u 1 \
    -q 20 \
    -qf i1.8 \
    -m 18 \
    -n 4 \
    -z GZ

View on GitHub

collapse data and remove PCR duplicates using pyCRAC

pyCRAC vv1.2.0 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install pyCRAC (if not already installed)
# pip install pyCRAC
# # Or using conda:
# # conda install -c bioconda pycrac

# Collapse reads and remove PCR duplicates
# Input: aligned_reads.bam (e.g., from a previous alignment step)
# Output: collapsed_reads.bam
pyCRAC collapse -i aligned_reads.bam -o collapsed_reads.bam

align using novalign

novoalign vNot specified GitHub

$ Bash example

# Installation instructions (novoalign is a commercial software, typically downloaded and installed manually or via a site license)
# You would typically download the binary from Novocraft's website and ensure it's in your PATH.
# Example (illustrative, not a package manager command):
# wget http://www.novocraft.com/downloads/novoalign_v4.04.00.tar.gz
# tar -xzf novoalign_v4.04.00.tar.gz
# export PATH=$PATH:/path/to/novoalign_binary

# Placeholder for reference genome index (e.g., human hg38)
# Replace 'path/to/hg38_index' with the actual path to your novoalign index files.
# The index needs to be built using 'novoindex' prior to alignment.
REFERENCE_INDEX="path/to/hg38_index"

# Placeholder for input FASTQ files (paired-end example)
# Replace with your actual input read files.
INPUT_READS_R1="input_reads_R1.fastq.gz"
INPUT_READS_R2="input_reads_R2.fastq.gz"

# Placeholder for output SAM file
OUTPUT_SAM="aligned_reads.sam"

# Execute novoalign for paired-end reads
# -d: specify the reference genome index directory
# -f: specify input FASTQ files (space-separated for paired-end)
# -o SAM: output format as SAM
# -r All: report all alignments (or choose a specific number, e.g., 1 for best, or 0 for random best)
# >: redirect standard output to the specified SAM file
novoalign -d "${REFERENCE_INDEX}" -f "${INPUT_READS_R1}" "${INPUT_READS_R2}" -o SAM -r All > "${OUTPUT_SAM}"

# If single-end reads:
# INPUT_READS="input_reads.fastq.gz"
# novoalign -d "${REFERENCE_INDEX}" -f "${INPUT_READS}" -o SAM -r All > "${OUTPUT_SAM}"

View on GitHub

Readcounter using pyCRAC

pyCRAC vv1.0.0

$ Bash example

# Install pyCRAC (e.g., via pip or cloning the repo)
# pip install pycrac==1.0.0 # Or for the version used in eCLIP workflow: pip install git+https://github.com/yeolab/pyCRAC.git@v1.0.0

# Define input and output files
INPUT_BAM="input.bam" # Path to the input BAM file (e.g., alignment output)
OUTPUT_FILE="output_counts.tsv" # Path for the output read count file
LOG_FILE="pycrac_count.log" # Path for the log file

# Define reference datasets (using human hg38 as a placeholder)
# GTF file: Gene annotation in GTF format (e.g., from GENCODE)
# Example download: https://www.gencodegenes.org/human/ (e.g., gencode.v38.annotation.gtf.gz)
GTF_FILE="gencode.v38.annotation.gtf"

# Genome FASTA file: Reference genome in FASTA format (e.g., from UCSC or Ensembl)
# Example download: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
GENOME_FASTA="GRCh38.primary_assembly.genome.fa"

# Run pycrac_count.py to count reads over genomic features
pycrac_count.py \
    --input_bam "${INPUT_BAM}" \
    --output_file "${OUTPUT_FILE}" \
    --gtf "${GTF_FILE}" \
    --genome_fasta "${GENOME_FASTA}" \
    --min_read_length 15 \
    --max_read_length 100 \
    --min_mapq 20 \
    --min_score 0 \
    --min_overlap 1 \
    --strand "reverse" \
    --feature_type "exon" \
    --id_attribute "gene_id" \
    --count_mode "unique" \
    --log_file "${LOG_FILE}"

calculate FDR using Mock sample using pyCRAC

pyCRAC v1.2.0

$ Bash example

    # Install pyCRAC (example using pip, adjust as needed for specific environment)
    # pip install pyCRAC

    # Define input and output files
    IP_BAM="ip_sample.bam"
    INPUT_BAM="input_sample.bam"
    MOCK_BAM="mock_sample.bam" # The description explicitly mentions "Mock sample"
    OUTPUT_PREFIX="pycrac_fdr_peaks"
    FDR_THRESHOLD="0.05" # Common FDR threshold

    # Calculate FDR using pyCRAC_peak_caller.py with a mock sample
    # This command assumes pyCRAC_peak_caller.py is in your PATH.
    # Adjust parameters like --fdr_threshold as needed.
    pyCRAC_peak_caller.py \
        -i "${IP_BAM}" \
        -c "${INPUT_BAM}" \
        -m "${MOCK_BAM}" \
        -o "${OUTPUT_PREFIX}" \
        --fdr_threshold "${FDR_THRESHOLD}"

intersect fdr with reads using bedtools intersect

bedtools v2.30.0 GitHub

$ Bash example

# Install bedtools (if not already installed)
# conda install -c bioconda bedtools

# Placeholder for input files. Replace with actual file paths.
# fdr_file.bed: BED file representing FDR regions (e.g., called peaks with FDR)
# reads_file.bed: BED file representing read alignments or regions derived from reads

# Perform the intersection of FDR regions with read regions.
# -a: The first input file (FDR regions)
# -b: The second input file (read regions)
# -wao: Write the original entry in A, the original entry in B, and the number of overlapping bases.
#       This is a common output format for detailed intersection results.
bedtools intersect -a fdr_file.bed -b reads_file.bed -wao > fdr_reads_intersect.bed

View on GitHub

Cluster reads using pyCRAC

pyCRAC (Inferred with models/gemini-2.5-flash) v0.1.0 (Inferred with models/gemini-2.5-flash)

$ Bash example

    # pyCRAC is a Python 2 tool. Ensure you are in a Python 2 environment.
    # Installation (if not already installed):
    # pip install pyCRAC

    # Example usage of pyCRAC_cluster.py
    # Input file: aligned_reads.txt (e.g., output from pyCRAC_align.py)
    # Output file: clustered_reads.txt

    pyCRAC_cluster.py \
      aligned_reads.txt \
      -o clustered_reads.txt \
      -c 10 \
      -s 1 \
      -m 1 \
      -t 4 \
      --bed

Tools Used

CLIP-seq

Raw Source Text

library strategy: CLIP-seq
novoindex
Remove 3âadapter using flexbar
collapse data and remove PCR duplicates using pyCRAC
align using novalign
Readcounter using pyCRAC
calculate FDR using Mock sample using pyCRAC
intersect fdr with reads using bedtools intersect
Cluster reads using pyCRAC
Genome_build: hg38; Homo_sapiens-ensembl-release_83 genome annotation
Supplementary_files_format_and_content: GTF representing clusters of hits

← Back to Analysis