GSE104949 Processing Pipeline
RIP-Seq
code_examples
9 steps
Publication
Transcriptome regulation by PARP13 in basal and antiviral states in human cells.iScience (2024) — PMID 38495826
Dataset
GSE104949RNA-binding activity of TRIM25 is mediated by its PRY/SPRY domain and is required for ubiquitination
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
library strategy: CLIP-seq
$ Bash example
# This script outlines key steps for CLIP-seq data processing, # drawing from common practices and tools mentioned in eCLIP guidelines. # For a complete eCLIP pipeline, refer to the yeolab/eclip (CWL) or yeolab/skipper (Snakemake) workflows. # --- Configuration --- # Placeholder for input FASTQ files (assuming single-end for simplicity) INPUT_FASTQ="sample_R1.fastq.gz" OUTPUT_PREFIX="clip_seq_processed" # Reference genome (using hg38 as a placeholder for human) # Ensure STAR genome index is built for hg38. # Example: STAR --runMode genomeGenerate --genomeDir /path/to/STAR_genome_index/hg38 --genomeFastaFiles /path/to/genome/hg38.fa --sjdbGTFfile /path/to/annotations/gencode.v38.annotation.gtf --runThreadN 8 STAR_GENOME_DIR="/path/to/STAR_genome_index/hg38" GENOME_FASTA="/path/to/genome/hg38.fa" # Required for CLIPper # --- 1. Alignment with STAR (splice-aware aligner) --- # conda install -c bioconda star STAR --runThreadN 8 \ --genomeDir "${STAR_GENOME_DIR}" \ --readFilesIn "${INPUT_FASTQ}" \ --outFileNamePrefix "${OUTPUT_PREFIX}_STAR_" \ --outSAMtype BAM SortedByCoordinate \ --outReadsUnmapped Fastx \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 30000000000 # Example: 30GB RAM for sorting ALIGNED_BAM="${OUTPUT_PREFIX}_STAR_Aligned.sortedByCoord.out.bam" # --- 2. Deduplication (using samtools) --- # conda install -c bioconda samtools samtools fixmate -m "${ALIGNED_BAM}" "${OUTPUT_PREFIX}_fixmate.bam" samtools sort -o "${OUTPUT_PREFIX}_sorted_fixmate.bam" "${OUTPUT_PREFIX}_fixmate.bam" samtools markdup -r "${OUTPUT_PREFIX}_sorted_fixmate.bam" "${OUTPUT_PREFIX}_dedup.bam" samtools index "${OUTPUT_PREFIX}_dedup.bam" DEDUP_BAM="${OUTPUT_PREFIX}_dedup.bam" # --- 3. Peak Calling with CLIPper --- # conda install -c bioconda clipper # CLIPper requires a genome FASTA file and the deduplicated BAM. # The '-s' parameter specifies the genome assembly (e.g., hg38, mm10). # The '-t' parameter is a threshold for peak calling (e.g., 5 for 5-fold enrichment). clipper -b "${DEDUP_BAM}" -s hg38 -o "${OUTPUT_PREFIX}_peaks.bed" \ -f "${GENOME_FASTA}" -t 5 # --- 4. Reproducible Peak Identification (IDR) --- # For eCLIP, IDR typically uses the yeolab/merge_peaks pipeline. # This step requires multiple replicates and often a control sample. # Example (assuming two replicate peak files from CLIPper): # REPLICATE1_PEAKS="replicate1_peaks.bed" # REPLICATE2_PEAKS="replicate2_peaks.bed" # git clone https://github.com/yeolab/merge_peaks.git # python merge_peaks/merge_peaks.py \ # --peak_files "${REPLICATE1_PEAKS}" "${REPLICATE2_PEAKS}" \ # --output_prefix "${OUTPUT_PREFIX}_idr_reproducible" \ # --idr_threshold 0.05 \ # --genome_fasta "${GENOME_FASTA}" -
2
novoindex
$ Bash example
# Installation (novoalign is commercial software, typically downloaded from Novocraft or installed via specific channels if licensed) # Example via Bioconda (requires a license for full functionality): # conda install -c bioconda novoalign # Placeholder for the latest human reference genome (hg38) # You would typically download this from a source like UCSC, NCBI, or Ensembl. # For example: # wget -O hg38.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz # gunzip hg38.fa.gz REF_FASTA="hg38.fa" OUTPUT_INDEX="hg38.idx" # Build the novoalign index # -k: k-mer size (e.g., 14 for human genome, common values 12-14) # -s: step size (e.g., 1, common values 1-2) novoindex -k 14 -s 1 "${OUTPUT_INDEX}" "${REF_FASTA}" -
3
Remove 3âadapter using flexbar
$ Bash example
# Install flexbar (example using conda) # conda install -c bioconda flexbar # Define variables INPUT_FASTQ="input_reads.fastq.gz" OUTPUT_PREFIX="output_reads_trimmed" # Placeholder for a common Illumina 3' adapter sequence. # Replace with the actual adapter used in your experiment. ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Execute flexbar for 3' adapter removal and quality trimming # Parameters are based on common eCLIP preprocessing workflows (e.g., yeolab/eclip) flexbar \ -r "${INPUT_FASTQ}" \ -t "${OUTPUT_PREFIX}" \ -a "${ADAPTER_SEQUENCE}" \ -ao 3 \ -u 1 \ -q 20 \ -qf i1.8 \ -m 18 \ -n 4 \ -z GZ -
4
collapse data and remove PCR duplicates using pyCRAC
pyCRAC vv1.2.0 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install pyCRAC (if not already installed) # pip install pyCRAC # # Or using conda: # # conda install -c bioconda pycrac # Collapse reads and remove PCR duplicates # Input: aligned_reads.bam (e.g., from a previous alignment step) # Output: collapsed_reads.bam pyCRAC collapse -i aligned_reads.bam -o collapsed_reads.bam
-
5
align using novalign
$ Bash example
# Installation instructions (novoalign is a commercial software, typically downloaded and installed manually or via a site license) # You would typically download the binary from Novocraft's website and ensure it's in your PATH. # Example (illustrative, not a package manager command): # wget http://www.novocraft.com/downloads/novoalign_v4.04.00.tar.gz # tar -xzf novoalign_v4.04.00.tar.gz # export PATH=$PATH:/path/to/novoalign_binary # Placeholder for reference genome index (e.g., human hg38) # Replace 'path/to/hg38_index' with the actual path to your novoalign index files. # The index needs to be built using 'novoindex' prior to alignment. REFERENCE_INDEX="path/to/hg38_index" # Placeholder for input FASTQ files (paired-end example) # Replace with your actual input read files. INPUT_READS_R1="input_reads_R1.fastq.gz" INPUT_READS_R2="input_reads_R2.fastq.gz" # Placeholder for output SAM file OUTPUT_SAM="aligned_reads.sam" # Execute novoalign for paired-end reads # -d: specify the reference genome index directory # -f: specify input FASTQ files (space-separated for paired-end) # -o SAM: output format as SAM # -r All: report all alignments (or choose a specific number, e.g., 1 for best, or 0 for random best) # >: redirect standard output to the specified SAM file novoalign -d "${REFERENCE_INDEX}" -f "${INPUT_READS_R1}" "${INPUT_READS_R2}" -o SAM -r All > "${OUTPUT_SAM}" # If single-end reads: # INPUT_READS="input_reads.fastq.gz" # novoalign -d "${REFERENCE_INDEX}" -f "${INPUT_READS}" -o SAM -r All > "${OUTPUT_SAM}" -
6
Readcounter using pyCRAC
pyCRAC vv1.0.0$ Bash example
# Install pyCRAC (e.g., via pip or cloning the repo) # pip install pycrac==1.0.0 # Or for the version used in eCLIP workflow: pip install git+https://github.com/yeolab/pyCRAC.git@v1.0.0 # Define input and output files INPUT_BAM="input.bam" # Path to the input BAM file (e.g., alignment output) OUTPUT_FILE="output_counts.tsv" # Path for the output read count file LOG_FILE="pycrac_count.log" # Path for the log file # Define reference datasets (using human hg38 as a placeholder) # GTF file: Gene annotation in GTF format (e.g., from GENCODE) # Example download: https://www.gencodegenes.org/human/ (e.g., gencode.v38.annotation.gtf.gz) GTF_FILE="gencode.v38.annotation.gtf" # Genome FASTA file: Reference genome in FASTA format (e.g., from UCSC or Ensembl) # Example download: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz GENOME_FASTA="GRCh38.primary_assembly.genome.fa" # Run pycrac_count.py to count reads over genomic features pycrac_count.py \ --input_bam "${INPUT_BAM}" \ --output_file "${OUTPUT_FILE}" \ --gtf "${GTF_FILE}" \ --genome_fasta "${GENOME_FASTA}" \ --min_read_length 15 \ --max_read_length 100 \ --min_mapq 20 \ --min_score 0 \ --min_overlap 1 \ --strand "reverse" \ --feature_type "exon" \ --id_attribute "gene_id" \ --count_mode "unique" \ --log_file "${LOG_FILE}" -
7
calculate FDR using Mock sample using pyCRAC
pyCRAC v1.2.0$ Bash example
# Install pyCRAC (example using pip, adjust as needed for specific environment) # pip install pyCRAC # Define input and output files IP_BAM="ip_sample.bam" INPUT_BAM="input_sample.bam" MOCK_BAM="mock_sample.bam" # The description explicitly mentions "Mock sample" OUTPUT_PREFIX="pycrac_fdr_peaks" FDR_THRESHOLD="0.05" # Common FDR threshold # Calculate FDR using pyCRAC_peak_caller.py with a mock sample # This command assumes pyCRAC_peak_caller.py is in your PATH. # Adjust parameters like --fdr_threshold as needed. pyCRAC_peak_caller.py \ -i "${IP_BAM}" \ -c "${INPUT_BAM}" \ -m "${MOCK_BAM}" \ -o "${OUTPUT_PREFIX}" \ --fdr_threshold "${FDR_THRESHOLD}" -
8
intersect fdr with reads using bedtools intersect
$ Bash example
# Install bedtools (if not already installed) # conda install -c bioconda bedtools # Placeholder for input files. Replace with actual file paths. # fdr_file.bed: BED file representing FDR regions (e.g., called peaks with FDR) # reads_file.bed: BED file representing read alignments or regions derived from reads # Perform the intersection of FDR regions with read regions. # -a: The first input file (FDR regions) # -b: The second input file (read regions) # -wao: Write the original entry in A, the original entry in B, and the number of overlapping bases. # This is a common output format for detailed intersection results. bedtools intersect -a fdr_file.bed -b reads_file.bed -wao > fdr_reads_intersect.bed
-
9
Cluster reads using pyCRAC
pyCRAC (Inferred with models/gemini-2.5-flash) v0.1.0 (Inferred with models/gemini-2.5-flash)$ Bash example
# pyCRAC is a Python 2 tool. Ensure you are in a Python 2 environment. # Installation (if not already installed): # pip install pyCRAC # Example usage of pyCRAC_cluster.py # Input file: aligned_reads.txt (e.g., output from pyCRAC_align.py) # Output file: clustered_reads.txt pyCRAC_cluster.py \ aligned_reads.txt \ -o clustered_reads.txt \ -c 10 \ -s 1 \ -m 1 \ -t 4 \ --bed
Tools Used
Raw Source Text
library strategy: CLIP-seq novoindex Remove 3âadapter using flexbar collapse data and remove PCR duplicates using pyCRAC align using novalign Readcounter using pyCRAC calculate FDR using Mock sample using pyCRAC intersect fdr with reads using bedtools intersect Cluster reads using pyCRAC Genome_build: hg38; Homo_sapiens-ensembl-release_83 genome annotation Supplementary_files_format_and_content: GTF representing clusters of hits