GSE104949 Processing Pipeline

RIP-Seq code_examples 9 steps

Publication

Transcriptome regulation by PARP13 in basal and antiviral states in human cells.

iScience (2024) — PMID 38495826

Dataset

GSE104949

RNA-binding activity of TRIM25 is mediated by its PRY/SPRY domain and is required for ubiquitination

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    library strategy: CLIP-seq

    $ Bash example
    # This script outlines key steps for CLIP-seq data processing,
    # drawing from common practices and tools mentioned in eCLIP guidelines.
    # For a complete eCLIP pipeline, refer to the yeolab/eclip (CWL) or yeolab/skipper (Snakemake) workflows.
    
    # --- Configuration ---
    # Placeholder for input FASTQ files (assuming single-end for simplicity)
    INPUT_FASTQ="sample_R1.fastq.gz"
    OUTPUT_PREFIX="clip_seq_processed"
    
    # Reference genome (using hg38 as a placeholder for human)
    # Ensure STAR genome index is built for hg38.
    # Example: STAR --runMode genomeGenerate --genomeDir /path/to/STAR_genome_index/hg38 --genomeFastaFiles /path/to/genome/hg38.fa --sjdbGTFfile /path/to/annotations/gencode.v38.annotation.gtf --runThreadN 8
    STAR_GENOME_DIR="/path/to/STAR_genome_index/hg38"
    GENOME_FASTA="/path/to/genome/hg38.fa" # Required for CLIPper
    
    # --- 1. Alignment with STAR (splice-aware aligner) ---
    # conda install -c bioconda star
    STAR --runThreadN 8 \
         --genomeDir "${STAR_GENOME_DIR}" \
         --readFilesIn "${INPUT_FASTQ}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}_STAR_" \
         --outSAMtype BAM SortedByCoordinate \
         --outReadsUnmapped Fastx \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 3 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --limitBAMsortRAM 30000000000 # Example: 30GB RAM for sorting
    
    ALIGNED_BAM="${OUTPUT_PREFIX}_STAR_Aligned.sortedByCoord.out.bam"
    
    # --- 2. Deduplication (using samtools) ---
    # conda install -c bioconda samtools
    samtools fixmate -m "${ALIGNED_BAM}" "${OUTPUT_PREFIX}_fixmate.bam"
    samtools sort -o "${OUTPUT_PREFIX}_sorted_fixmate.bam" "${OUTPUT_PREFIX}_fixmate.bam"
    samtools markdup -r "${OUTPUT_PREFIX}_sorted_fixmate.bam" "${OUTPUT_PREFIX}_dedup.bam"
    samtools index "${OUTPUT_PREFIX}_dedup.bam"
    
    DEDUP_BAM="${OUTPUT_PREFIX}_dedup.bam"
    
    # --- 3. Peak Calling with CLIPper ---
    # conda install -c bioconda clipper
    # CLIPper requires a genome FASTA file and the deduplicated BAM.
    # The '-s' parameter specifies the genome assembly (e.g., hg38, mm10).
    # The '-t' parameter is a threshold for peak calling (e.g., 5 for 5-fold enrichment).
    clipper -b "${DEDUP_BAM}" -s hg38 -o "${OUTPUT_PREFIX}_peaks.bed" \
            -f "${GENOME_FASTA}" -t 5
    
    # --- 4. Reproducible Peak Identification (IDR) ---
    # For eCLIP, IDR typically uses the yeolab/merge_peaks pipeline.
    # This step requires multiple replicates and often a control sample.
    # Example (assuming two replicate peak files from CLIPper):
    # REPLICATE1_PEAKS="replicate1_peaks.bed"
    # REPLICATE2_PEAKS="replicate2_peaks.bed"
    # git clone https://github.com/yeolab/merge_peaks.git
    # python merge_peaks/merge_peaks.py \
    #     --peak_files "${REPLICATE1_PEAKS}" "${REPLICATE2_PEAKS}" \
    #     --output_prefix "${OUTPUT_PREFIX}_idr_reproducible" \
    #     --idr_threshold 0.05 \
    #     --genome_fasta "${GENOME_FASTA}"
  2. 2

    novoindex

    novoindex (Inferred with models/gemini-2.5-flash) v4.04.00 GitHub
    $ Bash example
    # Installation (novoalign is commercial software, typically downloaded from Novocraft or installed via specific channels if licensed)
    # Example via Bioconda (requires a license for full functionality):
    # conda install -c bioconda novoalign
    
    # Placeholder for the latest human reference genome (hg38)
    # You would typically download this from a source like UCSC, NCBI, or Ensembl.
    # For example:
    # wget -O hg38.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
    # gunzip hg38.fa.gz
    REF_FASTA="hg38.fa"
    OUTPUT_INDEX="hg38.idx"
    
    # Build the novoalign index
    # -k: k-mer size (e.g., 14 for human genome, common values 12-14)
    # -s: step size (e.g., 1, common values 1-2)
    novoindex -k 14 -s 1 "${OUTPUT_INDEX}" "${REF_FASTA}"
  3. 3

    Remove 3’adapter using flexbar

    flexbar v3.0.3 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install flexbar (example using conda)
    # conda install -c bioconda flexbar
    
    # Define variables
    INPUT_FASTQ="input_reads.fastq.gz"
    OUTPUT_PREFIX="output_reads_trimmed"
    # Placeholder for a common Illumina 3' adapter sequence. 
    # Replace with the actual adapter used in your experiment.
    ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    
    # Execute flexbar for 3' adapter removal and quality trimming
    # Parameters are based on common eCLIP preprocessing workflows (e.g., yeolab/eclip)
    flexbar \
        -r "${INPUT_FASTQ}" \
        -t "${OUTPUT_PREFIX}" \
        -a "${ADAPTER_SEQUENCE}" \
        -ao 3 \
        -u 1 \
        -q 20 \
        -qf i1.8 \
        -m 18 \
        -n 4 \
        -z GZ
  4. 4

    collapse data and remove PCR duplicates using pyCRAC

    pyCRAC vv1.2.0 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install pyCRAC (if not already installed)
    # pip install pyCRAC
    # # Or using conda:
    # # conda install -c bioconda pycrac
    
    # Collapse reads and remove PCR duplicates
    # Input: aligned_reads.bam (e.g., from a previous alignment step)
    # Output: collapsed_reads.bam
    pyCRAC collapse -i aligned_reads.bam -o collapsed_reads.bam
  5. 5

    align using novalign

    novoalign vNot specified GitHub
    $ Bash example
    # Installation instructions (novoalign is a commercial software, typically downloaded and installed manually or via a site license)
    # You would typically download the binary from Novocraft's website and ensure it's in your PATH.
    # Example (illustrative, not a package manager command):
    # wget http://www.novocraft.com/downloads/novoalign_v4.04.00.tar.gz
    # tar -xzf novoalign_v4.04.00.tar.gz
    # export PATH=$PATH:/path/to/novoalign_binary
    
    # Placeholder for reference genome index (e.g., human hg38)
    # Replace 'path/to/hg38_index' with the actual path to your novoalign index files.
    # The index needs to be built using 'novoindex' prior to alignment.
    REFERENCE_INDEX="path/to/hg38_index"
    
    # Placeholder for input FASTQ files (paired-end example)
    # Replace with your actual input read files.
    INPUT_READS_R1="input_reads_R1.fastq.gz"
    INPUT_READS_R2="input_reads_R2.fastq.gz"
    
    # Placeholder for output SAM file
    OUTPUT_SAM="aligned_reads.sam"
    
    # Execute novoalign for paired-end reads
    # -d: specify the reference genome index directory
    # -f: specify input FASTQ files (space-separated for paired-end)
    # -o SAM: output format as SAM
    # -r All: report all alignments (or choose a specific number, e.g., 1 for best, or 0 for random best)
    # >: redirect standard output to the specified SAM file
    novoalign -d "${REFERENCE_INDEX}" -f "${INPUT_READS_R1}" "${INPUT_READS_R2}" -o SAM -r All > "${OUTPUT_SAM}"
    
    # If single-end reads:
    # INPUT_READS="input_reads.fastq.gz"
    # novoalign -d "${REFERENCE_INDEX}" -f "${INPUT_READS}" -o SAM -r All > "${OUTPUT_SAM}"
    
  6. 6

    Readcounter using pyCRAC

    pyCRAC vv1.0.0
    $ Bash example
    # Install pyCRAC (e.g., via pip or cloning the repo)
    # pip install pycrac==1.0.0 # Or for the version used in eCLIP workflow: pip install git+https://github.com/yeolab/pyCRAC.git@v1.0.0
    
    # Define input and output files
    INPUT_BAM="input.bam" # Path to the input BAM file (e.g., alignment output)
    OUTPUT_FILE="output_counts.tsv" # Path for the output read count file
    LOG_FILE="pycrac_count.log" # Path for the log file
    
    # Define reference datasets (using human hg38 as a placeholder)
    # GTF file: Gene annotation in GTF format (e.g., from GENCODE)
    # Example download: https://www.gencodegenes.org/human/ (e.g., gencode.v38.annotation.gtf.gz)
    GTF_FILE="gencode.v38.annotation.gtf"
    
    # Genome FASTA file: Reference genome in FASTA format (e.g., from UCSC or Ensembl)
    # Example download: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
    GENOME_FASTA="GRCh38.primary_assembly.genome.fa"
    
    # Run pycrac_count.py to count reads over genomic features
    pycrac_count.py \
        --input_bam "${INPUT_BAM}" \
        --output_file "${OUTPUT_FILE}" \
        --gtf "${GTF_FILE}" \
        --genome_fasta "${GENOME_FASTA}" \
        --min_read_length 15 \
        --max_read_length 100 \
        --min_mapq 20 \
        --min_score 0 \
        --min_overlap 1 \
        --strand "reverse" \
        --feature_type "exon" \
        --id_attribute "gene_id" \
        --count_mode "unique" \
        --log_file "${LOG_FILE}"
  7. 7

    calculate FDR using Mock sample using pyCRAC

    pyCRAC v1.2.0
    $ Bash example
        # Install pyCRAC (example using pip, adjust as needed for specific environment)
        # pip install pyCRAC
    
        # Define input and output files
        IP_BAM="ip_sample.bam"
        INPUT_BAM="input_sample.bam"
        MOCK_BAM="mock_sample.bam" # The description explicitly mentions "Mock sample"
        OUTPUT_PREFIX="pycrac_fdr_peaks"
        FDR_THRESHOLD="0.05" # Common FDR threshold
    
        # Calculate FDR using pyCRAC_peak_caller.py with a mock sample
        # This command assumes pyCRAC_peak_caller.py is in your PATH.
        # Adjust parameters like --fdr_threshold as needed.
        pyCRAC_peak_caller.py \
            -i "${IP_BAM}" \
            -c "${INPUT_BAM}" \
            -m "${MOCK_BAM}" \
            -o "${OUTPUT_PREFIX}" \
            --fdr_threshold "${FDR_THRESHOLD}"
    
  8. 8

    intersect fdr with reads using bedtools intersect

    bedtools v2.30.0 GitHub
    $ Bash example
    # Install bedtools (if not already installed)
    # conda install -c bioconda bedtools
    
    # Placeholder for input files. Replace with actual file paths.
    # fdr_file.bed: BED file representing FDR regions (e.g., called peaks with FDR)
    # reads_file.bed: BED file representing read alignments or regions derived from reads
    
    # Perform the intersection of FDR regions with read regions.
    # -a: The first input file (FDR regions)
    # -b: The second input file (read regions)
    # -wao: Write the original entry in A, the original entry in B, and the number of overlapping bases.
    #       This is a common output format for detailed intersection results.
    bedtools intersect -a fdr_file.bed -b reads_file.bed -wao > fdr_reads_intersect.bed
  9. 9

    Cluster reads using pyCRAC

    pyCRAC (Inferred with models/gemini-2.5-flash) v0.1.0 (Inferred with models/gemini-2.5-flash)
    $ Bash example
        # pyCRAC is a Python 2 tool. Ensure you are in a Python 2 environment.
        # Installation (if not already installed):
        # pip install pyCRAC
    
        # Example usage of pyCRAC_cluster.py
        # Input file: aligned_reads.txt (e.g., output from pyCRAC_align.py)
        # Output file: clustered_reads.txt
    
        pyCRAC_cluster.py \
          aligned_reads.txt \
          -o clustered_reads.txt \
          -c 10 \
          -s 1 \
          -m 1 \
          -t 4 \
          --bed

Tools Used

Raw Source Text
library strategy: CLIP-seq
novoindex
Remove 3’adapter using flexbar
collapse data and remove PCR duplicates using pyCRAC
align using novalign
Readcounter using pyCRAC
calculate FDR using Mock sample using pyCRAC
intersect fdr with reads using bedtools intersect
Cluster reads using pyCRAC
Genome_build: hg38; Homo_sapiens-ensembl-release_83 genome annotation
Supplementary_files_format_and_content: GTF representing clusters of hits
← Back to Analysis