GSE220185 Processing Pipeline

RIP-Seq code_examples 4 steps

Publication

Proteomic discovery of chemical probes that perturb protein complexes in human cells.

Molecular cell (2023) — PMID 37084731

Dataset

GSE220185

Proteomic discovery of chemical probes that perturb protein complexes in human cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Processed using Skipper https://github.com/YeoLab/skipper

Skipper vNot specified GitHub

$ Bash example

# Installation of Snakemake and cloning Skipper workflow
# It is recommended to use a dedicated conda environment for Snakemake and its dependencies.
# conda create -n skipper_env snakemake mamba -y
# conda activate skipper_env

# Clone the Skipper workflow repository
# git clone https://github.com/YeoLab/skipper.git
# cd skipper

# The Skipper workflow uses conda environments defined within its rules.
# These environments will be created automatically by Snakemake if --use-conda is specified.

# Execution of Skipper workflow
# This is a generic command. Actual parameters (input files, genome, etc.)
# would be specified in a configuration file (e.g., config/config.yaml).
#
# Reference dataset: For eCLIP, a genome assembly like hg38 or mm10 is typically required.
# This would be specified in the config.yaml (e.g., genome: hg38).
#
# Replace 'N' with the desired number of CPU cores.
# Replace 'path/to/your/config.yaml' with the path to your specific configuration file.
# Replace 'path/to/your/output_directory' with your desired output location.
# A minimal config.yaml might look like:
# samples:
#   sample1:
#     R1: "path/to/sample1_R1.fastq.gz"
#     R2: "path/to/sample1_R2.fastq.gz" # if paired-end
# genome: "hg38" # or "mm10", etc.
# annotation: "path/to/your/gtf_or_gff"
#
# For a full run, you would typically copy and modify the example config file from the Skipper repository.

snakemake --snakefile Snakefile --cores N --use-conda --configfile path/to/your/config.yaml --directory path/to/your/output_directory

View on GitHub

Skipper trims the read, extract UMI, align to the genome and deduplicate

Skipper vSnakemake workflow (dependencies: Snakemake 6.0.5, cutadapt 3.4, STAR 2.7.9a, UMI-tools 1.1.2, Picard 2.25.7) GitHub

$ Bash example

# Install Git and Conda if not already present
# sudo apt-get update && sudo apt-get install -y git
# wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# bash miniconda.sh -b -p $HOME/miniconda
# export PATH="$HOME/miniconda/bin:$PATH"
# conda init bash
# source ~/.bashrc

# Clone the Skipper pipeline repository
# git clone https://github.com/yeolab/skipper.git
# cd skipper

# Create and activate the conda environment for the pipeline
# conda env create -f environment.yaml
# conda activate skipper_env

# --- Configuration for Skipper ---
# Create a config.yaml file with your specific parameters.
# Replace placeholder paths and values with your actual data.
cat << EOF > config.yaml
GENOME_DIR: "/path/to/STAR_index/hg38" # Path to STAR genome index directory (e.g., built from hg38)
GENOME_FASTA: "/path/to/genome/hg38.fa" # Path to the genome FASTA file (e.g., hg38.fa)
GTF: "/path/to/annotations/gencode.v38.annotation.gtf" # Path to the gene annotation GTF file (e.g., gencode.v38.annotation.gtf)

ADAPTER_FWD: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Forward adapter sequence for trimming
ADAPTER_REV: "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Reverse adapter sequence for trimming

UMI_PATTERN: "NNNNNNNN" # UMI pattern (e.g., 8 N's for an 8-bp UMI)
# BARCODE_PATTERN: "NNNNNN" # Optional: Barcode pattern if present, otherwise comment out or leave empty

DEDUP_METHOD: "directional" # Deduplication method: "directional", "unique", or "position"

# Define your samples and input directory
SAMPLES: ["sample1", "sample2"] # List of sample names (e.g., corresponding to sample1_R1.fastq.gz, sample2_R1.fastq.gz)
INPUT_DIR: "/path/to/raw_fastq" # Directory containing your raw FASTQ files
OUTPUT_DIR: "results" # Directory for output files
EOF

# --- Run the Skipper pipeline ---
# This command executes the Snakemake workflow using the specified configuration.
# Adjust --cores based on available CPU resources.
snakemake --use-conda --cores 8 --configfile config.yaml all

View on GitHub

To find enriched windows, it models the CLIP libraries crosslinking-induced truncation(CITs) with beta-binomial model with GC-bias of each window as a varaible

PureCLIP (Inferred with models/gemini-2.5-flash) v1.3.1

$ Bash example

# Install PureCLIP via Bioconda
# conda install -c bioconda pureclip

# Example command to find enriched windows using PureCLIP
# Replace <input_clip_replicateX.bam> with your aligned CLIP-seq BAM files,
# <genome.fa> with your reference genome FASTA file (e.g., hg38.fa),
# and <output_enriched_windows.bed> with your desired output file name.
# Adjust -nt for the number of threads to use.

PureCLIP -i input_clip_replicate1.bam input_clip_replicate2.bam -g hg38.fa -o output_enriched_windows.bed -nt 8

overdispersion parameters are estimated from two input replicates

DESeq2 (Inferred with models/gemini-2.5-flash) v1.42.0 GitHub

$ Bash example

# Install R and DESeq2 if not already available
# conda create -n deseq2_env r-base r-essentials bioconductor-deseq2
# conda activate deseq2_env

# Example R script to estimate overdispersion parameters using DESeq2
# This script assumes a count matrix and a sample information file as input.

cat << 'EOF' > estimate_overdispersion.R
library(DESeq2)

# --- User-defined parameters ---
count_matrix_file <- "counts.tsv" # Combined count matrix from replicates
sample_info_file <- "sample_info.csv" # Sample metadata

output_dispersion_file <- "overdispersion_estimates.tsv"
# -------------------------------

# Load count data
count_data <- read.delim(count_matrix_file, row.names = 1, check.names = FALSE)
# Ensure counts are integers
count_data <- round(count_data)
count_data[is.na(count_data)] <- 0
count_data <- as.matrix(count_data)

# Load sample information
sample_info <- read.csv(sample_info_file, row.names = 1)

# Ensure sample names match and are in the same order
sample_info <- sample_info[colnames(count_data), , drop = FALSE]

# Create DESeqDataSet object
# For simple overdispersion estimation, a minimal design is sufficient.
# If comparing two replicates, the design might be ~1 or ~replicate_group
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = ~ 1) # A simple design to estimate dispersions

# Run DESeq2 pipeline (this includes dispersion estimation)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds) # Explicitly estimate dispersions

# Extract dispersion estimates
dispersions <- mcols(dds)$dispersion
names(dispersions) <- rownames(dds)

# Save dispersions
write.table(as.data.frame(dispersions), file = output_dispersion_file,
            sep = "\t", quote = FALSE, col.names = NA)

message(paste("Overdispersion parameters estimated and saved to:", output_dispersion_file))
EOF

# Create dummy count data (replace with actual data)
echo -e "gene\trep1\trep2" > counts.tsv
echo -e "geneA\t100\t120" >> counts.tsv
echo -e "geneB\t50\t65" >> counts.tsv
echo -e "geneC\t200\t210" >> counts.tsv
echo -e "geneD\t10\t15" >> counts.tsv

# Create dummy sample info (replace with actual data)
echo -e "sample,condition" > sample_info.csv
echo -e "rep1,control" >> sample_info.csv
echo -e "rep2,control" >> sample_info.csv

# Execute the R script
Rscript estimate_overdispersion.R

View on GitHub

Tools Used

Skipper

Raw Source Text

Processed using Skipper https://github.com/YeoLab/skipper
Skipper trims the read, extract UMI, align to the genome and deduplicate
To find enriched windows, it models the CLIP libraries crosslinking-induced truncation(CITs)  with beta-binomial model with GC-bias of each window as a varaible
overdispersion parameters are estimated from two input replicates
Assembly: hg38
Supplementary files format and content: wig files represent crosslinking-induced truncation(CITs) for plus and minus strands
Supplementary files format and content: tsv file represents reproducible genomic enriched window

← Back to Analysis