GSE220185 Processing Pipeline
RIP-Seq
code_examples
4 steps
Publication
Proteomic discovery of chemical probes that perturb protein complexes in human cells.Molecular cell (2023) — PMID 37084731
Dataset
GSE220185Proteomic discovery of chemical probes that perturb protein complexes in human cells
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Processed using Skipper https://github.com/YeoLab/skipper
$ Bash example
# Installation of Snakemake and cloning Skipper workflow # It is recommended to use a dedicated conda environment for Snakemake and its dependencies. # conda create -n skipper_env snakemake mamba -y # conda activate skipper_env # Clone the Skipper workflow repository # git clone https://github.com/YeoLab/skipper.git # cd skipper # The Skipper workflow uses conda environments defined within its rules. # These environments will be created automatically by Snakemake if --use-conda is specified. # Execution of Skipper workflow # This is a generic command. Actual parameters (input files, genome, etc.) # would be specified in a configuration file (e.g., config/config.yaml). # # Reference dataset: For eCLIP, a genome assembly like hg38 or mm10 is typically required. # This would be specified in the config.yaml (e.g., genome: hg38). # # Replace 'N' with the desired number of CPU cores. # Replace 'path/to/your/config.yaml' with the path to your specific configuration file. # Replace 'path/to/your/output_directory' with your desired output location. # A minimal config.yaml might look like: # samples: # sample1: # R1: "path/to/sample1_R1.fastq.gz" # R2: "path/to/sample1_R2.fastq.gz" # if paired-end # genome: "hg38" # or "mm10", etc. # annotation: "path/to/your/gtf_or_gff" # # For a full run, you would typically copy and modify the example config file from the Skipper repository. snakemake --snakefile Snakefile --cores N --use-conda --configfile path/to/your/config.yaml --directory path/to/your/output_directory
-
2
Skipper trims the read, extract UMI, align to the genome and deduplicate
Skipper vSnakemake workflow (dependencies: Snakemake 6.0.5, cutadapt 3.4, STAR 2.7.9a, UMI-tools 1.1.2, Picard 2.25.7) GitHub$ Bash example
# Install Git and Conda if not already present # sudo apt-get update && sudo apt-get install -y git # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh # bash miniconda.sh -b -p $HOME/miniconda # export PATH="$HOME/miniconda/bin:$PATH" # conda init bash # source ~/.bashrc # Clone the Skipper pipeline repository # git clone https://github.com/yeolab/skipper.git # cd skipper # Create and activate the conda environment for the pipeline # conda env create -f environment.yaml # conda activate skipper_env # --- Configuration for Skipper --- # Create a config.yaml file with your specific parameters. # Replace placeholder paths and values with your actual data. cat << EOF > config.yaml GENOME_DIR: "/path/to/STAR_index/hg38" # Path to STAR genome index directory (e.g., built from hg38) GENOME_FASTA: "/path/to/genome/hg38.fa" # Path to the genome FASTA file (e.g., hg38.fa) GTF: "/path/to/annotations/gencode.v38.annotation.gtf" # Path to the gene annotation GTF file (e.g., gencode.v38.annotation.gtf) ADAPTER_FWD: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Forward adapter sequence for trimming ADAPTER_REV: "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Reverse adapter sequence for trimming UMI_PATTERN: "NNNNNNNN" # UMI pattern (e.g., 8 N's for an 8-bp UMI) # BARCODE_PATTERN: "NNNNNN" # Optional: Barcode pattern if present, otherwise comment out or leave empty DEDUP_METHOD: "directional" # Deduplication method: "directional", "unique", or "position" # Define your samples and input directory SAMPLES: ["sample1", "sample2"] # List of sample names (e.g., corresponding to sample1_R1.fastq.gz, sample2_R1.fastq.gz) INPUT_DIR: "/path/to/raw_fastq" # Directory containing your raw FASTQ files OUTPUT_DIR: "results" # Directory for output files EOF # --- Run the Skipper pipeline --- # This command executes the Snakemake workflow using the specified configuration. # Adjust --cores based on available CPU resources. snakemake --use-conda --cores 8 --configfile config.yaml all
-
3
To find enriched windows, it models the CLIP libraries crosslinking-induced truncation(CITs) with beta-binomial model with GC-bias of each window as a varaible
PureCLIP (Inferred with models/gemini-2.5-flash) v1.3.1$ Bash example
# Install PureCLIP via Bioconda # conda install -c bioconda pureclip # Example command to find enriched windows using PureCLIP # Replace <input_clip_replicateX.bam> with your aligned CLIP-seq BAM files, # <genome.fa> with your reference genome FASTA file (e.g., hg38.fa), # and <output_enriched_windows.bed> with your desired output file name. # Adjust -nt for the number of threads to use. PureCLIP -i input_clip_replicate1.bam input_clip_replicate2.bam -g hg38.fa -o output_enriched_windows.bed -nt 8
-
4
overdispersion parameters are estimated from two input replicates
$ Bash example
# Install R and DESeq2 if not already available # conda create -n deseq2_env r-base r-essentials bioconductor-deseq2 # conda activate deseq2_env # Example R script to estimate overdispersion parameters using DESeq2 # This script assumes a count matrix and a sample information file as input. cat << 'EOF' > estimate_overdispersion.R library(DESeq2) # --- User-defined parameters --- count_matrix_file <- "counts.tsv" # Combined count matrix from replicates sample_info_file <- "sample_info.csv" # Sample metadata output_dispersion_file <- "overdispersion_estimates.tsv" # ------------------------------- # Load count data count_data <- read.delim(count_matrix_file, row.names = 1, check.names = FALSE) # Ensure counts are integers count_data <- round(count_data) count_data[is.na(count_data)] <- 0 count_data <- as.matrix(count_data) # Load sample information sample_info <- read.csv(sample_info_file, row.names = 1) # Ensure sample names match and are in the same order sample_info <- sample_info[colnames(count_data), , drop = FALSE] # Create DESeqDataSet object # For simple overdispersion estimation, a minimal design is sufficient. # If comparing two replicates, the design might be ~1 or ~replicate_group dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = ~ 1) # A simple design to estimate dispersions # Run DESeq2 pipeline (this includes dispersion estimation) dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) # Explicitly estimate dispersions # Extract dispersion estimates dispersions <- mcols(dds)$dispersion names(dispersions) <- rownames(dds) # Save dispersions write.table(as.data.frame(dispersions), file = output_dispersion_file, sep = "\t", quote = FALSE, col.names = NA) message(paste("Overdispersion parameters estimated and saved to:", output_dispersion_file)) EOF # Create dummy count data (replace with actual data) echo -e "gene\trep1\trep2" > counts.tsv echo -e "geneA\t100\t120" >> counts.tsv echo -e "geneB\t50\t65" >> counts.tsv echo -e "geneC\t200\t210" >> counts.tsv echo -e "geneD\t10\t15" >> counts.tsv # Create dummy sample info (replace with actual data) echo -e "sample,condition" > sample_info.csv echo -e "rep1,control" >> sample_info.csv echo -e "rep2,control" >> sample_info.csv # Execute the R script Rscript estimate_overdispersion.R
Tools Used
Raw Source Text
Processed using Skipper https://github.com/YeoLab/skipper Skipper trims the read, extract UMI, align to the genome and deduplicate To find enriched windows, it models the CLIP libraries crosslinking-induced truncation(CITs) with beta-binomial model with GC-bias of each window as a varaible overdispersion parameters are estimated from two input replicates Assembly: hg38 Supplementary files format and content: wig files represent crosslinking-induced truncation(CITs) for plus and minus strands Supplementary files format and content: tsv file represents reproducible genomic enriched window