GSE220185 Processing Pipeline

RIP-Seq code_examples 4 steps

Publication

Proteomic discovery of chemical probes that perturb protein complexes in human cells.

Molecular cell (2023) — PMID 37084731

Dataset

GSE220185

Proteomic discovery of chemical probes that perturb protein complexes in human cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Processed using Skipper https://github.com/YeoLab/skipper

    Skipper vNot specified GitHub
    $ Bash example
    # Installation of Snakemake and cloning Skipper workflow
    # It is recommended to use a dedicated conda environment for Snakemake and its dependencies.
    # conda create -n skipper_env snakemake mamba -y
    # conda activate skipper_env
    
    # Clone the Skipper workflow repository
    # git clone https://github.com/YeoLab/skipper.git
    # cd skipper
    
    # The Skipper workflow uses conda environments defined within its rules.
    # These environments will be created automatically by Snakemake if --use-conda is specified.
    
    # Execution of Skipper workflow
    # This is a generic command. Actual parameters (input files, genome, etc.)
    # would be specified in a configuration file (e.g., config/config.yaml).
    #
    # Reference dataset: For eCLIP, a genome assembly like hg38 or mm10 is typically required.
    # This would be specified in the config.yaml (e.g., genome: hg38).
    #
    # Replace 'N' with the desired number of CPU cores.
    # Replace 'path/to/your/config.yaml' with the path to your specific configuration file.
    # Replace 'path/to/your/output_directory' with your desired output location.
    # A minimal config.yaml might look like:
    # samples:
    #   sample1:
    #     R1: "path/to/sample1_R1.fastq.gz"
    #     R2: "path/to/sample1_R2.fastq.gz" # if paired-end
    # genome: "hg38" # or "mm10", etc.
    # annotation: "path/to/your/gtf_or_gff"
    #
    # For a full run, you would typically copy and modify the example config file from the Skipper repository.
    
    snakemake --snakefile Snakefile --cores N --use-conda --configfile path/to/your/config.yaml --directory path/to/your/output_directory
  2. 2

    Skipper trims the read, extract UMI, align to the genome and deduplicate

    Skipper vSnakemake workflow (dependencies: Snakemake 6.0.5, cutadapt 3.4, STAR 2.7.9a, UMI-tools 1.1.2, Picard 2.25.7) GitHub
    $ Bash example
    # Install Git and Conda if not already present
    # sudo apt-get update && sudo apt-get install -y git
    # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
    # bash miniconda.sh -b -p $HOME/miniconda
    # export PATH="$HOME/miniconda/bin:$PATH"
    # conda init bash
    # source ~/.bashrc
    
    # Clone the Skipper pipeline repository
    # git clone https://github.com/yeolab/skipper.git
    # cd skipper
    
    # Create and activate the conda environment for the pipeline
    # conda env create -f environment.yaml
    # conda activate skipper_env
    
    # --- Configuration for Skipper ---
    # Create a config.yaml file with your specific parameters.
    # Replace placeholder paths and values with your actual data.
    cat << EOF > config.yaml
    GENOME_DIR: "/path/to/STAR_index/hg38" # Path to STAR genome index directory (e.g., built from hg38)
    GENOME_FASTA: "/path/to/genome/hg38.fa" # Path to the genome FASTA file (e.g., hg38.fa)
    GTF: "/path/to/annotations/gencode.v38.annotation.gtf" # Path to the gene annotation GTF file (e.g., gencode.v38.annotation.gtf)
    
    ADAPTER_FWD: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Forward adapter sequence for trimming
    ADAPTER_REV: "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Reverse adapter sequence for trimming
    
    UMI_PATTERN: "NNNNNNNN" # UMI pattern (e.g., 8 N's for an 8-bp UMI)
    # BARCODE_PATTERN: "NNNNNN" # Optional: Barcode pattern if present, otherwise comment out or leave empty
    
    DEDUP_METHOD: "directional" # Deduplication method: "directional", "unique", or "position"
    
    # Define your samples and input directory
    SAMPLES: ["sample1", "sample2"] # List of sample names (e.g., corresponding to sample1_R1.fastq.gz, sample2_R1.fastq.gz)
    INPUT_DIR: "/path/to/raw_fastq" # Directory containing your raw FASTQ files
    OUTPUT_DIR: "results" # Directory for output files
    EOF
    
    # --- Run the Skipper pipeline ---
    # This command executes the Snakemake workflow using the specified configuration.
    # Adjust --cores based on available CPU resources.
    snakemake --use-conda --cores 8 --configfile config.yaml all
  3. 3

    To find enriched windows, it models the CLIP libraries crosslinking-induced truncation(CITs) with beta-binomial model with GC-bias of each window as a varaible

    PureCLIP (Inferred with models/gemini-2.5-flash) v1.3.1
    $ Bash example
    # Install PureCLIP via Bioconda
    # conda install -c bioconda pureclip
    
    # Example command to find enriched windows using PureCLIP
    # Replace <input_clip_replicateX.bam> with your aligned CLIP-seq BAM files,
    # <genome.fa> with your reference genome FASTA file (e.g., hg38.fa),
    # and <output_enriched_windows.bed> with your desired output file name.
    # Adjust -nt for the number of threads to use.
    
    PureCLIP -i input_clip_replicate1.bam input_clip_replicate2.bam -g hg38.fa -o output_enriched_windows.bed -nt 8
  4. 4

    overdispersion parameters are estimated from two input replicates

    DESeq2 (Inferred with models/gemini-2.5-flash) v1.42.0 GitHub
    $ Bash example
    # Install R and DESeq2 if not already available
    # conda create -n deseq2_env r-base r-essentials bioconductor-deseq2
    # conda activate deseq2_env
    
    # Example R script to estimate overdispersion parameters using DESeq2
    # This script assumes a count matrix and a sample information file as input.
    
    cat << 'EOF' > estimate_overdispersion.R
    library(DESeq2)
    
    # --- User-defined parameters ---
    count_matrix_file <- "counts.tsv" # Combined count matrix from replicates
    sample_info_file <- "sample_info.csv" # Sample metadata
    
    output_dispersion_file <- "overdispersion_estimates.tsv"
    # -------------------------------
    
    # Load count data
    count_data <- read.delim(count_matrix_file, row.names = 1, check.names = FALSE)
    # Ensure counts are integers
    count_data <- round(count_data)
    count_data[is.na(count_data)] <- 0
    count_data <- as.matrix(count_data)
    
    # Load sample information
    sample_info <- read.csv(sample_info_file, row.names = 1)
    
    # Ensure sample names match and are in the same order
    sample_info <- sample_info[colnames(count_data), , drop = FALSE]
    
    # Create DESeqDataSet object
    # For simple overdispersion estimation, a minimal design is sufficient.
    # If comparing two replicates, the design might be ~1 or ~replicate_group
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_info,
                                  design = ~ 1) # A simple design to estimate dispersions
    
    # Run DESeq2 pipeline (this includes dispersion estimation)
    dds <- estimateSizeFactors(dds)
    dds <- estimateDispersions(dds) # Explicitly estimate dispersions
    
    # Extract dispersion estimates
    dispersions <- mcols(dds)$dispersion
    names(dispersions) <- rownames(dds)
    
    # Save dispersions
    write.table(as.data.frame(dispersions), file = output_dispersion_file,
                sep = "\t", quote = FALSE, col.names = NA)
    
    message(paste("Overdispersion parameters estimated and saved to:", output_dispersion_file))
    EOF
    
    # Create dummy count data (replace with actual data)
    echo -e "gene\trep1\trep2" > counts.tsv
    echo -e "geneA\t100\t120" >> counts.tsv
    echo -e "geneB\t50\t65" >> counts.tsv
    echo -e "geneC\t200\t210" >> counts.tsv
    echo -e "geneD\t10\t15" >> counts.tsv
    
    # Create dummy sample info (replace with actual data)
    echo -e "sample,condition" > sample_info.csv
    echo -e "rep1,control" >> sample_info.csv
    echo -e "rep2,control" >> sample_info.csv
    
    # Execute the R script
    Rscript estimate_overdispersion.R

Tools Used

Raw Source Text
Processed using Skipper https://github.com/YeoLab/skipper
Skipper trims the read, extract UMI, align to the genome and deduplicate
To find enriched windows, it models the CLIP libraries crosslinking-induced truncation(CITs)  with beta-binomial model with GC-bias of each window as a varaible
overdispersion parameters are estimated from two input replicates
Assembly: hg38
Supplementary files format and content: wig files represent crosslinking-induced truncation(CITs) for plus and minus strands
Supplementary files format and content: tsv file represents reproducible genomic enriched window
← Back to Analysis