GSE224998 Processing Pipeline

RIP-Seq code_examples 6 steps

Publication

Skipper analysis of eCLIP datasets enables sensitive detection of constrained translation factor binding sites.

Cell genomics (2023) — PMID 37388912

Dataset

RPS19 binding with and without Diamond-Blackfan anemia variants

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

library strategy: eCLIP

eCLIP vlatest (pre-2021) GitHub

$ Bash example

# Clone the Yeo Lab eCLIP CWL workflow repository
# git clone https://github.com/yeolab/eclip.git
# cd eclip

# Define placeholder input parameters for the eCLIP workflow in a YAML file (e.g., inputs.yaml)
# Users should replace these paths and values with their actual data and desired settings.
# Refer to the eclip/eclip.cwl and eclip/example_inputs.yaml in the cloned repository for full details.
cat << EOF > inputs.yaml
fastq_r1:
  class: File
  path: /path/to/your/sample_R1.fastq.gz # Placeholder: Replace with your R1 FASTQ file
fastq_r2:
  class: File
  path: /path/to/your/sample_R2.fastq.gz # Placeholder: Replace with your R2 FASTQ file (or omit if single-end)
genome_fasta:
  class: File
  path: /path/to/your/reference/hg38.fa # Placeholder: Replace with your reference genome FASTA (e.g., hg38)
genome_gtf:
  class: File
  path: /path/to/your/reference/gencode.v38.annotation.gtf # Placeholder: Replace with your genome annotation GTF (e.g., Gencode v38)
adapter_sequence: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Placeholder: Replace with your actual adapter sequence
output_prefix: "eclip_sample" # Placeholder: Desired prefix for output files
EOF

# Install cwltool if not already installed
# pip install cwltool

# Execute the eCLIP CWL workflow using cwltool
# Ensure you are in the directory containing 'eclip.cwl' (e.g., the 'eclip' directory after cloning)
cwltool eclip.cwl inputs.yaml

View on GitHub

Data was processed using Skipper, available at: http://github.com/yeolab/skipper

Skipper vNot specified GitHub

$ Bash example

# Skipper is a Snakemake workflow for eCLIP data processing.
# To use Skipper, first clone the repository:
# git clone http://github.com/yeolab/skipper
# cd skipper

# Ensure Snakemake is installed:
# conda install -c bioconda -c conda-forge snakemake

# A 'config.yaml' file is required to run the Snakemake workflow.
# This file specifies input data, reference genomes, and other parameters.
# Example 'config.yaml' content (replace with actual paths and parameters):
# ---
# samples:
#   sample_name_1:
#     R1: "path/to/sample_1_R1.fastq.gz"
#     R2: "path/to/sample_1_R2.fastq.gz" # Optional, if paired-end
#   sample_name_2:
#     R1: "path/to/sample_2_R1.fastq.gz"
# genome_assembly: "hg38" # Placeholder for a common human reference genome
# adapters: "path/to/adapters.fa" # Path to adapter sequences
# output_dir: "results"
# ---

# Execute the Skipper Snakemake workflow.
# Replace '8' with the desired number of CPU cores.
# Ensure 'config.yaml' is properly configured for your data.
snakemake --cores 8 --configfile config.yaml

View on GitHub

Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.

eCLIP v0.2.2 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install skewer (if not already installed)
# conda install -c bioconda skewer

# Define input and output files
INPUT_READS="input_reads.fastq.gz"
OUTPUT_PREFIX="output_trimmed"

# Define eCLIP-specific adapter and barcode sequences
# This 3' adapter sequence is commonly used in eCLIP protocols (e.g., from NEBNext Small RNA Library Prep Set for Illumina).
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
# This represents a generic 8-nucleotide 5' barcode/UMI sequence to be trimmed.
# In the yeolab/eclip workflow, a 'barcode_sequence' input is passed to skewer's -y option for 5' trimming.
# If the barcode is a fixed sequence, replace 'NNNNNNNN' with the actual sequence.
# If it's a UMI, skewer will attempt to match and trim this pattern from the 5' end.
BARCODE_SEQUENCE="NNNNNNNN"

# Define trimming parameters, based on common eCLIP settings from yeolab/eclip workflow
MIN_READ_LENGTH=18  # Minimum read length after trimming
MIN_ADAPTER_LENGTH=10 # Minimum length of adapter to be matched
QUALITY_THRESHOLD=20 # Minimum quality score for trimming
THREADS=8 # Number of threads to use

# Execute skewer for adapter and barcode trimming
skewer -x "${ADAPTER_SEQUENCE}" \
       -y "${BARCODE_SEQUENCE}" \
       -m ${MIN_READ_LENGTH} \
       -l ${MIN_ADAPTER_LENGTH} \
       -q ${QUALITY_THRESHOLD} \
       -t ${THREADS} \
       -o "${OUTPUT_PREFIX}" \
       "${INPUT_READS}"

# The output files will be named output_trimmed-trimmed.fastq.gz (for single-end) or output_trimmed-trimmed-pair1.fastq.gz and output_trimmed-trimmed-pair2.fastq.gz (for paired-end)

View on GitHub

Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp

fastp v0.23.2 GitHub

$ Bash example

# Install fastp (e.g., using conda)
# conda install -c bioconda fastp

# Define input and output file paths
INPUT_R1="raw_reads_R1.fastq.gz"
INPUT_R2="raw_reads_R2.fastq.gz"
OUTPUT_R1="umi_extracted_R1.fastq.gz"
OUTPUT_R2="umi_extracted_R2.fastq.gz"
REPORT_JSON="fastp.json"
REPORT_HTML="fastp.html"

# Extract UMIs from raw sequencing reads using fastp
# This command assumes UMIs are located at the beginning of Read 1 (e.g., 10 bp long)
# and moves them to the read ID. Adjust --umi_loc and --umi_len as per experimental design.
fastp \
  -i "${INPUT_R1}" \
  -o "${OUTPUT_R1}" \
  -I "${INPUT_R2}" \
  -O "${OUTPUT_R2}" \
  --umi \
  --umi_loc read1 \
  --umi_len 10 \
  --umi_prefix UMI: \
  --json "${REPORT_JSON}" \
  --html "${REPORT_HTML}" \
  -w 8

View on GitHub

Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}

STAR v2.7.x GitHub

$ Bash example

# Install STAR using conda
# conda install -c bioconda star

# Example variables (replace with actual paths and values)
STAR_GENOME_DIR="/path/to/STAR_genome_index/GRCh38" # Path to STAR genome index directory
OUTPUT_PREFIX="sample_aligned"
READ_FILES="sample_R1.fastq.gz sample_R2.fastq.gz" # For paired-end reads, space-separated. For single-end, provide one file.
THREADS=8
REPLICATE_LABEL="sample_rep1"

STAR \
    --alignEndsType EndToEnd \
    --genomeDir "${STAR_GENOME_DIR}" \
    --genomeLoad NoSharedMemory \
    --outBAMcompression 10 \
    --outFileNamePrefix "${OUTPUT_PREFIX}" \
    --winAnchorMultimapNmax 100 \
    --outFilterMultimapNmax 100 \
    --outFilterMultimapScoreRange 1 \
    --outSAMmultNmax 1 \
    --outMultimapperOrder Random \
    --outFilterScoreMin 10 \
    --outFilterType BySJout \
    --limitOutSJcollapsed 5000000 \
    --outReadsUnmapped None \
    --outSAMattrRGline ID:"${REPLICATE_LABEL}" \
    --outSAMattributes All \
    --outSAMmode Full \
    --outSAMtype BAM Unsorted \
    --outSAMunmapped Within \
    --readFilesCommand zcat \
    --outStd Log \
    --readFilesIn "${READ_FILES}" \
    --runMode alignReads \
    --runThreadN "${THREADS}"

View on GitHub

Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper

Skipper vunversioned GitHub

$ Bash example

# Install Snakemake if not already installed
# conda install -c bioconda -c conda-forge snakemake

# Clone the Skipper workflow
# git clone https://github.com/yeolab/skipper.git
# cd skipper

# Create a configuration file (config.yaml) for your eCLIP experiment.
# This file defines input samples, genome assembly, and other parameters.
# Example content for config.yaml (adjust paths and values as needed):
cat << EOF > config.yaml
samples:
  sample1:
    ip: "path/to/sample1_ip.bam"
    input: "path/to/sample1_input.bam"
  sample2:
    ip: "path/to/sample2_ip.bam"
    input: "path/to/sample2_input.bam"
genome: "hg38" # Placeholder: Use the latest assembly like hg38 or mm10
annotation: "path/to/gencode.vXX.annotation.gtf" # Placeholder: e.g., GENCODE v38 for hg38
# Other parameters for peak calling, IDR, repeat element analysis, etc.
# Refer to the Skipper documentation for a complete list of configurable options.
EOF

# Run the Skipper workflow.
# This command will execute the Snakemake pipeline, which includes
# custom scripts for identifying reproducible enriched windows (peaks)
# and analyzing repetitive elements based on the provided configuration.
# The specific outputs related to "reproducible enriched windows" and
# "repetitive elements" will be generated in the designated output directory
# as defined by the workflow and configuration.
snakemake --snakefile Snakefile --configfile config.yaml --cores 8 # Adjust cores as needed

View on GitHub

Tools Used

eCLIP Skipper STAR

Raw Source Text

library strategy: eCLIP
Data was processed using Skipper, available at: http://github.com/yeolab/skipper
Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp
Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}
Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper
Assembly: GRCh38
Supplementary files format and content: reproducible enriched window files contain p values, q values, enrichment, and annotations for significantly bound transcriptome tiled windows
Supplementary files format and content: reproducible enriched re files contain p values, q values, enrichment, and annotations for significantly bound repetitive elements

← Back to Analysis