GSE232597 Processing Pipeline

RIP-Seq code_examples 5 steps

Publication

Large-scale evaluation of the ability of RNA-binding proteins to activate exon inclusion.

Nature biotechnology (2024) — PMID 38168984

Dataset

Systematic identification of RNA-binding proteins and tethered domains that activate exon splicing inclusion [eCLIP-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Data was processed using Skipper, available at: http://github.com/yeolab/skipper

Skipper vNot specified GitHub

$ Bash example

# Clone the Skipper workflow repository
git clone https://github.com/yeolab/skipper.git
cd skipper

# --- Configuration ---
# Skipper is a Snakemake workflow that requires a configuration file (config.yaml).
# This file specifies input data, genome assembly, and other parameters.
# Example placeholder for config.yaml (user needs to customize this):
#
# # config.yaml
# # Define samples and their input FASTQ files
# samples:
#   sample1:
#     R1: "path/to/sample1_R1.fastq.gz"
#     R2: "path/to/sample1_R2.fastq.gz" # Optional for paired-end
#   sample2:
#     R1: "path/to/sample2_R1.fastq.gz"
#     R2: "path/to/sample2_R2.fastq.gz"
#
# # Specify the genome assembly and paths to its index/annotation files
# genome: "hg38" # Placeholder: Replace with actual genome assembly (e.g., hg38, mm10)
# genome_dir: "/path/to/genome/index/files" # Placeholder: Path to STAR index, etc.
# annotation_gtf: "/path/to/annotation.gtf" # Placeholder: Path to GTF file
#
# # Other parameters like adapter sequences, trim settings, etc.
# # Refer to the Skipper documentation for a complete list of configurable options.
#
# --- Environment Setup (example using conda) ---
# # It is recommended to create a dedicated conda environment for Snakemake and its dependencies.
# # conda create -n skipper_env snakemake mamba -c conda-forge -c bioconda
# # conda activate skipper_env

# --- Workflow Execution ---
# Run the Snakemake workflow.
# The number of cores should be adjusted based on available resources.
# The --use-conda flag ensures that dependencies are managed via conda environments defined in the workflow.
snakemake --use-conda --cores 8

View on GitHub

Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.

eCLIP v0.2.2 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install skewer if not already installed
# conda install -c bioconda skewer

# Define input and output file names
INPUT_FASTQ="input.fastq.gz" # Placeholder for input FASTQ file
OUTPUT_PREFIX="trimmed_reads" # Prefix for output trimmed FASTQ files

# Define common Illumina 3' adapter sequence
# This is a widely used Illumina 3' adapter sequence. 
# For specific experiments, this sequence might vary.
ADAPTER_3_PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"

# Define a placeholder for the 5' barcode sequence.
# In eCLIP, 5' barcodes are often experiment-specific or random N-mers (UMIs).
# If a fixed 5' barcode sequence is known, replace this placeholder.
# If the barcode is a random N-mer, it might be handled by other tools 
# (e.g., UMI-tools) or by skewer's quality trimming if it's short and low quality.
# If skewer is expected to trim a specific 5' adapter/barcode, provide it with -y.
BARCODE_5_PRIME="[5_PRIME_BARCODE_SEQUENCE]" # REPLACE WITH ACTUAL 5' BARCODE IF KNOWN

# Minimum read length after trimming, a common setting for eCLIP data
MIN_LENGTH=18

# Execute skewer for adapter and barcode trimming
# -x: Specifies the 3' adapter sequence to trim
# -y: Specifies the 5' adapter/barcode sequence to trim
# -m: Sets the minimum read length after trimming. Reads shorter than this are discarded.
# -o: Specifies the prefix for the output trimmed FASTQ files.
#     Skewer will append '-trimmed.fastq.gz' (for single-end) or 
#     '-trimmed-pair1.fastq.gz' and '-trimmed-pair2.fastq.gz' (for paired-end).
skewer -x "${ADAPTER_3_PRIME}" -y "${BARCODE_5_PRIME}" -m "${MIN_LENGTH}" -o "${OUTPUT_PREFIX}" "${INPUT_FASTQ}"

Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp

fastp v0.23.4 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install fastp (if not already installed)
# conda install -c bioconda fastp

# Define input and output file names
INPUT_READ1="raw_read1.fastq.gz"
INPUT_READ2="raw_read2.fastq.gz"
OUTPUT_READ1="umi_extracted_read1.fastq.gz"
OUTPUT_READ2="umi_extracted_read2.fastq.gz"

# Note: The specific UMI location (--umi_loc) and length (--umi_len)
# are inferred as they were not provided in the description.
# Common settings include UMI at the beginning of read1 with a length of 12-16 bp.
# Adjust these parameters based on the specific library preparation protocol.
# fastp can also perform adapter trimming and quality filtering simultaneously,
# but this command focuses solely on UMI extraction as described.

fastp \
  --in1 "${INPUT_READ1}" \
  --in2 "${INPUT_READ2}" \
  --out1 "${OUTPUT_READ1}" \
  --out2 "${OUTPUT_READ2}" \
  --umi_loc read1 \
  --umi_len 12

View on GitHub

Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}

STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables for the pipeline (replace with actual paths and values)
STAR_INDEX_DIR="/path/to/STAR_index/GRCh38" # Placeholder: Path to the STAR genome index directory (e.g., for human GRCh38)
INPUT_FASTQ="reads.fastq.gz"             # Placeholder: Path to the input gzipped FASTQ file
OUTPUT_PREFIX="aligned_reads"            # Placeholder: Prefix for output files
REPLICATE_LABEL="sample1_rep1"           # Placeholder: Unique ID for the read group (e.g., sample_replicate_id)
NUM_THREADS="8"                          # Placeholder: Number of threads to use

# Execute STAR alignment command
STAR \
    --alignEndsType EndToEnd \
    --genomeDir "${STAR_INDEX_DIR}" \
    --genomeLoad NoSharedMemory \
    --outBAMcompression 10 \
    --outFileNamePrefix "${OUTPUT_PREFIX}" \
    --winAnchorMultimapNmax 100 \
    --outFilterMultimapNmax 100 \
    --outFilterMultimapScoreRange 1 \
    --outSAMmultNmax 1 \
    --outMultimapperOrder Random \
    --outFilterScoreMin 10 \
    --outFilterType BySJout \
    --limitOutSJcollapsed 5000000 \
    --outReadsUnmapped None \
    --outSAMattrRGline ID:"${REPLICATE_LABEL}" \
    --outSAMattributes All \
    --outSAMmode Full \
    --outSAMtype BAM Unsorted \
    --outSAMunmapped Within \
    --readFilesCommand zcat \
    --outStd Log \
    --readFilesIn "${INPUT_FASTQ}" \
    --runMode alignReads \
    --runThreadN "${NUM_THREADS}"

View on GitHub

Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper

Skipper vNot specified GitHub

$ Bash example

# Install Snakemake if not already installed
# conda create -n skipper_env snakemake -c bioconda -c conda-forge
# conda activate skipper_env

# Clone the Skipper workflow
# git clone https://github.com/yeolab/skipper.git
# cd skipper

# Create a configuration file (config.yaml) for your experiment.
# This example assumes you have peak files from a previous step (e.g., CLIPper)
# and want to perform IDR and annotation against repetitive elements.
# Replace 'path/to/your/peak_files' and 'your_sample_name' with actual paths and names.
# The genome assembly (e.g., hg38) should be specified.
cat << EOF > config.yaml
genome: hg38 # Placeholder: Using the latest human genome assembly
samples:
  - name: your_sample_name
    replicate1: path/to/your/peak_files/sample_rep1.bed # Example: CLIPper output for replicate 1
    replicate2: path/to/your/peak_files/sample_rep2.bed # Example: CLIPper output for replicate 2
    control: path/to/your/control_peak_files/control.bed # Example: Input/SMInput peaks for control
# Add other necessary configurations as per Skipper documentation
# e.g., paths to genome indices, annotation files if not handled by Snakemake's --use-conda
EOF

# Execute the Skipper Snakemake workflow to call reproducible enriched windows (IDR)
# and annotate repetitive elements. The 'all' rule in Skipper typically includes
# IDR and annotation steps. Adjust --cores based on available resources.
snakemake --cores 8 --configfile config.yaml --use-conda

View on GitHub

Tools Used

Skipper eCLIP STAR

Raw Source Text

Data was processed using Skipper, available at: http://github.com/yeolab/skipper
Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp
Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}
Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper
Assembly: GRCh38
Supplementary files format and content: single-replicate enriched window files contain p values, q values, enrichment, and annotations for significantly bound transcriptome tiled windows
Supplementary files format and content: reproducible enriched window files contain p values, q values, enrichment, and annotations for significantly bound transcriptome tiled windows

← Back to Analysis