GSE232599 Processing Pipeline

GSE code_examples 5 steps

Publication

Large-scale evaluation of the ability of RNA-binding proteins to activate exon inclusion.

Nature biotechnology (2024) — PMID 38168984

Dataset

Systematic identification of RNA-binding proteins and tethered domains that activate exon splicing inclusion

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Data was processed using Skipper, available at: http://github.com/yeolab/skipper

Skipper vNot specified GitHub

$ Bash example

# Install Snakemake (if not already installed)
# It is recommended to use a dedicated conda environment for Snakemake.
# conda create -n snakemake snakemake
# conda activate snakemake

# Clone the Skipper workflow repository
# git clone https://github.com/yeolab/skipper.git
# cd skipper

# Create or modify a configuration file (e.g., config.yaml)
# This file defines input files, reference genome, and other parameters for the workflow.
# A typical config.yaml for eCLIP using Skipper might look like this:
#
# genome: "hg38" # Placeholder for the reference genome (e.g., hg38, mm10). Using hg38 as a common latest assembly.
# fastq_dir: "/path/to/your/fastq_files" # Directory containing raw FASTQ files
# output_dir: "skipper_results" # Directory for output files
# threads: 10 # Number of CPU threads to use
#
# samples:
#   sample_RBP_rep1:
#     RBP: "YourRBP"
#     replicate: "rep1"
#     fastq: ["sample_RBP_rep1_R1.fastq.gz", "sample_RBP_rep1_R2.fastq.gz"] # Adjust for single-end or paired-end
#     control: "input_control_sample_name" # Name of the corresponding input control sample
#   input_control_sample_name:
#     RBP: "Input"
#     replicate: "rep1"
#     fastq: ["input_control_sample_name_R1.fastq.gz", "input_control_sample_name_R2.fastq.gz"]
#   # Add more samples as needed following this structure.
#
# Ensure that the 'fastq' paths are relative to 'fastq_dir' or absolute paths.
# The workflow will automatically manage software dependencies using conda environments defined in the workflow.

# Execute the Skipper workflow
# Ensure you are in the 'skipper' directory where the Snakefile and config.yaml are located.
# The --use-conda flag tells Snakemake to manage environments via conda.
# The --cores flag specifies the number of CPU cores to use.
# The --configfile flag points to your configuration file.
snakemake --use-conda --cores 10 --configfile config.yaml

View on GitHub

Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.

eCLIP v0.2.2 GitHub

$ Bash example

# Install skewer if not already available
# conda install -c bioconda skewer

# Define input and output file names
INPUT_R1="reads_R1.fastq.gz"
INPUT_R2="reads_R2.fastq.gz"
OUTPUT_PREFIX="trimmed_reads"

# Define common eCLIP 3' adapter sequence
# This adapter is often used in eCLIP protocols (e.g., TruSeq Small RNA 3' adapter)
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"

# Execute skewer for adapter and quality trimming
# -x: 3' adapter sequence
# -q: Minimum average quality score to keep a read (e.g., 20 for Phred+33)
# -l: Minimum read length after trimming
# -m any: Trimming mode (any means trim any adapter found)
# -o: Output file prefix
skewer -x "${ADAPTER_3PRIME}" -q 20 -l 18 -m any -o "${OUTPUT_PREFIX}" "${INPUT_R1}" "${INPUT_R2}"

View on GitHub

Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp

fastp v0.23.4 GitHub

$ Bash example

# Install fastp using conda
# conda install -c bioconda fastp

# Define input and output file names
# Replace with actual input file paths for your raw sequencing reads
INPUT_READ1="raw_sequencing_R1.fastq.gz"
INPUT_READ2="raw_sequencing_R2.fastq.gz" # If paired-end, provide R2 file

# Define output file names for reads with UMIs extracted
OUTPUT_READ1="umi_extracted_R1.fastq.gz"
OUTPUT_READ2="umi_extracted_R2.fastq.gz" # If paired-end, provide R2 output file

# Define report file names for fastp's quality control and UMI processing summary
REPORT_JSON="fastp_umi_report.json"
REPORT_HTML="fastp_umi_report.html"

# Execute fastp to extract UMIs from raw sequencing reads
# This command assumes UMIs are located at the beginning of Read 1
# and are 10 base pairs long. The UMI sequence will be moved to the read ID
# (e.g., @read_id UMI:ATGC...).
#
# IMPORTANT: Adjust --umi_loc and --umi_len based on your specific library
# preparation protocol and UMI design. Common locations include 'read1',
# 'read2', 'index1', 'index2'.
fastp \
    --in1 "${INPUT_READ1}" \
    --in2 "${INPUT_READ2}" \
    --out1 "${OUTPUT_READ1}" \
    --out2 "${OUTPUT_READ2}" \
    --umi_loc read1 \
    --umi_len 10 \
    --umi_prefix "UMI:" \
    --json "${REPORT_JSON}" \
    --html "${REPORT_HTML}" \
    --thread 8 # Adjust number of threads as needed for your system

View on GitHub

Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}

STAR v2.5.2b GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star

# Example values for placeholders
STAR_GENOME_DIR="/path/to/STAR_genome_index_hg38" # e.g., from ENCODE or custom build
INPUT_FASTQ="sample_R1.fastq.gz"
OUTPUT_PREFIX="sample_aligned"
NUM_THREADS="8"
REPLICATE_LABEL="sample_rep1"

STAR --alignEndsType EndToEnd \
     --genomeDir "${STAR_GENOME_DIR}" \
     --genomeLoad NoSharedMemory \
     --outBAMcompression 10 \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --winAnchorMultimapNmax 100 \
     --outFilterMultimapNmax 100 \
     --outFilterMultimapScoreRange 1 \
     --outSAMmultNmax 1 \
     --outMultimapperOrder Random \
     --outFilterScoreMin 10 \
     --outFilterType BySJout \
     --limitOutSJcollapsed 5000000 \
     --outReadsUnmapped None \
     --outSAMattrRGline ID:"${REPLICATE_LABEL}" \
     --outSAMattributes All \
     --outSAMmode Full \
     --outSAMtype BAM Unsorted \
     --outSAMunmapped Within \
     --readFilesCommand zcat \
     --outStd Log \
     --readFilesIn "${INPUT_FASTQ}" \
     --runMode alignReads \
     --runThreadN "${NUM_THREADS}"

View on GitHub

Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper

Skipper vlatest GitHub

$ Bash example

# Clone the Skipper repository
# git clone https://github.com/yeolab/skipper.git
# cd skipper

# Create and activate the conda environment
# conda env create -f environment.yaml
# conda activate skipper_env

# --- Placeholder for config.yaml ---
# Create a configuration file (config.yaml) with your specific inputs and parameters.
# This is an example; actual parameters will depend on your data and analysis goals.
#
# Example config.yaml content:
#
# samples:
#   sample1:
#     replicates:
#       rep1:
#         ip_bam: "path/to/sample1_rep1_ip.bam"
#         input_bam: "path/to/sample1_rep1_input.bam"
#       rep2:
#         ip_bam: "path/to/sample1_rep2_ip.bam"
#         input_bam: "path/to/sample1_rep2_input.bam"
#
# genome: "hg38" # Or mm10, etc.
# genome_fasta: "path/to/genome.fa"
# genome_chrom_sizes: "path/to/genome.chrom.sizes"
# blacklist_bed: "path/to/blacklist.bed"
# repeatmasker_bed: "path/to/repeatmasker.bed" # For repetitive elements annotation
#
# output_dir: "results"
#
# --- End of config.yaml placeholder ---

# Execute the Skipper Snakemake workflow
# This command will run the entire pipeline, including peak calling, IDR for reproducible peaks,
# and annotation with repetitive elements, based on the configuration in config.yaml.
snakemake --snakefile Snakefile --configfile config.yaml --cores 8 --use-conda

View on GitHub

Tools Used

Skipper eCLIP STAR

Raw Source Text

Data was processed using Skipper, available at: http://github.com/yeolab/skipper
Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp
Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}
Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper
Assembly: GRCh38
Supplementary files format and content: single-replicate enriched window files contain p values, q values, enrichment, and annotations for significantly bound transcriptome tiled windows
Supplementary files format and content: reproducible enriched window files contain p values, q values, enrichment, and annotations for significantly bound transcriptome tiled windows

← Back to Analysis