GSE232597 Processing Pipeline
Publication
Large-scale evaluation of the ability of RNA-binding proteins to activate exon inclusion.Nature biotechnology (2024) — PMID 38168984
Dataset
GSE232597Systematic identification of RNA-binding proteins and tethered domains that activate exon splicing inclusion [eCLIP-seq]
Processing Steps
Generate Jupyter Notebook-
1
Data was processed using Skipper, available at: http://github.com/yeolab/skipper
$ Bash example
# Clone the Skipper workflow repository git clone https://github.com/yeolab/skipper.git cd skipper # --- Configuration --- # Skipper is a Snakemake workflow that requires a configuration file (config.yaml). # This file specifies input data, genome assembly, and other parameters. # Example placeholder for config.yaml (user needs to customize this): # # # config.yaml # # Define samples and their input FASTQ files # samples: # sample1: # R1: "path/to/sample1_R1.fastq.gz" # R2: "path/to/sample1_R2.fastq.gz" # Optional for paired-end # sample2: # R1: "path/to/sample2_R1.fastq.gz" # R2: "path/to/sample2_R2.fastq.gz" # # # Specify the genome assembly and paths to its index/annotation files # genome: "hg38" # Placeholder: Replace with actual genome assembly (e.g., hg38, mm10) # genome_dir: "/path/to/genome/index/files" # Placeholder: Path to STAR index, etc. # annotation_gtf: "/path/to/annotation.gtf" # Placeholder: Path to GTF file # # # Other parameters like adapter sequences, trim settings, etc. # # Refer to the Skipper documentation for a complete list of configurable options. # # --- Environment Setup (example using conda) --- # # It is recommended to create a dedicated conda environment for Snakemake and its dependencies. # # conda create -n skipper_env snakemake mamba -c conda-forge -c bioconda # # conda activate skipper_env # --- Workflow Execution --- # Run the Snakemake workflow. # The number of cores should be adjusted based on available resources. # The --use-conda flag ensures that dependencies are managed via conda environments defined in the workflow. snakemake --use-conda --cores 8
-
2
Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.
eCLIP v0.2.2 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install skewer if not already installed # conda install -c bioconda skewer # Define input and output file names INPUT_FASTQ="input.fastq.gz" # Placeholder for input FASTQ file OUTPUT_PREFIX="trimmed_reads" # Prefix for output trimmed FASTQ files # Define common Illumina 3' adapter sequence # This is a widely used Illumina 3' adapter sequence. # For specific experiments, this sequence might vary. ADAPTER_3_PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Define a placeholder for the 5' barcode sequence. # In eCLIP, 5' barcodes are often experiment-specific or random N-mers (UMIs). # If a fixed 5' barcode sequence is known, replace this placeholder. # If the barcode is a random N-mer, it might be handled by other tools # (e.g., UMI-tools) or by skewer's quality trimming if it's short and low quality. # If skewer is expected to trim a specific 5' adapter/barcode, provide it with -y. BARCODE_5_PRIME="[5_PRIME_BARCODE_SEQUENCE]" # REPLACE WITH ACTUAL 5' BARCODE IF KNOWN # Minimum read length after trimming, a common setting for eCLIP data MIN_LENGTH=18 # Execute skewer for adapter and barcode trimming # -x: Specifies the 3' adapter sequence to trim # -y: Specifies the 5' adapter/barcode sequence to trim # -m: Sets the minimum read length after trimming. Reads shorter than this are discarded. # -o: Specifies the prefix for the output trimmed FASTQ files. # Skewer will append '-trimmed.fastq.gz' (for single-end) or # '-trimmed-pair1.fastq.gz' and '-trimmed-pair2.fastq.gz' (for paired-end). skewer -x "${ADAPTER_3_PRIME}" -y "${BARCODE_5_PRIME}" -m "${MIN_LENGTH}" -o "${OUTPUT_PREFIX}" "${INPUT_FASTQ}" -
3
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp
$ Bash example
# Install fastp (if not already installed) # conda install -c bioconda fastp # Define input and output file names INPUT_READ1="raw_read1.fastq.gz" INPUT_READ2="raw_read2.fastq.gz" OUTPUT_READ1="umi_extracted_read1.fastq.gz" OUTPUT_READ2="umi_extracted_read2.fastq.gz" # Note: The specific UMI location (--umi_loc) and length (--umi_len) # are inferred as they were not provided in the description. # Common settings include UMI at the beginning of read1 with a length of 12-16 bp. # Adjust these parameters based on the specific library preparation protocol. # fastp can also perform adapter trimming and quality filtering simultaneously, # but this command focuses solely on UMI extraction as described. fastp \ --in1 "${INPUT_READ1}" \ --in2 "${INPUT_READ2}" \ --out1 "${OUTPUT_READ1}" \ --out2 "${OUTPUT_READ2}" \ --umi_loc read1 \ --umi_len 12 -
4
Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables for the pipeline (replace with actual paths and values) STAR_INDEX_DIR="/path/to/STAR_index/GRCh38" # Placeholder: Path to the STAR genome index directory (e.g., for human GRCh38) INPUT_FASTQ="reads.fastq.gz" # Placeholder: Path to the input gzipped FASTQ file OUTPUT_PREFIX="aligned_reads" # Placeholder: Prefix for output files REPLICATE_LABEL="sample1_rep1" # Placeholder: Unique ID for the read group (e.g., sample_replicate_id) NUM_THREADS="8" # Placeholder: Number of threads to use # Execute STAR alignment command STAR \ --alignEndsType EndToEnd \ --genomeDir "${STAR_INDEX_DIR}" \ --genomeLoad NoSharedMemory \ --outBAMcompression 10 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --winAnchorMultimapNmax 100 \ --outFilterMultimapNmax 100 \ --outFilterMultimapScoreRange 1 \ --outSAMmultNmax 1 \ --outMultimapperOrder Random \ --outFilterScoreMin 10 \ --outFilterType BySJout \ --limitOutSJcollapsed 5000000 \ --outReadsUnmapped None \ --outSAMattrRGline ID:"${REPLICATE_LABEL}" \ --outSAMattributes All \ --outSAMmode Full \ --outSAMtype BAM Unsorted \ --outSAMunmapped Within \ --readFilesCommand zcat \ --outStd Log \ --readFilesIn "${INPUT_FASTQ}" \ --runMode alignReads \ --runThreadN "${NUM_THREADS}" -
5
Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper
$ Bash example
# Install Snakemake if not already installed # conda create -n skipper_env snakemake -c bioconda -c conda-forge # conda activate skipper_env # Clone the Skipper workflow # git clone https://github.com/yeolab/skipper.git # cd skipper # Create a configuration file (config.yaml) for your experiment. # This example assumes you have peak files from a previous step (e.g., CLIPper) # and want to perform IDR and annotation against repetitive elements. # Replace 'path/to/your/peak_files' and 'your_sample_name' with actual paths and names. # The genome assembly (e.g., hg38) should be specified. cat << EOF > config.yaml genome: hg38 # Placeholder: Using the latest human genome assembly samples: - name: your_sample_name replicate1: path/to/your/peak_files/sample_rep1.bed # Example: CLIPper output for replicate 1 replicate2: path/to/your/peak_files/sample_rep2.bed # Example: CLIPper output for replicate 2 control: path/to/your/control_peak_files/control.bed # Example: Input/SMInput peaks for control # Add other necessary configurations as per Skipper documentation # e.g., paths to genome indices, annotation files if not handled by Snakemake's --use-conda EOF # Execute the Skipper Snakemake workflow to call reproducible enriched windows (IDR) # and annotate repetitive elements. The 'all' rule in Skipper typically includes # IDR and annotation steps. Adjust --cores based on available resources. snakemake --cores 8 --configfile config.yaml --use-conda
Raw Source Text
Data was processed using Skipper, available at: http://github.com/yeolab/skipper
Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp
Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}
Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper
Assembly: GRCh38
Supplementary files format and content: single-replicate enriched window files contain p values, q values, enrichment, and annotations for significantly bound transcriptome tiled windows
Supplementary files format and content: reproducible enriched window files contain p values, q values, enrichment, and annotations for significantly bound transcriptome tiled windows