GSE232599 Processing Pipeline
Publication
Large-scale evaluation of the ability of RNA-binding proteins to activate exon inclusion.Nature biotechnology (2024) — PMID 38168984
Dataset
GSE232599Systematic identification of RNA-binding proteins and tethered domains that activate exon splicing inclusion
Processing Steps
Generate Jupyter Notebook-
1
Data was processed using Skipper, available at: http://github.com/yeolab/skipper
$ Bash example
# Install Snakemake (if not already installed) # It is recommended to use a dedicated conda environment for Snakemake. # conda create -n snakemake snakemake # conda activate snakemake # Clone the Skipper workflow repository # git clone https://github.com/yeolab/skipper.git # cd skipper # Create or modify a configuration file (e.g., config.yaml) # This file defines input files, reference genome, and other parameters for the workflow. # A typical config.yaml for eCLIP using Skipper might look like this: # # genome: "hg38" # Placeholder for the reference genome (e.g., hg38, mm10). Using hg38 as a common latest assembly. # fastq_dir: "/path/to/your/fastq_files" # Directory containing raw FASTQ files # output_dir: "skipper_results" # Directory for output files # threads: 10 # Number of CPU threads to use # # samples: # sample_RBP_rep1: # RBP: "YourRBP" # replicate: "rep1" # fastq: ["sample_RBP_rep1_R1.fastq.gz", "sample_RBP_rep1_R2.fastq.gz"] # Adjust for single-end or paired-end # control: "input_control_sample_name" # Name of the corresponding input control sample # input_control_sample_name: # RBP: "Input" # replicate: "rep1" # fastq: ["input_control_sample_name_R1.fastq.gz", "input_control_sample_name_R2.fastq.gz"] # # Add more samples as needed following this structure. # # Ensure that the 'fastq' paths are relative to 'fastq_dir' or absolute paths. # The workflow will automatically manage software dependencies using conda environments defined in the workflow. # Execute the Skipper workflow # Ensure you are in the 'skipper' directory where the Snakefile and config.yaml are located. # The --use-conda flag tells Snakemake to manage environments via conda. # The --cores flag specifies the number of CPU cores to use. # The --configfile flag points to your configuration file. snakemake --use-conda --cores 10 --configfile config.yaml
-
2
Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.
$ Bash example
# Install skewer if not already available # conda install -c bioconda skewer # Define input and output file names INPUT_R1="reads_R1.fastq.gz" INPUT_R2="reads_R2.fastq.gz" OUTPUT_PREFIX="trimmed_reads" # Define common eCLIP 3' adapter sequence # This adapter is often used in eCLIP protocols (e.g., TruSeq Small RNA 3' adapter) ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Execute skewer for adapter and quality trimming # -x: 3' adapter sequence # -q: Minimum average quality score to keep a read (e.g., 20 for Phred+33) # -l: Minimum read length after trimming # -m any: Trimming mode (any means trim any adapter found) # -o: Output file prefix skewer -x "${ADAPTER_3PRIME}" -q 20 -l 18 -m any -o "${OUTPUT_PREFIX}" "${INPUT_R1}" "${INPUT_R2}" -
3
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp
$ Bash example
# Install fastp using conda # conda install -c bioconda fastp # Define input and output file names # Replace with actual input file paths for your raw sequencing reads INPUT_READ1="raw_sequencing_R1.fastq.gz" INPUT_READ2="raw_sequencing_R2.fastq.gz" # If paired-end, provide R2 file # Define output file names for reads with UMIs extracted OUTPUT_READ1="umi_extracted_R1.fastq.gz" OUTPUT_READ2="umi_extracted_R2.fastq.gz" # If paired-end, provide R2 output file # Define report file names for fastp's quality control and UMI processing summary REPORT_JSON="fastp_umi_report.json" REPORT_HTML="fastp_umi_report.html" # Execute fastp to extract UMIs from raw sequencing reads # This command assumes UMIs are located at the beginning of Read 1 # and are 10 base pairs long. The UMI sequence will be moved to the read ID # (e.g., @read_id UMI:ATGC...). # # IMPORTANT: Adjust --umi_loc and --umi_len based on your specific library # preparation protocol and UMI design. Common locations include 'read1', # 'read2', 'index1', 'index2'. fastp \ --in1 "${INPUT_READ1}" \ --in2 "${INPUT_READ2}" \ --out1 "${OUTPUT_READ1}" \ --out2 "${OUTPUT_READ2}" \ --umi_loc read1 \ --umi_len 10 \ --umi_prefix "UMI:" \ --json "${REPORT_JSON}" \ --html "${REPORT_HTML}" \ --thread 8 # Adjust number of threads as needed for your system -
4
Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star # Example values for placeholders STAR_GENOME_DIR="/path/to/STAR_genome_index_hg38" # e.g., from ENCODE or custom build INPUT_FASTQ="sample_R1.fastq.gz" OUTPUT_PREFIX="sample_aligned" NUM_THREADS="8" REPLICATE_LABEL="sample_rep1" STAR --alignEndsType EndToEnd \ --genomeDir "${STAR_GENOME_DIR}" \ --genomeLoad NoSharedMemory \ --outBAMcompression 10 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --winAnchorMultimapNmax 100 \ --outFilterMultimapNmax 100 \ --outFilterMultimapScoreRange 1 \ --outSAMmultNmax 1 \ --outMultimapperOrder Random \ --outFilterScoreMin 10 \ --outFilterType BySJout \ --limitOutSJcollapsed 5000000 \ --outReadsUnmapped None \ --outSAMattrRGline ID:"${REPLICATE_LABEL}" \ --outSAMattributes All \ --outSAMmode Full \ --outSAMtype BAM Unsorted \ --outSAMunmapped Within \ --readFilesCommand zcat \ --outStd Log \ --readFilesIn "${INPUT_FASTQ}" \ --runMode alignReads \ --runThreadN "${NUM_THREADS}" -
5
Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper
$ Bash example
# Clone the Skipper repository # git clone https://github.com/yeolab/skipper.git # cd skipper # Create and activate the conda environment # conda env create -f environment.yaml # conda activate skipper_env # --- Placeholder for config.yaml --- # Create a configuration file (config.yaml) with your specific inputs and parameters. # This is an example; actual parameters will depend on your data and analysis goals. # # Example config.yaml content: # # samples: # sample1: # replicates: # rep1: # ip_bam: "path/to/sample1_rep1_ip.bam" # input_bam: "path/to/sample1_rep1_input.bam" # rep2: # ip_bam: "path/to/sample1_rep2_ip.bam" # input_bam: "path/to/sample1_rep2_input.bam" # # genome: "hg38" # Or mm10, etc. # genome_fasta: "path/to/genome.fa" # genome_chrom_sizes: "path/to/genome.chrom.sizes" # blacklist_bed: "path/to/blacklist.bed" # repeatmasker_bed: "path/to/repeatmasker.bed" # For repetitive elements annotation # # output_dir: "results" # # --- End of config.yaml placeholder --- # Execute the Skipper Snakemake workflow # This command will run the entire pipeline, including peak calling, IDR for reproducible peaks, # and annotation with repetitive elements, based on the configuration in config.yaml. snakemake --snakefile Snakefile --configfile config.yaml --cores 8 --use-conda
Raw Source Text
Data was processed using Skipper, available at: http://github.com/yeolab/skipper
Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp
Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}
Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper
Assembly: GRCh38
Supplementary files format and content: single-replicate enriched window files contain p values, q values, enrichment, and annotations for significantly bound transcriptome tiled windows
Supplementary files format and content: reproducible enriched window files contain p values, q values, enrichment, and annotations for significantly bound transcriptome tiled windows