GSE213867 Processing Pipeline
Publication
Skipper analysis of eCLIP datasets enables sensitive detection of constrained translation factor binding sites.Cell genomics (2023) — PMID 37388912
Dataset
GSE213867Skipper analysis of RNA-protein interactions highlights depletion of genetic variation in translation factor binding sites
Processing Steps
Generate Jupyter Notebook-
1
Data was processed using Skipper, available at: http://github.com/yeolab/skipper
$ Bash example
# Clone the Skipper repository (if not already cloned) # git clone https://github.com/yeolab/skipper.git # cd skipper # Install Snakemake and dependencies (if not already installed) # conda create -n skipper_env snakemake mamba -c conda-forge -c bioconda # conda activate skipper_env # Placeholder for configuration: # A 'config.yaml' file is typically required to specify input files, # reference genomes, and other parameters for the Skipper workflow. # Example content for config.yaml (refer to Skipper's documentation for details): # samples: # sample1: # R1: "path/to/sample1_R1.fastq.gz" # R2: "path/to/sample1_R2.fastq.gz" # Optional for single-end # genome: "path/to/reference_genome.fa" # Placeholder for reference genome # annotation: "path/to/genome_annotation.gtf" # Placeholder for genome annotation # output_dir: "results" # Run the Skipper Snakemake workflow # Adjust --cores based on available resources. # Ensure 'config.yaml' is properly configured for your data and references. snakemake --use-conda --cores 8 --configfile config.yaml
-
2
Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.
$ Bash example
# Install skewer (e.g., via conda) # conda create -n skewer_env skewer=0.2.2 -c bioconda -c conda-forge # conda activate skewer_env # Placeholder for adapter and barcode sequences. These would typically be provided in a FASTA file. # Example content for adapters.fa (replace with actual sequences): # >adapter1 # AGATCGGAAGAGCACACGTCTGAACTCCAGTCA # >barcode_sequence_example # GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT # Define input and output file names INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" OUTPUT_PREFIX="trimmed_reads" ADAPTER_FILE="adapters.fa" THREADS=8 MIN_LENGTH=18 MIN_QUALITY=20 # Trim reads for adapter and barcode sequences using skewer skewer -x "${ADAPTER_FILE}" \ -m any \ -l "${MIN_LENGTH}" \ -q "${MIN_QUALITY}" \ -t "${THREADS}" \ -o "${OUTPUT_PREFIX}" \ "${INPUT_R1}" \ "${INPUT_R2}" # Expected output files: # trimmed_reads-trimmed-pair1.fastq.gz # trimmed_reads-trimmed-pair2.fastq.gz # trimmed_reads-trimmed.log -
3
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp
$ Bash example
# Install fastp (if not already installed) # conda install -c bioconda fastp # Define input and output file names (placeholders) # Replace with your actual input and desired output file names INPUT_R1="raw_reads_R1.fastq.gz" INPUT_R2="raw_reads_R2.fastq.gz" OUTPUT_R1="umi_extracted_R1.fastq.gz" OUTPUT_R2="umi_extracted_R2.fastq.gz" REPORT_JSON="fastp_report.json" REPORT_HTML="fastp_report.html" # Extract UMIs from raw sequencing reads using fastp # This command assumes UMIs are at the beginning of Read 1 and are 10 bp long. # The UMI sequence will be moved from the read sequence to the read ID, prefixed with 'UMI:'. # Basic quality control (trimming low-quality bases, adapter trimming, filtering short reads) is also included. fastp \ --in1 "${INPUT_R1}" \ --out1 "${OUTPUT_R1}" \ --in2 "${INPUT_R2}" \ --out2 "${OUTPUT_R2}" \ --umi_loc read1 \ --umi_len 10 \ --umi_prefix "UMI:" \ --qualified_quality_phred 20 \ --unqualified_percent_limit 50 \ --length_required 50 \ --detect_adapter_for_pe \ --json "${REPORT_JSON}" \ --html "${REPORT_HTML}" -
4
Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}
$ Bash example
# Install STAR using conda # conda install -c bioconda star # Define variables (replace with actual paths and values) genome_dir="/path/to/STAR_genome_index/GRCh38" output_prefix="sample_aligned" input_fastq="sample.fastq.gz" num_threads=8 replicate_id="sample1_rep1" # Run STAR alignment STAR \ --alignEndsType EndToEnd \ --genomeDir "${genome_dir}" \ --genomeLoad NoSharedMemory \ --outBAMcompression 10 \ --outFileNamePrefix "${output_prefix}" \ --winAnchorMultimapNmax 100 \ --outFilterMultimapNmax 100 \ --outFilterMultimapScoreRange 1 \ --outSAMmultNmax 1 \ --outMultimapperOrder Random \ --outFilterScoreMin 10 \ --outFilterType BySJout \ --limitOutSJcollapsed 5000000 \ --outReadsUnmapped None \ --outSAMattrRGline ID:"${replicate_id}" \ --outSAMattributes All \ --outSAMmode Full \ --outSAMtype BAM Unsorted \ --outSAMunmapped Within \ --readFilesCommand zcat \ --outStd Log \ --readFilesIn "${input_fastq}" \ --runMode alignReads \ --runThreadN "${num_threads}" -
5
Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper
$ Bash example
# Clone the Skipper repository # git clone https://github.com/yeolab/skipper.git # cd skipper # Create a Conda environment for Snakemake # conda create -n skipper_env snakemake -y # conda activate skipper_env # --- Configuration for Skipper workflow --- # Create a config.yaml file with your sample and genome information. # Example config.yaml: # samples: # sample_rep1: # r1: "/path/to/sample_rep1_R1.fastq.gz" # r2: "/path/to/sample_rep1_R2.fastq.gz" # Optional, if paired-end # sample_rep2: # r1: "/path/to/sample_rep2_R1.fastq.gz" # r2: "/path/to/sample_rep2_R2.fastq.gz" # Optional, if paired-end # input_rep1: # Control/Input sample # r1: "/path/to/input_rep1_R1.fastq.gz" # r2: "/path/to/input_rep1_R2.fastq.gz" # Optional, if paired-end # input_rep2: # Control/Input sample # r1: "/path/to/input_rep2_R1.fastq.gz" # r2: "/path/to/input_rep2_R2.fastq.gz" # Optional, if paired-end # # genome_assembly: "hg38" # e.g., hg38, mm10 # genome_dir: "/path/to/STAR/index/for/hg38" # Path to STAR genome index # gtf_file: "/path/to/gencode.vXX.annotation.gtf" # Path to GTF annotation file # repeat_masker_file: "/path/to/hg38.fa.out.bed" # Path to RepeatMasker BED file # # # Ensure all necessary reference files (STAR index, GTF, RepeatMasker BED) are prepared. # # For hg38, these can be downloaded from ENCODE or UCSC. # # Example for hg38: # # STAR index: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ (e.g., star_2.7.10a_idx.tar) # # GTF: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz # # RepeatMasker BED: Can be generated from UCSC Table Browser (track: RepeatMasker, output format: BED) or downloaded from specific resources. # Execute the Skipper Snakemake workflow to perform peak calling, IDR, and annotation. # This command will run the 'all' rule, which includes reproducible enriched window (IDR) # and repetitive element annotation steps. snakemake --snakefile Snakefile --configfile config.yaml --cores 8 all
Raw Source Text
Data was processed using Skipper, available at: http://github.com/yeolab/skipper
Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp
Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}
Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper
Assembly: GRCh38
Supplementary files format and content: reproducible enriched window files contain p values, q values, enrichment, and annotations for significantly bound transcriptome tiled windows
Supplementary files format and content: reproducible enriched re files contain p values, q values, enrichment, and annotations for significantly bound repetitive elements