GSE230717 Processing Pipeline
Publication
Large-scale map of RNA-binding protein interactomes across the mRNA life cycle.Molecular cell (2024) — PMID 39303721
Processing Steps
Generate Jupyter Notebook-
1
Data was processed using the eCLIP pipeline and available at https://github.com/YeoLab/skipper
$ Bash example
git clone https://github.com/YeoLab/skipper.git cd skipper # Configure the workflow by editing config/config.yaml and config/samples.yaml # Example reference genome (hg38 inferred as common default for human eCLIP): # In config/config.yaml, set: # species: human # assembly: hg38 # genome_fasta: /path/to/your/hg38.fasta # gtf: /path/to/your/hg38.gtf # star_index: /path/to/your/star_index_hg38 # ... other reference paths ... # In config/samples.yaml, define your samples and their fastq files. # (Optional) Create and activate the Snakemake environment # conda env create -f envs/snakemake.yaml # conda activate snakemake # Execute the Skipper eCLIP processing workflow # The --use-conda flag will create environments for individual rules as needed. # Adjust --cores based on available resources. snakemake --use-conda --cores 8
-
2
Unique Molecular Identifiers (UMIs) were extracted using fastp 0.11.5 (https://github.com/OpenGene/fastp)
$ Bash example
# Install fastp (if not already installed) # conda install -c bioconda fastp # Example command for UMI extraction using fastp. # This command assumes UMIs are at the beginning of Read 1 and are 10 bases long. # Adjust input/output filenames, UMI location (--umi_loc), and UMI length (--umi_len) as per your library preparation. # fastp will prepend the UMI sequence to the read ID in the output FASTQ files. fastp \ -i input_read1.fastq.gz \ -o output_read1_umi_extracted.fastq.gz \ -I input_read2.fastq.gz \ -O output_read2_umi_extracted.fastq.gz \ --umi \ --umi_loc per_read \ --umi_len 10 \ --thread 8 \ --json fastp.json \ --html fastp.html
-
3
Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples)
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt # Define input and output files INPUT_FASTQ="input_umi_extracted_reads.fastq" OUTPUT_FASTQ="trimmed_reads.fastq" # Define adapter sequences and minimum read length # These are placeholder sequences. Actual sequences should be obtained from the library preparation protocol # or the specific eCLIP protocol used (e.g., from the Yeo lab). ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example: Illumina Universal Adapter ADAPTER_5PRIME="GTTCAGAGTTCTACAGTCCGACGATC" # Example: A common linker or 5' barcode sequence to be trimmed after UMI extraction MIN_READ_LENGTH=18 # Execute cutadapt to trim adapter and barcode sequences cutadapt \ -a "${ADAPTER_3PRIME}" \ -g "${ADAPTER_5PRIME}" \ -m ${MIN_READ_LENGTH} \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
4
Mapping was then performed against the full human genome (hg38) including a database of splice junctions with STAR (v 2.7.6) allowing up to 100 multimapped regions.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.7.6 # Define variables READ1_FASTQ="path/to/your/read1.fastq.gz" # Replace with your R1 FASTQ file READ2_FASTQ="path/to/your/read2.fastq.gz" # Replace with your R2 FASTQ file (remove if single-end) STAR_GENOME_DIR="path/to/your/star_hg38_index" # Replace with path to pre-built STAR hg38 genome index OUTPUT_PREFIX="sample_name" # Prefix for output files NUM_THREADS=8 # Number of threads to use # Note: The STAR genome index (STAR_GENOME_DIR) must be pre-built using hg38 FASTA and a GTF/GFF file for splice junctions. # Example command for building index (run once, adjust sjdbOverhang based on read length - 1): # STAR --runMode genomeGenerate \ # --genomeDir ${STAR_GENOME_DIR} \ # --genomeFastaFiles path/to/hg38.fa \ # --sjdbGTFfile path/to/gencode.vXX.annotation.gtf \ # --sjdbOverhang 100 \ # --runThreadN ${NUM_THREADS} # Perform mapping with STAR STAR --genomeDir ${STAR_GENOME_DIR} \ --readFilesIn ${READ1_FASTQ} ${READ2_FASTQ} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outFilterMultimapNmax 100 \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --runThreadN ${NUM_THREADS} -
5
Reads were then PCR deduplicated using UMIcollaps (https://github.com/Daniel-Liu-c0deb0t/UMICollapse
$ Bash example
# Clone the repository if not already available # git clone https://github.com/Daniel-Liu-c0deb0t/UMICollapse.git # cd UMICollapse # Execute UMIcollaps for PCR deduplication # Assuming input_reads.fastq is the input file and deduplicated_reads.fastq is the desired output. # The UMI length (-l) is an example (default or common value); adjust if the actual UMI length is known from experimental design. python UMICollapse.py -i input_reads.fastq -o deduplicated_reads.fastq -l 10
-
6
Enriched âwindowsâ (IP versus SM-Input) was called on deduplicated reads using a GC-bias aware beta-binomial model.
clipper (Inferred with models/gemini-2.5-flash) vPython 3.8 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install clipper (if not already installed) # git clone https://github.com/yeolab/clipper.git # cd clipper # # Ensure dependencies are met, e.g., pysam, numpy, scipy # pip install pysam numpy scipy # Define input and output files IP_BAM="path/to/deduplicated_IP.bam" # Immunoprecipitated reads, deduplicated SM_INPUT_BAM="path/to/deduplicated_SM_Input.bam" # Size-matched input reads, deduplicated GENOME_ASSEMBLY="hg38" # Placeholder: Replace with the actual genome assembly (e.g., hg19, mm10) OUTPUT_PREFIX="eCLIP_peaks" # Run clipper to call enriched windows using a GC-bias aware beta-binomial model python clipper.py \ -s ${GENOME_ASSEMBLY} \ -u ${IP_BAM} \ -c ${SM_INPUT_BAM} \ -o ${OUTPUT_PREFIX} -
7
Each window (~100 b.p.) were partitioned from Gencode v38 and associated with a specific type of genomic region (CDS, UTR, proximal introns near splice site⦠etc).
$ Bash example
# Download Gencode v38 annotation GTF wget -nc https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz gunzip -f gencode.v38.annotation.gtf.gz # Download hg38 chrom.sizes for bedtools makewindows wget -nc http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes # # Installation for bedtools: # # conda install -c bioconda bedtools # Extract specific genomic regions from Gencode v38 GTF into BED format # Note: This is a simplified representation. A robust pipeline would use dedicated GTF parsers # or more complex scripting to accurately define all region types, especially introns. # Extract CDS regions awk '$3 == "CDS"' gencode.v38.annotation.gtf | \ awk -v OFS='\t' '{print $1, $4-1, $5, $10, $12, $7}' | \ sed 's/;//g; s/"//g' | \ sort -k1,1 -k2,2n > gencode.v38.cds.bed # Extract UTR regions awk '$3 == "UTR"' gencode.v38.annotation.gtf | \ awk -v OFS='\t' '{print $1, $4-1, $5, $10, $12, $7}' | \ sed 's/;//g; s/"//g' | \ sort -k1,1 -k2,2n > gencode.v38.utr.bed # Extract Exon regions (often used to derive introns) awk '$3 == "exon"' gencode.v38.annotation.gtf | \ awk -v OFS='\t' '{print $1, $4-1, $5, $10, $12, $7}' | \ sed 's/;//g; s/"//g' | \ sort -k1,1 -k2,2n > gencode.v38.exons.bed # Conceptual step: Derive intron regions and proximal introns near splice sites. # This typically involves custom scripting based on exon coordinates (e.g., using bedtools complement # on exons within transcripts, then defining proximal regions +/- X bp around splice sites). # For this example, we assume these files (e.g., gencode.v38.introns.bed, gencode.v38.proximal_introns.bed) # would be generated by a preceding step or custom script. # Create genome-wide windows of ~100 b.p. bedtools makewindows -g hg38.chrom.sizes -w 100 > genome.100bp_windows.bed # Associate each window with a specific type of genomic region # This step involves intersecting the windows with the derived region BED files # and assigning a primary region type based on overlap hierarchy (e.g., CDS > UTR > Intron). # This is often done with a custom script after multiple bedtools intersect/map calls. # Example of intersecting windows with CDS regions: bedtools intersect -a genome.100bp_windows.bed -b gencode.v38.cds.bed -wa -wb > windows_overlapping_cds.bed # Further processing (e.g., with a Python/R script) would then assign a single region type # to each window based on the overlaps with CDS, UTR, Intron, etc., following a defined hierarchy. # For instance: # python assign_region_type.py genome.100bp_windows.bed windows_overlapping_cds.bed windows_overlapping_utr.bed ... > final_annotated_windows.bed -
8
Windows are filtered using FDR < 0.2 and only reproducible windows between two replicates were used.
$ Bash example
# Clone the merge_peaks repository # git clone https://github.com/yeolab/merge_peaks.git # cd merge_peaks # Example usage of merge_replicates.py # This command filters windows (peaks) from two replicate BED files # using an FDR threshold of 0.2 and identifies reproducible windows. # Replace 'replicate1_peaks.bed' and 'replicate2_peaks.bed' with your actual input peak files. # The output will be prefixed with 'reproducible_windows'. python merge_replicates.py --fdr_threshold 0.2 replicate1_peaks.bed replicate2_peaks.bed --output_prefix reproducible_windows
-
9
Data was processed using the eCLIP pipeline and available at https://github.com/YeoLab/skipper
$ Bash example
# Clone the Skipper workflow repository git clone https://github.com/YeoLab/skipper.git cd skipper # --- Configuration Setup (User needs to adapt these files) --- # The workflow requires a configuration file (e.g., config/config.yaml) # and a samplesheet (e.g., config/samples.tsv). # # Example config/config.yaml (adjust genome_dir and other parameters as needed): # genome_assembly: "hg38" # genome_dir: "/path/to/your/genome/data/hg38" # # Example config/samples.tsv (tab-separated, adjust paths to your FASTQ files): # sample_id fastq_r1 fastq_r2 replicate # sample1_rep1 /path/to/sample1_rep1_R1.fastq.gz /path/to/sample1_rep1_R2.fastq.gz 1 # sample1_rep2 /path/to/sample1_rep2_R1.fastq.gz /path/to/sample1_rep2_R2.fastq.gz 2 # # --- Environment Setup (Optional, but recommended) --- # # Install mamba (if not already installed) # # conda install -c conda-forge mamba # # # Create and activate the Snakemake environment # # mamba env create -f envs/snakemake.yaml # # conda activate snakemake # --- Run the Skipper eCLIP workflow --- # Adjust --cores based on available resources. # Ensure config/config.yaml and config/samples.tsv are properly configured. snakemake --use-conda --cores 8 --configfile config/config.yaml
Raw Source Text
Data was processed using the eCLIP pipeline and available at https://github.com/YeoLab/skipper Unique Molecular Identifiers (UMIs) were extracted using fastp 0.11.5 (https://github.com/OpenGene/fastp) Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) Mapping was then performed against the full human genome (hg38) including a database of splice junctions with STAR (v 2.7.6) allowing up to 100 multimapped regions. Reads were then PCR deduplicated using UMIcollaps (https://github.com/Daniel-Liu-c0deb0t/UMICollapse Enriched âwindowsâ (IP versus SM-Input) was called on deduplicated reads using a GC-bias aware beta-binomial model. Each window (~100 b.p.) were partitioned from Gencode v38 and associated with a specific type of genomic region (CDS, UTR, proximal introns near splice site⦠etc). Windows are filtered using FDR < 0.2 and only reproducible windows between two replicates were used. Data was processed using the eCLIP pipeline and available at https://github.com/YeoLab/skipper Assembly: hg38 Supplementary files format and content: bed files contain Skipper windows.