GSE240521 Processing Pipeline
Publication
High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.Nature communications (2024) — PMID 39152130
Dataset
GSE240521An in situ method for identification of transcriptome-wide protein-RNA interactions in cells [eCLIP-seq ]
Processing Steps
Generate Jupyter Notebook-
1
Data was processed using the eCLIP pipeline and available at: http://github.com/yeolab/eclip
$ Bash example
# Install cwltool if not already installed # pip install cwltool # Clone the eCLIP pipeline repository # git clone https://github.com/yeolab/eclip.git # cd eclip # Placeholder for input data and reference genome. # Replace with actual paths to your FASTQ files and reference genome (e.g., hg38). # Example: # READ1_FASTQ="/path/to/sample_R1.fastq.gz" # READ2_FASTQ="/path/to/sample_R2.fastq.gz" # REFERENCE_GENOME_FASTA="/path/to/hg38.fa" # REFERENCE_GENOME_GTF="/path/to/hg38.gtf" # Or appropriate annotation file # Create an inputs.yaml file based on the eclip.cwl requirements. # Refer to https://github.com/yeolab/eclip/blob/master/example_inputs.yaml for a detailed example. # Example simplified inputs.yaml structure: # cat << EOF > inputs.yaml # read1: # class: File # path: ${READ1_FASTQ} # read2: # class: File # path: ${READ2_FASTQ} # genome_fasta: # class: File # path: ${REFERENCE_GENOME_FASTA} # genome_gtf: # class: File # path: ${REFERENCE_GENOME_GTF} # output_prefix: "my_eclip_output" # EOF # Execute the eCLIP CWL pipeline. # This command assumes 'eclip.cwl' and 'inputs.yaml' are in the current directory. cwltool eclip.cwl inputs.yaml -
2
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract
$ Bash example
# Install UMI-tools # conda install -c bioconda umi-tools # Example: Extract UMIs (assuming 12bp at the start of Read 1) from raw sequencing reads # and append them to the read header. # Input: raw_reads_R1.fastq.gz # Output: umi_extracted_reads_R1.fastq.gz umi_tools extract \ --bc-pattern="^(?P<umi_1>.{12})" \ --extract-method=regex \ -I raw_reads_R1.fastq.gz \ -S umi_extracted_reads_R1.fastq.gz \ --log=umi_tools_extract.log -
3
Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt=4.0 # Define input and output files INPUT_READS="sample_umi_extracted.fastq.gz" OUTPUT_TRIMMED_READS="sample_trimmed.fastq.gz" OUTPUT_UNTRIMMED_READS="sample_untrimmed.fastq.gz" # Optional, for reads where no adapter was found # Define adapter sequences for eCLIP (from yeolab/skipper workflow) # Illumina TruSeq Small RNA 3' Adapter (commonly found at 3' end of cDNA) ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Illumina TruSeq Universal Adapter (part of P7, sometimes found at 5' end of cDNA) ADAPTER_5PRIME="AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT" # Define trimming parameters (from yeolab/skipper workflow) MIN_LENGTH=18 # Minimum read length after trimming QUALITY_CUTOFF=20 # Quality cutoff for 3' end trimming (Phred score) THREADS=8 # Number of CPU threads to use # Execute cutadapt for adapter and barcode trimming cutadapt \ -a "${ADAPTER_3PRIME}" \ -g "${ADAPTER_5PRIME}" \ -o "${OUTPUT_TRIMMED_READS}" \ --untrimmed-output "${OUTPUT_UNTRIMMED_READS}" \ --minimum-length "${MIN_LENGTH}" \ --quality-cutoff "${QUALITY_CUTOFF}" \ --cores "${THREADS}" \ "${INPUT_READS}" -
4
Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
$ Bash example
# Install STAR (example) # conda install -c bioconda star # Define variables # Placeholder for input trimmed reads. Adjust for paired-end if necessary. READS_R1="trimmed_reads_R1.fastq.gz" # READS_R2="trimmed_reads_R2.fastq.gz" # Uncomment for paired-end # Placeholder for RepBase FASTA file. Obtaining RepBase might require a license or specific download steps. # Example: wget -O RepBase.fasta "http://www.girinst.org/repbase/update/RepBase.fasta.gz" (actual URL may vary) REPBASE_FASTA="RepBase.fasta" # Directory for the STAR genome index STAR_INDEX_DIR="RepBase_STAR_index" # Prefix for output files OUTPUT_PREFIX="repbase_filtered" # Number of threads to use THREADS=8 # Adjust as needed # --- Step 1: Build STAR index for RepBase (if not already built) --- # This step is typically performed once. If the index already exists, skip this. # STAR --runMode genomeGenerate \ # --genomeDir ${STAR_INDEX_DIR} \ # --genomeFastaFiles ${REPBASE_FASTA} \ # --runThreadN ${THREADS} \ # --genomeSAindexNbases 10 # Adjust based on genome size, 10 is suitable for small decoy genomes # --- Step 2: Map trimmed reads against RepBase and extract unmapped reads --- STAR --genomeDir ${STAR_INDEX_DIR} \ --readFilesIn ${READS_R1} \ --runThreadN ${THREADS} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outFilterMultimapNmax 30 \ --alignEndsType EndToEnd \ --outFilterMultimapScoreRange 1 \ --outSAMmode Full \ --outFilterType BySJout \ --outSAMtype BAM Unsorted \ --outFilterScoreMin 10 \ --outReadsUnmapped Fastx \ --outSAMattributes All # The unmapped reads will be in files like: ${OUTPUT_PREFIX}Unmapped.out.mate1 (and .mate2 for paired-end) -
5
Remaining reads were mapped to the appropriate genome build (mm10) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables for clarity # Replace /path/to/STAR_index/mm10 with the actual path to your mm10 STAR genome index GENOME_DIR="/path/to/STAR_index/mm10" # Replace remaining_reads.fastq.gz with your actual input FASTQ file INPUT_FASTQ="remaining_reads.fastq.gz" # Prefix for output files (e.g., aligned_mm10_Aligned.out.bam, aligned_mm10_Unmapped.out.mate1) OUTPUT_PREFIX="aligned_mm10_" # Run STAR alignment STAR \ --runThreadN 8 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_FASTQ}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outFilterMultimapNmax 1 \ --alignEndsType EndToEnd \ --outFilterMultimapScoreRange 1 \ --outSAMmode Full \ --outFilterType BySJout \ --outSAMtype BAM Unsorted \ --outFilterScoreMin 10 \ --outReadsUnmapped Fastx \ --outSAMattributes All -
6
Uniquely mapped reads were removed of PCR duplicates with umi_tools
$ Bash example
# Install UMI-tools (example using conda) # conda install -c bioconda umi-tools # Check UMI-tools version # umi_tools --version # Remove PCR duplicates from uniquely mapped reads # This command assumes UMIs are embedded in the read ID, separated by a colon. # Adjust --extract-umi-method and --umi-separator (or use --umi-tag if UMIs are in a BAM tag) # based on how UMIs were incorporated during library preparation and mapping. # Placeholder filenames are used for input and output. umi_tools dedup \ --input mapped_reads.bam \ --output deduplicated_reads.bam \ --extract-umi-method=read_id \ --umi-separator=':' \ --output-stats umi_tools_dedup_stats.tsv \ --log umi_tools_dedup.log -
7
Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper
$ Bash example
# Clone the CLIPper repository # git clone https://github.com/YeoLab/clipper.git # cd clipper # Install dependencies (if not already installed) # pip install numpy scipy pysam # Example usage for eCLIP peak calling with human hg38 genome. # Replace <TREATED_BAM>, <CONTROL_BAM>, <OUTPUT_PREFIX> with actual paths/names. # Reference genome and annotation files (e.g., hg38.fa, gencode.v38.annotation.gtf) should be pre-indexed and available. # The size factor (-s) can be calculated based on library sizes or spike-ins, or omitted if CLIPper calculates it or defaults to 1.0. python clipper.py \ -b treated_sample.bam \ -c control_sample.bam \ -o clipper_peaks \ -g hg38.fa \ -a gencode.v38.annotation.gtf \ -p 0.05 \ -f 0.05 \ -u 100 \ -d 100 -
8
Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl
$ Bash example
# --- Installation (commented out) --- # Ensure Perl is installed # sudo apt-get update && sudo apt-get install -y perl # Clone the eCLIP repository to access the script # git clone https://github.com/yeolab/eclip.git # cd eclip # git checkout master # Or a specific commit/tag if available and desired # cd .. # --- Define Input Files and Parameters --- # Placeholder for the input peak file (e.g., output from CLIPper) PEAK_FILE="path/to/your/input_peaks.bed" # Placeholder for the size-matched input BAM file SM_INPUT_BAM="path/to/your/sm_input.bam" # Output file prefix for the enriched clusters OUTPUT_PREFIX="enriched_clusters" # Genome assembly for chromosome sizes (e.g., hg38, mm10) # Using hg38 (GRCh38) as a common latest human assembly placeholder GENOME_ASSEMBLY="hg38" # Minimum overlap percentage for peak identification (default 0.5) MIN_OVERLAP=0.5 # Minimum number of reads in a cluster (default 5) MIN_READS=5 # Minimum fold enrichment over SMInput (default 2) MIN_FOLD_ENRICHMENT=2 # --- Execute the Script --- # The script is located in the 'bin' directory of the cloned eCLIP repository perl eclip/bin/overlap_peakfi_with_bam.pl \ --peakfile "${PEAK_FILE}" \ --bamfile "${SM_INPUT_BAM}" \ --output "${OUTPUT_PREFIX}" \ --genome "${GENOME_ASSEMBLY}" \ --min_overlap "${MIN_OVERLAP}" \ --min_reads "${MIN_READS}" \ --min_fold_enrichment "${MIN_FOLD_ENRICHMENT}" -
9
Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
$ Bash example
# Install Perl if not already available # sudo apt-get update && sudo apt-get install perl # For Debian/Ubuntu # yum install perl # For CentOS/RHEL # Download the script (if not part of a larger pipeline installation, e.g., cloning the eCLIP repository) # wget https://raw.githubusercontent.com/yeolab/eclip/master/bin/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl # chmod +x compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl # Example usage: # Assuming input peak files are named rep1_peaks.bed, rep2_peaks.bed, etc. # And the desired output prefix for the merged file is 'merged_replicates' perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl rep1_peaks.bed rep2_peaks.bed merged_replicates
Raw Source Text
Data was processed using the eCLIP pipeline and available at: http://github.com/yeolab/eclip Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt. Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All) Remaining reads were mapped to the appropriate genome build (mm10) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All) Uniquely mapped reads were removed of PCR duplicates with umi_tools Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl Assembly: mm10 Supplementary files format and content: bigwigs contain RPM-normalized read densities of uniquely-mapped reads Supplementary files format and content: BED files contain CLIPper peak clusters. Columns 4 and 5 describe the -log10(p-value) and log2(fold) enrichment IP over corresponding SMInput.