GSE166402 Processing Pipeline
Publication
Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.Nature methods (2021) — PMID 33963355
Dataset
GSE166402Robust single-cell discovery of RNA targets of RNA binding proteins and ribosomes [STAMP_new_eclip]
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0).
$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi_tools=1.0.0 # Define input and output file names (placeholders) INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" OUTPUT_R1="output_R1.fastq.gz" OUTPUT_R2="output_R2.fastq.gz" LOG_FILE="umi_tools_extract.log" # Extract UMIs from read sequences and add them to read headers. # This command assumes UMIs are 10bp long and located at the start of Read 1. # Adjust --bc-pattern if UMI length or location is different (e.g., NNNNNNNNNN for 10bp UMI). # If UMIs are in a separate index read, the command would be different (e.g., --bc-pattern-r2). umi_tools extract \ --bc-pattern=NNNNNNNNNN \ --read1-in "${INPUT_R1}" \ --read2-in "${INPUT_R2}" \ --read1-out "${OUTPUT_R1}" \ --read2-out "${OUTPUT_R2}" \ --log "${LOG_FILE}" -
2
Args: --random-seed 1 --bc-pattern NNNNNNNNNN
$ Bash example
# Install umi_tools # conda install -c bioconda umi_tools=1.1.2 # Execute umi_tools extract to extract UMIs based on the pattern umi_tools extract --random-seed 1 --bc-pattern NNNNNNNNNN -I input.fastq.gz -S output.fastq.gz
-
3
Reads were then trimmed with cutadapt (1.14).
$ Bash example
# Install cutadapt via conda # conda create -n cutadapt_env cutadapt=1.14 -y # conda activate cutadapt_env # Example command for trimming adapters and quality # Replace 'input_reads.fastq.gz' with your actual input file path. # Replace 'trimmed_reads.fastq.gz' with your desired output file path. # The adapter sequence 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCA' is a common Illumina 3' adapter. Adjust if a different adapter was used. # -q 20: Trims low-quality bases from the 3' end with a quality threshold of 20. # -m 20: Discards reads shorter than 20 bp after trimming. cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -q 20 -m 20 -o trimmed_reads.fastq.gz input_reads.fastq.gz
-
4
Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
$ Bash example
# Install cutadapt # conda install -c bioconda cutadapt=4.0 # Download adapter sequence file (if not already present) # The description refers to 'InvRNA*.fasta', and the example inputs contain 'InvRNA_adapters.fasta'. # wget https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA_adapters.fasta # Example input files (replace 'raw_reads.fastq.gz' with your actual input FASTQ file) # raw_reads.fastq.gz: Input FASTQ file containing sequencing reads cutadapt \ --match-read-wildcards \ -O 1 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -a InvRNA_adapters.fasta \ -o trimmed_reads.fastq.gz \ raw_reads.fastq.gz
-
5
Reads were then trimmed once more with cutadapt (1.14) to remove double-ligation events.
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt=1.14 # Define input and output file paths INPUT_R1="input_reads_R1.fastq.gz" INPUT_R2="input_reads_R2.fastq.gz" OUTPUT_R1="trimmed_reads_R1.fastq.gz" OUTPUT_R2="trimmed_reads_R2.fastq.gz" # Define the adapter sequence to remove. # This sequence should correspond to the adapter involved in double-ligation events. # For eCLIP, this is typically the 3' adapter sequence. # Example: ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" ADAPTER_SEQUENCE="YOUR_3_PRIME_ADAPTER_SEQUENCE" # Trim reads to remove double-ligation events using cutadapt # -a ADAPTER_SEQUENCE: Trims the 3' end of R1 if ADAPTER_SEQUENCE is found. # -A ADAPTER_SEQUENCE: Trims the 3' end of R2 if ADAPTER_SEQUENCE is found. # -o: Output file for R1 # -p: Output file for R2 (paired-end) cutadapt -a "${ADAPTER_SEQUENCE}" -A "${ADAPTER_SEQUENCE}" \ -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" \ "${INPUT_R1}" "${INPUT_R2}" -
6
Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=1.18 # Download example input files (replace with actual paths if already available) # wget -P . https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA_adapter.fasta # wget -P . https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA_R1.fastq.gz cutadapt \ -O 5 \ --quality-cutoff 6 \ -m 18 \ -a file:InvRNA_adapter.fasta \ -o InvRNA_R1_trimmed.fastq.gz \ InvRNA_R1.fastq.gz
-
7
Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05).
$ Bash example
# Install STAR if not already available # conda install -c bioconda star=2.4.0i # Define variables TRIMMED_READS="trimmed_reads.fastq.gz" # Placeholder for actual trimmed reads file REPBASE_STAR_INDEX="/path/to/RepBase_18.05_STAR_index" # Placeholder for the STAR index built from RepBase 18.05 OUTPUT_DIR="star_repbase_mapping" THREADS=8 # Adjust as needed # Create output directory mkdir -p "${OUTPUT_DIR}" # Run STAR mapping STAR --runThreadN "${THREADS}" \ --genomeDir "${REPBASE_STAR_INDEX}" \ --readFilesIn "${TRIMMED_READS}" \ --outFileNamePrefix "${OUTPUT_DIR}/star_repbase_" \ --outSAMtype BAM SortedByCoordinate \ --outBAMcompression 6 -
8
Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables for clarity # Replace 'path/to/human_repbase_STAR_index' with the actual path to your STAR genome index for human_repbase. GENOME_DIR="path/to/human_repbase_STAR_index" # Replace 'path/to/read1.fastq.gz' with the actual path to your input read file(s). # For paired-end reads, it would be "path/to/read1.fastq.gz path/to/read2.fastq.gz" READ_FILES="path/to/read1.fastq.gz" OUT_PREFIX="out_prefix" # Execute STAR alignment STAR \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ_FILES}" \ --outFileNamePrefix "${OUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
9
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.4.0i # Define variables STAR_VERSION="2.4.0i" GENOME_DIR="/path/to/hg19_star_index" # Placeholder for indexed hg19 genome directory READS_FILE="input_unmapped_filtered_reads.fastq.gz" # Placeholder for input FASTQ file OUTPUT_PREFIX="aligned_reads_hg19" # --- Genome Indexing (Run once for hg19 if index doesn't exist) --- # This step requires the hg19 FASTA and GTF files to build the STAR index. # Example: # GENOME_FASTA="/path/to/hg19.fa" # GENOME_GTF="/path/to/hg19.gtf" # mkdir -p "${GENOME_DIR}" # STAR --runMode genomeGenerate \ # --genomeDir "${GENOME_DIR}" \ # --genomeFastaFiles "${GENOME_FASTA}" \ # --sjdbGTFfile "${GENOME_GTF}" \ # --sjdbOverhang 100 \ # --runThreadN 8 # Adjust threads as needed based on available resources # --- Alignment Step --- # Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19). STAR --runMode alignReads \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_FILE}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}." \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped None \ --runThreadN 8 # Adjust threads as needed based on available resources -
10
Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables # Placeholder for human GRCh38 genome index. Replace with your actual path. GENOME_DIR="/path/to/STAR_genome_index_GRCh38" # Placeholder for input FASTQ files. Assuming paired-end reads based on common RNA-seq practices and 'readFilesIn' plural. READ1_FILE="/path/to/sample_R1.fastq.gz" READ2_FILE="/path/to/sample_R2.fastq.gz" # Placeholder for output file prefix. OUTPUT_PREFIX="sample_aligned_" # Execute STAR alignment STAR \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1_FILE}" "${READ2_FILE}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
11
Aligned reads were sorted with samtools (1.6)
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.6 # Sort aligned reads (e.g., BAM file) samtools sort -o sorted_reads.bam aligned_reads.bam # Index the sorted BAM file (common practice after sorting) samtools index sorted_reads.bam
-
12
Sorted reads were collapsed with umi_tools (1.0.0).
$ Bash example
# Install umi_tools (if not already installed) # conda install -c bioconda umi_tools=1.0.0 # Placeholder for input and output files # Replace 'input_sorted.bam' with your actual sorted BAM file. # Replace 'output_deduplicated.bam' with your desired output file name. # This command assumes UMIs are already extracted and stored in the 'RX' tag (default for umi_tools extract). # The '--method=directional' is a common and robust deduplication strategy. umi_tools dedup \ --input input_sorted.bam \ --output output_deduplicated.bam \ --extract-umi-method=tag \ --umi-tag=RX \ --method=directional
-
13
Args: --random-seed 1 --method unique
Generic Data Processing Tool (Inferred with models/gemini-2.5-flash) vUnknown$ Bash example
# This command represents a step in a bioinformatics pipeline. # The specific tool is not explicitly stated in the description. # The arguments '--random-seed 1' and '--method unique' suggest a data processing # or analysis step involving a random seed for reproducibility and a 'unique' # method for handling or identifying unique features/items. # # Please replace 'your_actual_tool_name' with the specific tool # that uses these arguments in your pipeline context. your_actual_tool_name --random-seed 1 --method unique
-
14
BAM files were used to identify peak clusters with Clipper (1.2.2).
$ Bash example
bash # Install CLIPper (if not already installed) # conda install -c bioconda clipper # Placeholder for genome size file (e.g., for hg38) # Download hg38.chrom.sizes from UCSC or generate it # Example: wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes GENOME_SIZE_FILE="hg38.chrom.sizes" # Replace with actual path to your genome size file INPUT_BAM="input.bam" # Replace with your input BAM file OUTPUT_PREFIX="output_peaks" # Prefix for output files # Execute CLIPper to identify peak clusters # Using common parameters for eCLIP peak calling (FDR 0.05, window size 20, extension size 100) clipper -b "${INPUT_BAM}" \ -s "${GENOME_SIZE_FILE}" \ -o "${OUTPUT_PREFIX}" \ -f 0.05 \ -w 20 \ -e 100 \ -p 8 # Example: use 8 processes -
15
Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
clipper (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Installation instructions (commented out) # Clone the clipper repository # git clone https://github.com/yeolab/clipper.git # cd clipper # Install dependencies (e.g., via pip or conda) # conda create -n clipper_env python=3.8 # conda activate clipper_env # pip install -r requirements.txt # Ensure the clipper script is executable or run with python. # The --species hg19 argument indicates the human genome build 19. # For clipper, this typically refers to a pre-indexed genome directory or a configuration # that points to the hg19 genome FASTA and its annotations (e.g., GTF/GFF). # Make sure the necessary reference files for hg19 are accessible to clipper. # Execution command python clipper.py \ --species hg19 \ --bam path/to/input.bam \ --timeout 3600000 \ --maxgenes 1000000 \ --save-pickle \ --outfile path/to/output.bam -
16
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
$ Bash example
# Install eclip (if not already installed) # The eclip repository (https://github.com/yeolab/eclip) contains the peaksnormalize.pl script. # Ensure Perl and necessary dependencies (e.g., bedtools, samtools) are available. # For example, using conda: # conda create -n eclip_env python=3.8 # conda activate eclip_env # conda install -c bioconda bedtools samtools # git clone https://github.com/yeolab/eclip.git # export PATH=$PATH:$(pwd)/eclip/bin # Define input files (placeholders based on description) IP_BAM="ip_sample.bam" # BAM file for IP sample INPUT_BAM="input_sample.bam" # BAM file for INPUT sample PEAK_FILE="input_peaks.bed" # File containing peak clusters (e.g., BED format) OUTPUT_PREFIX="normalized_peaks" # Prefix for output files # Execute the peak normalization command peaksnormalize.pl "${PEAK_FILE}" "${IP_BAM}" "${INPUT_BAM}" "${OUTPUT_PREFIX}" -
17
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
$ Bash example
# Install dependencies # This script is part of the 'merge_peaks' repository, which is often used within the eCLIP pipeline. # It may rely on standard Perl modules and external tools like bedtools. # conda install -c bioconda bedtools # Clone the merge_peaks repository to obtain the script # git clone https://github.com/yeolab/merge_peaks.git # SCRIPT_DIR="merge_peaks" # Directory where the repository was cloned # Placeholder for input normalized peak regions from replicates # These would be generated by previous steps in the eCLIP pipeline (e.g., peak calling and normalization) # Example: two replicate peak files in BED format INPUT_REPLICATE1_PEAKS="sample_rep1_normalized_peaks.bed" INPUT_REPLICATE2_PEAKS="sample_rep2_normalized_peaks.bed" # If more replicates, add them as additional arguments: INPUT_REPLICATE3_PEAKS="sample_rep3_normalized_peaks.bed" # Output merged peak file OUTPUT_MERGED_PEAKS="sample_merged_replicate_overlapping_peaks.bed" # Path to the script # Adjust SCRIPT_PATH if the script is in your system's PATH or a different location SCRIPT_PATH="${SCRIPT_DIR}/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl" # Execute the script to merge overlapping normalized peak regions # The script typically takes multiple input BED files as arguments and outputs a single merged BED file to stdout. perl "${SCRIPT_PATH}" "${INPUT_REPLICATE1_PEAKS}" "${INPUT_REPLICATE2_PEAKS}" > "${OUTPUT_MERGED_PEAKS}" -
18
Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
$ Bash example
# Install IDR (if not already installed) # conda install -c bioconda idr # Install merge_peaks (if not already installed) # git clone https://github.com/yeolab/merge_peaks.git # export PATH=$PATH:/path/to/merge_peaks/bin # Define input peak files (replace with actual file paths) REP1_PEAKS="rep1_peaks.compressed.bed" REP2_PEAKS="rep2_peaks.compressed.bed" OUTPUT_PREFIX="idr_reproducible_peaks" # Step 1: Rank normalized peak files by entropy score # make_informationcontent_from_peaks.pl adds entropy as the 5th column (1-indexed) make_informationcontent_from_peaks.pl "${REP1_PEAKS}" > "${REP1_PEAKS}.entropy" make_informationcontent_from_peaks.pl "${REP2_PEAKS}" > "${REP2_PEAKS}.entropy" # Sort the entropy-scored peaks by the 5th column (entropy score) in reverse numerical order # IDR expects the score column to be sorted for optimal performance, though it can sort internally. # The merge_peaks pipeline often sorts explicitly. sort -k5gr "${REP1_PEAKS}.entropy" > "${REP1_PEAKS}.entropy.sorted" sort -k5gr "${REP2_PEAKS}.entropy" > "${REP2_PEAKS}.entropy.sorted" # Step 2: Run IDR to determine reproducible peaks # --rank-column 5 specifies that the 5th column (entropy score) should be used for ranking idr --samples "${REP1_PEAKS}.entropy.sorted" "${REP2_PEAKS}.entropy.sorted" \ --output-file "${OUTPUT_PREFIX}" \ --rank-column 5 \ --plot \ --log-output-file "${OUTPUT_PREFIX}.log" # Clean up intermediate files (optional) # rm "${REP1_PEAKS}.entropy" "${REP2_PEAKS}.entropy" # rm "${REP1_PEAKS}.entropy.sorted" "${REP2_PEAKS}.entropy.sorted"
Raw Source Text
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0). Args: --random-seed 1 --bc-pattern NNNNNNNNNN Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/) Reads were then trimmed once more with cutadapt (1.14) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/) Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19). Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Aligned reads were sorted with samtools (1.6) Sorted reads were collapsed with umi_tools (1.0.0). Args: --random-seed 1 --method unique BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+. Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+ Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks. Genome_build: hg19 Supplementary_files_format_and_content: BigWig, BED