GSE155728 Processing Pipeline
Publication
Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.Nature methods (2021) — PMID 33963355
Dataset
GSE155728Robust single-cell discovery of RNA targets of RNA binding proteins and ribosomes [tia1_eclip]
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1).
$ Bash example
# Clone the eclip repository to get the eclipdemux script # git clone https://github.com/yeolab/eclip.git # cd eclip/workflow/scripts # Example usage of eclipdemux.py # This command assumes a 6bp randomer followed by a 4bp barcode at the 5' end of the read. # The randomer will be moved to the read header, and the barcode will be trimmed. python eclipdemux.py \ --input reads.fastq.gz \ --output demuxed_reads.fastq.gz \ --barcode-length 4 \ --randomer-length 6 -
2
Args: --length 10
Generic Tool (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# This is a placeholder command as the specific tool could not be inferred from "Args: --length 10". # The --length argument typically specifies a minimum length for reads or a specific feature. generic_tool --length 10
-
3
Reads were then trimmed with cutadapt (1.14).
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=1.14 # Trim reads for quality and minimum length (common default parameters) # Replace 'raw_reads.fastq.gz' with your input file and 'trimmed_reads.fastq.gz' with your desired output file. # If you have specific adapter sequences, add them using -a (3' adapter) and/or -g (5' adapter). cutadapt -q 20 -m 20 -o trimmed_reads.fastq.gz raw_reads.fastq.gz
-
4
Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt # Define input and output files (placeholders) INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="trimmed.fastq.gz" # Adapter sequences generated by parsebarcodes.sh G_ADAPTERS="g_adapters.fasta" A_ADAPTERS="A_adapters.fasta" a_ADAPTERS="a_adapters.fasta" # Execute cutadapt with parameters from the description cutadapt \ --match-read-wildcards \ -O 1 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -g "${G_ADAPTERS}" \ -A "${A_ADAPTERS}" \ -a "${a_ADAPTERS}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
5
Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=1.9.1 # Trim reads to remove 3' adapter sequences, addressing potential double-ligation events. # Replace 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' with the actual 3' adapter sequence used in the experiment. # Replace 'input_reads.fastq.gz' and 'output_trimmed_reads.fastq.gz' with your actual input and output file names. cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \ -o output_trimmed_reads.fastq.gz \ input_reads.fastq.gz -
6
Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt=1.18 # Define input and output files INPUT_FASTQ="input_reads.fastq.gz" # Placeholder for input FASTQ file OUTPUT_FASTQ="output_trimmed_reads.fastq.gz" # Placeholder for output trimmed FASTQ file ADAPTER_FILE="A_adapters.fasta" # Fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline # Execute cutadapt command for adapter trimming and quality filtering cutadapt \ --match-read-wildcards \ -O 5 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -A "${ADAPTER_FILE}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
7
Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05).
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star # Define paths for input and output # Replace with actual paths to your trimmed reads and STAR index TRIMMED_READS="path/to/trimmed_reads.fastq.gz" # Assuming single-end reads. For paired-end, use "read1.fastq.gz read2.fastq.gz" STAR_INDEX_DIR="path/to/repbase_18.05_star_index" # This directory should contain the STAR index built from RepBase 18.05 OUTPUT_PREFIX="mapped_repbase_18.05" # Prefix for output files NUM_THREADS=8 # Number of threads to use, adjust as needed # Create output directory if it doesn't exist mkdir -p "$(dirname ${OUTPUT_PREFIX})" # Map trimmed reads to the human-specific repeat element database (RepBase 18.05) using STAR STAR --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${TRIMMED_READS}" \ --runThreadN "${NUM_THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 100 # Allow reads to map to up to 100 locations, common for repeat mapping -
8
Args: --runThreadN 16 '--genomeDir human_repbase '--readFilesIn path/to/read1 path/to/read2 '--outFileNamePrefix out_prefix '--outReadsUnmapped Fastx '--outSAMtype BAM Unsorted '--outSAMattributes All '--outSAMunmapped Within '--outSAMattrRGline ID:foo '--outFilterType BySJout '--outFilterMultimapNmax 30 '--outFilterMultimapScoreRange 1 '--outFilterScoreMin 10 '--alignEndsType EndToEnd
STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables (replace with actual paths and values) # GENOME_DIR refers to the directory containing the STAR genome index files. # 'human_repbase' is specified in the description, implying a pre-built index. # A common human reference assembly for STAR indices is GRCh38. GENOME_DIR="human_repbase" READ1="path/to/read1.fastq.gz" READ2="path/to/read2.fastq.gz" OUT_PREFIX="out_prefix" # Run STAR alignment STAR \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --outFileNamePrefix "${OUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
9
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19).
$ Bash example
# Install STAR (example using conda) # conda create -n star_env star=2.4.0i -c bioconda -y # conda activate star_env # Define variables STAR_INDEX_DIR="path/to/star_hg19_index" # Directory containing pre-built hg19 STAR index READS_FASTQ="unmapped_filtered_reads.fastq.gz" # Input FASTQ file of unmapped reads OUTPUT_PREFIX="aligned_to_hg19" NUM_THREADS=8 # Adjust based on available resources # --- Genome Index Generation (Run once if index is not available) --- # To build the hg19 index, you would need hg19.fa and hg19.gtf files. # Example: # mkdir -p "${STAR_INDEX_DIR}" # STAR --runMode genomeGenerate \ # --genomeDir "${STAR_INDEX_DIR}" \ # --genomeFastaFiles "path/to/hg19.fa" \ # --sjdbGTFfile "path/to/hg19.gtf" \ # --runThreadN "${NUM_THREADS}" # --- Alignment Step --- # Maps unmapped reads (filtered of repeat elements) to the human genome (hg19) STAR --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${READS_FASTQ}" \ --runThreadN "${NUM_THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 20 # A common parameter to limit multi-mapping reads -
10
Args: --runThreadN 16 '--genomeDir genomedir '--readFilesIn /path/to/read1 /path/to/read2 '--outFileNamePrefix out_prefix '--outReadsUnmapped Fastx '--outSAMtype BAM Unsorted '--outSAMattributes All '--outSAMunmapped Within '--outSAMattrRGline ID:foo '--outFilterType BySJout '--outFilterMultimapNmax 1 '--outFilterMultimapScoreRange 1 '--outFilterScoreMin 10 '--alignEndsType EndToEnd
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables for input and output # Replace /path/to/STAR_genome_index with the actual path to your STAR genome index # This index should be built using a reference genome (e.g., GRCh38/hg38) and GTF annotation. GENOME_DIR="/path/to/STAR_genome_index" READ1="/path/to/read1" READ2="/path/to/read2" OUT_PREFIX="out_prefix" # Execute STAR alignment STAR \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --outFileNamePrefix "${OUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
11
Aligned reads were sorted with samtools (1.4.1)
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.4.1 # Sort aligned reads by coordinate samtools sort -o output.sorted.bam input.bam
-
12
Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines.
$ Bash example
# Install eclip (if not already installed) # pip install eclip # or via conda # Define input and output files INPUT_R1="sorted_reads_R1.fastq.gz" INPUT_R2="sorted_reads_R2.fastq.gz" OUTPUT_PREFIX="collapsed_reads" BARCODE_LENGTH=6 # Example barcode length, adjust as needed based on experimental design MIN_READS_PER_BARCODE=1 # Example minimum reads per unique barcode, adjust as needed # Execute barcodecollapsepe.py barcodecollapsepe.py -i "${INPUT_R1}" -j "${INPUT_R2}" -o "${OUTPUT_PREFIX}" -b "${BARCODE_LENGTH}" -m "${MIN_READS_PER_BARCODE}" -
13
Args: -o preRmDup.bam -m metrics_file -b rmDup.bam
$ Bash example
# Install Picard (if not already installed) # conda install -c bioconda picard # Execute Picard MarkDuplicates # Assuming picard.jar is in the PATH or specified with its full path java -jar $(which picard.jar) MarkDuplicates \ I=rmDup.bam \ O=preRmDup.bam \ M=metrics_file \ REMOVE_DUPLICATES=false \ VALIDATION_STRINGENCY=SILENT \ ASSUME_SORTED=true -
14
PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.4.1 # Merge PCR de-duped reads from each inline barcode # Replace input1.bam, input2.bam with the actual paths to your de-duped BAM files for each barcode samtools merge merged.bam input1.bam input2.bam
-
15
Merged alignments were split to keep just read2 using samtools (1.4.1) view.
$ Bash example
# Install samtools if not already installed # conda install samtools=1.4.1 # Split merged alignments to keep just read2 # Input: merged_alignments.bam (merged alignments) # Output: read2_alignments.bam (alignments containing only read2) samtools view -b -f 0x80 merged_alignments.bam > read2_alignments.bam
-
16
Args: -h -b -f 128
Generic Command (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# The specific bioinformatics tool could not be inferred from the generic arguments: -h -b -f 128. # This is a placeholder command. A specific tool and context (e.g., assay type, task) would be needed to generate a meaningful command. # Example of how the arguments might be used with a hypothetical tool: # placeholder_tool -h -b -f 128
-
17
Read2 BAM files were used to identify peak clusters with Clipper (1.2.2).
$ Bash example
# Install CLIPper if not already installed # pip install clipper # Or clone from GitHub and install # git clone https://github.com/yeolab/clipper.git # cd clipper # python setup.py install # Define input and output files INPUT_BAM="read2.bam" # Placeholder for Read2 BAM file OUTPUT_PREFIX="read2_clipper_peaks" GENOME_SIZE="hg38" # Placeholder for genome size (e.g., hg38, mm10, or a custom value) # Run CLIPper to identify peak clusters # Using common parameters: -b for input BAM, -s for genome size, -o for output prefix clipper.py -b "${INPUT_BAM}" -s "${GENOME_SIZE}" -o "${OUTPUT_PREFIX}" -
18
Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
eCLIP_BAM_Processor.py (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install dependencies for a custom Python script (example, adjust as needed) # conda create -n eclip_env python=3.8 # conda activate eclip_env # conda install -c bioconda pysam numpy scipy pandas # Common dependencies for BAM processing and pickle # # Reference genome: hg19 (Human Genome build 19) # Source: UCSC Genome Browser (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz) or NCBI. # Execute the inferred eCLIP BAM processing script python eCLIP_BAM_Processor.py \ --species hg19 \ --bam path/to/input.bam \ --timeout 3600000 \ --maxgenes 1000000 \ --save-pickle \ --outfile path/to/output.bam -
19
Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
$ Bash example
# Install eCLIP tools (example, actual installation might vary) # git clone https://github.com/yeolab/eclip.git # cd eclip # # Follow installation instructions, e.g., setting up Perl dependencies and adding eclip/bin to PATH # The description mentions "overlap_peakfi_with_bam_PE.pl" which is often used # to filter peak files based on BAM coverage prior to normalization. # Assuming 'pre_normalized_peaks.narrowPeak' is the output of such a filtering step # or the initial peak file to be normalized. # Placeholder variables IP_BAM="ip_sample_read2.bam" INPUT_BAM="input_sample_read2.bam" PEAK_FILE="pre_normalized_peaks.narrowPeak" # Input peak file for normalization OUTPUT_PREFIX="normalized_peaks" # Prefix for output files (e.g., normalized_peaks.bed, normalized_peaks.log) # Execute the normalization using peaksnormalize.pl # Usage: peaksnormalize.pl <IP_BAM> <INPUT_BAM> <PEAK_FILE> <OUTPUT_PREFIX> peaksnormalize.pl "${IP_BAM}" "${INPUT_BAM}" "${PEAK_FILE}" "${OUTPUT_PREFIX}" -
20
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
$ Bash example
# Install Perl (if not already available) # sudo apt-get update && sudo apt-get install -y perl # # Clone the merge_peaks repository to get the script # git clone https://github.com/yeolab/merge_peaks.git # # Add the script's directory to PATH or call it directly # export PATH=$PATH:$(pwd)/merge_peaks/bin # # Example usage: # Assuming 'rep1_normalized_peaks.bed' and 'rep2_normalized_peaks.bed' are the input normalized peak regions. # The script merges overlapping regions from multiple replicate BED files based on L2 fold enrichment. perl merge_peaks/bin/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \ rep1_normalized_peaks.bed \ rep2_normalized_peaks.bed \ > merged_replicate_peaks.bed -
21
Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
$ Bash example
# Install IDR (if not already installed) # conda install -c bioconda idr # Install merge_peaks pipeline (if not already installed) # git clone https://github.com/yeolab/merge_peaks.git # export PATH=$PATH:/path/to/merge_peaks/scripts # Define input files (placeholders) INPUT_REP1_BED="replicate1.compressed.bed" INPUT_REP2_BED="replicate2.compressed.bed" GENOME_FASTA="/path/to/genome/hg38.fa" # Placeholder for reference genome fasta (e.g., hg38) OUTPUT_PREFIX="idr_output" # Step 1: Rank peaks by entropy score using make_informationcontent_from_peaks.pl # This script is part of the merge_peaks pipeline (https://github.com/yeolab/merge_peaks). # It takes a BED file and a genome fasta, and outputs a new BED file with an information content score (entropy) in the 5th column. RANKED_REP1_BED="${INPUT_REP1_BED%.bed}.ranked.bed" RANKED_REP2_BED="${INPUT_REP2_BED%.bed}.ranked.bed" make_informationcontent_from_peaks.pl "${INPUT_REP1_BED}" "${GENOME_FASTA}" "${RANKED_REP1_BED}" make_informationcontent_from_peaks.pl "${INPUT_REP2_BED}" "${GENOME_FASTA}" "${RANKED_REP2_BED}" # Step 2: Run IDR (Irreproducible Discovery Rate) with the ranked peak files # The --rank-by signal.value parameter tells IDR to use the 5th column (score) of the input BED files, # which now contains the entropy score generated by make_informationcontent_from_peaks.pl. idr --samples "${RANKED_REP1_BED}" "${RANKED_REP2_BED}" \ --output-file "${OUTPUT_PREFIX}.idr" \ --rank-by signal.value \ --plot \ --log-output-file "${OUTPUT_PREFIX}.log"
Raw Source Text
Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1). Args: --length 10 Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline) Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline) Trimmed reads were then mapped with STAR (2.4.0i) against a human-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 '--genomeDir human_repbase '--readFilesIn path/to/read1 path/to/read2 '--outFileNamePrefix out_prefix '--outReadsUnmapped Fastx '--outSAMtype BAM Unsorted '--outSAMattributes All '--outSAMunmapped Within '--outSAMattrRGline ID:foo '--outFilterType BySJout '--outFilterMultimapNmax 30 '--outFilterMultimapScoreRange 1 '--outFilterScoreMin 10 '--alignEndsType EndToEnd Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19). Args: --runThreadN 16 '--genomeDir genomedir '--readFilesIn /path/to/read1 /path/to/read2 '--outFileNamePrefix out_prefix '--outReadsUnmapped Fastx '--outSAMtype BAM Unsorted '--outSAMattributes All '--outSAMunmapped Within '--outSAMattrRGline ID:foo '--outFilterType BySJout '--outFilterMultimapNmax 1 '--outFilterMultimapScoreRange 1 '--outFilterScoreMin 10 '--alignEndsType EndToEnd Aligned reads were sorted with samtools (1.4.1) Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines. Args: -o preRmDup.bam -m metrics_file -b rmDup.bam PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam) Merged alignments were split to keep just read2 using samtools (1.4.1) view. Args: -h -b -f 128 Read2 BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+. Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+ Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks. Genome_build: hg19 Supplementary_files_format_and_content: BigWig, BED