GSE112353 Processing Pipeline
Publication
Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.Nature methods (2021) — PMID 33963355
Dataset
GSE112353Active ribosome profiling with RiboLace and standard ribosome profiling in HEK-293 cells
Processing Steps
Generate Jupyter Notebook-
1
Raw reads were processed by removing 5â adapters, discarding reads shorter than 20 nucleotides and trimming the first nucleotide (using Trimmomatic v0.36).
Trimmomatic v0.36$ Bash example
# Install Trimmomatic (if not already installed) # conda install -c bioconda trimmomatic=0.36 # Define input and output files INPUT_FASTQ="raw_reads.fastq" OUTPUT_FASTQ="trimmed_reads.fastq" # Define the path to the Trimmomatic JAR file. # Replace '/path/to/trimmomatic-0.36.jar' with the actual path on your system. # If installed via conda, it might be in $CONDA_PREFIX/share/trimmomatic-0.36/trimmomatic-0.36.jar TRIMMOMATIC_JAR="/path/to/trimmomatic-0.36.jar" # Define the adapter file. # This file typically comes with Trimmomatic installation (e.g., adapters/TruSeq3-SE.fa) # or needs to be provided by the user. The description mentions "removing 5' adapters" # without specifying which, so TruSeq3-SE.fa is used as a common placeholder for single-end reads. ADAPTER_FILE="/path/to/adapters/TruSeq3-SE.fa" # Execute Trimmomatic for single-end (SE) reads # Parameters: # ILLUMINACLIP: removes adapter sequences. Format: <fastaFile>:<seedMismatches>:<palindromeClipThreshold>:<simpleClipThreshold> # HEADCROP: trims a fixed number of bases from the start of the read. # MINLEN: discards reads shorter than a specified length. java -jar "$TRIMMOMATIC_JAR" SE \ "$INPUT_FASTQ" \ "$OUTPUT_FASTQ" \ ILLUMINACLIP:"$ADAPTER_FILE":2:30:10 \ HEADCROP:1 \ MINLEN:20 -
2
Reads mapping on rRNAs and tRNAs (downloaded from the SILVA rRNA and the Genomic tRNA databases respectively) were removed.
$ Bash example
# Define reference files and input/output RRNA_TRNA_FA="rRNA_tRNA.fa" # Combined rRNA and tRNA sequences from SILVA and Genomic tRNA databases BOWTIE2_INDEX_PREFIX="rRNA_tRNA_index" INPUT_FASTQ="input_reads.fastq.gz" # Replace with your actual input file (e.g., input_R1.fastq.gz for paired-end) OUTPUT_FASTQ="filtered_reads.fastq.gz" # For single-end, or filtered_R%.fastq.gz for paired-end NUM_THREADS=8 # Adjust as needed # --- Installation (commented out) --- # conda install -c bioconda bowtie2 # --- Prepare reference index (if not already done) --- # Download and combine rRNA and tRNA sequences (e.g., from SILVA and Genomic tRNA databases) # Example for creating the combined reference and index: # wget -O silva_rRNA.fa.gz "https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz" # wget -O gtrnadb.fa.gz "http://gtrnadb.ucsc.edu/GtRNAdb-12.0/genomes/human/hg38-tRNAs.fa.gz" # Example for human # gunzip -c silva_rRNA.fa.gz gtrnadb.fa.gz > $RRNA_TRNA_FA # bowtie2-build $RRNA_TRNA_FA $BOWTIE2_INDEX_PREFIX # --- Execute read mapping and removal --- # Align reads to the rRNA/tRNA reference and output unaligned reads # --un-gz: write reads that did not align to a gzipped FASTQ file # -p: number of threads # -x: basename of the index # -U: single-end reads input (use -1 R1.fastq.gz -2 R2.fastq.gz for paired-end) # --very-fast: a preset for faster alignment, suitable for filtering # Redirect stdout (SAM output) to /dev/null as we only care about the --un-gz output # For single-end reads: bowtie2 --very-fast -p $NUM_THREADS -x $BOWTIE2_INDEX_PREFIX -U $INPUT_FASTQ --un-gz $OUTPUT_FASTQ > /dev/null # For paired-end reads, replace the above command with: # bowtie2 --very-fast -p $NUM_THREADS -x $BOWTIE2_INDEX_PREFIX -1 input_R1.fastq.gz -2 input_R2.fastq.gz --un-conc-gz filtered_R%.fastq.gz > /dev/null
-
3
The remaining reads were aligned to the human transcriptome (Ensembl v85, corresponding to GENCODE 25) with Bowtie2 (v2.2.6) employing the default settings.
$ Bash example
# Install Bowtie2 # conda install -c bioconda bowtie2=2.2.6 # Placeholder for reference transcriptome FASTA file (Ensembl v85, GENCODE 25) # Download the human transcriptome FASTA for Ensembl v85 / GENCODE 25 # For example, from Ensembl FTP or GENCODE website # wget -O human_transcriptome_ensembl_v85_gencode_25.fa "https://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz" # This is an example, actual URL might vary or require specific GENCODE version # gunzip human_transcriptome_ensembl_v85_gencode_25.fa.gz # Build Bowtie2 index for the human transcriptome bowtie2-build human_transcriptome_ensembl_v85_gencode_25.fa human_transcriptome_ensembl_v85_gencode_25_index # Align remaining reads to the human transcriptome using Bowtie2 with default settings # Assuming 'remaining_reads.fastq' is the input file and 'aligned_reads.sam' is the output bowtie2 -x human_transcriptome_ensembl_v85_gencode_25_index -U remaining_reads.fastq -S aligned_reads.sam
-
4
All reads aligning to the very same region were collapsed to avoid potential PCR duplicates, and only strand-specific reads were kept.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.19 # Define input and output file paths INPUT_BAM="aligned_reads.bam" # Placeholder for your aligned BAM file OUTPUT_DEDUP_BAM="deduplicated_reads.bam" NUM_THREADS=8 # Adjust as needed # Collapse reads to avoid PCR duplicates and keep strand-specific reads # The -r flag removes duplicate reads instead of just marking them. # samtools markdup is inherently strand-aware for paired-end reads, # respecting the strand information during deduplication. samtools markdup -r -@ "${NUM_THREADS}" "${INPUT_BAM}" "${OUTPUT_DEDUP_BAM}" -
5
The number of ribosome protected fragments was determined with samtools idxstats.
$ Bash example
# Install samtools (e.g., via conda) # conda install -c bioconda samtools=1.19 # Determine the number of ribosome protected fragments using samtools idxstats # Input: aligned.bam (a BAM file containing ribosome protected fragments) # Output: ribosome_protected_fragments.txt (a file with chromosome, length, mapped reads, unmapped reads) samtools idxstats aligned.bam > ribosome_protected_fragments.txt
Tools Used
Raw Source Text
Raw reads were processed by removing 5â adapters, discarding reads shorter than 20 nucleotides and trimming the first nucleotide (using Trimmomatic v0.36). Reads mapping on rRNAs and tRNAs (downloaded from the SILVA rRNA and the Genomic tRNA databases respectively) were removed. The remaining reads were aligned to the human transcriptome (Ensembl v85, corresponding to GENCODE 25) with Bowtie2 (v2.2.6) employing the default settings. All reads aligning to the very same region were collapsed to avoid potential PCR duplicates, and only strand-specific reads were kept. The number of ribosome protected fragments was determined with samtools idxstats. Genome_build: GRCh38.p7 Supplementary_files_format_and_content: HEK_count_table.txt: tab-delimited text file including transcript annotation (columns 1-8) and sample-specific counts of aligned reads