GSE112353 Processing Pipeline

OTHER code_examples 5 steps

Publication

Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.

Nature methods (2021) — PMID 33963355

Dataset

Active ribosome profiling with RiboLace and standard ribosome profiling in HEK-293 cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Raw reads were processed by removing 5â adapters, discarding reads shorter than 20 nucleotides and trimming the first nucleotide (using Trimmomatic v0.36).

Trimmomatic v0.36

$ Bash example

# Install Trimmomatic (if not already installed)
# conda install -c bioconda trimmomatic=0.36

# Define input and output files
INPUT_FASTQ="raw_reads.fastq"
OUTPUT_FASTQ="trimmed_reads.fastq"

# Define the path to the Trimmomatic JAR file.
# Replace '/path/to/trimmomatic-0.36.jar' with the actual path on your system.
# If installed via conda, it might be in $CONDA_PREFIX/share/trimmomatic-0.36/trimmomatic-0.36.jar
TRIMMOMATIC_JAR="/path/to/trimmomatic-0.36.jar"

# Define the adapter file.
# This file typically comes with Trimmomatic installation (e.g., adapters/TruSeq3-SE.fa)
# or needs to be provided by the user. The description mentions "removing 5' adapters"
# without specifying which, so TruSeq3-SE.fa is used as a common placeholder for single-end reads.
ADAPTER_FILE="/path/to/adapters/TruSeq3-SE.fa"

# Execute Trimmomatic for single-end (SE) reads
# Parameters:
#   ILLUMINACLIP: removes adapter sequences. Format: <fastaFile>:<seedMismatches>:<palindromeClipThreshold>:<simpleClipThreshold>
#   HEADCROP: trims a fixed number of bases from the start of the read.
#   MINLEN: discards reads shorter than a specified length.
java -jar "$TRIMMOMATIC_JAR" SE \
    "$INPUT_FASTQ" \
    "$OUTPUT_FASTQ" \
    ILLUMINACLIP:"$ADAPTER_FILE":2:30:10 \
    HEADCROP:1 \
    MINLEN:20

Reads mapping on rRNAs and tRNAs (downloaded from the SILVA rRNA and the Genomic tRNA databases respectively) were removed.

Bowtie2 (Inferred with models/gemini-2.5-flash) v2.5.2 GitHub

$ Bash example

# Define reference files and input/output
RRNA_TRNA_FA="rRNA_tRNA.fa" # Combined rRNA and tRNA sequences from SILVA and Genomic tRNA databases
BOWTIE2_INDEX_PREFIX="rRNA_tRNA_index"
INPUT_FASTQ="input_reads.fastq.gz" # Replace with your actual input file (e.g., input_R1.fastq.gz for paired-end)
OUTPUT_FASTQ="filtered_reads.fastq.gz" # For single-end, or filtered_R%.fastq.gz for paired-end
NUM_THREADS=8 # Adjust as needed

# --- Installation (commented out) ---
# conda install -c bioconda bowtie2

# --- Prepare reference index (if not already done) ---
# Download and combine rRNA and tRNA sequences (e.g., from SILVA and Genomic tRNA databases)
# Example for creating the combined reference and index:
# wget -O silva_rRNA.fa.gz "https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz"
# wget -O gtrnadb.fa.gz "http://gtrnadb.ucsc.edu/GtRNAdb-12.0/genomes/human/hg38-tRNAs.fa.gz" # Example for human
# gunzip -c silva_rRNA.fa.gz gtrnadb.fa.gz > $RRNA_TRNA_FA
# bowtie2-build $RRNA_TRNA_FA $BOWTIE2_INDEX_PREFIX

# --- Execute read mapping and removal ---
# Align reads to the rRNA/tRNA reference and output unaligned reads
# --un-gz: write reads that did not align to a gzipped FASTQ file
# -p: number of threads
# -x: basename of the index
# -U: single-end reads input (use -1 R1.fastq.gz -2 R2.fastq.gz for paired-end)
# --very-fast: a preset for faster alignment, suitable for filtering
# Redirect stdout (SAM output) to /dev/null as we only care about the --un-gz output
# For single-end reads:
bowtie2 --very-fast -p $NUM_THREADS -x $BOWTIE2_INDEX_PREFIX -U $INPUT_FASTQ --un-gz $OUTPUT_FASTQ > /dev/null

# For paired-end reads, replace the above command with:
# bowtie2 --very-fast -p $NUM_THREADS -x $BOWTIE2_INDEX_PREFIX -1 input_R1.fastq.gz -2 input_R2.fastq.gz --un-conc-gz filtered_R%.fastq.gz > /dev/null

View on GitHub

The remaining reads were aligned to the human transcriptome (Ensembl v85, corresponding to GENCODE 25) with Bowtie2 (v2.2.6) employing the default settings.

Bowtie2 v2.2.6 GitHub

$ Bash example

# Install Bowtie2
# conda install -c bioconda bowtie2=2.2.6

# Placeholder for reference transcriptome FASTA file (Ensembl v85, GENCODE 25)
# Download the human transcriptome FASTA for Ensembl v85 / GENCODE 25
# For example, from Ensembl FTP or GENCODE website
# wget -O human_transcriptome_ensembl_v85_gencode_25.fa "https://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz" # This is an example, actual URL might vary or require specific GENCODE version
# gunzip human_transcriptome_ensembl_v85_gencode_25.fa.gz

# Build Bowtie2 index for the human transcriptome
bowtie2-build human_transcriptome_ensembl_v85_gencode_25.fa human_transcriptome_ensembl_v85_gencode_25_index

# Align remaining reads to the human transcriptome using Bowtie2 with default settings
# Assuming 'remaining_reads.fastq' is the input file and 'aligned_reads.sam' is the output
bowtie2 -x human_transcriptome_ensembl_v85_gencode_25_index -U remaining_reads.fastq -S aligned_reads.sam

View on GitHub

All reads aligning to the very same region were collapsed to avoid potential PCR duplicates, and only strand-specific reads were kept.

samtools markdup (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.19

# Define input and output file paths
INPUT_BAM="aligned_reads.bam" # Placeholder for your aligned BAM file
OUTPUT_DEDUP_BAM="deduplicated_reads.bam"
NUM_THREADS=8 # Adjust as needed

# Collapse reads to avoid PCR duplicates and keep strand-specific reads
# The -r flag removes duplicate reads instead of just marking them.
# samtools markdup is inherently strand-aware for paired-end reads, 
# respecting the strand information during deduplication.
samtools markdup -r -@ "${NUM_THREADS}" "${INPUT_BAM}" "${OUTPUT_DEDUP_BAM}"

View on GitHub

The number of ribosome protected fragments was determined with samtools idxstats.

samtools v1.19 GitHub

$ Bash example

# Install samtools (e.g., via conda)
# conda install -c bioconda samtools=1.19

# Determine the number of ribosome protected fragments using samtools idxstats
# Input: aligned.bam (a BAM file containing ribosome protected fragments)
# Output: ribosome_protected_fragments.txt (a file with chromosome, length, mapped reads, unmapped reads)
samtools idxstats aligned.bam > ribosome_protected_fragments.txt

View on GitHub

Tools Used

Bowtie2

Raw Source Text

Raw reads were processed by removing 5â adapters, discarding reads shorter than 20 nucleotides and trimming the first nucleotide (using Trimmomatic v0.36). Reads mapping on rRNAs and tRNAs (downloaded from the SILVA rRNA and the Genomic tRNA databases respectively) were removed. The remaining reads were aligned to the human transcriptome (Ensembl v85, corresponding to GENCODE 25) with Bowtie2 (v2.2.6) employing the default settings. All reads aligning to the very same region were collapsed to avoid potential PCR duplicates, and only strand-specific reads were kept.  The number of ribosome protected fragments was determined with samtools idxstats.
Genome_build: GRCh38.p7
Supplementary_files_format_and_content: HEK_count_table.txt: tab-delimited text file including transcript annotation (columns 1-8) and sample-specific counts of aligned reads

← Back to Analysis