GSE112353 Processing Pipeline

OTHER code_examples 5 steps

Publication

Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.

Nature methods (2021) — PMID 33963355

Dataset

GSE112353

Active ribosome profiling with RiboLace and standard ribosome profiling in HEK-293 cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Raw reads were processed by removing 5’ adapters, discarding reads shorter than 20 nucleotides and trimming the first nucleotide (using Trimmomatic v0.36).

    Trimmomatic v0.36
    $ Bash example
    # Install Trimmomatic (if not already installed)
    # conda install -c bioconda trimmomatic=0.36
    
    # Define input and output files
    INPUT_FASTQ="raw_reads.fastq"
    OUTPUT_FASTQ="trimmed_reads.fastq"
    
    # Define the path to the Trimmomatic JAR file.
    # Replace '/path/to/trimmomatic-0.36.jar' with the actual path on your system.
    # If installed via conda, it might be in $CONDA_PREFIX/share/trimmomatic-0.36/trimmomatic-0.36.jar
    TRIMMOMATIC_JAR="/path/to/trimmomatic-0.36.jar"
    
    # Define the adapter file.
    # This file typically comes with Trimmomatic installation (e.g., adapters/TruSeq3-SE.fa)
    # or needs to be provided by the user. The description mentions "removing 5' adapters"
    # without specifying which, so TruSeq3-SE.fa is used as a common placeholder for single-end reads.
    ADAPTER_FILE="/path/to/adapters/TruSeq3-SE.fa"
    
    # Execute Trimmomatic for single-end (SE) reads
    # Parameters:
    #   ILLUMINACLIP: removes adapter sequences. Format: <fastaFile>:<seedMismatches>:<palindromeClipThreshold>:<simpleClipThreshold>
    #   HEADCROP: trims a fixed number of bases from the start of the read.
    #   MINLEN: discards reads shorter than a specified length.
    java -jar "$TRIMMOMATIC_JAR" SE \
        "$INPUT_FASTQ" \
        "$OUTPUT_FASTQ" \
        ILLUMINACLIP:"$ADAPTER_FILE":2:30:10 \
        HEADCROP:1 \
        MINLEN:20
  2. 2

    Reads mapping on rRNAs and tRNAs (downloaded from the SILVA rRNA and the Genomic tRNA databases respectively) were removed.

    Bowtie2 (Inferred with models/gemini-2.5-flash) v2.5.2 GitHub
    $ Bash example
    # Define reference files and input/output
    RRNA_TRNA_FA="rRNA_tRNA.fa" # Combined rRNA and tRNA sequences from SILVA and Genomic tRNA databases
    BOWTIE2_INDEX_PREFIX="rRNA_tRNA_index"
    INPUT_FASTQ="input_reads.fastq.gz" # Replace with your actual input file (e.g., input_R1.fastq.gz for paired-end)
    OUTPUT_FASTQ="filtered_reads.fastq.gz" # For single-end, or filtered_R%.fastq.gz for paired-end
    NUM_THREADS=8 # Adjust as needed
    
    # --- Installation (commented out) ---
    # conda install -c bioconda bowtie2
    
    # --- Prepare reference index (if not already done) ---
    # Download and combine rRNA and tRNA sequences (e.g., from SILVA and Genomic tRNA databases)
    # Example for creating the combined reference and index:
    # wget -O silva_rRNA.fa.gz "https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz"
    # wget -O gtrnadb.fa.gz "http://gtrnadb.ucsc.edu/GtRNAdb-12.0/genomes/human/hg38-tRNAs.fa.gz" # Example for human
    # gunzip -c silva_rRNA.fa.gz gtrnadb.fa.gz > $RRNA_TRNA_FA
    # bowtie2-build $RRNA_TRNA_FA $BOWTIE2_INDEX_PREFIX
    
    # --- Execute read mapping and removal ---
    # Align reads to the rRNA/tRNA reference and output unaligned reads
    # --un-gz: write reads that did not align to a gzipped FASTQ file
    # -p: number of threads
    # -x: basename of the index
    # -U: single-end reads input (use -1 R1.fastq.gz -2 R2.fastq.gz for paired-end)
    # --very-fast: a preset for faster alignment, suitable for filtering
    # Redirect stdout (SAM output) to /dev/null as we only care about the --un-gz output
    # For single-end reads:
    bowtie2 --very-fast -p $NUM_THREADS -x $BOWTIE2_INDEX_PREFIX -U $INPUT_FASTQ --un-gz $OUTPUT_FASTQ > /dev/null
    
    # For paired-end reads, replace the above command with:
    # bowtie2 --very-fast -p $NUM_THREADS -x $BOWTIE2_INDEX_PREFIX -1 input_R1.fastq.gz -2 input_R2.fastq.gz --un-conc-gz filtered_R%.fastq.gz > /dev/null
  3. 3

    The remaining reads were aligned to the human transcriptome (Ensembl v85, corresponding to GENCODE 25) with Bowtie2 (v2.2.6) employing the default settings.

    $ Bash example
    # Install Bowtie2
    # conda install -c bioconda bowtie2=2.2.6
    
    # Placeholder for reference transcriptome FASTA file (Ensembl v85, GENCODE 25)
    # Download the human transcriptome FASTA for Ensembl v85 / GENCODE 25
    # For example, from Ensembl FTP or GENCODE website
    # wget -O human_transcriptome_ensembl_v85_gencode_25.fa "https://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz" # This is an example, actual URL might vary or require specific GENCODE version
    # gunzip human_transcriptome_ensembl_v85_gencode_25.fa.gz
    
    # Build Bowtie2 index for the human transcriptome
    bowtie2-build human_transcriptome_ensembl_v85_gencode_25.fa human_transcriptome_ensembl_v85_gencode_25_index
    
    # Align remaining reads to the human transcriptome using Bowtie2 with default settings
    # Assuming 'remaining_reads.fastq' is the input file and 'aligned_reads.sam' is the output
    bowtie2 -x human_transcriptome_ensembl_v85_gencode_25_index -U remaining_reads.fastq -S aligned_reads.sam
  4. 4

    All reads aligning to the very same region were collapsed to avoid potential PCR duplicates, and only strand-specific reads were kept.

    samtools markdup (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools=1.19
    
    # Define input and output file paths
    INPUT_BAM="aligned_reads.bam" # Placeholder for your aligned BAM file
    OUTPUT_DEDUP_BAM="deduplicated_reads.bam"
    NUM_THREADS=8 # Adjust as needed
    
    # Collapse reads to avoid PCR duplicates and keep strand-specific reads
    # The -r flag removes duplicate reads instead of just marking them.
    # samtools markdup is inherently strand-aware for paired-end reads, 
    # respecting the strand information during deduplication.
    samtools markdup -r -@ "${NUM_THREADS}" "${INPUT_BAM}" "${OUTPUT_DEDUP_BAM}"
  5. 5

    The number of ribosome protected fragments was determined with samtools idxstats.

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools (e.g., via conda)
    # conda install -c bioconda samtools=1.19
    
    # Determine the number of ribosome protected fragments using samtools idxstats
    # Input: aligned.bam (a BAM file containing ribosome protected fragments)
    # Output: ribosome_protected_fragments.txt (a file with chromosome, length, mapped reads, unmapped reads)
    samtools idxstats aligned.bam > ribosome_protected_fragments.txt

Tools Used

Raw Source Text
Raw reads were processed by removing 5’ adapters, discarding reads shorter than 20 nucleotides and trimming the first nucleotide (using Trimmomatic v0.36). Reads mapping on rRNAs and tRNAs (downloaded from the SILVA rRNA and the Genomic tRNA databases respectively) were removed. The remaining reads were aligned to the human transcriptome (Ensembl v85, corresponding to GENCODE 25) with Bowtie2 (v2.2.6) employing the default settings. All reads aligning to the very same region were collapsed to avoid potential PCR duplicates, and only strand-specific reads were kept.  The number of ribosome protected fragments was determined with samtools idxstats.
Genome_build: GRCh38.p7
Supplementary_files_format_and_content: HEK_count_table.txt: tab-delimited text file including transcript annotation (columns 1-8) and sample-specific counts of aligned reads
← Back to Analysis