GSE230349 Processing Pipeline

OTHER code_examples 8 steps

Publication

Seryl-tRNA synthetase promotes translational readthrough by mRNA binding and involvement of the selenocysteine incorporation machinery.

Nucleic acids research (2023) — PMID 37739431

Dataset

GSE230349

Seryl-tRNA synthetase promotes translational readthrough via mRNA binding by involving the selenocysteine incorporation machinery

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Ribosomal footprints were analyzed as described by Ingolia et al. with these modifications: Trimgalore was used to trim off adapters and clip the first nucleotide off the 5’ end.

    Trim Galore v0.6.10 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Trim Galore (requires Cutadapt and FastQC)
    # conda install -c bioconda trim-galore
    
    # Define input and output file names
    INPUT_FASTQ="ribo_footprints.fastq.gz"
    OUTPUT_DIR="."
    
    # Trim adapters and clip the first nucleotide from the 5' end
    # --illumina: Trims standard Illumina adapters
    # --clip_r1 1: Clips 1 bp from the 5' end of Read 1 (assuming single-end reads or primary read in paired-end)
    # --output_dir: Specifies the output directory
    # Trim Galore automatically appends '_trimmed.fq.gz' to the output file name.
    trim_galore --illumina --clip_r1 1 --output_dir "${OUTPUT_DIR}" "${INPUT_FASTQ}"
  2. 2

    Reads were then mapped to ribosomal RNA using bowtie2 (39) and unmapped reads were further mapped to the human transcriptome (v19) with STAR aligner (33).

    STAR vNot specified
    $ Bash example
    # Install STAR (e.g., using Bioconda)
    # conda install -c bioconda star
    
    # --- Reference Data Setup ---
    # The description mentions "human transcriptome (v19)". This typically refers to
    # the GRCh37/hg19 genome assembly and GENCODE v19 annotation.
    # A STAR genome index needs to be pre-built using the genome FASTA and GTF files.
    
    # Placeholder for STAR genome index directory
    STAR_INDEX_DIR="/path/to/STAR_human_GRCh37_gencode_v19_index"
    
    # Example commands to build the STAR index (uncomment and modify if needed):
    # GENOME_FASTA="/path/to/Homo_sapiens.GRCh37.dna.primary_assembly.fa" # e.g., from Ensembl
    # GTF_FILE="/path/to/gencode.v19.annotation.gtf" # e.g., from GENCODE
    # 
    # mkdir -p ${STAR_INDEX_DIR}
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${STAR_INDEX_DIR} \
    #      --genomeFastaFiles ${GENOME_FASTA} \
    #      --sjdbGTFfile ${GTF_FILE} \
    #      --sjdbOverhang 100 # Recommended: read_length - 1
    
    # --- Input and Output Setup ---
    # The input is "unmapped reads" from a previous bowtie2 step. Assuming these are in FASTQ format.
    # If the input is a BAM file of unmapped reads, it would first need to be converted to FASTQ.
    INPUT_FASTQ="unmapped_reads_from_rRNA_mapping.fastq.gz" # Placeholder for input FASTQ file
    OUTPUT_DIR="star_human_transcriptome_alignment"
    OUTPUT_PREFIX="${OUTPUT_DIR}/human_transcriptome_aligned"
    
    # Create output directory
    mkdir -p ${OUTPUT_DIR}
    
    # --- STAR Alignment Command ---
    # Maps unmapped reads to the human transcriptome (genome + GTF) using STAR.
    # The `quantMode TranscriptomeSAM` option outputs alignments in transcriptome coordinates,
    # which is useful for downstream quantification tools.
    STAR --genomeDir ${STAR_INDEX_DIR} \
         --readFilesIn ${INPUT_FASTQ} \
         --runThreadN 8 \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes Standard \
         --outFilterType BySJout \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.05 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --quantMode TranscriptomeSAM
    
    # The output will include:
    # - ${OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam: Genome-aligned reads, sorted by coordinate.
    # - ${OUTPUT_PREFIX}ReadsPerGene.out.tab: Gene counts (if quantMode GeneCounts was also used).
    # - ${OUTPUT_PREFIX}Log.final.out: Summary of the alignment run.
    # - ${OUTPUT_PREFIX}Transcriptome.out.bam: Reads aligned to transcriptome coordinates (due to quantMode TranscriptomeSAM).
  3. 3

    Expected read length distribution was tested with the R package RiboProfiling.

    R vNot specified
    $ Bash example
    # Install R and RiboProfiling if not already present
    # conda install -c conda-forge r-base
    # R -e "install.packages('BiocManager')"
    # R -e "BiocManager::install('RiboProfiling')"
    
    # Define input and output files
    INPUT_BAM="input.bam" # Placeholder for the input BAM file
    GENOME_GTF="genome.gtf" # Placeholder: RiboProfiling's read_bam typically requires a GTF annotation file
    GENOME_FASTA="genome.fasta" # Placeholder: RiboProfiling's read_bam typically requires a FASTA genome file
    OUTPUT_PLOT="read_length_distribution.pdf" # Output PDF file for the plot
    
    # Create an R script to analyze read length distribution using RiboProfiling
    cat <<EOF > analyze_read_lengths.R
    library(RiboProfiling)
    library(ggplot2) # Used for flexible plotting of general read length distribution
    
    # Create a Ribo object from the BAM file, GTF, and FASTA.
    # This step is central to RiboProfiling analysis. Parameters like min/max read length
    # are often specified for ribosome profiling data, using typical values here.
    ribo_obj <- read_bam(
      bam_file = "${INPUT_BAM}",
      gtf_file = "${GENOME_GTF}",
      fasta_file = "${GENOME_FASTA}",
      min_read_length = 15, # Typical minimum read length for ribosome profiling
      max_read_length = 40  # Typical maximum read length for ribosome profiling
    )
    
    # Get read lengths from the Ribo object.
    # The `get_read_lengths` function extracts lengths, which can be P-site lengths
    # or raw read lengths depending on the object's processing. For a general
    # distribution, we combine all extracted lengths.
    read_lengths_data <- get_read_lengths(ribo_obj)
    
    # The result is a list of numeric vectors (lengths per transcript/region).
    # Unlist to get a single vector of all read lengths for global distribution.
    all_read_lengths <- unlist(read_lengths_data)
    
    # Plot the distribution using ggplot2.
    # While RiboProfiling has `plot_read_lengths`, it's often tailored for P-site offsets.
    # For a general histogram of all read lengths, ggplot2 offers more flexibility
    # and is a standard R plotting package commonly used in such workflows.
    pdf("${OUTPUT_PLOT}")
    ggplot(data.frame(Length = all_read_lengths), aes(x = Length)) +
      geom_histogram(binwidth = 1, fill = "steelblue", color = "black") +
      labs(title = "Read Length Distribution (from RiboProfiling object)", x = "Read Length (bp)", y = "Frequency") +
      theme_minimal()
    dev.off()
    EOF
    
    # Execute the R script
    Rscript analyze_read_lengths.R
  4. 4

    To center ribosomes and obtain a list of genes with P-sites in their 3’ UTR, we used functionalities within Ribowaltz (40) and a custom python script by Scott Adamson, UConn, and Jax Laboratories.

    Python vNot specified (likely Python 3.x)
    $ Bash example
    # Installation instructions for Python and potential dependencies (if known).
    # As this is a custom script, specific dependencies are not mentioned.
    # A general Python environment is assumed.
    # conda create -n ribo_env python=3.x
    # conda activate ribo_env
    # pip install pandas numpy # Common data handling libraries, might be used
    
    # Execution of the custom Python script.
    # This script was developed by Scott Adamson, UConn, and Jax Laboratories.
    # Replace 'path/to/custom_script.py' with the actual path to this script.
    #
    # Input files:
    #   - 'input_ribo_seq.bam': Aligned ribosome profiling reads (e.g., output from STAR aligner).
    #   - 'genome_annotation.gtf': Genome annotation file (e.g., from Ensembl or UCSC, for the latest assembly like hg38 or mm10).
    #
    # Output file:
    #   - 'output_psites_3UTR.tsv': A list of genes with P-sites in their 3' UTR.
    #
    # The exact parameters will depend on the custom script's implementation.
    # These are illustrative placeholders based on the description's goals.
    
    python path/to/custom_script.py \
        --ribo_bam input_ribo_seq.bam \
        --annotation genome_annotation.gtf \
        --output_file output_psites_3UTR.tsv \
        --center_ribosomes_method "ribowaltz_like_method" \
        --psite_3UTR_threshold 10 # Example parameter for 3' UTR P-site identification
  5. 5

    Observed/expected ratios were calculated using a custom script by Scott Adamson.

    custom script by Scott Adamson vN/A
    $ Bash example
    # This step uses a custom script by Scott Adamson to calculate observed/expected ratios.
    # The exact script name, parameters, and dependencies are not specified in the description.
    # Placeholder for custom script execution:
    # python /path/to/scott_adamson_script.py --input_observed_counts observed.txt --output_ratios ratios.txt
  6. 6

    Obs/exp indicates the number of observed codons in the A-site of the ribosome vs the calculated hypothetical expected number of observations.

    riboWaltz (Inferred with models/gemini-2.5-flash) vNot specified
    $ Bash example
    # Install R and Bioconductor if not already present
    # sudo apt-get update && sudo apt-get install -y r-base
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # R -e 'BiocManager::install("riboWaltz")'
    # R -e 'BiocManager::install("GenomicFeatures")'
    # R -e 'BiocManager::install("Biostrings")'
    # R -e 'install.packages("data.table")'
    
    # Define input files (placeholders - replace with actual paths)
    # ALIGNED_BAM: Path to the aligned Ribo-seq BAM file.
    # REFERENCE_GTF: Path to the gene annotation file (GTF/GFF format).
    # REFERENCE_FASTA: Path to the reference genome or transcriptome FASTA file.
    # For example, for human GRCh38:
    # ALIGNED_BAM="/path/to/your/aligned_riboseq_reads.bam"
    # REFERENCE_GTF="/path/to/Homo_sapiens.GRCh38.109.gtf"
    # REFERENCE_FASTA="/path/to/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
    
    ALIGNED_BAM="aligned_riboseq_reads.bam"
    REFERENCE_GTF="reference.gtf"
    REFERENCE_FASTA="reference.fasta"
    
    # Define output file
    OUTPUT_FILE="codon_occupancy_obs_exp.tsv"
    
    # Set environment variables for the R script
    export ALIGNED_BAM
    export REFERENCE_GTF
    export REFERENCE_FASTA
    export OUTPUT_FILE
    
    # Create an R script to perform the Obs/Exp calculation
    cat << 'EOF' > analyze_ribo_occupancy.R
    library(riboWaltz)
    library(GenomicFeatures)
    library(Biostrings)
    library(data.table) # For faster data manipulation
    
    # --- Configuration --- 
    aligned_bam_file <- Sys.getenv("ALIGNED_BAM")
    reference_gtf_file <- Sys.getenv("REFERENCE_GTF")
    reference_fasta_file <- Sys.getenv("REFERENCE_FASTA")
    output_file <- Sys.getenv("OUTPUT_FILE")
    
    # --- Load Annotation and Sequences ---
    message("Loading GTF annotation...")
    txdb <- makeTxDbFromGFF(reference_gtf_file)
    message("Loading FASTA sequences...")
    fasta_seq <- readDNAStringSet(reference_fasta_file)
    names(fasta_seq) <- sub(" .*", "", names(fasta_seq)) # Clean names
    
    # --- Process Ribo-seq Data ---
    message("Reading BAM file and inferring P-sites...")
    # Define read length range for Ribo-seq (adjust as per your data)
    min_read_length <- 25
    max_read_length <- 35
    
    # Read BAM file into a list of data frames (one per sample)
    # `read_bam` processes the BAM file and infers P-site offsets if not provided.
    # This step can be computationally intensive for large files.
    df_list <- read_bam(file = aligned_bam_file,
                        annotation = txdb,
                        fasta_sequences = fasta_seq,
                        min_read_length = min_read_length,
                        max_read_length = max_read_length)
    
    # Infer P-site offsets (if `read_bam` didn't fully infer or for refinement)
    message("Inferring P-site offsets...")
    psite_mapping_list <- psite_mapping(df_list)
    
    # Get P-site and A-site information for each read
    message("Extracting P-site and A-site information...")
    psite_info_list <- psite_info(df_list,
                                  annotation = txdb,
                                  fasta_sequences = fasta_seq,
                                  p_shift = psite_mapping_list)
    
    # --- Calculate Observed A-site Codon Counts ---
    message("Calculating observed A-site codon counts...")
    # `codon_usage_asite` calculates observed codon usage at the A-site.
    # It returns a list of data frames, one per sample.
    observed_codon_usage <- codon_usage_asite(psite_info_list,
                                              fasta_sequences = fasta_seq)
    
    # For a single sample, extract the data frame
    sample_name <- names(observed_codon_usage)[1]
    observed_df <- observed_codon_usage[[sample_name]]
    
    # --- Calculate Expected Codon Frequencies from Transcriptome ---
    message("Calculating expected codon frequencies from transcriptome...")
    # This involves extracting CDS sequences and calculating overall codon frequencies.
    cds_seqs <- getCDS(txdb, fasta_seq)
    # Concatenate all CDS sequences into a single string for overall codon frequency
    all_cds_string <- paste(as.character(cds_seqs), collapse = "")
    
    # Calculate codon frequencies from the entire CDS pool
    codon_table <- oligonucleotideFrequency(DNAString(all_cds_string), width = 3, step = 3)
    expected_codon_df <- data.frame(
      codon = names(codon_table),
      expected_count = as.numeric(codon_table),
      expected_frequency = as.numeric(codon_table) / sum(codon_table)
    )
    
    # Ensure all 64 codons are present, filling with 0 if not observed in CDS
    all_possible_codons <- c(
      "AAA", "AAG", "AAC", "AAT", "ACA", "ACG", "ACC", "ACT", "AGA", "AGG", "AGC", "AGT",
      "ATA", "ATG", "ATC", "ATT", "CAA", "CAG", "CAC", "CAT", "CCA", "CCG", "CCC", "CCT",
      "CGA", "CGG", "CGC", "CGT", "CTA", "CTG", "CTC", "CTT", "GAA", "GAG", "GAC", "GAT",
      "GCA", "GCG", "GCC", "GCT", "GGA", "GGG", "GGC", "GGT", "GTA", "GTG", "GTC", "GTT",
      "TAA", "TAG", "TAC", "TAT", "TCA", "TCG", "TCC", "TCT", "TGA", "TGG", "TGC", "TGT",
      "TTA", "TTG", "TTC", "TTT"
    )
    expected_codon_df_full <- data.frame(codon = all_possible_codons)
    expected_codon_df_full <- merge(expected_codon_df_full, expected_codon_df, by = "codon", all.x = TRUE)
    expected_codon_df_full[is.na(expected_codon_df_full)] <- 0
    expected_codon_df_full$expected_frequency <- expected_codon_df_full$expected_count / sum(expected_codon_df_full$expected_count)
    
    # --- Combine Observed and Expected and Calculate Obs/Exp Ratio ---
    message("Calculating Obs/Exp ratios...")
    result_df <- merge(observed_df, expected_codon_df_full, by = "codon", all.x = TRUE)
    
    # Rename columns for clarity
    colnames(result_df)[colnames(result_df) == "count"] <- "observed_asite_count"
    colnames(result_df)[colnames(result_df) == "frequency"] <- "observed_asite_frequency"
    
    # Calculate Obs/Exp ratio
    # Handle division by zero for codons with zero expected frequency
    result_df$obs_exp_ratio <- ifelse(result_df$expected_frequency > 0,
                                      result_df$observed_asite_frequency / result_df$expected_frequency,
                                      NA) # Set to NA if expected frequency is zero
    
    # Select and reorder columns for output
    final_output <- result_df[, c("codon", "observed_asite_count", "observed_asite_frequency",
                                  "expected_count", "expected_frequency", "obs_exp_ratio")]
    
    # --- Write Output ---
    message(paste("Writing results to", output_file))
    fwrite(final_output, file = output_file, sep = "\t", quote = FALSE)
    message("Analysis complete.")
    EOF
    
    Rscript analyze_ribo_occupancy.R
  7. 7

    If ribosome pausing occurs, obs/exp > 1.

    (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # This step describes a condition for identifying ribosome pausing, not a specific tool execution.
    # Assuming an upstream process has generated a file (e.g., 'ribosome_pausing_data.tsv')
    # containing observed (obs) and expected (exp) ribosome occupancy counts, along with other data.
    # This example uses 'awk' to filter for sites where obs/exp > 1.
    # Replace 'ribosome_pausing_data.tsv' with your actual input file and adjust column numbers ($3, $4) as needed.
    # Assuming column 3 contains observed counts and column 4 contains expected counts.
    
    awk 'BEGIN {OFS="\t"} NR==1 {print; next} $3/$4 > 1 {print}' ribosome_pausing_data.tsv > pausing_sites.tsv
  8. 8

    Library strategy: Ribo-Seq

    $ Bash example
    # --- Ribo-seq Data Processing Workflow ---
    # This script outlines a typical workflow for Ribo-seq (Ribosome Profiling) data analysis.
    # "Ribo-seq" refers to the sequencing strategy itself, and this workflow uses common
    # bioinformatics tools for its processing, rather than a single software package named "Ribo-seq".
    
    # --- Configuration ---
    # Placeholder paths for reference genome and annotation. Replace with actual paths.
    GENOME_FASTA="/path/to/your/reference/genome.fa" # e.g., /path/to/hg38.fa
    GENOME_GTF="/path/to/your/reference/annotation.gtf" # e.g., /path/to/gencode.v38.annotation.gtf
    STAR_INDEX_DIR="/path/to/your/star_index" # Directory for STAR genome index
    
    # Common Illumina adapter sequence for Ribo-seq. Verify with your library prep protocol.
    ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    
    # Sample-specific identifiers and input files.
    SAMPLE_ID="sample_name"
    INPUT_FASTQ_R1="${SAMPLE_ID}_R1.fastq.gz"
    # Ribo-seq is typically single-end. If paired-end, adjust INPUT_FASTQ_R2 and Trim Galore! command.
    # INPUT_FASTQ_R2="${SAMPLE_ID}_R2.fastq.gz"
    OUTPUT_DIR="${SAMPLE_ID}_riboseq_analysis"
    
    mkdir -p "${OUTPUT_DIR}"
    cd "${OUTPUT_DIR}"
    
    # --- 1. Quality Control (FastQC) ---
    # FastQC provides a general overview of read quality before and after trimming.
    # conda install -c bioconda fastqc
    fastqc "${INPUT_FASTQ_R1}" -o .
    # If paired-end: fastqc "${INPUT_FASTQ_R1}" "${INPUT_FASTQ_R2}" -o .
    
    # --- 2. Adapter Trimming and Quality Filtering (Trim Galore!) ---
    # Trim Galore! is a wrapper around Cutadapt. It removes adapter sequences and performs quality trimming.
    # Ribo-seq reads are typically short (e.g., 28-30 nt), so length filtering is crucial.
    # Reads shorter than 25 nt are often discarded as they are unlikely to be ribosome footprints.
    # conda install -c bioconda trim-galore
    trim_galore --fastqc --output_dir . \
                --adapter "${ADAPTER_SEQUENCE}" \
                --length 25 \
                "${INPUT_FASTQ_R1}"
    # If paired-end:
    # trim_galore --fastqc --output_dir . \
    #             --adapter "${ADAPTER_SEQUENCE}" \
    #             --length 25 \
    #             --paired "${INPUT_FASTQ_R1}" "${INPUT_FASTQ_R2}"
    
    TRIMMED_FASTQ="${SAMPLE_ID}_R1_trimmed.fq.gz" # Adjust name if paired-end
    
    # --- 3. Genome Indexing (STAR) ---
    # This step is performed once per reference genome and annotation.
    # conda install -c bioconda star
    # STAR --runMode genomeGenerate \
    #      --genomeDir "${STAR_INDEX_DIR}" \
    #      --genomeFastaFiles "${GENOME_FASTA}" \
    #      --sjdbGTFfile "${GENOME_GTF}" \
    #      --sjdbOverhang 100 # Recommended for RNA-seq, adjust for Ribo-seq if needed (e.g., read length - 1)
    
    # --- 4. Alignment to Reference Genome (STAR) ---
    # Align trimmed reads to the reference genome. Ribo-seq reads are typically short and expected to map uniquely.
    # Key parameters for Ribo-seq: --outFilterMultimapNmax 1 (unique mapping), --alignIntronMax 1 (no splicing).
    STAR --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${TRIMMED_FASTQ}" \
         --readFilesCommand zcat \
         --outFileNamePrefix "${SAMPLE_ID}_" \
         --outSAMtype BAM SortedByCoordinate \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 2 \
         --outFilterMatchNmin 25 \
         --outFilterScoreMinOverLread 0 \
         --outFilterMatchNminOverLread 0 \
         --alignIntronMax 1 \
         --runThreadN 8
    
    # Index the BAM file for downstream processing.
    # conda install -c bioconda samtools
    samtools index "${SAMPLE_ID}_Aligned.sortedByCoord.out.bam"
    
    # --- 5. Ribosome Footprint Counting (featureCounts or custom script) ---
    # Count reads mapping to coding sequences (CDS) or specific regions.
    # For detailed Ribo-seq analysis, custom scripts are often used to precisely define
    # ribosome P-site positions, assign reads to frames, and calculate translation efficiency.
    # Tools like riboviz, Ribo-seq-tools, or RiboPro provide more specialized functionalities.
    # For a simplified example, we'll use featureCounts to count reads over CDS regions.
    # conda install -c bioconda subread
    featureCounts -a "${GENOME_GTF}" \
                  -o "${SAMPLE_ID}_featureCounts.txt" \
                  -F GTF \
                  -t CDS \
                  -g gene_id \
                  -M \
                  -p \
                  "${SAMPLE_ID}_Aligned.sortedByCoord.out.bam"
    
    # Note: Further analysis typically involves identifying P-site offsets, 
    # calculating read periodicity, and quantifying translation efficiency.
    # These steps often require specialized R or Python scripts and packages.
    

Tools Used

Raw Source Text
Ribosomal footprints were analyzed as described by Ingolia et al. with these modifications: Trimgalore was used to trim off adapters and clip the first nucleotide off the 5’ end. Reads were then mapped to ribosomal RNA using bowtie2 (39) and unmapped reads were further mapped to the human transcriptome (v19) with STAR aligner (33).
Expected read length distribution was tested with the R package RiboProfiling. To center ribosomes and obtain a list of genes with P-sites in their 3’ UTR, we used functionalities within Ribowaltz (40) and a custom python script by Scott Adamson, UConn, and Jax Laboratories.
Observed/expected ratios were calculated using a custom script by Scott Adamson. Obs/exp indicates the number of observed codons in the A-site of the ribosome vs the calculated hypothetical expected number of observations. If ribosome pausing occurs, obs/exp > 1.
Assembly: hg19
Supplementary files format and content: Excel file
Library strategy: Ribo-Seq
← Back to Analysis