GSE230349 Processing Pipeline

OTHER code_examples 8 steps

Publication

Seryl-tRNA synthetase promotes translational readthrough by mRNA binding and involvement of the selenocysteine incorporation machinery.

Nucleic acids research (2023) — PMID 37739431

Dataset

GSE230349

Seryl-tRNA synthetase promotes translational readthrough via mRNA binding by involving the selenocysteine incorporation machinery

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Ribosomal footprints were analyzed as described by Ingolia et al. with these modifications: Trimgalore was used to trim off adapters and clip the first nucleotide off the 5â end.

Trim Galore v0.6.10 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Trim Galore (requires Cutadapt and FastQC)
# conda install -c bioconda trim-galore

# Define input and output file names
INPUT_FASTQ="ribo_footprints.fastq.gz"
OUTPUT_DIR="."

# Trim adapters and clip the first nucleotide from the 5' end
# --illumina: Trims standard Illumina adapters
# --clip_r1 1: Clips 1 bp from the 5' end of Read 1 (assuming single-end reads or primary read in paired-end)
# --output_dir: Specifies the output directory
# Trim Galore automatically appends '_trimmed.fq.gz' to the output file name.
trim_galore --illumina --clip_r1 1 --output_dir "${OUTPUT_DIR}" "${INPUT_FASTQ}"

View on GitHub

Reads were then mapped to ribosomal RNA using bowtie2 (39) and unmapped reads were further mapped to the human transcriptome (v19) with STAR aligner (33).

STAR vNot specified

$ Bash example

# Install STAR (e.g., using Bioconda)
# conda install -c bioconda star

# --- Reference Data Setup ---
# The description mentions "human transcriptome (v19)". This typically refers to
# the GRCh37/hg19 genome assembly and GENCODE v19 annotation.
# A STAR genome index needs to be pre-built using the genome FASTA and GTF files.

# Placeholder for STAR genome index directory
STAR_INDEX_DIR="/path/to/STAR_human_GRCh37_gencode_v19_index"

# Example commands to build the STAR index (uncomment and modify if needed):
# GENOME_FASTA="/path/to/Homo_sapiens.GRCh37.dna.primary_assembly.fa" # e.g., from Ensembl
# GTF_FILE="/path/to/gencode.v19.annotation.gtf" # e.g., from GENCODE
# 
# mkdir -p ${STAR_INDEX_DIR}
# STAR --runMode genomeGenerate \
#      --genomeDir ${STAR_INDEX_DIR} \
#      --genomeFastaFiles ${GENOME_FASTA} \
#      --sjdbGTFfile ${GTF_FILE} \
#      --sjdbOverhang 100 # Recommended: read_length - 1

# --- Input and Output Setup ---
# The input is "unmapped reads" from a previous bowtie2 step. Assuming these are in FASTQ format.
# If the input is a BAM file of unmapped reads, it would first need to be converted to FASTQ.
INPUT_FASTQ="unmapped_reads_from_rRNA_mapping.fastq.gz" # Placeholder for input FASTQ file
OUTPUT_DIR="star_human_transcriptome_alignment"
OUTPUT_PREFIX="${OUTPUT_DIR}/human_transcriptome_aligned"

# Create output directory
mkdir -p ${OUTPUT_DIR}

# --- STAR Alignment Command ---
# Maps unmapped reads to the human transcriptome (genome + GTF) using STAR.
# The `quantMode TranscriptomeSAM` option outputs alignments in transcriptome coordinates,
# which is useful for downstream quantification tools.
STAR --genomeDir ${STAR_INDEX_DIR} \
     --readFilesIn ${INPUT_FASTQ} \
     --runThreadN 8 \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.05 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --quantMode TranscriptomeSAM

# The output will include:
# - ${OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam: Genome-aligned reads, sorted by coordinate.
# - ${OUTPUT_PREFIX}ReadsPerGene.out.tab: Gene counts (if quantMode GeneCounts was also used).
# - ${OUTPUT_PREFIX}Log.final.out: Summary of the alignment run.
# - ${OUTPUT_PREFIX}Transcriptome.out.bam: Reads aligned to transcriptome coordinates (due to quantMode TranscriptomeSAM).

Expected read length distribution was tested with the R package RiboProfiling.

R vNot specified

$ Bash example

# Install R and RiboProfiling if not already present
# conda install -c conda-forge r-base
# R -e "install.packages('BiocManager')"
# R -e "BiocManager::install('RiboProfiling')"

# Define input and output files
INPUT_BAM="input.bam" # Placeholder for the input BAM file
GENOME_GTF="genome.gtf" # Placeholder: RiboProfiling's read_bam typically requires a GTF annotation file
GENOME_FASTA="genome.fasta" # Placeholder: RiboProfiling's read_bam typically requires a FASTA genome file
OUTPUT_PLOT="read_length_distribution.pdf" # Output PDF file for the plot

# Create an R script to analyze read length distribution using RiboProfiling
cat <<EOF > analyze_read_lengths.R
library(RiboProfiling)
library(ggplot2) # Used for flexible plotting of general read length distribution

# Create a Ribo object from the BAM file, GTF, and FASTA.
# This step is central to RiboProfiling analysis. Parameters like min/max read length
# are often specified for ribosome profiling data, using typical values here.
ribo_obj <- read_bam(
  bam_file = "${INPUT_BAM}",
  gtf_file = "${GENOME_GTF}",
  fasta_file = "${GENOME_FASTA}",
  min_read_length = 15, # Typical minimum read length for ribosome profiling
  max_read_length = 40  # Typical maximum read length for ribosome profiling
)

# Get read lengths from the Ribo object.
# The `get_read_lengths` function extracts lengths, which can be P-site lengths
# or raw read lengths depending on the object's processing. For a general
# distribution, we combine all extracted lengths.
read_lengths_data <- get_read_lengths(ribo_obj)

# The result is a list of numeric vectors (lengths per transcript/region).
# Unlist to get a single vector of all read lengths for global distribution.
all_read_lengths <- unlist(read_lengths_data)

# Plot the distribution using ggplot2.
# While RiboProfiling has `plot_read_lengths`, it's often tailored for P-site offsets.
# For a general histogram of all read lengths, ggplot2 offers more flexibility
# and is a standard R plotting package commonly used in such workflows.
pdf("${OUTPUT_PLOT}")
ggplot(data.frame(Length = all_read_lengths), aes(x = Length)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "black") +
  labs(title = "Read Length Distribution (from RiboProfiling object)", x = "Read Length (bp)", y = "Frequency") +
  theme_minimal()
dev.off()
EOF

# Execute the R script
Rscript analyze_read_lengths.R

To center ribosomes and obtain a list of genes with P-sites in their 3â UTR, we used functionalities within Ribowaltz (40) and a custom python script by Scott Adamson, UConn, and Jax Laboratories.

Python vNot specified (likely Python 3.x)

$ Bash example

# Installation instructions for Python and potential dependencies (if known).
# As this is a custom script, specific dependencies are not mentioned.
# A general Python environment is assumed.
# conda create -n ribo_env python=3.x
# conda activate ribo_env
# pip install pandas numpy # Common data handling libraries, might be used

# Execution of the custom Python script.
# This script was developed by Scott Adamson, UConn, and Jax Laboratories.
# Replace 'path/to/custom_script.py' with the actual path to this script.
#
# Input files:
#   - 'input_ribo_seq.bam': Aligned ribosome profiling reads (e.g., output from STAR aligner).
#   - 'genome_annotation.gtf': Genome annotation file (e.g., from Ensembl or UCSC, for the latest assembly like hg38 or mm10).
#
# Output file:
#   - 'output_psites_3UTR.tsv': A list of genes with P-sites in their 3' UTR.
#
# The exact parameters will depend on the custom script's implementation.
# These are illustrative placeholders based on the description's goals.

python path/to/custom_script.py \
    --ribo_bam input_ribo_seq.bam \
    --annotation genome_annotation.gtf \
    --output_file output_psites_3UTR.tsv \
    --center_ribosomes_method "ribowaltz_like_method" \
    --psite_3UTR_threshold 10 # Example parameter for 3' UTR P-site identification

Observed/expected ratios were calculated using a custom script by Scott Adamson.

custom script by Scott Adamson vN/A

$ Bash example

# This step uses a custom script by Scott Adamson to calculate observed/expected ratios.
# The exact script name, parameters, and dependencies are not specified in the description.
# Placeholder for custom script execution:
# python /path/to/scott_adamson_script.py --input_observed_counts observed.txt --output_ratios ratios.txt

Obs/exp indicates the number of observed codons in the A-site of the ribosome vs the calculated hypothetical expected number of observations.

riboWaltz (Inferred with models/gemini-2.5-flash) vNot specified

$ Bash example

# Install R and Bioconductor if not already present
# sudo apt-get update && sudo apt-get install -y r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("riboWaltz")'
# R -e 'BiocManager::install("GenomicFeatures")'
# R -e 'BiocManager::install("Biostrings")'
# R -e 'install.packages("data.table")'

# Define input files (placeholders - replace with actual paths)
# ALIGNED_BAM: Path to the aligned Ribo-seq BAM file.
# REFERENCE_GTF: Path to the gene annotation file (GTF/GFF format).
# REFERENCE_FASTA: Path to the reference genome or transcriptome FASTA file.
# For example, for human GRCh38:
# ALIGNED_BAM="/path/to/your/aligned_riboseq_reads.bam"
# REFERENCE_GTF="/path/to/Homo_sapiens.GRCh38.109.gtf"
# REFERENCE_FASTA="/path/to/Homo_sapiens.GRCh38.dna.primary_assembly.fa"

ALIGNED_BAM="aligned_riboseq_reads.bam"
REFERENCE_GTF="reference.gtf"
REFERENCE_FASTA="reference.fasta"

# Define output file
OUTPUT_FILE="codon_occupancy_obs_exp.tsv"

# Set environment variables for the R script
export ALIGNED_BAM
export REFERENCE_GTF
export REFERENCE_FASTA
export OUTPUT_FILE

# Create an R script to perform the Obs/Exp calculation
cat << 'EOF' > analyze_ribo_occupancy.R
library(riboWaltz)
library(GenomicFeatures)
library(Biostrings)
library(data.table) # For faster data manipulation

# --- Configuration --- 
aligned_bam_file <- Sys.getenv("ALIGNED_BAM")
reference_gtf_file <- Sys.getenv("REFERENCE_GTF")
reference_fasta_file <- Sys.getenv("REFERENCE_FASTA")
output_file <- Sys.getenv("OUTPUT_FILE")

# --- Load Annotation and Sequences ---
message("Loading GTF annotation...")
txdb <- makeTxDbFromGFF(reference_gtf_file)
message("Loading FASTA sequences...")
fasta_seq <- readDNAStringSet(reference_fasta_file)
names(fasta_seq) <- sub(" .*", "", names(fasta_seq)) # Clean names

# --- Process Ribo-seq Data ---
message("Reading BAM file and inferring P-sites...")
# Define read length range for Ribo-seq (adjust as per your data)
min_read_length <- 25
max_read_length <- 35

# Read BAM file into a list of data frames (one per sample)
# `read_bam` processes the BAM file and infers P-site offsets if not provided.
# This step can be computationally intensive for large files.
df_list <- read_bam(file = aligned_bam_file,
                    annotation = txdb,
                    fasta_sequences = fasta_seq,
                    min_read_length = min_read_length,
                    max_read_length = max_read_length)

# Infer P-site offsets (if `read_bam` didn't fully infer or for refinement)
message("Inferring P-site offsets...")
psite_mapping_list <- psite_mapping(df_list)

# Get P-site and A-site information for each read
message("Extracting P-site and A-site information...")
psite_info_list <- psite_info(df_list,
                              annotation = txdb,
                              fasta_sequences = fasta_seq,
                              p_shift = psite_mapping_list)

# --- Calculate Observed A-site Codon Counts ---
message("Calculating observed A-site codon counts...")
# `codon_usage_asite` calculates observed codon usage at the A-site.
# It returns a list of data frames, one per sample.
observed_codon_usage <- codon_usage_asite(psite_info_list,
                                          fasta_sequences = fasta_seq)

# For a single sample, extract the data frame
sample_name <- names(observed_codon_usage)[1]
observed_df <- observed_codon_usage[[sample_name]]

# --- Calculate Expected Codon Frequencies from Transcriptome ---
message("Calculating expected codon frequencies from transcriptome...")
# This involves extracting CDS sequences and calculating overall codon frequencies.
cds_seqs <- getCDS(txdb, fasta_seq)
# Concatenate all CDS sequences into a single string for overall codon frequency
all_cds_string <- paste(as.character(cds_seqs), collapse = "")

# Calculate codon frequencies from the entire CDS pool
codon_table <- oligonucleotideFrequency(DNAString(all_cds_string), width = 3, step = 3)
expected_codon_df <- data.frame(
  codon = names(codon_table),
  expected_count = as.numeric(codon_table),
  expected_frequency = as.numeric(codon_table) / sum(codon_table)
)

# Ensure all 64 codons are present, filling with 0 if not observed in CDS
all_possible_codons <- c(
  "AAA", "AAG", "AAC", "AAT", "ACA", "ACG", "ACC", "ACT", "AGA", "AGG", "AGC", "AGT",
  "ATA", "ATG", "ATC", "ATT", "CAA", "CAG", "CAC", "CAT", "CCA", "CCG", "CCC", "CCT",
  "CGA", "CGG", "CGC", "CGT", "CTA", "CTG", "CTC", "CTT", "GAA", "GAG", "GAC", "GAT",
  "GCA", "GCG", "GCC", "GCT", "GGA", "GGG", "GGC", "GGT", "GTA", "GTG", "GTC", "GTT",
  "TAA", "TAG", "TAC", "TAT", "TCA", "TCG", "TCC", "TCT", "TGA", "TGG", "TGC", "TGT",
  "TTA", "TTG", "TTC", "TTT"
)
expected_codon_df_full <- data.frame(codon = all_possible_codons)
expected_codon_df_full <- merge(expected_codon_df_full, expected_codon_df, by = "codon", all.x = TRUE)
expected_codon_df_full[is.na(expected_codon_df_full)] <- 0
expected_codon_df_full$expected_frequency <- expected_codon_df_full$expected_count / sum(expected_codon_df_full$expected_count)

# --- Combine Observed and Expected and Calculate Obs/Exp Ratio ---
message("Calculating Obs/Exp ratios...")
result_df <- merge(observed_df, expected_codon_df_full, by = "codon", all.x = TRUE)

# Rename columns for clarity
colnames(result_df)[colnames(result_df) == "count"] <- "observed_asite_count"
colnames(result_df)[colnames(result_df) == "frequency"] <- "observed_asite_frequency"

# Calculate Obs/Exp ratio
# Handle division by zero for codons with zero expected frequency
result_df$obs_exp_ratio <- ifelse(result_df$expected_frequency > 0,
                                  result_df$observed_asite_frequency / result_df$expected_frequency,
                                  NA) # Set to NA if expected frequency is zero

# Select and reorder columns for output
final_output <- result_df[, c("codon", "observed_asite_count", "observed_asite_frequency",
                              "expected_count", "expected_frequency", "obs_exp_ratio")]

# --- Write Output ---
message(paste("Writing results to", output_file))
fwrite(final_output, file = output_file, sep = "\t", quote = FALSE)
message("Analysis complete.")
EOF

Rscript analyze_ribo_occupancy.R

If ribosome pausing occurs, obs/exp > 1.

(Inferred with models/gemini-2.5-flash)

$ Bash example

# This step describes a condition for identifying ribosome pausing, not a specific tool execution.
# Assuming an upstream process has generated a file (e.g., 'ribosome_pausing_data.tsv')
# containing observed (obs) and expected (exp) ribosome occupancy counts, along with other data.
# This example uses 'awk' to filter for sites where obs/exp > 1.
# Replace 'ribosome_pausing_data.tsv' with your actual input file and adjust column numbers ($3, $4) as needed.
# Assuming column 3 contains observed counts and column 4 contains expected counts.

awk 'BEGIN {OFS="\t"} NR==1 {print; next} $3/$4 > 1 {print}' ribosome_pausing_data.tsv > pausing_sites.tsv

Library strategy: Ribo-Seq

Ribo-seq vN/A GitHub

$ Bash example

# --- Ribo-seq Data Processing Workflow ---
# This script outlines a typical workflow for Ribo-seq (Ribosome Profiling) data analysis.
# "Ribo-seq" refers to the sequencing strategy itself, and this workflow uses common
# bioinformatics tools for its processing, rather than a single software package named "Ribo-seq".

# --- Configuration ---
# Placeholder paths for reference genome and annotation. Replace with actual paths.
GENOME_FASTA="/path/to/your/reference/genome.fa" # e.g., /path/to/hg38.fa
GENOME_GTF="/path/to/your/reference/annotation.gtf" # e.g., /path/to/gencode.v38.annotation.gtf
STAR_INDEX_DIR="/path/to/your/star_index" # Directory for STAR genome index

# Common Illumina adapter sequence for Ribo-seq. Verify with your library prep protocol.
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"

# Sample-specific identifiers and input files.
SAMPLE_ID="sample_name"
INPUT_FASTQ_R1="${SAMPLE_ID}_R1.fastq.gz"
# Ribo-seq is typically single-end. If paired-end, adjust INPUT_FASTQ_R2 and Trim Galore! command.
# INPUT_FASTQ_R2="${SAMPLE_ID}_R2.fastq.gz"
OUTPUT_DIR="${SAMPLE_ID}_riboseq_analysis"

mkdir -p "${OUTPUT_DIR}"
cd "${OUTPUT_DIR}"

# --- 1. Quality Control (FastQC) ---
# FastQC provides a general overview of read quality before and after trimming.
# conda install -c bioconda fastqc
fastqc "${INPUT_FASTQ_R1}" -o .
# If paired-end: fastqc "${INPUT_FASTQ_R1}" "${INPUT_FASTQ_R2}" -o .

# --- 2. Adapter Trimming and Quality Filtering (Trim Galore!) ---
# Trim Galore! is a wrapper around Cutadapt. It removes adapter sequences and performs quality trimming.
# Ribo-seq reads are typically short (e.g., 28-30 nt), so length filtering is crucial.
# Reads shorter than 25 nt are often discarded as they are unlikely to be ribosome footprints.
# conda install -c bioconda trim-galore
trim_galore --fastqc --output_dir . \
            --adapter "${ADAPTER_SEQUENCE}" \
            --length 25 \
            "${INPUT_FASTQ_R1}"
# If paired-end:
# trim_galore --fastqc --output_dir . \
#             --adapter "${ADAPTER_SEQUENCE}" \
#             --length 25 \
#             --paired "${INPUT_FASTQ_R1}" "${INPUT_FASTQ_R2}"

TRIMMED_FASTQ="${SAMPLE_ID}_R1_trimmed.fq.gz" # Adjust name if paired-end

# --- 3. Genome Indexing (STAR) ---
# This step is performed once per reference genome and annotation.
# conda install -c bioconda star
# STAR --runMode genomeGenerate \
#      --genomeDir "${STAR_INDEX_DIR}" \
#      --genomeFastaFiles "${GENOME_FASTA}" \
#      --sjdbGTFfile "${GENOME_GTF}" \
#      --sjdbOverhang 100 # Recommended for RNA-seq, adjust for Ribo-seq if needed (e.g., read length - 1)

# --- 4. Alignment to Reference Genome (STAR) ---
# Align trimmed reads to the reference genome. Ribo-seq reads are typically short and expected to map uniquely.
# Key parameters for Ribo-seq: --outFilterMultimapNmax 1 (unique mapping), --alignIntronMax 1 (no splicing).
STAR --genomeDir "${STAR_INDEX_DIR}" \
     --readFilesIn "${TRIMMED_FASTQ}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${SAMPLE_ID}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 2 \
     --outFilterMatchNmin 25 \
     --outFilterScoreMinOverLread 0 \
     --outFilterMatchNminOverLread 0 \
     --alignIntronMax 1 \
     --runThreadN 8

# Index the BAM file for downstream processing.
# conda install -c bioconda samtools
samtools index "${SAMPLE_ID}_Aligned.sortedByCoord.out.bam"

# --- 5. Ribosome Footprint Counting (featureCounts or custom script) ---
# Count reads mapping to coding sequences (CDS) or specific regions.
# For detailed Ribo-seq analysis, custom scripts are often used to precisely define
# ribosome P-site positions, assign reads to frames, and calculate translation efficiency.
# Tools like riboviz, Ribo-seq-tools, or RiboPro provide more specialized functionalities.
# For a simplified example, we'll use featureCounts to count reads over CDS regions.
# conda install -c bioconda subread
featureCounts -a "${GENOME_GTF}" \
              -o "${SAMPLE_ID}_featureCounts.txt" \
              -F GTF \
              -t CDS \
              -g gene_id \
              -M \
              -p \
              "${SAMPLE_ID}_Aligned.sortedByCoord.out.bam"

# Note: Further analysis typically involves identifying P-site offsets, 
# calculating read periodicity, and quantifying translation efficiency.
# These steps often require specialized R or Python scripts and packages.

View on GitHub

Tools Used

Trim Galore STAR R Python Ribo-seq

Raw Source Text

Ribosomal footprints were analyzed as described by Ingolia et al. with these modifications: Trimgalore was used to trim off adapters and clip the first nucleotide off the 5â end. Reads were then mapped to ribosomal RNA using bowtie2 (39) and unmapped reads were further mapped to the human transcriptome (v19) with STAR aligner (33).
Expected read length distribution was tested with the R package RiboProfiling. To center ribosomes and obtain a list of genes with P-sites in their 3â UTR, we used functionalities within Ribowaltz (40) and a custom python script by Scott Adamson, UConn, and Jax Laboratories.
Observed/expected ratios were calculated using a custom script by Scott Adamson. Obs/exp indicates the number of observed codons in the A-site of the ribosome vs the calculated hypothetical expected number of observations. If ribosome pausing occurs, obs/exp > 1.
Assembly: hg19
Supplementary files format and content: Excel file
Library strategy: Ribo-Seq

← Back to Analysis