GSE157050 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

AMPK regulation of Raptor and TSC2 mediate metformin effects on transcriptional control of anabolism and inflammation.

Genes & development (2020) — PMID 32912901

Dataset

Ribosome profiling of metformin treatment in primary murine hepatocytes in WT, Raptor Ser-Ala mutant, Tsc2-null, Raptor mutant;Tsc2-null, and Ampk-n…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

First, adapter trimming was performed withÂ cutadaptÂ (v1.14).

cutadapt v1.14 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=1.14

# Example command for adapter trimming with cutadapt.
# The specific adapter sequence(s) and input/output filenames were not provided in the description.
# Replace 'ADAPTER_SEQUENCE' with the actual adapter sequence(s) used (e.g., Illumina universal adapter).
# Replace 'input.fastq.gz' with your input FASTQ file.
# Replace 'trimmed.fastq.gz' with your desired output FASTQ file.
# For paired-end reads, use -A for the reverse complement adapter and -o/-p for paired outputs.
cutadapt -a ADAPTER_SEQUENCE -o trimmed.fastq.gz input.fastq.gz

View on GitHub

Next, reads were mapped with STAR (v2.4.0i) against a database of repetitive elements, with mapping reads discarded from further analysis.

STAR v2.4.0i

$ Bash example

# Install STAR (example, uncomment if needed)
# conda install -c bioconda star=2.4.0i

# Define variables
# Replace with your actual input FASTQ files
INPUT_R1="input_reads_R1.fastq.gz"
INPUT_R2="input_reads_R2.fastq.gz" # Set to empty string if single-end: INPUT_R2=""

# Replace with the path to your STAR-indexed repetitive elements database
STAR_INDEX="path/to/repetitive_elements_STAR_index"

OUTPUT_DIR="star_repetitive_mapping_output"

mkdir -p "${OUTPUT_DIR}"

# Run STAR to map against repetitive elements.
# The key here is that reads mapping to these elements are 'discarded from further analysis'.
# This implies we are either interested in the unmapped reads for downstream, or simply performing
# this mapping as a filtering step and the output BAM itself is not used for gene quantification.
# We will output both mapped BAM and unmapped FASTQ for flexibility.
# Parameters are set to be permissive for mapping to repetitive elements (e.g., allowing multiple alignments).
STAR --genomeDir "${STAR_INDEX}" \
     --readFilesIn "${INPUT_R1}" ${INPUT_R2:+"${INPUT_R2}"} \
     --runThreadN 8 \
     --outFileNamePrefix "${OUTPUT_DIR}/" \
     --outSAMtype BAM SortedByCoordinate \
     --outReadsUnmapped Fastx \
     --outFilterMultimapNmax 100 \
     --outFilterMismatchNmax 999 \
     --outFilterScoreMinOverLread 0 \
     --outFilterMatchNminOverLread 0 \
     --alignIntronMax 1 \
     --alignMatesGapMax 1000000 \
     --alignSJDBoverhangMin 1000000 \
     --alignSJoverhangMin 1000000 \
     --outSAMattributes All \
     --outSAMunmapped Within \
     --outSAMprimaryFlag AllBestScore \
     --outSAMmapqUnique 255

# After this step, the reads in ${OUTPUT_DIR}/unmapped_to_repetitive_R1.fastq.gz (and R2 if paired-end)
# would be considered for 'further analysis', while the mapped reads in ${OUTPUT_DIR}/Aligned.sortedByCoordinate.out.bam
# are effectively 'discarded' for the main analysis pipeline (e.g., gene expression quantification).

Remaining reads were then mapped to the mouse genome (mm10) with STAR (v2.4.0i)

STAR v2.4.0i GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.4.0i

# Define variables
GENOME_DIR="/path/to/STAR_index/mm10"
READS_R1="sample_R1.fastq.gz"
READS_R2="sample_R2.fastq.gz" # Assuming paired-end reads; adjust if single-end
OUTPUT_PREFIX="sample_aligned_"
THREADS=8

# Create genome index if it doesn't exist (this is a one-time step per genome version)
# STAR --runThreadN ${THREADS} --runMode genomeGenerate --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/mm10.fa --sjdbGTFfile /path/to/mm10.gtf

# Map reads to the mouse genome (mm10)
STAR --runThreadN ${THREADS} \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READS_R1} ${READS_R2} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 10 \
     --alignIntronMin 20 --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --quantMode GeneCounts # Optional: output gene counts

View on GitHub

Stranded reads mapping to GENCODE (vM20) genes were quantified with featureCounts (subread-1.6.4).

featureCounts v1.6.4

$ Bash example

# Install featureCounts (part of subread package)
# conda install -c bioconda subread=1.6.4

# Download GENCODE vM20 GTF annotation file
# wget -O gencode.vM20.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.annotation.gtf.gz
# gunzip gencode.vM20.annotation.gtf.gz

# Run featureCounts to quantify stranded reads against GENCODE vM20 genes
# -p: Assume paired-end reads (common for RNA-seq, adjust if single-end)
# -s 2: Assume reverse stranded library (e.g., Illumina TruSeq stranded, adjust to 1 for forward or 0 for unstranded if different)
# -a: Specify the GTF annotation file
# -o: Specify the output counts file
featureCounts -p -s 2 -a gencode.vM20.annotation.gtf -o gene_counts.txt input.bam

Mitochondrial genes were discarded, and only genes with at least 0.5 reads per million across all datasets were kept for further analysis.

R (Inferred with models/gemini-2.5-flash) vCustom script GitHub

$ Bash example

# This script assumes 'gene_counts.tsv' is a tab-separated file with gene IDs as row names and sample IDs as column names.
# It also assumes 'gene_annotations.tsv' is a tab-separated file with gene IDs and chromosome information.
# The output will be 'filtered_gene_counts.tsv'.

# Example: Create dummy input files for demonstration purposes (uncomment to run with dummy data)
# echo -e "gene_id\tsample1\tsample2\tsample3" > gene_counts.tsv
# echo -e "gene1\t1000\t1200\t1100" >> gene_counts.tsv
# echo -e "gene2\t50\t60\t55" >> gene_counts.tsv
# echo -e "gene_mito1\t10000\t11000\t10500" >> gene_counts.tsv
# echo -e "gene3\t0\t1\t0" >> gene_counts.tsv
# echo -e "gene4\t50\t60\t70" >> gene_counts.tsv
# echo -e "gene_mito2\t20000\t21000\t20500" >> gene_counts.tsv
# echo -e "gene5\t100\t120\t110" >> gene_counts.tsv
# echo -e "gene6\t0\t0\t0" >> gene_counts.tsv

# echo -e "gene_id\tchromosome" > gene_annotations.tsv
# echo -e "gene1\tchr1" >> gene_annotations.tsv
# echo -e "gene2\tchr2" >> gene_annotations.tsv
# echo -e "gene_mito1\tchrM" >> gene_annotations.tsv
# echo -e "gene3\tchr3" >> gene_annotations.tsv
# echo -e "gene4\tchr4" >> gene_annotations.tsv
# echo -e "gene_mito2\tMT" >> gene_annotations.tsv
# echo -e "gene5\tchr5" >> gene_annotations.tsv
# echo -e "gene6\tchr6" >> gene_annotations.tsv

# R script for filtering gene expression data
cat << 'EOF' > filter_expression.R
# Load necessary packages (if not installed, uncomment and run BiocManager::install() or install.packages())
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("data.table") # For faster reading/writing if needed

# Read gene counts matrix
# Assumes gene IDs are in the first column and sample counts in subsequent columns
counts_df <- read.delim("gene_counts.tsv", row.names = 1, check.names = FALSE)

# Read gene annotations
# Assumes gene IDs are in the first column and chromosome information in a 'chromosome' column
gene_annotations <- read.delim("gene_annotations.tsv", row.names = 1, check.names = FALSE)

# Ensure annotations match counts matrix row names
gene_annotations <- gene_annotations[rownames(counts_df), , drop = FALSE]

# 1. Discard mitochondrial genes
# Identify mitochondrial genes (case-insensitive for 'chrM' or 'MT' in the chromosome column)
mito_gene_ids <- rownames(gene_annotations)[grepl("chrM|MT", gene_annotations$chromosome, ignore.case = TRUE)]
message(paste("Identified", length(mito_gene_ids), "mitochondrial genes."))

# Filter out mitochondrial genes from the counts matrix
counts_filtered_mito <- counts_df[!rownames(counts_df) %in% mito_gene_ids, , drop = FALSE]
message(paste("Genes after discarding mitochondrial:", nrow(counts_filtered_mito)))

# Check if any genes remain after mitochondrial filtering
if (nrow(counts_filtered_mito) == 0) {
    warning("No genes remaining after mitochondrial filtering. Outputting an empty file.")
    write.table(data.frame(), "filtered_gene_counts.tsv", sep = "\t", quote = FALSE, col.names = NA)
    quit(save = "no")
}

# 2. Calculate Reads Per Million (RPM) for remaining genes
# Calculate total reads per sample from the filtered counts
total_reads_per_sample <- colSums(counts_filtered_mito)

# Handle samples with zero total reads to avoid division by zero (set to 1 to make RPM 0 for all genes in that sample)
total_reads_per_sample[total_reads_per_sample == 0] <- 1

# Calculate RPM matrix
rpm_matrix <- sweep(counts_filtered_mito, 2, total_reads_per_sample, FUN = "/") * 1e6

# 3. Keep only genes with at least 0.5 RPM across ALL datasets
# Apply 'all' function to check if RPM is >= 0.5 for every sample for each gene
genes_to_keep_idx <- apply(rpm_matrix, 1, function(row) all(row >= 0.5))
final_counts_matrix <- counts_filtered_mito[genes_to_keep_idx, , drop = FALSE]
message(paste("Genes after RPM filtering:", nrow(final_counts_matrix)))

# Save the final filtered counts matrix
write.table(final_counts_matrix, "filtered_gene_counts.tsv", sep = "\t", quote = FALSE, col.names = NA)
EOF

# Execute the R script
Rscript filter_expression.R

View on GitHub

Differential expression in polysome versus input was quantified with edgeR (v3.26.8) with dispersion = 0.06 estimated from the subset of experiments with multiple replicates.

edgeR v3.26.8 GitHub

$ Bash example

# Install edgeR (if not already installed)
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("edgeR")'
# conda install -c bioconda bioconductor-edger

# Create a dummy count matrix for demonstration purposes.
# In a real scenario, this 'counts.tsv' would be generated by a quantification tool
# like featureCounts, HTSeq, or Salmon/Kallisto followed by tximport.
cat << 'EOF' > counts.tsv
Gene	Polysome_Rep1	Polysome_Rep2	Input_Rep1	Input_Rep2
GeneA	100	120	50	60
GeneB	20	25	80	90
GeneC	500	550	480	520
GeneD	10	12	5	6
GeneE	300	310	320	330
EOF

# R script for differential expression analysis with edgeR
cat << 'EOF' > run_edger.R
library(edgeR)

# Load count data
# Assuming counts.tsv has genes as rows and samples as columns, with the first column as gene names
counts_data <- read.delim("counts.tsv", row.names = 1)

# Define experimental groups
# Adjust sample names and groups based on your actual data structure
samples <- colnames(counts_data)
group <- factor(c("Polysome", "Polysome", "Input", "Input")) # Assuming 2 polysome and 2 input samples

# Create a DGEList object
dge <- DGEList(counts=counts_data, group=group)

# Filter out lowly expressed genes (optional but highly recommended)
# This removes genes that are unlikely to be differentially expressed and improves statistical power.
keep <- filterByExpr(dge)
dge <- dge[keep,,keep.lib.sizes=FALSE]

# Normalize library sizes using TMM method
dge <- calcNormFactors(dge)

# Estimate dispersion
# The description states "dispersion = 0.06 estimated from the subset of experiments with multiple replicates."
# This implies that the standard edgeR dispersion estimation process was performed, and 0.06 was the resulting
# common dispersion value (or a value derived from it). The estimateDisp function performs this estimation.
# If a fixed dispersion value was to be explicitly used, it could be set as: dge$common.dispersion <- 0.06
dge <- estimateDisp(dge)

# Print estimated common dispersion (for verification)
cat("Estimated common dispersion: ", dge$common.dispersion, "\n")

# Fit a GLM (Generalized Linear Model)
design <- model.matrix(~group)
fit <- glmFit(dge, design)

# Perform likelihood ratio test for differential expression (Polysome vs Input)
# 'coef=2' typically corresponds to the second coefficient in the design matrix,
# which represents the log-fold-change for the 'Polysome' group relative to the reference 'Input' group.
lrt <- glmLRT(fit, coef=2)

# Get top differentially expressed genes
results <- topTags(lrt, n=Inf)$table

# Save results to a CSV file
write.csv(results, "differential_expression_results.csv", row.names = TRUE)

cat("Differential expression analysis complete. Results saved to differential_expression_results.csv\n")
EOF

# Execute the R script
Rscript run_edger.R

View on GitHub

Tools Used

STAR

Raw Source Text

First, adapter trimming was performed withÂ cutadaptÂ (v1.14).
Next, reads were mapped with STAR (v2.4.0i) against a database of repetitive elements, with mapping reads discarded from further analysis.
Remaining reads were then mapped to the mouse genome (mm10) with STAR (v2.4.0i)
Stranded reads mapping to GENCODE (vM20) genes were quantified with featureCounts (subread-1.6.4). Mitochondrial genes were discarded, and only genes with at least 0.5 reads per million across all datasets were kept for further analysis.
Differential expression in polysome versus input was quantified with edgeR (v3.26.8) with dispersion = 0.06 estimated from the subset of experiments with multiple replicates.
Genome_build: mm10
Supplementary_files_format_and_content: EdgeR output files (*.txt.gz*) contain gene identifier, log(fold-change in polysome versus input), log(counts per million), and p-value (polysome versus input)

← Back to Analysis