GSE157048 Processing Pipeline

RNA-Seq code_examples 14 steps

Publication

AMPK regulation of Raptor and TSC2 mediate metformin effects on transcriptional control of anabolism and inflammation.

Genes & development (2020) — PMID 32912901

Dataset

RNA-seq of metformin treatment in primary murine hepatocytes in WT, Raptor Ser-Ala mutant, Tsc2-null, Raptor mutant;Tsc2-null, and Ampk-null cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Raw RNA sequencing reads in FASTQ files were quality-tested using FASTQC (v0.11.8) (Andrews 2010) and mapped to the mouse reference genome (mm10) with STAR (v2.5.3a) aligner with default parameters (Dobin et al.

STAR v2.5.3a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.5.3a

# Placeholder for STAR genome index directory (mm10)
# This directory should contain the pre-built STAR index files for the mouse mm10 reference genome.
STAR_GENOME_DIR="/path/to/star_index/mm10"

# Placeholder for input FASTQ files
# Assuming paired-end reads, adjust if single-end.
READ1_FASTQ="input_reads_R1.fastq.gz"
READ2_FASTQ="input_reads_R2.fastq.gz"

# Placeholder for output directory and prefix
OUTPUT_PREFIX="aligned_reads"
OUTPUT_DIR="./star_output"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run STAR alignment to the mouse reference genome (mm10) with default parameters
STAR \
  --genomeDir "${STAR_GENOME_DIR}" \
  --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \
  --runThreadN 8 \
  --outFileNamePrefix "${OUTPUT_DIR}/${OUTPUT_PREFIX}_" \
  --readFilesCommand zcat

View on GitHub

2

2012).

(Inferred with models/gemini-2.5-flash)

Raw or normalized gene expression levels were quantified across all the exons of RefSeq genes with analyzeRepeats.pl in HOMER (v4.11.1) (Heinz et al.

HOMER v4.11.1 GitHub

$ Bash example

# Install HOMER (if not already installed)
# Note: HOMER installation typically involves downloading and running a Perl script.
# Example (adjust path as needed):
# cd /path/to/install
# perl configureHomer.pl -install
# perl configureHomer.pl -install hg38 # To install human genome (hg38) annotation

# Define input BAM files (replace with actual file paths)
INPUT_BAM_FILES=("sample1.bam" "sample2.bam")

# Define output directory
OUTPUT_DIR="gene_expression_quantification"
mkdir -p "${OUTPUT_DIR}"

# Define genome and annotation type
GENOME="hg38" # Using hg38 as a common placeholder, adjust if different genome was used
ANNOTATION_TYPE="refseq" # As specified in the description

# Loop through each input BAM file and quantify gene expression
for bam_file in "${INPUT_BAM_FILES[@]}"; do
    sample_name=$(basename "${bam_file}" .bam)
    
    echo "Quantifying raw counts for ${sample_name}..."
    analyzeRepeats.pl "${ANNOTATION_TYPE}" "${GENOME}" -count "${bam_file}" -raw > "${OUTPUT_DIR}/${sample_name}_raw_counts.txt"
    
    echo "Quantifying RPKM for ${sample_name}..."
    analyzeRepeats.pl "${ANNOTATION_TYPE}" "${GENOME}" -count "${bam_file}" -rpkm > "${OUTPUT_DIR}/${sample_name}_rpkm.txt"
    
    echo "Finished quantification for ${sample_name}"
done

echo "All gene expression quantification complete."

View on GitHub

2010).

(Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# No specific command or parameters can be inferred from the description '2010)'.
# This description appears to be incomplete or a placeholder.

Differential expression analysis was performed with raw gene counts using R package, edgeR (v3.26.7) (Robinson et al.

edgeR v3.26.7 GitHub

$ Bash example

# Install R if not already installed (example for Ubuntu)
# sudo apt-get update
# sudo apt-get install r-base

# Install edgeR package in R if not already installed
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("edgeR")'

# Create dummy raw gene counts and sample metadata files for demonstration.
# In a real scenario, these would be provided as input.
# raw_gene_counts.tsv should contain gene IDs as row names and sample counts as columns.
# sample_metadata.tsv should contain sample IDs as row names and group information.

# Example: Create a dummy raw_gene_counts.tsv
echo -e "gene_id\tsample1_control\tsample2_control\tsample3_treatment\tsample4_treatment" > raw_gene_counts.tsv
echo -e "geneA\t100\t120\t500\t550" >> raw_gene_counts.tsv
echo -e "geneB\t50\t60\t20\t25" >> raw_gene_counts.tsv
echo -e "geneC\t200\t210\t220\t230" >> raw_gene_counts.tsv
echo -e "geneD\t10\t12\t100\t110" >> raw_gene_counts.tsv

# Example: Create a dummy sample_metadata.tsv
echo -e "sample_id\tgroup" > sample_metadata.tsv
echo -e "sample1_control\tcontrol" >> sample_metadata.tsv
echo -e "sample2_control\tcontrol" >> sample_metadata.tsv
echo -e "sample3_treatment\ttreatment" >> sample_metadata.tsv
echo -e "sample4_treatment\ttreatment" >> sample_metadata.tsv

# R script for edgeR differential expression analysis
Rscript -e '
  library(edgeR)

  # Load raw gene counts
  counts_data <- read.delim("raw_gene_counts.tsv", row.names = 1)
  counts_matrix <- as.matrix(counts_data)

  # Load sample metadata
  metadata <- read.delim("sample_metadata.tsv", row.names = 1)
  # Ensure metadata rows match count matrix columns
  group <- factor(metadata[colnames(counts_matrix), "group"])

  # Create a DGEList object
  dge <- DGEList(counts = counts_matrix, group = group)

  # Filter out lowly expressed genes (optional, but recommended)
  # keep <- filterByExpr(dge)
  # dge <- dge[keep,,keep.lib.sizes=FALSE]

  # Normalize library sizes (TMM normalization is default)
  dge <- calcNormFactors(dge)

  # Define the design matrix
  design <- model.matrix(~group)

  # Estimate dispersions
  dge <- estimateDisp(dge, design)

  # Fit the GLM
  fit <- glmFit(dge, design)

  # Perform likelihood ratio test for differential expression
  # coef=2 typically corresponds to the second factor level (e.g., "treatment" vs "control")
  lrt <- glmLRT(fit, coef = 2)

  # Get top differentially expressed genes
  results <- topTags(lrt, n = Inf, sort.by = "PValue")$table

  # Save results to a file
  write.csv(results, "edgeR_differential_expression_results.csv", row.names = TRUE)

  message("edgeR analysis complete. Results saved to edgeR_differential_expression_results.csv")
'

View on GitHub

2010), using replicates to compute within-group dispersion and correcting for batch effects.

DESeq (Inferred with models/gemini-2.5-flash) v1.0.0

$ Bash example

# This step describes a statistical analysis process, specifically using replicates to compute
# within-group dispersion and correcting for batch effects, as introduced around 2010.
# The original DESeq package (Anders and Huber, 2010) implemented these concepts for RNA-seq data.

# Installation of DESeq (if not already installed):
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq')"

# Example R script (e.g., analyze_data_deseq.R) that would perform these operations:
# library(DESeq)
#
# # Placeholder for loading count data and metadata
# # count_matrix <- read.csv("input_counts.csv", row.names = 1)
# # metadata <- read.csv("sample_metadata.csv", row.names = 1)
#
# # Create CountDataSet object
# # cds <- newCountDataSet(count_matrix, metadata$condition) # 'condition' is a placeholder column
#
# # Estimate size factors
# # cds <- estimateSizeFactors(cds)
#
# # Estimate dispersions (this computes within-group dispersion)
# # cds <- estimateDispersions(cds, method = "per-condition", sharingMode = "fit-only", fitType = "local")
#
# # For batch effects, the original DESeq package primarily focused on differential testing.
# # Batch effects could be addressed by including them in the experimental design or by using
# # methods like limma's removeBatchEffect on transformed data (e.g., variance-stabilized data).
#
# # Perform differential expression test (example)
# # res <- nbinomTest(cds, "control", "treated")
#
# # Save results
# # write.csv(res, "differential_results_deseq.csv")

# Execute a hypothetical R script that performs the described analysis using DESeq.
# Replace 'input_counts.csv', 'sample_metadata.csv', and 'analysis_results.csv'
# with actual file paths and parameters as per the specific pipeline.
Rscript analyze_data_deseq.R --counts input_counts.csv --metadata sample_metadata.csv --output analysis_results.csv

Genes with false discovery rate (FDR) < 0.05 and an absolute fold-change (FC) >= 1.5 were identified as significantly different when comparing two conditions.

DESeq2 (Inferred with models/gemini-2.5-flash) v1.38.3 GitHub

$ Bash example

# This script assumes that a differential expression analysis has already been performed
# (e.g., using DESeq2, edgeR, or limma) and has produced a tab-separated file (TSV)
# containing gene-level statistics, including log2FoldChange and adjusted p-value (FDR).
#
# For this example, we assume the input file 'deseq2_results.tsv' has the following columns:
# Column 1: gene_id
# Column 2: log2FoldChange
# Column 6: padj (FDR)
# Please adjust column numbers ($2, $6) if your file format differs.

# Calculate the log2(1.5) threshold for fold-change. An absolute fold-change of 1.5
# corresponds to a log2FoldChange of approximately 0.585 or -0.585.
LOG2_FC_THRESHOLD=$(echo "l(1.5)/l(2)" | bc -l)

# Filter genes based on the specified criteria:
# 1. False Discovery Rate (FDR) < 0.05 (column 6)
# 2. Absolute Fold-Change (FC) >= 1.5, which translates to absolute log2FoldChange >= LOG2_FC_THRESHOLD (column 2)
# The 'awk' command prints the header (NR==1) and then filters subsequent rows.
awk -v THRESH="$LOG2_FC_THRESHOLD" 'BEGIN{FS="\t"; OFS="\t"}
    NR==1 {print; next} # Print header row
    $6 < 0.05 && ( $2 >= THRESH || $2 <= -THRESH ) {print}' deseq2_results.tsv > significantly_different_genes.tsv

View on GitHub

Normalized counts were log transformed, filtered, scaled, and centered prior to clustering and heatmap generation with Cluster3 (de Hoon et al.

Cluster 3.0 v3.0

$ Bash example

# Install Cluster 3.0 (C version)
# conda install -c bioconda cluster3

# Assuming 'normalized_counts.tsv' is the input file containing normalized counts.
# The file should be in a tab-separated format, with gene IDs as the first column
# and sample names as the header row.

# Log transform, center, scale, cluster, and prepare for heatmap generation.
# -l: Log transform data (log2 by default).
# -g: Center genes (rows) by mean.
# -n: Normalize genes (rows) by standard deviation.
# -e: Cluster genes (rows) using hierarchical clustering.
# -s: Cluster arrays (columns) using hierarchical clustering.
# -r 0: Use uncentered correlation as distance metric for genes.
# -c 0: Use uncentered correlation as distance metric for arrays.
# -k 2: Use average linkage for genes.
# -x 2: Use average linkage for arrays.
# -f clustered_data: Output file prefix for .cdt, .gtr, .atr files.
cluster3 -l -g -n -e -s -r 0 -c 0 -k 2 -x 2 -f clustered_data normalized_counts.tsv

# The generated .cdt, .gtr, and .atr files can then be visualized using Java TreeView
# or other compatible heatmap visualization tools.

2004), and visualized with JavaTreeView (Saldanha 2004).

JavaTreeView v2004

$ Bash example

# Java TreeView is a GUI application for visualizing phylogenetic trees.
# This command assumes Java is installed and the 'javatreeview.jar' file is accessible.
# Replace 'javatreeview.jar' with the actual path to the Java TreeView executable JAR file.
java -jar javatreeview.jar

GSEA was carried out with the GenePattern interface, https://genepattern.broadinstitute.org) using preranked lists generated from FDR values, setting gene set permutations to 1000 and using the Hallmark collection in MSigDB version5.0.

GSEA v2.2.4 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install GSEA (example, adjust for specific environment)
# GSEA is typically downloaded as a Java application from the Broad Institute website.
# Ensure Java Runtime Environment (JRE) is installed.
# Download GSEA_2.2.4.jar or the appropriate version from https://www.gsea-msigdb.org/gsea/downloads.jsp
# Place the gsea-cli.sh script (included in the GSEA download) in your PATH or specify its full path.

# Download MSigDB v5.0 Hallmark gene sets
# mkdir -p msigdb_v5.0
# wget -O msigdb_v5.0/h.all.v5.0.symbols.gmt https://data.broadinstitute.org/gsea-msigdb/msigdb_versions/GSEA_MSigDB_v5.0/h.all.v5.0.symbols.gmt

# Placeholder for the preranked list file
# Replace 'your_preranked_list.rnk' with the actual file generated from FDR values.
# This file should be a tab-separated file with two columns: GENE_SYMBOL and RANK_METRIC.
# Example content of your_preranked_list.rnk:
# GENE_SYMBOL\tRANK_METRIC
# TP53\t2.5
# MYC\t1.8
# ...

# Run GSEA Preranked analysis
# The GenePattern interface would execute a similar command internally.
gsea-cli.sh GSEA_Preranked \
    -rnk your_preranked_list.rnk \
    -gmx msigdb_v5.0/h.all.v5.0.symbols.gmt \
    -nperm 1000 \
    -out gsea_output_hallmark \
    -set_max 500 \
    -set_min 15 \
    -plot_top_x 20 \
    -rnd_type no_balance \
    -permute_type gene_set \
    -gui false

View on GitHub

GO Term and KEGG Pathway analysis were performed with David: http://david.ncifcrf.gov/.

DAVID v6.8 (Inferred from http://david.ncifcrf.gov/)

$ Bash example

# DAVID is a web-based bioinformatics resource and does not have a direct command-line execution.
# Analysis is performed by uploading gene lists to the web interface.
#
# Steps typically involve:
# 1. Navigating to http://david.ncifcrf.gov/
# 2. Uploading a gene list (e.g., Entrez Gene IDs, Official Gene Symbols, etc.)
# 3. Selecting the desired analysis options, such as 'GO Term' and 'KEGG Pathway'.
# 4. Running the analysis and interpreting the results through the web interface.
#
# No direct bash command is applicable for the execution of DAVID.

Area-proportional Venn diagrams were plotted using BioVenn (Hulsen et al.

BioVenn vN/A GitHub

$ Bash example

# Install biovenn Python package if not already installed
# pip install biovenn

# Create dummy input files for demonstration.
# In a real bioinformatics pipeline, these would be actual lists of identifiers
# (e.g., gene IDs, protein IDs, peak IDs) generated by upstream steps.
# Each file represents a set for the Venn diagram.
echo -e "geneA\ngeneB\ngeneC\ngeneF\ngeneI" > set1_items.txt
echo -e "geneB\ngeneC\ngeneD\ngeneG\ngeneJ" > set2_items.txt
echo -e "geneC\ngeneD\ngeneE\ngeneH\ngeneK" > set3_items.txt

# Python script to generate an area-proportional Venn diagram using the biovenn package.
# This package provides a programmatic way to achieve the functionality of the BioVenn web tool.
python -c "
import biovenn as bv

# Function to load items from a file
def load_items(filepath):
    with open(filepath, 'r') as f:
        return [line.strip() for line in f if line.strip()]

# Load the item lists from the created files
list1 = load_items('set1_items.txt')
list2 = load_items('set2_items.txt')
list3 = load_items('set3_items.txt')

# Define names for the sets, which will appear on the Venn diagram
set_names = ['Set_A_Condition1', 'Set_B_Condition2', 'Set_C_Condition3']

# Generate and save the 3-set area-proportional Venn diagram.
# The 'filename' parameter specifies the output image file (e.g., PNG, SVG).
# The biovenn package automatically calculates and renders the diagram with areas
# proportional to the sizes of the sets and their overlaps, as described.
# For more than 3 sets, biovenn offers venn2, venn3, venn4, venn5, venn6.
# For this example, we assume 3 sets based on common usage in pipelines.
output_filename = 'area_proportional_venn_diagram.png'
bv.venn3(list1, list2, list3, names=set_names, filename=output_filename)

print(f'Successfully generated area-proportional Venn diagram: {output_filename}')
"

View on GitHub

13

2008).

(Inferred with models/gemini-2.5-flash) v(Inferred with models/gemini-2.5-flash)

Enrichr: http://amp.pharm.mssm.edu/Enrichr.

Enrichr vLatest (Web-based service)

$ Bash example

# Install gseapy if not already installed. gseapy is a Python package that provides a command-line interface to interact with the Enrichr web service.
# conda install -c bioconda gseapy

# Create a dummy gene list file for demonstration purposes.
# In a real bioinformatics pipeline, this file would be generated by a previous step
# (e.g., differential expression analysis results, a list of significant genes).
cat <<EOF > gene_list.txt
TP53
MYC
EGFR
JUN
FOS
EOF

# Perform gene set enrichment analysis using Enrichr via the gseapy command-line tool.
# -i: Input gene list file (one gene symbol per line).
# -o: Output directory where Enrichr results will be saved.
# -d: Gene set libraries to use for enrichment analysis. These are the reference datasets
#     provided by Enrichr. Common libraries include KEGG, GO Biological Process, Reactome, etc.
#     Multiple libraries can be specified, separated by commas.
#     For a full list of available libraries, refer to the Enrichr website or gseapy documentation.
# --no-plot: Optional flag to prevent gseapy from generating plots, useful for automated pipelines.
gseapy enrichr -i gene_list.txt -o enrichr_results -d "KEGG_2021_Human,GO_Biological_Process_2023,Reactome_2023" --no-plot

Tools Used

STAR HOMER

Raw Source Text

Raw RNA sequencing reads in FASTQ files were quality-tested using FASTQC (v0.11.8) (Andrews 2010) and mapped to the mouse reference genome (mm10) with STAR (v2.5.3a) aligner with default parameters (Dobin et al. 2012).
Raw or normalized gene expression levels were quantified across all the exons of RefSeq genes with analyzeRepeats.pl in HOMER (v4.11.1) (Heinz et al. 2010).
Differential expression analysis was performed with raw gene counts using R package, edgeR (v3.26.7) (Robinson et al. 2010), using replicates to compute within-group dispersion and correcting for batch effects. Genes with false discovery rate (FDR) < 0.05 and an absolute fold-change (FC) >= 1.5 were identified as significantly different when comparing two conditions.
Normalized counts were log transformed, filtered, scaled, and centered prior to clustering and heatmap generation with Cluster3 (de Hoon et al. 2004), and visualized with JavaTreeView (Saldanha 2004).
GSEA was carried out with the GenePattern interface, https://genepattern.broadinstitute.org) using preranked lists generated from FDR values, setting gene set permutations to 1000 and using the Hallmark collection in MSigDB version5.0. GO Term and KEGG Pathway analysis were performed with David: http://david.ncifcrf.gov/.
Area-proportional Venn diagrams were plotted using BioVenn (Hulsen et al. 2008). Enrichr: http://amp.pharm.mssm.edu/Enrichr.
Genome_build: mm10
Supplementary_files_format_and_content: Normalized expression table generated with analyzeRepeats.pl. Please see header for specific arguments used.

← Back to Analysis