GSE157051 Processing Pipeline

GSE code_examples 14 steps

Publication

AMPK regulation of Raptor and TSC2 mediate metformin effects on transcriptional control of anabolism and inflammation.

Genes & development (2020) — PMID 32912901

Dataset

AMPK regulation of Raptor and TSC2 mediate Metformin effects on transcriptional control of anabolism and inflammation

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Raw RNA sequencing reads in FASTQ files were quality-tested using FASTQC (v0.11.8) (Andrews 2010) and mapped to the mouse reference genome (mm10) with STAR (v2.5.3a) aligner with default parameters (Dobin et al.

STAR v2.5.3a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star=2.5.3a

# Placeholder for STAR genome index directory.
# This directory should contain the STAR index files generated for mm10.
# Example command to generate index (run once):
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_index/mm10 \
#      --genomeFastaFiles /path/to/mm10.fa \
#      --sjdbGTFfile /path/to/mm10.gtf \
#      --sjdbOverhang 100 \
#      --runThreadN 8

# Align raw RNA sequencing reads (FASTQ files) to the mouse reference genome (mm10) using STAR (v2.5.3a) with default parameters.
# Assuming paired-end reads and gzipped FASTQ files.
STAR --genomeDir /path/to/STAR_index/mm10 \
     --readFilesIn read1.fastq.gz read2.fastq.gz \
     --readFilesCommand zcat \
     --runThreadN 8 \
     --outFileNamePrefix sample_aligned_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard

View on GitHub

2012).

Unknown (Inferred with models/gemini-2.5-flash) vUnknown (Inferred with models/gemini-2.5-flash)

$ Bash example

# No specific tool or command could be inferred from the description "2012)".
# Please provide a more detailed step description to generate a relevant bash command.

Raw or normalized gene expression levels were quantified across all the exons of RefSeq genes with analyzeRepeats.pl in HOMER (v4.11.1) (Heinz et al.

HOMER v4.11.1 GitHub

$ Bash example

# Install HOMER (v4.11.1)
# wget http://homer.ucsd.edu/homer/configureHomer.pl
# perl configureHomer.pl -install
# perl configureHomer.pl -add hg38 # Add human hg38 genome and RefSeq annotations
# # Ensure HOMER scripts are in your PATH
# # export PATH=$PATH:/path/to/homer/bin

# Placeholder for input BAM file (aligned RNA-seq reads)
INPUT_BAM="input.bam"
OUTPUT_FILE="gene_expression_counts.txt"
GENOME_ASSEMBLY="hg38" # Assuming human hg38 for RefSeq genes

# Quantify raw or normalized gene expression levels across all exons of RefSeq genes
# using analyzeRepeats.pl from HOMER.
# -rna: Specifies RNA-seq analysis mode.
# -count exons: Counts reads mapping to exons.
# -gene <genome>: Uses gene annotations for the specified genome (e.g., RefSeq for hg38).
# -rpkm: Calculates RPKM values for normalization (as 'normalized gene expression' was mentioned).
# -d <input_bam_file>: Specifies the input alignment file(s).
analyzeRepeats.pl rna "${OUTPUT_FILE}" -count exons -gene "${GENOME_ASSEMBLY}" -rpkm -d "${INPUT_BAM}"

View on GitHub

4

2010).

(Inferred with models/gemini-2.5-flash)

Differential expression analysis was performed with raw gene counts using R package, edgeR (v3.26.7) (Robinson et al.

edgeR v3.26.7 GitHub

$ Bash example

# Install edgeR if not already installed
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("edgeR")

# Create dummy input files for demonstration
# raw_gene_counts.tsv: A tab-separated file with gene IDs and raw counts for samples
echo -e "GeneID\tSample1\tSample2\tSample3\tSample4" > raw_gene_counts.tsv
echo -e "GeneA\t100\t120\t50\t60" >> raw_gene_counts.tsv
echo -e "GeneB\t200\t210\t100\t110" >> raw_gene_counts.tsv
echo -e "GeneC\t50\t60\t150\t160" >> raw_gene_counts.tsv
echo -e "GeneD\t10\t15\t20\t25" >> raw_gene_counts.tsv

# sample_metadata.tsv: A tab-separated file with sample names and their group
echo -e "Sample\tGroup" > sample_metadata.tsv
echo -e "Sample1\tControl" >> sample_metadata.tsv
echo -e "Sample2\tControl" >> sample_metadata.tsv
echo -e "Sample3\tTreatment" >> sample_metadata.tsv
echo -e "Sample4\tTreatment" >> sample_metadata.tsv

# R script for edgeR differential expression analysis
Rscript -e '
library(edgeR)

# --- Configuration ---
counts_file <- "raw_gene_counts.tsv"
metadata_file <- "sample_metadata.tsv"
output_file <- "edgeR_DE_results.tsv"
group_column <- "Group" # Column in metadata_file defining experimental groups

# --- Load Data ---
counts_data <- read.delim(counts_file, row.names = 1, stringsAsFactors = FALSE)
metadata <- read.delim(metadata_file, row.names = 1, stringsAsFactors = FALSE)

# Ensure sample order matches
metadata <- metadata[colnames(counts_data), , drop = FALSE]

# Create DGEList object
dge <- DGEList(counts = counts_data)

# --- Filtering (optional but recommended) ---
# Keep genes with at least 10 counts per million (CPM) in at least 2 samples
keep <- filterByExpr(dge, group = metadata[[group_column]])
dge <- dge[keep, , keep.lib.sizes = FALSE]

# --- Normalization ---
dge <- calcNormFactors(dge)

# --- Design Matrix ---
group <- factor(metadata[[group_column]])
design <- model.matrix(~group)

# --- Estimate Dispersions ---
dge <- estimateDisp(dge, design)

# --- GLM Fitting and Testing ---
fit <- glmFit(dge, design)
lrt <- glmLRT(fit, coef = 2) # Assuming "Treatment" is the second coefficient for comparison against "Control"

# --- Extract Results ---
results <- topTags(lrt, n = Inf, adjust.method = "BH")$table

# --- Save Results ---
write.table(results, file = output_file, sep = "\t", quote = FALSE, row.names = TRUE)

message(paste("Differential expression results saved to:", output_file))
'

View on GitHub

2010), using replicates to compute within-group dispersion and correcting for batch effects.

DESeq2 (Inferred with models/gemini-2.5-flash) v1.42.0 GitHub

$ Bash example

# Install Bioconductor and DESeq2 if not already installed
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')"

# The following R script (e.g., run_deseq2.R) performs the analysis:
# # run_deseq2.R
# library(DESeq2)
#
# # Load count data (e.g., from featureCounts or similar)
# # Assuming counts.tsv has genes as rows, samples as columns
# count_data <- read.table("counts.tsv", header=TRUE, row.names=1, sep="\t")
#
# # Load sample metadata
# # Assuming coldata.tsv has sample names as rows, metadata columns (e.g., condition, batch)
# # Ensure row names match column names of count_data
# coldata <- read.table("coldata.tsv", header=TRUE, row.names=1, sep="\t")
# coldata$condition <- factor(coldata$condition) # Example factor for experimental condition
# coldata$batch <- factor(coldata$batch) # Example factor for batch effect
#
# # Ensure order of samples in coldata matches count_data
# coldata <- coldata[colnames(count_data), ]
#
# # Create DESeqDataSet object
# # Design formula includes batch effect correction
# dds <- DESeqDataSetFromMatrix(countData = count_data,
#                               colData = coldata,
#                               design = ~ batch + condition) # Example design
#
# # Run DESeq2 analysis to estimate dispersion and differential expression
# dds <- DESeq(dds)
#
# # Get results (e.g., for condition comparison)
# res <- results(dds, contrast=c("condition", "treatment", "control")) # Example contrast
#
# # Save results
# write.csv(as.data.frame(res), file="deseq2_differential_expression_results.csv")
#
# # Optional: Save normalized counts
# norm_counts <- counts(dds, normalized=TRUE)
# write.csv(as.data.frame(norm_counts), file="deseq2_normalized_counts.csv")

# Execute the R script
Rscript run_deseq2.R

View on GitHub

Genes with false discovery rate (FDR) < 0.05 and an absolute fold-change (FC) >= 1.5 were identified as significantly different when comparing two conditions.

DESeq2 (Inferred with models/gemini-2.5-flash) v1.38.3 GitHub

$ Bash example

#!/bin/bash

# This script performs differential gene expression analysis using DESeq2
# based on the specified FDR and absolute Fold-Change thresholds.

# Prerequisites: R and Bioconductor's DESeq2 package installed.

# Example installation (uncomment to use):
# # Install R if not present (e.g., on Ubuntu/Debian)
# # sudo apt-get update
# # sudo apt-get install r-base
#
# # Install BiocManager and DESeq2 in R
# # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')"

# --- Configuration ---
# Replace with your actual input files and column names
COUNT_MATRIX_FILE="counts.tsv" # Example: Tab-separated file with gene IDs as row names and sample counts as columns
COLDATA_FILE="coldata.tsv"     # Example: Tab-separated file with sample IDs as row names and metadata columns
OUTPUT_FILE="significant_genes.tsv"
CONDITION_COLUMN="condition"   # Column name in coldata.tsv that defines the conditions to compare
CONTROL_LEVEL="control"        # Name of the control group in the CONDITION_COLUMN
TREATMENT_LEVEL="treatment"    # Name of the treatment group in the CONDITION_COLUMN
FDR_THRESHOLD=0.05
ABS_FC_THRESHOLD=1.5

# Create dummy input files for demonstration if they don't exist
if [ ! -f "$COUNT_MATRIX_FILE" ]; then
    echo -e "gene\tsample1_ctrl\tsample2_ctrl\tsample3_treat\tsample4_treat" > "$COUNT_MATRIX_FILE"
    echo -e "geneA\t100\t120\t250\t280" >> "$COUNT_MATRIX_FILE"
    echo -e "geneB\t50\t60\t40\t35" >> "$COUNT_MATRIX_FILE"
    echo -e "geneC\t200\t210\t205\t215" >> "$COUNT_MATRIX_FILE"
    echo -e "geneD\t10\t12\t30\t35" >> "$COUNT_MATRIX_FILE"
    echo -e "geneE\t500\t520\t400\t380" >> "$COUNT_MATRIX_FILE"
    echo "Created dummy $COUNT_MATRIX_FILE"
fi

if [ ! -f "$COLDATA_FILE" ]; then
    echo -e "sample\tcondition" > "$COLDATA_FILE"
    echo -e "sample1_ctrl\tcontrol" >> "$COLDATA_FILE"
    echo -e "sample2_ctrl\tcontrol" >> "$COLDATA_FILE"
    echo -e "sample3_treat\ttreatment" >> "$COLDATA_FILE"
    echo -e "sample4_treat\ttreatment" >> "$COLDATA_FILE"
    echo "Created dummy $COLDATA_FILE"
fi

# Execute R script
Rscript -e "
    library(DESeq2)

    COUNT_MATRIX_FILE='${COUNT_MATRIX_FILE}'
    COLDATA_FILE='${COLDATA_FILE}'
    OUTPUT_FILE='${OUTPUT_FILE}'
    CONDITION_COLUMN='${CONDITION_COLUMN}'
    CONTROL_LEVEL='${CONTROL_LEVEL}'
    TREATMENT_LEVEL='${TREATMENT_LEVEL}'
    FDR_THRESHOLD=${FDR_THRESHOLD}
    ABS_FC_THRESHOLD=${ABS_FC_THRESHOLD}
    LOG2FC_THRESHOLD=log2(ABS_FC_THRESHOLD)

    count_data <- read.table(COUNT_MATRIX_FILE, header=TRUE, row.names=1, sep='\t', check.names=FALSE)
    count_data <- round(count_data)

    coldata <- read.table(COLDATA_FILE, header=TRUE, row.names=1, sep='\t', check.names=FALSE)
    coldata <- coldata[colnames(count_data), , drop=FALSE]
    coldata[[CONDITION_COLUMN]] <- factor(coldata[[CONDITION_COLUMN]], levels=c(CONTROL_LEVEL, TREATMENT_LEVEL))

    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = coldata,
                                  design = as.formula(paste('~', CONDITION_COLUMN)))

    dds <- DESeq(dds)

    res <- results(dds, contrast=c(CONDITION_COLUMN, TREATMENT_LEVEL, CONTROL_LEVEL))

    significant_res <- subset(res, padj < FDR_THRESHOLD)
    significant_res <- subset(significant_res, abs(log2FoldChange) >= LOG2FC_THRESHOLD)
    significant_res <- significant_res[order(significant_res$padj), ]

    write.table(as.data.frame(significant_res), file=OUTPUT_FILE, sep='\t', quote=FALSE, row.names=TRUE)

    message(paste('Significant genes (FDR <', FDR_THRESHOLD, 'and |FC| >=', ABS_FC_THRESHOLD, ') saved to', OUTPUT_FILE))
"

View on GitHub

Normalized counts were log transformed, filtered, scaled, and centered prior to clustering and heatmap generation with Cluster3 (de Hoon et al.

Cluster3 v3.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Cluster 3.0 (example using apt, actual installation might vary)
# sudo apt-get update
# sudo apt-get install cluster3

# Input file: Tab-separated file of normalized counts (e.g., genes x samples)
INPUT_FILE="input_normalized_counts.tsv"
OUTPUT_PREFIX="clustered_data"

# Perform log transformation, filtering, scaling, centering, and hierarchical clustering
# -l: Log transform data (base 2 by default)
# -sf 0.5: Filter genes with standard deviation below 0.5 (example threshold for filtering)
# -g 1: Mean-center genes (centering)
# -s g: Normalize genes by standard deviation (scaling)
# -a 1: Mean-center arrays/samples (centering)
# -s a: Normalize arrays/samples by standard deviation (scaling)
# -x: Cluster genes
# -y: Cluster arrays/samples
# -m a: Use average linkage method for clustering
# -e c: Use correlation as distance metric
# -o ${OUTPUT_PREFIX}: Specify output file prefix
cluster3 -l -sf 0.5 -g 1 -s g -a 1 -s a -x -y -m a -e c -o "${OUTPUT_PREFIX}" "${INPUT_FILE}"

# The output files (e.g., clustered_data.cdt, clustered_data.gtr, clustered_data.atr)
# can then be used with visualization tools like Java TreeView to generate heatmaps.

View on GitHub

2004), and visualized with JavaTreeView (Saldanha 2004).

JavaTreeView v2004

$ Bash example

# Ensure Java Runtime Environment (JRE) is installed
# For example, on Ubuntu:
# sudo apt update
# sudo apt install default-jre

# Download JavaTreeView (replace with actual download link if available)
# wget -O JavaTreeView.jar "http://jtreeview.sourceforge.net/download/JavaTreeView.jar" # Placeholder URL, check SourceForge for latest

# Run JavaTreeView to visualize data. JavaTreeView typically opens a GUI.
# If it supports command-line input for a specific file, it might look like:
# java -jar JavaTreeView.jar your_data_file.cdt
# Without specific input files mentioned, a generic launch command is provided:
java -jar JavaTreeView.jar

GSEA was carried out with the GenePattern interface, https://genepattern.broadinstitute.org) using preranked lists generated from FDR values, setting gene set permutations to 1000 and using the Hallmark collection in MSigDB version5.0.

GSEA v(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Download GSEA command-line tool (replace with actual version if known)
# wget https://data.broadinstitute.org/gsea-msigdb/gsea_latest_versions/GSEA_4.3.2/gsea-cli.zip
# unzip gsea-cli.zip
# mv gsea-cli/gsea-cli.jar .

# Download MSigDB Hallmark collection v5.0
# Note: Direct link for v5.0 Hallmark might be difficult to find. Typically, you download the full MSigDB and extract the Hallmark collection.
# For demonstration, assuming 'h.all.v5.0.symbols.gmt' is available.
# wget https://data.broadinstitute.org/gsea-msigdb/msigdb/release/5.0/msigdb.v5.0.symbols.gmt
# (Then, you would filter this file to get only the Hallmark collection, or use a pre-filtered one if available)

# Example preranked list (replace with actual input file)
# echo -e "GENE1\t0.95\nGENE2\t0.88\n..." > input.rnk

# Run GSEA Preranked analysis
# Adjust -Xmx (memory) as needed
java -Xmx4000m -cp gsea-cli.jar xtools.gsea.GseaPreranked \
  -rnk input.rnk \
  -gmx h.all.v5.0.symbols.gmt \
  -o gsea_output \
  -nperm 1000 \
  -set_max 500 \
  -set_min 15 \
  -rnd_type no_balance \
  -plot_top_x 20 \
  -collapse false \
  -norm meandiv

View on GitHub

GO Term and KEGG Pathway analysis were performed with David: http://david.ncifcrf.gov/.

DAVID v6.8 (Inferred with models/gemini-2.5-flash)

$ Bash example

# DAVID is a web-based tool for functional annotation and enrichment analysis.
# The following conceptual steps outline its usage:

# 1. Prepare your gene list (e.g., Entrez Gene IDs, Official Gene Symbols)
#    Example: Create a dummy gene list file
# echo -e "TP53\nMYC\nJUN\nFOS\nEGR1" > gene_list.txt

# 2. Access the DAVID web interface: http://david.ncifcrf.gov/
# 3. Upload your gene list (e.g., gene_list.txt) to the DAVID website.
# 4. Select "Gene List" as the input type and choose the appropriate identifier.
# 5. Select "Functional Annotation Chart" or "Functional Annotation Clustering".
# 6. Ensure "GO Term" (Biological Process, Molecular Function, Cellular Component)
#    and "KEGG Pathway" categories are selected for analysis.
# 7. Run the analysis.
# 8. Download the results (e.g., as a tab-separated file).
#    The output would typically be a file containing enrichment scores, p-values,
#    and associated genes for each GO term and KEGG pathway.
#    Example output file name: david_enrichment_results.tsv

# Note: Direct command-line execution for DAVID is not standard as it's a web service.
# This block describes the typical workflow.

Area-proportional Venn diagrams were plotted using BioVenn (Hulsen et al.

BioVenn vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install biovenn Python package
# pip install biovenn matplotlib

# Create a Python script to generate the Venn diagram
cat << 'EOF' > generate_venn.py
import biovenn
import matplotlib.pyplot as plt

# Placeholder for actual data loading or generation from the pipeline.
# In a real pipeline, these lists would be generated from previous steps
# (e.g., from peak calling results, gene lists, etc.).
# For demonstration, let's assume we have text files with one item per line:
# with open('list1.txt', 'r') as f:
#     list1 = set(f.read().splitlines())
# with open('list2.txt', 'r') as f:
#     list2 = set(f.read().splitlines())
# with open('list3.txt', 'r') as f:
#     list3 = set(f.read().splitlines())

# For this example, we'll use hardcoded lists representing elements from different sets
list1 = set(['element_A', 'element_B', 'element_C', 'element_D', 'element_E'])
list2 = set(['element_C', 'element_D', 'element_F', 'element_G'])
list3 = set(['element_D', 'element_E', 'element_H', 'element_I'])

# Define labels for the sets (e.g., 'Replicate 1', 'Replicate 2', 'Replicate 3')
labels = ('Set 1', 'Set 2', 'Set 3')

# Output filename for the generated Venn diagram image
output_filename = 'area_proportional_venn_diagram.png'

# Generate the 3-set area-proportional Venn diagram.
# BioVenn uses matplotlib internally to save the figure.
# The 'filename' parameter saves the plot directly to a file.
biovenn.venn3(list1, list2, list3, labels=labels, filename=output_filename)

print(f"Area-proportional Venn diagram saved to {output_filename}")
EOF

# Execute the Python script to generate the Venn diagram
python generate_venn.py

View on GitHub

2008).

Unknown (Inferred with models/gemini-2.5-flash) vUnknown (Inferred with models/gemini-2.5-flash)

$ Bash example

No specific tool, parameters, or reference datasets can be inferred from the description '2008).'. Please provide more context about the bioinformatics step.

Enrichr: http://amp.pharm.mssm.edu/Enrichr.

Enrichr vAPI GitHub

$ Bash example

# Install gseapy, a Python wrapper for Enrichr API
# pip install gseapy

# Example: Create a dummy gene list file (replace with your actual gene list)
# echo "TP53" > gene_list.txt
# echo "MYC" >> gene_list.txt
# echo "EGFR" >> gene_list.txt

python -c "
import gseapy as gp
import pandas as pd
import os

# Define input gene list file path
gene_list_file = 'gene_list.txt'

# Load gene list from the file
try:
    with open(gene_list_file, 'r') as f:
        gene_list = [line.strip() for line in f if line.strip()]
except FileNotFoundError:
    print(f'Error: {gene_list_file} not found. Please ensure your gene list file exists.')
    exit(1)

if not gene_list:
    print(f'Error: {gene_list_file} is empty. Please provide a list of gene symbols.')
    exit(1)

# Define output directory
outdir = 'enrichr_results'
os.makedirs(outdir, exist_ok=True)

# Define gene set libraries to use (e.g., 'GO_Biological_Process_2023', 'KEGG_2021_Human')
# You can find available libraries on the Enrichr website or by using gp.get_library_name()
gene_set_libraries = ['GO_Biological_Process_2023', 'KEGG_2021_Human'] # Placeholder libraries

# Run Enrichr analysis
print(f'Running Enrichr analysis for {len(gene_list)} genes using libraries: {', '.join(gene_set_libraries)}...')
enr = gp.enrichr(gene_list=gene_list,
                 gene_sets=gene_set_libraries,
                 organism='human', # Specify organism (e.g., 'human', 'mouse', 'rat')
                 outdir=outdir,
                 cutoff=0.05, # p-value cutoff for significance
                 format='png', # Output format for plots (e.g., 'png', 'pdf', 'svg')
                 no_plot=False, # Set to True if you don't want plots
                 verbose=True)

# Save results to a CSV file
if enr.res is not None and not enr.res.empty:
    output_csv_path = os.path.join(outdir, 'enrichr_output.csv')
    enr.res.to_csv(output_csv_path, index=False)
    print(f'Enrichment results saved to {output_csv_path}')
else:
    print('No significant enrichment results found or an error occurred during analysis.')

print(f'Enrichr plots (if generated) saved to {outdir}')
"

View on GitHub

Tools Used

STAR HOMER

Raw Source Text

Raw RNA sequencing reads in FASTQ files were quality-tested using FASTQC (v0.11.8) (Andrews 2010) and mapped to the mouse reference genome (mm10) with STAR (v2.5.3a) aligner with default parameters (Dobin et al. 2012).
Raw or normalized gene expression levels were quantified across all the exons of RefSeq genes with analyzeRepeats.pl in HOMER (v4.11.1) (Heinz et al. 2010).
Differential expression analysis was performed with raw gene counts using R package, edgeR (v3.26.7) (Robinson et al. 2010), using replicates to compute within-group dispersion and correcting for batch effects. Genes with false discovery rate (FDR) < 0.05 and an absolute fold-change (FC) >= 1.5 were identified as significantly different when comparing two conditions.
Normalized counts were log transformed, filtered, scaled, and centered prior to clustering and heatmap generation with Cluster3 (de Hoon et al. 2004), and visualized with JavaTreeView (Saldanha 2004).
GSEA was carried out with the GenePattern interface, https://genepattern.broadinstitute.org) using preranked lists generated from FDR values, setting gene set permutations to 1000 and using the Hallmark collection in MSigDB version5.0. GO Term and KEGG Pathway analysis were performed with David: http://david.ncifcrf.gov/.
Area-proportional Venn diagrams were plotted using BioVenn (Hulsen et al. 2008). Enrichr: http://amp.pharm.mssm.edu/Enrichr.
Genome_build: mm10
Supplementary_files_format_and_content: Normalized expression table generated with analyzeRepeats.pl. Please see header for specific arguments used.

← Back to Analysis