GSE157051 Processing Pipeline
Publication
AMPK regulation of Raptor and TSC2 mediate metformin effects on transcriptional control of anabolism and inflammation.Genes & development (2020) — PMID 32912901
Dataset
GSE157051AMPK regulation of Raptor and TSC2 mediate Metformin effects on transcriptional control of anabolism and inflammation
Processing Steps
Generate Jupyter Notebook-
1
Raw RNA sequencing reads in FASTQ files were quality-tested using FASTQC (v0.11.8) (Andrews 2010) and mapped to the mouse reference genome (mm10) with STAR (v2.5.3a) aligner with default parameters (Dobin et al.
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star=2.5.3a # Placeholder for STAR genome index directory. # This directory should contain the STAR index files generated for mm10. # Example command to generate index (run once): # STAR --runMode genomeGenerate \ # --genomeDir /path/to/STAR_index/mm10 \ # --genomeFastaFiles /path/to/mm10.fa \ # --sjdbGTFfile /path/to/mm10.gtf \ # --sjdbOverhang 100 \ # --runThreadN 8 # Align raw RNA sequencing reads (FASTQ files) to the mouse reference genome (mm10) using STAR (v2.5.3a) with default parameters. # Assuming paired-end reads and gzipped FASTQ files. STAR --genomeDir /path/to/STAR_index/mm10 \ --readFilesIn read1.fastq.gz read2.fastq.gz \ --readFilesCommand zcat \ --runThreadN 8 \ --outFileNamePrefix sample_aligned_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard -
2
2012).
Unknown (Inferred with models/gemini-2.5-flash) vUnknown (Inferred with models/gemini-2.5-flash)$ Bash example
# No specific tool or command could be inferred from the description "2012)". # Please provide a more detailed step description to generate a relevant bash command.
-
3
Raw or normalized gene expression levels were quantified across all the exons of RefSeq genes with analyzeRepeats.pl in HOMER (v4.11.1) (Heinz et al.
$ Bash example
# Install HOMER (v4.11.1) # wget http://homer.ucsd.edu/homer/configureHomer.pl # perl configureHomer.pl -install # perl configureHomer.pl -add hg38 # Add human hg38 genome and RefSeq annotations # # Ensure HOMER scripts are in your PATH # # export PATH=$PATH:/path/to/homer/bin # Placeholder for input BAM file (aligned RNA-seq reads) INPUT_BAM="input.bam" OUTPUT_FILE="gene_expression_counts.txt" GENOME_ASSEMBLY="hg38" # Assuming human hg38 for RefSeq genes # Quantify raw or normalized gene expression levels across all exons of RefSeq genes # using analyzeRepeats.pl from HOMER. # -rna: Specifies RNA-seq analysis mode. # -count exons: Counts reads mapping to exons. # -gene <genome>: Uses gene annotations for the specified genome (e.g., RefSeq for hg38). # -rpkm: Calculates RPKM values for normalization (as 'normalized gene expression' was mentioned). # -d <input_bam_file>: Specifies the input alignment file(s). analyzeRepeats.pl rna "${OUTPUT_FILE}" -count exons -gene "${GENOME_ASSEMBLY}" -rpkm -d "${INPUT_BAM}" -
4
2010).
(Inferred with models/gemini-2.5-flash) -
5
Differential expression analysis was performed with raw gene counts using R package, edgeR (v3.26.7) (Robinson et al.
$ Bash example
# Install edgeR if not already installed # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("edgeR") # Create dummy input files for demonstration # raw_gene_counts.tsv: A tab-separated file with gene IDs and raw counts for samples echo -e "GeneID\tSample1\tSample2\tSample3\tSample4" > raw_gene_counts.tsv echo -e "GeneA\t100\t120\t50\t60" >> raw_gene_counts.tsv echo -e "GeneB\t200\t210\t100\t110" >> raw_gene_counts.tsv echo -e "GeneC\t50\t60\t150\t160" >> raw_gene_counts.tsv echo -e "GeneD\t10\t15\t20\t25" >> raw_gene_counts.tsv # sample_metadata.tsv: A tab-separated file with sample names and their group echo -e "Sample\tGroup" > sample_metadata.tsv echo -e "Sample1\tControl" >> sample_metadata.tsv echo -e "Sample2\tControl" >> sample_metadata.tsv echo -e "Sample3\tTreatment" >> sample_metadata.tsv echo -e "Sample4\tTreatment" >> sample_metadata.tsv # R script for edgeR differential expression analysis Rscript -e ' library(edgeR) # --- Configuration --- counts_file <- "raw_gene_counts.tsv" metadata_file <- "sample_metadata.tsv" output_file <- "edgeR_DE_results.tsv" group_column <- "Group" # Column in metadata_file defining experimental groups # --- Load Data --- counts_data <- read.delim(counts_file, row.names = 1, stringsAsFactors = FALSE) metadata <- read.delim(metadata_file, row.names = 1, stringsAsFactors = FALSE) # Ensure sample order matches metadata <- metadata[colnames(counts_data), , drop = FALSE] # Create DGEList object dge <- DGEList(counts = counts_data) # --- Filtering (optional but recommended) --- # Keep genes with at least 10 counts per million (CPM) in at least 2 samples keep <- filterByExpr(dge, group = metadata[[group_column]]) dge <- dge[keep, , keep.lib.sizes = FALSE] # --- Normalization --- dge <- calcNormFactors(dge) # --- Design Matrix --- group <- factor(metadata[[group_column]]) design <- model.matrix(~group) # --- Estimate Dispersions --- dge <- estimateDisp(dge, design) # --- GLM Fitting and Testing --- fit <- glmFit(dge, design) lrt <- glmLRT(fit, coef = 2) # Assuming "Treatment" is the second coefficient for comparison against "Control" # --- Extract Results --- results <- topTags(lrt, n = Inf, adjust.method = "BH")$table # --- Save Results --- write.table(results, file = output_file, sep = "\t", quote = FALSE, row.names = TRUE) message(paste("Differential expression results saved to:", output_file)) ' -
6
2010), using replicates to compute within-group dispersion and correcting for batch effects.
$ Bash example
# Install Bioconductor and DESeq2 if not already installed # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')" # The following R script (e.g., run_deseq2.R) performs the analysis: # # run_deseq2.R # library(DESeq2) # # # Load count data (e.g., from featureCounts or similar) # # Assuming counts.tsv has genes as rows, samples as columns # count_data <- read.table("counts.tsv", header=TRUE, row.names=1, sep="\t") # # # Load sample metadata # # Assuming coldata.tsv has sample names as rows, metadata columns (e.g., condition, batch) # # Ensure row names match column names of count_data # coldata <- read.table("coldata.tsv", header=TRUE, row.names=1, sep="\t") # coldata$condition <- factor(coldata$condition) # Example factor for experimental condition # coldata$batch <- factor(coldata$batch) # Example factor for batch effect # # # Ensure order of samples in coldata matches count_data # coldata <- coldata[colnames(count_data), ] # # # Create DESeqDataSet object # # Design formula includes batch effect correction # dds <- DESeqDataSetFromMatrix(countData = count_data, # colData = coldata, # design = ~ batch + condition) # Example design # # # Run DESeq2 analysis to estimate dispersion and differential expression # dds <- DESeq(dds) # # # Get results (e.g., for condition comparison) # res <- results(dds, contrast=c("condition", "treatment", "control")) # Example contrast # # # Save results # write.csv(as.data.frame(res), file="deseq2_differential_expression_results.csv") # # # Optional: Save normalized counts # norm_counts <- counts(dds, normalized=TRUE) # write.csv(as.data.frame(norm_counts), file="deseq2_normalized_counts.csv") # Execute the R script Rscript run_deseq2.R -
7
Genes with false discovery rate (FDR) < 0.05 and an absolute fold-change (FC) >= 1.5 were identified as significantly different when comparing two conditions.
$ Bash example
#!/bin/bash # This script performs differential gene expression analysis using DESeq2 # based on the specified FDR and absolute Fold-Change thresholds. # Prerequisites: R and Bioconductor's DESeq2 package installed. # Example installation (uncomment to use): # # Install R if not present (e.g., on Ubuntu/Debian) # # sudo apt-get update # # sudo apt-get install r-base # # # Install BiocManager and DESeq2 in R # # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')" # --- Configuration --- # Replace with your actual input files and column names COUNT_MATRIX_FILE="counts.tsv" # Example: Tab-separated file with gene IDs as row names and sample counts as columns COLDATA_FILE="coldata.tsv" # Example: Tab-separated file with sample IDs as row names and metadata columns OUTPUT_FILE="significant_genes.tsv" CONDITION_COLUMN="condition" # Column name in coldata.tsv that defines the conditions to compare CONTROL_LEVEL="control" # Name of the control group in the CONDITION_COLUMN TREATMENT_LEVEL="treatment" # Name of the treatment group in the CONDITION_COLUMN FDR_THRESHOLD=0.05 ABS_FC_THRESHOLD=1.5 # Create dummy input files for demonstration if they don't exist if [ ! -f "$COUNT_MATRIX_FILE" ]; then echo -e "gene\tsample1_ctrl\tsample2_ctrl\tsample3_treat\tsample4_treat" > "$COUNT_MATRIX_FILE" echo -e "geneA\t100\t120\t250\t280" >> "$COUNT_MATRIX_FILE" echo -e "geneB\t50\t60\t40\t35" >> "$COUNT_MATRIX_FILE" echo -e "geneC\t200\t210\t205\t215" >> "$COUNT_MATRIX_FILE" echo -e "geneD\t10\t12\t30\t35" >> "$COUNT_MATRIX_FILE" echo -e "geneE\t500\t520\t400\t380" >> "$COUNT_MATRIX_FILE" echo "Created dummy $COUNT_MATRIX_FILE" fi if [ ! -f "$COLDATA_FILE" ]; then echo -e "sample\tcondition" > "$COLDATA_FILE" echo -e "sample1_ctrl\tcontrol" >> "$COLDATA_FILE" echo -e "sample2_ctrl\tcontrol" >> "$COLDATA_FILE" echo -e "sample3_treat\ttreatment" >> "$COLDATA_FILE" echo -e "sample4_treat\ttreatment" >> "$COLDATA_FILE" echo "Created dummy $COLDATA_FILE" fi # Execute R script Rscript -e " library(DESeq2) COUNT_MATRIX_FILE='${COUNT_MATRIX_FILE}' COLDATA_FILE='${COLDATA_FILE}' OUTPUT_FILE='${OUTPUT_FILE}' CONDITION_COLUMN='${CONDITION_COLUMN}' CONTROL_LEVEL='${CONTROL_LEVEL}' TREATMENT_LEVEL='${TREATMENT_LEVEL}' FDR_THRESHOLD=${FDR_THRESHOLD} ABS_FC_THRESHOLD=${ABS_FC_THRESHOLD} LOG2FC_THRESHOLD=log2(ABS_FC_THRESHOLD) count_data <- read.table(COUNT_MATRIX_FILE, header=TRUE, row.names=1, sep='\t', check.names=FALSE) count_data <- round(count_data) coldata <- read.table(COLDATA_FILE, header=TRUE, row.names=1, sep='\t', check.names=FALSE) coldata <- coldata[colnames(count_data), , drop=FALSE] coldata[[CONDITION_COLUMN]] <- factor(coldata[[CONDITION_COLUMN]], levels=c(CONTROL_LEVEL, TREATMENT_LEVEL)) dds <- DESeqDataSetFromMatrix(countData = count_data, colData = coldata, design = as.formula(paste('~', CONDITION_COLUMN))) dds <- DESeq(dds) res <- results(dds, contrast=c(CONDITION_COLUMN, TREATMENT_LEVEL, CONTROL_LEVEL)) significant_res <- subset(res, padj < FDR_THRESHOLD) significant_res <- subset(significant_res, abs(log2FoldChange) >= LOG2FC_THRESHOLD) significant_res <- significant_res[order(significant_res$padj), ] write.table(as.data.frame(significant_res), file=OUTPUT_FILE, sep='\t', quote=FALSE, row.names=TRUE) message(paste('Significant genes (FDR <', FDR_THRESHOLD, 'and |FC| >=', ABS_FC_THRESHOLD, ') saved to', OUTPUT_FILE)) " -
8
Normalized counts were log transformed, filtered, scaled, and centered prior to clustering and heatmap generation with Cluster3 (de Hoon et al.
$ Bash example
# Install Cluster 3.0 (example using apt, actual installation might vary) # sudo apt-get update # sudo apt-get install cluster3 # Input file: Tab-separated file of normalized counts (e.g., genes x samples) INPUT_FILE="input_normalized_counts.tsv" OUTPUT_PREFIX="clustered_data" # Perform log transformation, filtering, scaling, centering, and hierarchical clustering # -l: Log transform data (base 2 by default) # -sf 0.5: Filter genes with standard deviation below 0.5 (example threshold for filtering) # -g 1: Mean-center genes (centering) # -s g: Normalize genes by standard deviation (scaling) # -a 1: Mean-center arrays/samples (centering) # -s a: Normalize arrays/samples by standard deviation (scaling) # -x: Cluster genes # -y: Cluster arrays/samples # -m a: Use average linkage method for clustering # -e c: Use correlation as distance metric # -o ${OUTPUT_PREFIX}: Specify output file prefix cluster3 -l -sf 0.5 -g 1 -s g -a 1 -s a -x -y -m a -e c -o "${OUTPUT_PREFIX}" "${INPUT_FILE}" # The output files (e.g., clustered_data.cdt, clustered_data.gtr, clustered_data.atr) # can then be used with visualization tools like Java TreeView to generate heatmaps. -
9
2004), and visualized with JavaTreeView (Saldanha 2004).
JavaTreeView v2004$ Bash example
# Ensure Java Runtime Environment (JRE) is installed # For example, on Ubuntu: # sudo apt update # sudo apt install default-jre # Download JavaTreeView (replace with actual download link if available) # wget -O JavaTreeView.jar "http://jtreeview.sourceforge.net/download/JavaTreeView.jar" # Placeholder URL, check SourceForge for latest # Run JavaTreeView to visualize data. JavaTreeView typically opens a GUI. # If it supports command-line input for a specific file, it might look like: # java -jar JavaTreeView.jar your_data_file.cdt # Without specific input files mentioned, a generic launch command is provided: java -jar JavaTreeView.jar
-
10
GSEA was carried out with the GenePattern interface, https://genepattern.broadinstitute.org) using preranked lists generated from FDR values, setting gene set permutations to 1000 and using the Hallmark collection in MSigDB version5.0.
$ Bash example
# Download GSEA command-line tool (replace with actual version if known) # wget https://data.broadinstitute.org/gsea-msigdb/gsea_latest_versions/GSEA_4.3.2/gsea-cli.zip # unzip gsea-cli.zip # mv gsea-cli/gsea-cli.jar . # Download MSigDB Hallmark collection v5.0 # Note: Direct link for v5.0 Hallmark might be difficult to find. Typically, you download the full MSigDB and extract the Hallmark collection. # For demonstration, assuming 'h.all.v5.0.symbols.gmt' is available. # wget https://data.broadinstitute.org/gsea-msigdb/msigdb/release/5.0/msigdb.v5.0.symbols.gmt # (Then, you would filter this file to get only the Hallmark collection, or use a pre-filtered one if available) # Example preranked list (replace with actual input file) # echo -e "GENE1\t0.95\nGENE2\t0.88\n..." > input.rnk # Run GSEA Preranked analysis # Adjust -Xmx (memory) as needed java -Xmx4000m -cp gsea-cli.jar xtools.gsea.GseaPreranked \ -rnk input.rnk \ -gmx h.all.v5.0.symbols.gmt \ -o gsea_output \ -nperm 1000 \ -set_max 500 \ -set_min 15 \ -rnd_type no_balance \ -plot_top_x 20 \ -collapse false \ -norm meandiv
-
11
GO Term and KEGG Pathway analysis were performed with David: http://david.ncifcrf.gov/.
DAVID v6.8 (Inferred with models/gemini-2.5-flash)$ Bash example
# DAVID is a web-based tool for functional annotation and enrichment analysis. # The following conceptual steps outline its usage: # 1. Prepare your gene list (e.g., Entrez Gene IDs, Official Gene Symbols) # Example: Create a dummy gene list file # echo -e "TP53\nMYC\nJUN\nFOS\nEGR1" > gene_list.txt # 2. Access the DAVID web interface: http://david.ncifcrf.gov/ # 3. Upload your gene list (e.g., gene_list.txt) to the DAVID website. # 4. Select "Gene List" as the input type and choose the appropriate identifier. # 5. Select "Functional Annotation Chart" or "Functional Annotation Clustering". # 6. Ensure "GO Term" (Biological Process, Molecular Function, Cellular Component) # and "KEGG Pathway" categories are selected for analysis. # 7. Run the analysis. # 8. Download the results (e.g., as a tab-separated file). # The output would typically be a file containing enrichment scores, p-values, # and associated genes for each GO term and KEGG pathway. # Example output file name: david_enrichment_results.tsv # Note: Direct command-line execution for DAVID is not standard as it's a web service. # This block describes the typical workflow.
-
12
Area-proportional Venn diagrams were plotted using BioVenn (Hulsen et al.
$ Bash example
# Install biovenn Python package # pip install biovenn matplotlib # Create a Python script to generate the Venn diagram cat << 'EOF' > generate_venn.py import biovenn import matplotlib.pyplot as plt # Placeholder for actual data loading or generation from the pipeline. # In a real pipeline, these lists would be generated from previous steps # (e.g., from peak calling results, gene lists, etc.). # For demonstration, let's assume we have text files with one item per line: # with open('list1.txt', 'r') as f: # list1 = set(f.read().splitlines()) # with open('list2.txt', 'r') as f: # list2 = set(f.read().splitlines()) # with open('list3.txt', 'r') as f: # list3 = set(f.read().splitlines()) # For this example, we'll use hardcoded lists representing elements from different sets list1 = set(['element_A', 'element_B', 'element_C', 'element_D', 'element_E']) list2 = set(['element_C', 'element_D', 'element_F', 'element_G']) list3 = set(['element_D', 'element_E', 'element_H', 'element_I']) # Define labels for the sets (e.g., 'Replicate 1', 'Replicate 2', 'Replicate 3') labels = ('Set 1', 'Set 2', 'Set 3') # Output filename for the generated Venn diagram image output_filename = 'area_proportional_venn_diagram.png' # Generate the 3-set area-proportional Venn diagram. # BioVenn uses matplotlib internally to save the figure. # The 'filename' parameter saves the plot directly to a file. biovenn.venn3(list1, list2, list3, labels=labels, filename=output_filename) print(f"Area-proportional Venn diagram saved to {output_filename}") EOF # Execute the Python script to generate the Venn diagram python generate_venn.py -
13
2008).
Unknown (Inferred with models/gemini-2.5-flash) vUnknown (Inferred with models/gemini-2.5-flash)$ Bash example
No specific tool, parameters, or reference datasets can be inferred from the description '2008).'. Please provide more context about the bioinformatics step.
-
14
Enrichr: http://amp.pharm.mssm.edu/Enrichr.
$ Bash example
# Install gseapy, a Python wrapper for Enrichr API # pip install gseapy # Example: Create a dummy gene list file (replace with your actual gene list) # echo "TP53" > gene_list.txt # echo "MYC" >> gene_list.txt # echo "EGFR" >> gene_list.txt python -c " import gseapy as gp import pandas as pd import os # Define input gene list file path gene_list_file = 'gene_list.txt' # Load gene list from the file try: with open(gene_list_file, 'r') as f: gene_list = [line.strip() for line in f if line.strip()] except FileNotFoundError: print(f'Error: {gene_list_file} not found. Please ensure your gene list file exists.') exit(1) if not gene_list: print(f'Error: {gene_list_file} is empty. Please provide a list of gene symbols.') exit(1) # Define output directory outdir = 'enrichr_results' os.makedirs(outdir, exist_ok=True) # Define gene set libraries to use (e.g., 'GO_Biological_Process_2023', 'KEGG_2021_Human') # You can find available libraries on the Enrichr website or by using gp.get_library_name() gene_set_libraries = ['GO_Biological_Process_2023', 'KEGG_2021_Human'] # Placeholder libraries # Run Enrichr analysis print(f'Running Enrichr analysis for {len(gene_list)} genes using libraries: {', '.join(gene_set_libraries)}...') enr = gp.enrichr(gene_list=gene_list, gene_sets=gene_set_libraries, organism='human', # Specify organism (e.g., 'human', 'mouse', 'rat') outdir=outdir, cutoff=0.05, # p-value cutoff for significance format='png', # Output format for plots (e.g., 'png', 'pdf', 'svg') no_plot=False, # Set to True if you don't want plots verbose=True) # Save results to a CSV file if enr.res is not None and not enr.res.empty: output_csv_path = os.path.join(outdir, 'enrichr_output.csv') enr.res.to_csv(output_csv_path, index=False) print(f'Enrichment results saved to {output_csv_path}') else: print('No significant enrichment results found or an error occurred during analysis.') print(f'Enrichr plots (if generated) saved to {outdir}') "
Raw Source Text
Raw RNA sequencing reads in FASTQ files were quality-tested using FASTQC (v0.11.8) (Andrews 2010) and mapped to the mouse reference genome (mm10) with STAR (v2.5.3a) aligner with default parameters (Dobin et al. 2012). Raw or normalized gene expression levels were quantified across all the exons of RefSeq genes with analyzeRepeats.pl in HOMER (v4.11.1) (Heinz et al. 2010). Differential expression analysis was performed with raw gene counts using R package, edgeR (v3.26.7) (Robinson et al. 2010), using replicates to compute within-group dispersion and correcting for batch effects. Genes with false discovery rate (FDR) < 0.05 and an absolute fold-change (FC) >= 1.5 were identified as significantly different when comparing two conditions. Normalized counts were log transformed, filtered, scaled, and centered prior to clustering and heatmap generation with Cluster3 (de Hoon et al. 2004), and visualized with JavaTreeView (Saldanha 2004). GSEA was carried out with the GenePattern interface, https://genepattern.broadinstitute.org) using preranked lists generated from FDR values, setting gene set permutations to 1000 and using the Hallmark collection in MSigDB version5.0. GO Term and KEGG Pathway analysis were performed with David: http://david.ncifcrf.gov/. Area-proportional Venn diagrams were plotted using BioVenn (Hulsen et al. 2008). Enrichr: http://amp.pharm.mssm.edu/Enrichr. Genome_build: mm10 Supplementary_files_format_and_content: Normalized expression table generated with analyzeRepeats.pl. Please see header for specific arguments used.