GSE157049 Processing Pipeline
Publication
AMPK regulation of Raptor and TSC2 mediate metformin effects on transcriptional control of anabolism and inflammation.Genes & development (2020) — PMID 32912901
Dataset
GSE157049RNA-seq of metformin treatment in liver in WT, Raptor Ser-Ala mutant, Tsc2-null, Raptor mutant;Tsc2-null, and Ampk-null
Processing Steps
Generate Jupyter Notebook-
1
Raw RNA sequencing reads in FASTQ files were quality-tested using FASTQC (v0.11.8) (Andrews 2010) and mapped to the mouse reference genome (mm10) with STAR (v2.5.3a) aligner with default parameters (Dobin et al.
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star=2.5.3a # Placeholder for input FASTQ file(s) # input_reads.fastq (for single-end reads) # input_reads_R1.fastq input_reads_R2.fastq (for paired-end reads) # Placeholder for STAR genome index directory for mm10 # Ensure the mm10 genome index is pre-built using STAR --runMode genomeGenerate ... # Example: /path/to/STAR_genome_index/mm10 # Align RNA-seq reads to the mouse reference genome (mm10) using STAR with default parameters # This command assumes single-end reads and outputs a sorted BAM file. # Adjust --readFilesIn for paired-end reads (e.g., --readFilesIn input_reads_R1.fastq input_reads_R2.fastq) STAR --genomeDir /path/to/STAR_genome_index/mm10 \ --readFilesIn input_reads.fastq \ --outFileNamePrefix aligned_reads_ \ --runThreadN 8 \ --outSAMtype BAM SortedByCoordinate -
2
2012).
Not specified (Inferred with models/gemini-2.5-flash) vNot specified$ Bash example
# No specific tool or command can be inferred from the description '2012).' # Please provide more context or a tool name to generate a relevant command.
-
3
Raw or normalized gene expression levels were quantified across all the exons of RefSeq genes with analyzeRepeats.pl in HOMER (v4.11.1) (Heinz et al.
$ Bash example
# Install HOMER (if not already installed) # For example, using conda: # conda install -c bioconda homer=4.11.1 # Or manual installation: # wget http://homer.ucsd.edu/homer/configureHomer.pl # perl configureHomer.pl -install # perl configureHomer.pl -install hg38 # Install human genome (replace hg38 with your desired genome, e.g., mm10) # Quantify raw gene expression levels across RefSeq exons using analyzeRepeats.pl # -gene: Quantify expression across RefSeq genes (exons). # -raw: Output raw read counts. For RPKM normalized values, use -rpkm instead of -raw. # -count: Specify input BAM file(s) or a HOMER tag directory. # Replace 'hg38' with the genome you have installed in HOMER (e.g., hg38, mm10). # Replace 'gene_expression.txt' with your desired output file name. # Replace 'input.bam' with your actual aligned BAM file(s) (e.g., from STAR or HISAT2 alignment). # If you have multiple BAM files, list them separated by spaces (e.g., input1.bam input2.bam). # Alternatively, you can provide a HOMER tag directory created with makeTagDirectory. analyzeRepeats.pl hg38 gene_expression.txt -gene -raw -count input.bam
-
4
2010).
No specific tool inferred (Inferred with models/gemini-2.5-flash)$ Bash example
# The description '2010)' is insufficient to infer a specific bioinformatics tool, version, or command. # Please provide a more detailed step description.
-
5
Differential expression analysis was performed with raw gene counts using R package, edgeR (v3.26.7) (Robinson et al.
$ Bash example
# Install R if not already installed (example for Debian/Ubuntu) # sudo apt-get update && sudo apt-get install -y r-base # # Install Bioconductor and edgeR package # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("edgeR")' # Example raw gene counts file (replace with actual path) # counts.tsv should have genes as rows and samples as columns, with the first column being gene IDs. # Example sample_info.csv should have sample IDs and a 'group' column for differential expression. # # Example counts.tsv: # GeneID\tSample1\tSample2\tSample3\tSample4 # GeneA\t100\t120\t50\t60 # GeneB\t20\t25\t100\t110 # ... # # Example sample_info.csv: # SampleID,Group # Sample1,Control # Sample2,Control # Sample3,Treated # Sample4,Treated Rscript -e ' library(edgeR) # --- Configuration --- counts_file <- "counts.tsv" # Path to raw gene counts file sample_info_file <- "sample_info.csv" # Path to sample information file output_file <- "de_results.csv" # Output file for differential expression results group_column <- "Group" # Column in sample_info_file indicating experimental groups control_group <- "Control" # Name of the control group treated_group <- "Treated" # Name of the treated group # --- End Configuration --- # Load data counts_data <- read.delim(counts_file, row.names = 1, check.names = FALSE) sample_info <- read.csv(sample_info_file, row.names = 1) # Ensure sample order matches between counts and sample info sample_info <- sample_info[colnames(counts_data), , drop = FALSE] # Create DGEList object dge <- DGEList(counts = counts_data, group = factor(sample_info[[group_column]])) # Filter out lowly expressed genes (optional but recommended) # Keeping genes with at least 10 counts per million (CPM) in at least 2 samples keep <- filterByExpr(dge) dge <- dge[keep, , keep.lib.sizes = FALSE] # Normalize library sizes (TMM normalization is default) dge <- calcNormFactors(dge) # Create design matrix design <- model.matrix(~0 + group, data = dge$samples) colnames(design) <- levels(dge$samples$group) # Estimate dispersions dge <- estimateDisp(dge, design) # Fit GLM fit <- glmFit(dge, design) # Define contrast for differential expression (e.g., Treated vs Control) contrast_matrix <- makeContrasts( paste0(treated_group, "-", control_group), levels = design ) # Perform likelihood ratio test lrt <- glmLRT(fit, contrast = contrast_matrix) # Get differential expression results results <- topTags(lrt, n = Inf, sort.by = "PValue")$table # Add adjusted p-value (FDR) if not already present if (!"FDR" %in% colnames(results)) { results$FDR <- p.adjust(results$PValue, method = "BH") } # Write results to file write.csv(results, file = output_file, row.names = TRUE) cat(paste0("Differential expression analysis completed. Results saved to ", output_file, "\n")) ' -
6
2010), using replicates to compute within-group dispersion and correcting for batch effects.
$ Bash example
# Install R and Bioconductor if not already present # sudo apt-get update # sudo apt-get install r-base # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("DESeq2")' # DESeq2 is the modern successor to DESeq (2010) and implements these concepts. # Placeholder for count data and sample information # Replace 'counts.tsv' with your actual count matrix (rows: genes, columns: samples) # Replace 'sample_info.tsv' with your actual sample metadata (columns: sample_id, condition, batch, etc.) # Example: # counts.tsv: # gene_id sample1 sample2 sample3 sample4 # geneA 100 120 90 110 # geneB 50 60 45 55 # ... # # sample_info.tsv: # sample_id condition batch # sample1 control batch1 # sample2 treated batch1 # sample3 control batch2 # sample4 treated batch2 Rscript -e ' library(DESeq2) # --- Placeholder: Load your actual count data and sample information --- # For demonstration, creating dummy data set.seed(123) num_genes <- 100 num_samples <- 6 counts <- matrix(sample(1:1000, num_genes * num_samples, replace = TRUE), ncol = num_samples) colnames(counts) <- paste0("sample", 1:num_samples) rownames(counts) <- paste0("gene", 1:num_genes) sample_info <- data.frame( sample_id = colnames(counts), condition = factor(rep(c("control", "treated"), num_samples / 2)), batch = factor(c("batch1", "batch1", "batch2", "batch2", "batch1", "batch2")) # Example batch effect ) rownames(sample_info) <- sample_info$sample_id # Ensure sample order matches sample_info <- sample_info[colnames(counts), ] # -------------------------------------------------------------------- # Create DESeqDataSet object # The design formula specifies factors to consider. # Here, "batch" is included to correct for batch effects. # "condition" is included as the primary biological variable. # DESeq2 internally uses replicates to estimate within-group dispersion. dds <- DESeqDataSetFromMatrix(countData = counts, colData = sample_info, design = ~ batch + condition) # Run the DESeq pipeline to estimate size factors, dispersions, and fit the model. # This step computes within-group dispersion and incorporates the design formula # to account for batch effects. dds <- DESeq(dds) # The estimated dispersions can be accessed and plotted: # plotDispEsts(dds) message("DESeq2 analysis completed, including dispersion estimation and batch effect correction.") message("The dds object now contains normalized counts, dispersion estimates, and model fits.") ' -
7
Genes with false discovery rate (FDR) < 0.05 and an absolute fold-change (FC) >= 1.5 were identified as significantly different when comparing two conditions.
$ Bash example
# Install R and DESeq2 if not already installed # sudo apt-get update && sudo apt-get install -y r-base # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')" # Create dummy input files for demonstration purposes. # In a real pipeline, 'gene_counts.tsv' would be the output from a quantification tool (e.g., featureCounts, Salmon, Kallisto) # and 'sample_info.tsv' would be a metadata file describing your experimental design. # Dummy gene count matrix (rows: genes, columns: samples) echo -e "gene\tsampleA1\tsampleA2\tsampleB1\tsampleB2" > gene_counts.tsv echo -e "gene1\t100\t120\t200\t210" >> gene_counts.tsv echo -e "gene2\t50\t60\t30\t35" >> gene_counts.tsv echo -e "gene3\t10\t12\t15\t18" >> gene_counts.tsv echo -e "gene4\t500\t550\t200\t220" >> gene_counts.tsv echo -e "gene5\t20\t25\t40\t45" >> gene_counts.tsv echo -e "gene6\t300\t310\t305\t315" >> gene_counts.tsv echo -e "gene7\t80\t85\t150\t160" >> gene_counts.tsv # Dummy sample metadata (rows: samples, columns: experimental factors) echo -e "sample\tcondition" > sample_info.tsv echo -e "sampleA1\tA" >> sample_info.tsv echo -e "sampleA2\tA" >> sample_info.tsv echo -e "sampleB1\tB" >> sample_info.tsv echo -e "sampleB2\tB" >> sample_info.tsv # Execute R script for DESeq2 analysis Rscript -e ' library(DESeq2) # Define input and output file paths count_matrix_file <- "gene_counts.tsv" sample_metadata_file <- "sample_info.tsv" output_file <- "significant_genes.tsv" # Read count data and sample metadata count_data <- read.table(count_matrix_file, header=TRUE, row.names=1, sep="\t") sample_info <- read.table(sample_metadata_file, header=TRUE, row.names=1, sep="\t") # Ensure sample order in count_data matches sample_info count_data <- count_data[, rownames(sample_info)] # DESeq2 expects integer counts, so round if necessary (e.g., from Salmon/Kallisto) dds <- DESeqDataSetFromMatrix(countData = round(count_data), colData = sample_info, design = ~ condition) # Assuming "condition" is the primary experimental variable # Run DESeq2 differential expression analysis dds <- DESeq(dds) # Extract results. The description specifies FDR < 0.05, which corresponds to alpha=0.05. res <- results(dds, alpha=0.05) # Filter for significant genes based on the criteria: # 1. False Discovery Rate (FDR) < 0.05 (padj < 0.05) # 2. Absolute Fold-Change (FC) >= 1.5 (abs(log2FoldChange) >= log2(1.5)) significant_genes <- subset(res, padj < 0.05 & abs(log2FoldChange) >= log2(1.5)) # Order results by adjusted p-value for easier interpretation significant_genes <- significant_genes[order(significant_genes$padj), ] # Save the table of significantly different genes write.table(as.data.frame(significant_genes), file=output_file, sep="\t", quote=FALSE, row.names=TRUE) message(paste("Identified", nrow(significant_genes), "significantly different genes. Results saved to", output_file)) ' -
8
Normalized counts were log transformed, filtered, scaled, and centered prior to clustering and heatmap generation with Cluster3 (de Hoon et al.
Cluster3 v3.0$ Bash example
# Install C Clustering Library (which provides the 'cluster' executable, the core algorithm for Cluster 3.0) # For Debian/Ubuntu: # sudo apt-get update # sudo apt-get install c-cluster-programs # Input file: normalized_counts.tsv (assuming a tab-separated matrix with gene IDs as first column and sample IDs as header) INPUT_FILE="normalized_counts.tsv" OUTPUT_PREFIX="clustered_data" # Perform log transformation, filtering, scaling, centering, clustering, and generate files for heatmap visualization. # The 'cluster' executable from the C Clustering Library directly supports these operations via command-line parameters. # -l: Log transform data (e.g., log2 transformation) # -g: Normalize genes (adjusts genes to have mean 0 and variance 1, effectively scaling and centering gene-wise) # -a: Normalize arrays (adjusts arrays/samples to have mean 0 and variance 1, effectively scaling and centering sample-wise) # -f 0 0: No filtering (adjust parameters if specific filtering criteria are needed, e.g., -f 1 10 for 10% missing values) # -m a: Hierarchical clustering (agglomerative method) # -s c: Similarity measure: Pearson correlation (a common choice for gene expression data) # -u: Output gene tree (.gtr file) # -e: Output array tree (.atr file) # -o: Output prefix for the generated .cdt, .gtr, and .atr files cluster -l -g -a -f 0 0 -m a -s c -u -e -o "${OUTPUT_PREFIX}" "${INPUT_FILE}" # Heatmap visualization is typically performed using a companion viewer like Java TreeView or similar, # which reads the .cdt, .gtr, and .atr files generated by the 'cluster' command. # For example, to open the generated data with Java TreeView (if installed and associated with .cdt files): # java -jar javatreeview.jar "${OUTPUT_PREFIX}.cdt" -
9
2004), and visualized with JavaTreeView (Saldanha 2004).
JavaTreeView v2004$ Bash example
No specific command-line execution details are provided in the description for JavaTreeView, as it is primarily a graphical visualization tool.
-
10
GSEA was carried out with the GenePattern interface, https://genepattern.broadinstitute.org) using preranked lists generated from FDR values, setting gene set permutations to 1000 and using the Hallmark collection in MSigDB version5.0.
$ Bash example
# Download GSEA software (e.g., GSEA_2.2.4.jar for command-line execution) # The GSEA software can be downloaded from the Broad Institute's GSEA website. # For example, assuming GSEA_2.2.4.jar is downloaded and placed in a 'gsea_home' directory: # mkdir -p gsea_home # # Place GSEA_2.2.4.jar here # export GSEA_HOME=$(pwd)/gsea_home # Download MSigDB Hallmark collection v5.0 # mkdir -p msigdb_v5.0 # wget -O msigdb_v5.0/h.all.v5.0.symbols.gmt https://data.broadinstitute.org/gsea-msigdb/msigdb/release/5.0/h.all.v5.0.symbols.gmt # Prepare the preranked list file (e.g., 'input_preranked.rnk') # This file should contain two columns: Gene Symbol and Rank Value (e.g., FDR values or a derived score). # Example content of input_preranked.rnk: # GENE_SYMBOL RANK_VALUE # TP53 0.95 # MYC 0.88 # ... # Run GSEA with the preranked list, Hallmark gene sets, and specified permutations # Note: The GenePattern interface provides a GUI for GSEA. This command reflects the underlying command-line execution. # The '-Xmx' parameter sets the maximum Java heap size, adjust as needed. java -Xmx4000m -cp "$GSEA_HOME/gsea-cli.jar" xtools.gsea.GseaPreranked \ -rnk input_preranked.rnk \ -gmx msigdb_v5.0/h.all.v5.0.symbols.gmt \ -out gsea_output_preranked \ -perm 1000 \ -plot_top_x 20 \ -collapse false \ -mode gene_set \ -norm meandiv \ -rnd_seed timestamp \ -set_max 500 \ -set_min 15 -
11
GO Term and KEGG Pathway analysis were performed with David: http://david.ncifcrf.gov/.
DAVID v6.8$ Bash example
# Create a placeholder gene list file (e.g., Entrez Gene IDs) echo "10000" > my_gene_list.txt echo "10001" >> my_gene_list.txt echo "10002" >> my_gene_list.txt echo "10003" >> my_gene_list.txt echo "10004" >> my_gene_list.txt # Install the suds-py3 library for SOAP client interaction if not already installed # pip install suds-py3 # Python script to interact with DAVID Web Service API # This script performs GO Term and KEGG Pathway analysis. python3 - <<EOF from suds.client import Client import os # DAVID Web Service URL DAVID_WS_URL = "https://david.ncifcrf.gov/webservice/services/DAVIDWebService?wsdl" # Your DAVID email (required for some API calls, though often optional for basic queries) # Replace with your actual email if needed for authentication # EMAIL = "your_email@example.com" # Input gene list file GENE_LIST_FILE = "my_gene_list.txt" LIST_NAME = "My_Experiment_Genes" LIST_TYPE = "ENTREZ_GENE_ID" # Or "GENE_SYMBOL", "ENSEMBL_GENE_ID", etc. # Annotation categories for enrichment analysis ANNOTATION_CATEGORIES = ["GOTERM_BP_ALL", "KEGG_PATHWAY"] # GO Biological Process and KEGG Pathways # Output file names GO_KEGG_CHART_FILE = "david_go_kegg_chart.txt" GO_KEGG_TABLE_FILE = "david_go_kegg_table.txt" try: # Initialize SOAP client client = Client(DAVID_WS_URL) # Login (optional, depending on DAVID server configuration and specific API calls) # if EMAIL: # client.service.authenticate(EMAIL) # print(f"Authenticated with DAVID Web Service (email: {EMAIL}).") # Read gene list from file with open(GENE_LIST_FILE, 'r') as f: gene_list_content = f.read().strip() # Add gene list to DAVID print(f"Adding gene list from {GENE_LIST_FILE} to DAVID...") # The '0' indicates it's a gene list, '1' would be for a background list client.service.addList(gene_list_content, LIST_TYPE, LIST_NAME, 0) print(f"Gene list '{LIST_NAME}' added successfully.") # Set annotation categories print(f"Setting annotation categories: {', '.join(ANNOTATION_CATEGORIES)}...") client.service.setCategories(ANNOTATION_CATEGORIES) print("Categories set.") # Get Functional Annotation Chart (summary of enrichment) print(f"Retrieving Functional Annotation Chart and saving to {GO_KEGG_CHART_FILE}...") chart_results = client.service.getFunctionalAnnotationChart() with open(GO_KEGG_CHART_FILE, 'w') as f: # Write header f.write("Term\tCount\t%\tPValue\tBenjamini\tFDR\tGenes\tListTotal\tPopHits\tPopTotal\tFoldEnrichment\tBonferroni\n") # Write results for row in chart_results: f.write(f"{row.Term}\t{row.Count}\t{row.Percentage}\t{row.PValue}\t{row.Benjamini}\t{row.FDR}\t{row.Genes}\t{row.ListHits}\t{row.PopHits}\t{row.PopTotal}\t{row.FoldEnrichment}\t{row.Bonferroni}\n") print("Functional Annotation Chart saved.") # Get Functional Annotation Table (detailed gene-term mapping) print(f"Retrieving Functional Annotation Table and saving to {GO_KEGG_TABLE_FILE}...") table_results = client.service.getFunctionalAnnotationTable() with open(GO_KEGG_TABLE_FILE, 'w') as f: # Write header f.write("GeneID\tGeneName\tTerm\tCategory\tPValue\tBenjamini\tFDR\n") # Write results for row in table_results: f.write(f"{row.GeneID}\t{row.GeneName}\t{row.Term}\t{row.Category}\t{row.PValue}\t{row.Benjamini}\t{row.FDR}\n") print("Functional Annotation Table saved.") except Exception as e: print(f"An error occurred: {e}") print("Please ensure 'suds-py3' is installed (`pip install suds-py3`) and your gene list file is correctly formatted.") EOF -
12
Area-proportional Venn diagrams were plotted using BioVenn (Hulsen et al.
BioVenn v1.1.0 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install biovenn Python package and matplotlib for plotting # pip install biovenn matplotlib # Example usage: # Assuming set1.txt, set2.txt, set3.txt contain lists of items (one per line) # and venn_diagram.png is the desired output file. # Create dummy input files for demonstration if needed: # echo -e "geneA\ngeneB\ngeneC\ngeneD" > set1.txt # echo -e "geneC\ngeneD\ngeneE\ngeneF" > set2.txt # echo -e "geneD\ngeneF\ngeneG\ngeneH" > set3.txt python -c ' import biovenn as bv import matplotlib.pyplot as plt import sys import os def read_list_from_file(filepath): with open(filepath, "r") as f: return [line.strip() for line in f if line.strip()] if __name__ == "__main__": # Placeholder for actual input files and output image. # In a real pipeline, these would be passed as arguments or configured. list_files = ["set1.txt", "set2.txt", "set3.txt"] output_image = "venn_diagram.png" # Check if input files exist for f in list_files: if not os.path.exists(f): print(f"Error: Input file not found: {f}", file=sys.stderr) sys.exit(1) lists_data = [read_list_from_file(f) for f in list_files] names = [f"Set {i+1}" for i in range(len(lists_data))] # BioVenn Python package supports 2, 3, or 4 sets for area-proportional diagrams if len(lists_data) == 2: venn = bv.venn2(lists_data[0], lists_data[1], names=names) elif len(lists_data) == 3: venn = bv.venn3(lists_data[0], lists_data[1], lists_data[2], names=names) elif len(lists_data) == 4: venn = bv.venn4(lists_data[0], lists_data[1], lists_data[2], lists_data[3], names=names) else: print("Error: BioVenn Python package currently supports 2, 3, or 4 sets for area-proportional diagrams.", file=sys.stderr) sys.exit(1) plt.title("Area-Proportional Venn Diagram") plt.savefig(output_image, dpi=300, bbox_inches="tight") plt.close() ' -
13
2008).
Unknown (Inferred with models/gemini-2.5-flash) vUnknown -
14
Enrichr: http://amp.pharm.mssm.edu/Enrichr.
$ Bash example
# Enrichr is a web-based tool for gene set enrichment analysis. # In automated pipelines, it's typically accessed via its API using a client library (e.g., in Python or R). # This step assumes a gene list (e.g., differentially expressed genes) has been generated upstream. # Example: Using a Python script (e.g., 'run_enrichr.py') that leverages the gseapy library # to interact with the Enrichr API. # The 'run_enrichr.py' script would take a gene list file and an output prefix as arguments. # Installation of gseapy (if using Python API client): # pip install gseapy # Example execution command: # Replace 'your_gene_list.txt' with the path to your gene list file (one gene symbol per line). # Replace 'enrichr_results' with your desired output file prefix. # The 'run_enrichr.py' script would need to be created separately to handle the API calls. python run_enrichr.py your_gene_list.txt enrichr_results # Enrichr uses its own internal databases of gene sets (e.g., GO, KEGG, WikiPathways, etc.). # The specific gene set libraries and organism (e.g., 'human', 'mouse') would be configured # within the 'run_enrichr.py' script based on the experimental context.
Raw Source Text
Raw RNA sequencing reads in FASTQ files were quality-tested using FASTQC (v0.11.8) (Andrews 2010) and mapped to the mouse reference genome (mm10) with STAR (v2.5.3a) aligner with default parameters (Dobin et al. 2012). Raw or normalized gene expression levels were quantified across all the exons of RefSeq genes with analyzeRepeats.pl in HOMER (v4.11.1) (Heinz et al. 2010). Differential expression analysis was performed with raw gene counts using R package, edgeR (v3.26.7) (Robinson et al. 2010), using replicates to compute within-group dispersion and correcting for batch effects. Genes with false discovery rate (FDR) < 0.05 and an absolute fold-change (FC) >= 1.5 were identified as significantly different when comparing two conditions. Normalized counts were log transformed, filtered, scaled, and centered prior to clustering and heatmap generation with Cluster3 (de Hoon et al. 2004), and visualized with JavaTreeView (Saldanha 2004). GSEA was carried out with the GenePattern interface, https://genepattern.broadinstitute.org) using preranked lists generated from FDR values, setting gene set permutations to 1000 and using the Hallmark collection in MSigDB version5.0. GO Term and KEGG Pathway analysis were performed with David: http://david.ncifcrf.gov/. Area-proportional Venn diagrams were plotted using BioVenn (Hulsen et al. 2008). Enrichr: http://amp.pharm.mssm.edu/Enrichr. Genome_build: mm10 Supplementary_files_format_and_content: Normalized expression table generated with analyzeRepeats.pl. Please see header for specific arguments used.