GSE157050 Processing Pipeline
Publication
AMPK regulation of Raptor and TSC2 mediate metformin effects on transcriptional control of anabolism and inflammation.Genes & development (2020) — PMID 32912901
Dataset
GSE157050Ribosome profiling of metformin treatment in primary murine hepatocytes in WT, Raptor Ser-Ala mutant, Tsc2-null, Raptor mutant;Tsc2-null, and Ampk-n…
Processing Steps
Generate Jupyter Notebook-
1
First, adapter trimming was performed with cutadapt (v1.14).
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=1.14 # Example command for adapter trimming with cutadapt. # The specific adapter sequence(s) and input/output filenames were not provided in the description. # Replace 'ADAPTER_SEQUENCE' with the actual adapter sequence(s) used (e.g., Illumina universal adapter). # Replace 'input.fastq.gz' with your input FASTQ file. # Replace 'trimmed.fastq.gz' with your desired output FASTQ file. # For paired-end reads, use -A for the reverse complement adapter and -o/-p for paired outputs. cutadapt -a ADAPTER_SEQUENCE -o trimmed.fastq.gz input.fastq.gz
-
2
Next, reads were mapped with STAR (v2.4.0i) against a database of repetitive elements, with mapping reads discarded from further analysis.
STAR v2.4.0i$ Bash example
# Install STAR (example, uncomment if needed) # conda install -c bioconda star=2.4.0i # Define variables # Replace with your actual input FASTQ files INPUT_R1="input_reads_R1.fastq.gz" INPUT_R2="input_reads_R2.fastq.gz" # Set to empty string if single-end: INPUT_R2="" # Replace with the path to your STAR-indexed repetitive elements database STAR_INDEX="path/to/repetitive_elements_STAR_index" OUTPUT_DIR="star_repetitive_mapping_output" mkdir -p "${OUTPUT_DIR}" # Run STAR to map against repetitive elements. # The key here is that reads mapping to these elements are 'discarded from further analysis'. # This implies we are either interested in the unmapped reads for downstream, or simply performing # this mapping as a filtering step and the output BAM itself is not used for gene quantification. # We will output both mapped BAM and unmapped FASTQ for flexibility. # Parameters are set to be permissive for mapping to repetitive elements (e.g., allowing multiple alignments). STAR --genomeDir "${STAR_INDEX}" \ --readFilesIn "${INPUT_R1}" ${INPUT_R2:+"${INPUT_R2}"} \ --runThreadN 8 \ --outFileNamePrefix "${OUTPUT_DIR}/" \ --outSAMtype BAM SortedByCoordinate \ --outReadsUnmapped Fastx \ --outFilterMultimapNmax 100 \ --outFilterMismatchNmax 999 \ --outFilterScoreMinOverLread 0 \ --outFilterMatchNminOverLread 0 \ --alignIntronMax 1 \ --alignMatesGapMax 1000000 \ --alignSJDBoverhangMin 1000000 \ --alignSJoverhangMin 1000000 \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMprimaryFlag AllBestScore \ --outSAMmapqUnique 255 # After this step, the reads in ${OUTPUT_DIR}/unmapped_to_repetitive_R1.fastq.gz (and R2 if paired-end) # would be considered for 'further analysis', while the mapped reads in ${OUTPUT_DIR}/Aligned.sortedByCoordinate.out.bam # are effectively 'discarded' for the main analysis pipeline (e.g., gene expression quantification). -
3
Remaining reads were then mapped to the mouse genome (mm10) with STAR (v2.4.0i)
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.4.0i # Define variables GENOME_DIR="/path/to/STAR_index/mm10" READS_R1="sample_R1.fastq.gz" READS_R2="sample_R2.fastq.gz" # Assuming paired-end reads; adjust if single-end OUTPUT_PREFIX="sample_aligned_" THREADS=8 # Create genome index if it doesn't exist (this is a one-time step per genome version) # STAR --runThreadN ${THREADS} --runMode genomeGenerate --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/mm10.fa --sjdbGTFfile /path/to/mm10.gtf # Map reads to the mouse genome (mm10) STAR --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READS_R1} ${READS_R2} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 10 \ --alignIntronMin 20 --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --quantMode GeneCounts # Optional: output gene counts -
4
Stranded reads mapping to GENCODE (vM20) genes were quantified with featureCounts (subread-1.6.4).
featureCounts v1.6.4$ Bash example
# Install featureCounts (part of subread package) # conda install -c bioconda subread=1.6.4 # Download GENCODE vM20 GTF annotation file # wget -O gencode.vM20.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.annotation.gtf.gz # gunzip gencode.vM20.annotation.gtf.gz # Run featureCounts to quantify stranded reads against GENCODE vM20 genes # -p: Assume paired-end reads (common for RNA-seq, adjust if single-end) # -s 2: Assume reverse stranded library (e.g., Illumina TruSeq stranded, adjust to 1 for forward or 0 for unstranded if different) # -a: Specify the GTF annotation file # -o: Specify the output counts file featureCounts -p -s 2 -a gencode.vM20.annotation.gtf -o gene_counts.txt input.bam
-
5
Mitochondrial genes were discarded, and only genes with at least 0.5 reads per million across all datasets were kept for further analysis.
$ Bash example
# This script assumes 'gene_counts.tsv' is a tab-separated file with gene IDs as row names and sample IDs as column names. # It also assumes 'gene_annotations.tsv' is a tab-separated file with gene IDs and chromosome information. # The output will be 'filtered_gene_counts.tsv'. # Example: Create dummy input files for demonstration purposes (uncomment to run with dummy data) # echo -e "gene_id\tsample1\tsample2\tsample3" > gene_counts.tsv # echo -e "gene1\t1000\t1200\t1100" >> gene_counts.tsv # echo -e "gene2\t50\t60\t55" >> gene_counts.tsv # echo -e "gene_mito1\t10000\t11000\t10500" >> gene_counts.tsv # echo -e "gene3\t0\t1\t0" >> gene_counts.tsv # echo -e "gene4\t50\t60\t70" >> gene_counts.tsv # echo -e "gene_mito2\t20000\t21000\t20500" >> gene_counts.tsv # echo -e "gene5\t100\t120\t110" >> gene_counts.tsv # echo -e "gene6\t0\t0\t0" >> gene_counts.tsv # echo -e "gene_id\tchromosome" > gene_annotations.tsv # echo -e "gene1\tchr1" >> gene_annotations.tsv # echo -e "gene2\tchr2" >> gene_annotations.tsv # echo -e "gene_mito1\tchrM" >> gene_annotations.tsv # echo -e "gene3\tchr3" >> gene_annotations.tsv # echo -e "gene4\tchr4" >> gene_annotations.tsv # echo -e "gene_mito2\tMT" >> gene_annotations.tsv # echo -e "gene5\tchr5" >> gene_annotations.tsv # echo -e "gene6\tchr6" >> gene_annotations.tsv # R script for filtering gene expression data cat << 'EOF' > filter_expression.R # Load necessary packages (if not installed, uncomment and run BiocManager::install() or install.packages()) # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("data.table") # For faster reading/writing if needed # Read gene counts matrix # Assumes gene IDs are in the first column and sample counts in subsequent columns counts_df <- read.delim("gene_counts.tsv", row.names = 1, check.names = FALSE) # Read gene annotations # Assumes gene IDs are in the first column and chromosome information in a 'chromosome' column gene_annotations <- read.delim("gene_annotations.tsv", row.names = 1, check.names = FALSE) # Ensure annotations match counts matrix row names gene_annotations <- gene_annotations[rownames(counts_df), , drop = FALSE] # 1. Discard mitochondrial genes # Identify mitochondrial genes (case-insensitive for 'chrM' or 'MT' in the chromosome column) mito_gene_ids <- rownames(gene_annotations)[grepl("chrM|MT", gene_annotations$chromosome, ignore.case = TRUE)] message(paste("Identified", length(mito_gene_ids), "mitochondrial genes.")) # Filter out mitochondrial genes from the counts matrix counts_filtered_mito <- counts_df[!rownames(counts_df) %in% mito_gene_ids, , drop = FALSE] message(paste("Genes after discarding mitochondrial:", nrow(counts_filtered_mito))) # Check if any genes remain after mitochondrial filtering if (nrow(counts_filtered_mito) == 0) { warning("No genes remaining after mitochondrial filtering. Outputting an empty file.") write.table(data.frame(), "filtered_gene_counts.tsv", sep = "\t", quote = FALSE, col.names = NA) quit(save = "no") } # 2. Calculate Reads Per Million (RPM) for remaining genes # Calculate total reads per sample from the filtered counts total_reads_per_sample <- colSums(counts_filtered_mito) # Handle samples with zero total reads to avoid division by zero (set to 1 to make RPM 0 for all genes in that sample) total_reads_per_sample[total_reads_per_sample == 0] <- 1 # Calculate RPM matrix rpm_matrix <- sweep(counts_filtered_mito, 2, total_reads_per_sample, FUN = "/") * 1e6 # 3. Keep only genes with at least 0.5 RPM across ALL datasets # Apply 'all' function to check if RPM is >= 0.5 for every sample for each gene genes_to_keep_idx <- apply(rpm_matrix, 1, function(row) all(row >= 0.5)) final_counts_matrix <- counts_filtered_mito[genes_to_keep_idx, , drop = FALSE] message(paste("Genes after RPM filtering:", nrow(final_counts_matrix))) # Save the final filtered counts matrix write.table(final_counts_matrix, "filtered_gene_counts.tsv", sep = "\t", quote = FALSE, col.names = NA) EOF # Execute the R script Rscript filter_expression.R -
6
Differential expression in polysome versus input was quantified with edgeR (v3.26.8) with dispersion = 0.06 estimated from the subset of experiments with multiple replicates.
$ Bash example
# Install edgeR (if not already installed) # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("edgeR")' # conda install -c bioconda bioconductor-edger # Create a dummy count matrix for demonstration purposes. # In a real scenario, this 'counts.tsv' would be generated by a quantification tool # like featureCounts, HTSeq, or Salmon/Kallisto followed by tximport. cat << 'EOF' > counts.tsv Gene Polysome_Rep1 Polysome_Rep2 Input_Rep1 Input_Rep2 GeneA 100 120 50 60 GeneB 20 25 80 90 GeneC 500 550 480 520 GeneD 10 12 5 6 GeneE 300 310 320 330 EOF # R script for differential expression analysis with edgeR cat << 'EOF' > run_edger.R library(edgeR) # Load count data # Assuming counts.tsv has genes as rows and samples as columns, with the first column as gene names counts_data <- read.delim("counts.tsv", row.names = 1) # Define experimental groups # Adjust sample names and groups based on your actual data structure samples <- colnames(counts_data) group <- factor(c("Polysome", "Polysome", "Input", "Input")) # Assuming 2 polysome and 2 input samples # Create a DGEList object dge <- DGEList(counts=counts_data, group=group) # Filter out lowly expressed genes (optional but highly recommended) # This removes genes that are unlikely to be differentially expressed and improves statistical power. keep <- filterByExpr(dge) dge <- dge[keep,,keep.lib.sizes=FALSE] # Normalize library sizes using TMM method dge <- calcNormFactors(dge) # Estimate dispersion # The description states "dispersion = 0.06 estimated from the subset of experiments with multiple replicates." # This implies that the standard edgeR dispersion estimation process was performed, and 0.06 was the resulting # common dispersion value (or a value derived from it). The estimateDisp function performs this estimation. # If a fixed dispersion value was to be explicitly used, it could be set as: dge$common.dispersion <- 0.06 dge <- estimateDisp(dge) # Print estimated common dispersion (for verification) cat("Estimated common dispersion: ", dge$common.dispersion, "\n") # Fit a GLM (Generalized Linear Model) design <- model.matrix(~group) fit <- glmFit(dge, design) # Perform likelihood ratio test for differential expression (Polysome vs Input) # 'coef=2' typically corresponds to the second coefficient in the design matrix, # which represents the log-fold-change for the 'Polysome' group relative to the reference 'Input' group. lrt <- glmLRT(fit, coef=2) # Get top differentially expressed genes results <- topTags(lrt, n=Inf)$table # Save results to a CSV file write.csv(results, "differential_expression_results.csv", row.names = TRUE) cat("Differential expression analysis complete. Results saved to differential_expression_results.csv\n") EOF # Execute the R script Rscript run_edger.R
Tools Used
Raw Source Text
First, adapter trimming was performed with cutadapt (v1.14). Next, reads were mapped with STAR (v2.4.0i) against a database of repetitive elements, with mapping reads discarded from further analysis. Remaining reads were then mapped to the mouse genome (mm10) with STAR (v2.4.0i) Stranded reads mapping to GENCODE (vM20) genes were quantified with featureCounts (subread-1.6.4). Mitochondrial genes were discarded, and only genes with at least 0.5 reads per million across all datasets were kept for further analysis. Differential expression in polysome versus input was quantified with edgeR (v3.26.8) with dispersion = 0.06 estimated from the subset of experiments with multiple replicates. Genome_build: mm10 Supplementary_files_format_and_content: EdgeR output files (*.txt.gz*) contain gene identifier, log(fold-change in polysome versus input), log(counts per million), and p-value (polysome versus input)