GSE237099 Processing Pipeline
Publication
Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348
Dataset
GSE237099Analysis of skeletal muscle gene expression upon mouse hindlimb unloading and reloading
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra parameters --outFilterScoreMinOverLread 0 --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0 --outFilterMismatchNmax 33 --seedSearchStartLmax 12 --alignSJoverhangMin 8
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables for clarity GENOME_DIR="/path/to/GRCm38/STAR_index" # Placeholder for GRCm38 STAR index READS_R1="input_reads_R1.fastq.gz" # Placeholder for input forward reads READS_R2="input_reads_R2.fastq.gz" # Placeholder for input reverse reads (assuming paired-end) OUTPUT_PREFIX="aligned_reads" # Prefix for output files THREADS=8 # Number of threads to use # Run STAR alignment STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outFilterScoreMinOverLread 0 \ --outFilterMatchNmin 0 \ --outFilterMatchNminOverLread 0 \ --outFilterMismatchNmax 33 \ --seedSearchStartLmax 12 \ --alignSJoverhangMin 8 \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes All \ --readFilesCommand zcat -
2
Reads aligning to gene exons (Mus_musculus.GRCm38.102.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level with extra parameters -s 0 --countReadPairs.
$ Bash example
# Install Subread package (which includes featureCounts) # conda install -c bioconda subread # Define input and output files, and reference annotation INPUT_BAM="input.bam" # Placeholder for your aligned BAM file(s) GTF_FILE="/path/to/Mus_musculus.GRCm38.102.gtf" # Placeholder for the Mus_musculus GRCm38.102 GTF file OUTPUT_COUNTS="gene_counts.txt" # Run featureCounts featureCounts -a "${GTF_FILE}" -o "${OUTPUT_COUNTS}" -s 0 --countReadPairs "${INPUT_BAM}" -
3
Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis
$ Bash example
# Install R and Bioconductor if not already present # For Ubuntu/Debian: # sudo apt-get update # sudo apt-get install r-base # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')" # Create dummy input files for demonstration purposes. # In a real scenario, replace these with your actual count matrix and sample metadata. # counts_matrix.csv: A CSV file where rows are genes and columns are samples, containing raw counts. # The first column should be gene IDs. echo "gene_id,sample1,sample2,sample3,sample4" > counts_matrix.csv echo "geneA,100,120,110,130" >> counts_matrix.csv echo "geneB,50,60,55,65" >> counts_matrix.csv echo "geneC,0,0,0,0" >> counts_matrix.csv # This gene will be filtered out as its total raw count is 0 echo "geneD,200,210,190,220" >> counts_matrix.csv echo "geneE,5,0,3,2" >> counts_matrix.csv # This gene will NOT be filtered (sum = 10 > 0) # sample_metadata.csv: A CSV file where rows are samples and columns are experimental factors. # The first column should be sample names, matching the column names in counts_matrix.csv. echo "sample,condition" > sample_metadata.csv echo "sample1,control" >> sample_metadata.csv echo "sample2,treatment" >> sample_metadata.csv echo "sample3,control" >> sample_metadata.csv echo "sample4,treatment" >> sample_metadata.csv # R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R library(DESeq2) # Load count data # Assuming the first column is gene_id and subsequent columns are sample counts count_data <- read.csv("counts_matrix.csv", row.names = 1, check.names = FALSE) # Load sample metadata # Assuming the first column is sample name and subsequent columns are experimental factors coldata <- read.csv("sample_metadata.csv", row.names = 1) # Ensure sample names match and are in the same order coldata <- coldata[colnames(count_data), , drop = FALSE] stopifnot(all(rownames(coldata) == colnames(count_data))) # Filter out genes with total raw counts less than 1 across all samples (i.e., sum of counts is 0) # This implements the description: "genes with total raw counts less then 1 across all samples were excluded from analysis" keep <- rowSums(count_data) >= 1 count_data_filtered <- count_data[keep, ] # Create DESeqDataSet object # The design formula should reflect the experimental design, e.g., ~ condition dds <- DESeqDataSetFromMatrix(countData = count_data_filtered, colData = coldata, design = ~ condition) # Run DESeq2 analysis dds <- DESeq(dds) # Get results with FDR level 0.05 (alpha = 0.05) # This implements the description: "Differential gene expression at FDR level 0.05 was obtained" res <- results(dds, alpha = 0.05) # Order results by adjusted p-value res_ordered <- res[order(res$padj), ] # Save results to a CSV file write.csv(as.data.frame(res_ordered), "deseq2_results_fdr0.05.csv") message("DESeq2 analysis complete. Results saved to deseq2_results_fdr0.05.csv") EOF # Execute the R script Rscript run_deseq2.R
Raw Source Text
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra parameters --outFilterScoreMinOverLread 0 --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0 --outFilterMismatchNmax 33 --seedSearchStartLmax 12 --alignSJoverhangMin 8 Reads aligning to gene exons (Mus_musculus.GRCm38.102.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level with extra parameters -s 0 --countReadPairs. Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis Assembly: GRCm38.p6 Supplementary files format and content: tab-delimited text file includes raw read counts for all samples Supplementary files format and content: tab-delimited text files includes differetial gene expression analysis output from DESeq2