GSE237099 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.

Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348

Dataset

GSE237099

Analysis of skeletal muscle gene expression upon mouse hindlimb unloading and reloading

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra parameters --outFilterScoreMinOverLread 0 --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0 --outFilterMismatchNmax 33 --seedSearchStartLmax 12 --alignSJoverhangMin 8

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables for clarity
    GENOME_DIR="/path/to/GRCm38/STAR_index" # Placeholder for GRCm38 STAR index
    READS_R1="input_reads_R1.fastq.gz" # Placeholder for input forward reads
    READS_R2="input_reads_R2.fastq.gz" # Placeholder for input reverse reads (assuming paired-end)
    OUTPUT_PREFIX="aligned_reads" # Prefix for output files
    THREADS=8 # Number of threads to use
    
    # Run STAR alignment
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READS_R1}" "${READS_R2}" \
         --runThreadN "${THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outFilterScoreMinOverLread 0 \
         --outFilterMatchNmin 0 \
         --outFilterMatchNminOverLread 0 \
         --outFilterMismatchNmax 33 \
         --seedSearchStartLmax 12 \
         --alignSJoverhangMin 8 \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outSAMattributes All \
         --readFilesCommand zcat
  2. 2

    Reads aligning to gene exons (Mus_musculus.GRCm38.102.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level with extra parameters -s 0 --countReadPairs.

    featureCounts v2.0.3 GitHub
    $ Bash example
    # Install Subread package (which includes featureCounts)
    # conda install -c bioconda subread
    
    # Define input and output files, and reference annotation
    INPUT_BAM="input.bam" # Placeholder for your aligned BAM file(s)
    GTF_FILE="/path/to/Mus_musculus.GRCm38.102.gtf" # Placeholder for the Mus_musculus GRCm38.102 GTF file
    OUTPUT_COUNTS="gene_counts.txt"
    
    # Run featureCounts
    featureCounts -a "${GTF_FILE}" -o "${OUTPUT_COUNTS}" -s 0 --countReadPairs "${INPUT_BAM}"
    
  3. 3

    Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis

    DESeq2 v1.40.2 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R and Bioconductor if not already present
    # For Ubuntu/Debian:
    # sudo apt-get update
    # sudo apt-get install r-base
    # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')"
    
    # Create dummy input files for demonstration purposes.
    # In a real scenario, replace these with your actual count matrix and sample metadata.
    
    # counts_matrix.csv: A CSV file where rows are genes and columns are samples, containing raw counts.
    # The first column should be gene IDs.
    echo "gene_id,sample1,sample2,sample3,sample4" > counts_matrix.csv
    echo "geneA,100,120,110,130" >> counts_matrix.csv
    echo "geneB,50,60,55,65" >> counts_matrix.csv
    echo "geneC,0,0,0,0" >> counts_matrix.csv # This gene will be filtered out as its total raw count is 0
    echo "geneD,200,210,190,220" >> counts_matrix.csv
    echo "geneE,5,0,3,2" >> counts_matrix.csv # This gene will NOT be filtered (sum = 10 > 0)
    
    # sample_metadata.csv: A CSV file where rows are samples and columns are experimental factors.
    # The first column should be sample names, matching the column names in counts_matrix.csv.
    echo "sample,condition" > sample_metadata.csv
    echo "sample1,control" >> sample_metadata.csv
    echo "sample2,treatment" >> sample_metadata.csv
    echo "sample3,control" >> sample_metadata.csv
    echo "sample4,treatment" >> sample_metadata.csv
    
    # R script for DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    library(DESeq2)
    
    # Load count data
    # Assuming the first column is gene_id and subsequent columns are sample counts
    count_data <- read.csv("counts_matrix.csv", row.names = 1, check.names = FALSE)
    
    # Load sample metadata
    # Assuming the first column is sample name and subsequent columns are experimental factors
    coldata <- read.csv("sample_metadata.csv", row.names = 1)
    
    # Ensure sample names match and are in the same order
    coldata <- coldata[colnames(count_data), , drop = FALSE]
    stopifnot(all(rownames(coldata) == colnames(count_data)))
    
    # Filter out genes with total raw counts less than 1 across all samples (i.e., sum of counts is 0)
    # This implements the description: "genes with total raw counts less then 1 across all samples were excluded from analysis"
    keep <- rowSums(count_data) >= 1
    count_data_filtered <- count_data[keep, ]
    
    # Create DESeqDataSet object
    # The design formula should reflect the experimental design, e.g., ~ condition
    dds <- DESeqDataSetFromMatrix(countData = count_data_filtered,
                                  colData = coldata,
                                  design = ~ condition)
    
    # Run DESeq2 analysis
    dds <- DESeq(dds)
    
    # Get results with FDR level 0.05 (alpha = 0.05)
    # This implements the description: "Differential gene expression at FDR level 0.05 was obtained"
    res <- results(dds, alpha = 0.05)
    
    # Order results by adjusted p-value
    res_ordered <- res[order(res$padj), ]
    
    # Save results to a CSV file
    write.csv(as.data.frame(res_ordered), "deseq2_results_fdr0.05.csv")
    
    message("DESeq2 analysis complete. Results saved to deseq2_results_fdr0.05.csv")
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R

Tools Used

Raw Source Text
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra parameters --outFilterScoreMinOverLread 0   --outFilterMatchNmin 0   --outFilterMatchNminOverLread 0   --outFilterMismatchNmax 33   --seedSearchStartLmax 12   --alignSJoverhangMin 8
Reads aligning to gene exons (Mus_musculus.GRCm38.102.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level with extra parameters -s 0 --countReadPairs.
Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis
Assembly: GRCm38.p6
Supplementary files format and content: tab-delimited text file includes raw read counts for all samples
Supplementary files format and content: tab-delimited text files includes differetial gene expression analysis output from DESeq2
← Back to Analysis