GSE273093 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.

Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348

Dataset

GSE273093

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling (Zfp697 transduced primary mouse myotubes RNA-Seq)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra arguments --outFilterMultimapNmax 50 --outFilterMultimapScoreRange 3 --outFilterScoreMinOverLread 0.7 --outFilterMatchNminOverLread 0.7 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2

    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star=2.7.3a
    
    # Create STAR genome index (if not already done)
    # This step needs to be run once for a given genome assembly.
    # Replace /path/to/GRCm38_fasta with the actual path to your GRCm38 FASTA file(s).
    # Replace /path/to/GRCm38_gtf with the actual path to your GRCm38 GTF annotation file.
    # Replace /path/to/STAR_index/GRCm38 with your desired output directory for the STAR index.
    # STAR --runThreadN 8 \
    #      --runMode genomeGenerate \
    #      --genomeDir /path/to/STAR_index/GRCm38 \
    #      --genomeFastaFiles /path/to/GRCm38_fasta/GRCm38.fa \
    #      --sjdbGTFfile /path/to/GRCm38_gtf/GRCm38.gtf \
    #      --sjdbOverhang 100 # Recommended: (ReadLength - 1)
    
    # Align reads using STAR
    # Replace /path/to/STAR_index/GRCm38 with the actual path to your STAR genome index.
    # Replace read1.fastq.gz and read2.fastq.gz with your input FASTQ file(s).
    # If single-end reads, remove 'read2.fastq.gz'.
    # Replace /path/to/output_dir with your desired output directory.
    # Replace sample_prefix with a meaningful prefix for your output files.
    STAR --runThreadN 8 \
         --genomeDir /path/to/STAR_index/GRCm38 \
         --readFilesIn read1.fastq.gz read2.fastq.gz \
         --outFileNamePrefix /path/to/output_dir/sample_prefix_ \
         --outFilterMultimapNmax 50 \
         --outFilterMultimapScoreRange 3 \
         --outFilterScoreMinOverLread 0.7 \
         --outFilterMatchNminOverLread 0.7 \
         --outFilterMismatchNmax 10 \
         --alignIntronMax 500000 \
         --alignMatesGapMax 1000000 \
         --sjdbScore 2 \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes Standard \
         --outSAMunmapped Within
  2. 2

    Reads aligning to gene exons (Mus_musculus.GRCm38.102.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level with extra parameters -s 0

    featureCounts v2.0.3
    $ Bash example
    # Install featureCounts (part of Subread package)
    # conda install -c bioconda subread
    
    # Reference dataset: Mus_musculus.GRCm38.102.gtf
    # This GTF file can be downloaded from Ensembl's FTP server.
    # For Ensembl release 102, the path would be similar to:
    # wget http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz
    # gunzip Mus_musculus.GRCm38.102.gtf.gz
    
    GTF_FILE="Mus_musculus.GRCm38.102.gtf"
    INPUT_BAM="input.bam" # Placeholder for your alignment file (e.g., STAR output BAM)
    OUTPUT_COUNTS="gene_counts.txt"
    
    featureCounts -a "${GTF_FILE}" \
                  -o "${OUTPUT_COUNTS}" \
                  -s 0 \
                  "${INPUT_BAM}"
  3. 3

    Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis

    DESeq2 v1.40.2 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R and Bioconductor if not already installed
    # For R:
    # sudo apt-get update
    # sudo apt-get install r-base
    #
    # For Bioconductor and DESeq2:
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # R -e 'BiocManager::install("DESeq2")'
    
    # Create a placeholder count matrix and sample metadata for demonstration
    # In a real scenario, these files would already exist.
    echo "gene_id,sample1,sample2,sample3,sample4" > count_matrix.csv
    echo "geneA,10,12,5,7" >> count_matrix.csv
    echo "geneB,0,0,0,0" >> count_matrix.csv
    echo "geneC,20,25,10,15" >> count_matrix.csv
    echo "geneD,5,6,1,2" >> count_matrix.csv
    echo "geneE,0,1,0,0" >> count_matrix.csv # This gene will be kept due to count >= 1
    
    echo "sample_id,condition" > sample_metadata.csv
    echo "sample1,control" >> sample_metadata.csv
    echo "sample2,control" >> sample_metadata.csv
    echo "sample3,treated" >> sample_metadata.csv
    echo "sample4,treated" >> sample_metadata.csv
    
    # R script for DESeq2 analysis
    Rscript -e '
    library(DESeq2)
    
    # --- Configuration ---
    # Define input and output file paths
    count_matrix_file <- "count_matrix.csv" # Replace with your actual count matrix file
    sample_metadata_file <- "sample_metadata.csv" # Replace with your actual sample metadata file
    output_results_file <- "deseq2_results.csv"
    design_formula <- "~ condition" # Replace "condition" with your actual primary experimental variable
    
    # --- Data Loading ---
    # Load count data
    # Assuming the first column is gene names and subsequent columns are sample counts
    count_data <- read.csv(count_matrix_file, row.names = 1, check.names = FALSE)
    # Ensure counts are integers
    count_data <- round(count_data)
    
    # Load sample metadata
    # Assuming the first column is sample names matching count_data columns
    sample_metadata <- read.csv(sample_metadata_file, row.names = 1)
    
    # Ensure sample names match and are in the same order
    sample_metadata <- sample_metadata[colnames(count_data), , drop = FALSE]
    
    # --- DESeq2 Analysis ---
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_metadata,
                                  design = as.formula(design_formula))
    
    # Filter out genes with total raw counts less than 1 across all samples
    # This means genes with a sum of 0 counts across all samples are removed.
    dds <- dds[rowSums(counts(dds)) >= 1, ]
    
    # Run DESeq2 differential expression analysis
    dds <- DESeq(dds)
    
    # Extract results with FDR level 0.05 (alpha=0.05)
    # Adjust contrast if needed, e.g., contrast=c("condition", "treated", "control")
    res <- results(dds, alpha = 0.05)
    
    # Order results by adjusted p-value
    res <- res[order(res$padj), ]
    
    # Save the results
    write.csv(as.data.frame(res), file = output_results_file)
    
    # Display summary of results
    summary(res)
    '
    
    # Clean up placeholder files (optional)
    # rm count_matrix.csv sample_metadata.csv

Tools Used

Raw Source Text
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra arguments --outFilterMultimapNmax 50   --outFilterMultimapScoreRange 3   --outFilterScoreMinOverLread 0.7   --outFilterMatchNminOverLread 0.7   --outFilterMismatchNmax 10   --alignIntronMax 500000   --alignMatesGapMax 1000000   --sjdbScore 2
Reads aligning to gene exons (Mus_musculus.GRCm38.102.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level with extra parameters -s 0
Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis
Assembly: GRCm38.p6
Supplementary files format and content: tab-delimited text file includes raw read counts for all samples
Supplementary files format and content: tab-delimited text file includes differetial gene expression analysis output from DESeq2
← Back to Analysis