GSE273093 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.

Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348

Dataset

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling (Zfp697 transduced primary mouse myotubes RNA-Seq)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra arguments --outFilterMultimapNmax 50 --outFilterMultimapScoreRange 3 --outFilterScoreMinOverLread 0.7 --outFilterMatchNminOverLread 0.7 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2

STAR v2.7.3a GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star=2.7.3a

# Create STAR genome index (if not already done)
# This step needs to be run once for a given genome assembly.
# Replace /path/to/GRCm38_fasta with the actual path to your GRCm38 FASTA file(s).
# Replace /path/to/GRCm38_gtf with the actual path to your GRCm38 GTF annotation file.
# Replace /path/to/STAR_index/GRCm38 with your desired output directory for the STAR index.
# STAR --runThreadN 8 \
#      --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_index/GRCm38 \
#      --genomeFastaFiles /path/to/GRCm38_fasta/GRCm38.fa \
#      --sjdbGTFfile /path/to/GRCm38_gtf/GRCm38.gtf \
#      --sjdbOverhang 100 # Recommended: (ReadLength - 1)

# Align reads using STAR
# Replace /path/to/STAR_index/GRCm38 with the actual path to your STAR genome index.
# Replace read1.fastq.gz and read2.fastq.gz with your input FASTQ file(s).
# If single-end reads, remove 'read2.fastq.gz'.
# Replace /path/to/output_dir with your desired output directory.
# Replace sample_prefix with a meaningful prefix for your output files.
STAR --runThreadN 8 \
     --genomeDir /path/to/STAR_index/GRCm38 \
     --readFilesIn read1.fastq.gz read2.fastq.gz \
     --outFileNamePrefix /path/to/output_dir/sample_prefix_ \
     --outFilterMultimapNmax 50 \
     --outFilterMultimapScoreRange 3 \
     --outFilterScoreMinOverLread 0.7 \
     --outFilterMatchNminOverLread 0.7 \
     --outFilterMismatchNmax 10 \
     --alignIntronMax 500000 \
     --alignMatesGapMax 1000000 \
     --sjdbScore 2 \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outSAMunmapped Within

View on GitHub

Reads aligning to gene exons (Mus_musculus.GRCm38.102.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level with extra parameters -s 0

featureCounts v2.0.3

$ Bash example

# Install featureCounts (part of Subread package)
# conda install -c bioconda subread

# Reference dataset: Mus_musculus.GRCm38.102.gtf
# This GTF file can be downloaded from Ensembl's FTP server.
# For Ensembl release 102, the path would be similar to:
# wget http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz
# gunzip Mus_musculus.GRCm38.102.gtf.gz

GTF_FILE="Mus_musculus.GRCm38.102.gtf"
INPUT_BAM="input.bam" # Placeholder for your alignment file (e.g., STAR output BAM)
OUTPUT_COUNTS="gene_counts.txt"

featureCounts -a "${GTF_FILE}" \
              -o "${OUTPUT_COUNTS}" \
              -s 0 \
              "${INPUT_BAM}"

Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis

DESeq2 v1.40.2 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor if not already installed
# For R:
# sudo apt-get update
# sudo apt-get install r-base
#
# For Bioconductor and DESeq2:
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("DESeq2")'

# Create a placeholder count matrix and sample metadata for demonstration
# In a real scenario, these files would already exist.
echo "gene_id,sample1,sample2,sample3,sample4" > count_matrix.csv
echo "geneA,10,12,5,7" >> count_matrix.csv
echo "geneB,0,0,0,0" >> count_matrix.csv
echo "geneC,20,25,10,15" >> count_matrix.csv
echo "geneD,5,6,1,2" >> count_matrix.csv
echo "geneE,0,1,0,0" >> count_matrix.csv # This gene will be kept due to count >= 1

echo "sample_id,condition" > sample_metadata.csv
echo "sample1,control" >> sample_metadata.csv
echo "sample2,control" >> sample_metadata.csv
echo "sample3,treated" >> sample_metadata.csv
echo "sample4,treated" >> sample_metadata.csv

# R script for DESeq2 analysis
Rscript -e '
library(DESeq2)

# --- Configuration ---
# Define input and output file paths
count_matrix_file <- "count_matrix.csv" # Replace with your actual count matrix file
sample_metadata_file <- "sample_metadata.csv" # Replace with your actual sample metadata file
output_results_file <- "deseq2_results.csv"
design_formula <- "~ condition" # Replace "condition" with your actual primary experimental variable

# --- Data Loading ---
# Load count data
# Assuming the first column is gene names and subsequent columns are sample counts
count_data <- read.csv(count_matrix_file, row.names = 1, check.names = FALSE)
# Ensure counts are integers
count_data <- round(count_data)

# Load sample metadata
# Assuming the first column is sample names matching count_data columns
sample_metadata <- read.csv(sample_metadata_file, row.names = 1)

# Ensure sample names match and are in the same order
sample_metadata <- sample_metadata[colnames(count_data), , drop = FALSE]

# --- DESeq2 Analysis ---
# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_metadata,
                              design = as.formula(design_formula))

# Filter out genes with total raw counts less than 1 across all samples
# This means genes with a sum of 0 counts across all samples are removed.
dds <- dds[rowSums(counts(dds)) >= 1, ]

# Run DESeq2 differential expression analysis
dds <- DESeq(dds)

# Extract results with FDR level 0.05 (alpha=0.05)
# Adjust contrast if needed, e.g., contrast=c("condition", "treated", "control")
res <- results(dds, alpha = 0.05)

# Order results by adjusted p-value
res <- res[order(res$padj), ]

# Save the results
write.csv(as.data.frame(res), file = output_results_file)

# Display summary of results
summary(res)
'

# Clean up placeholder files (optional)
# rm count_matrix.csv sample_metadata.csv

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra arguments --outFilterMultimapNmax 50   --outFilterMultimapScoreRange 3   --outFilterScoreMinOverLread 0.7   --outFilterMatchNminOverLread 0.7   --outFilterMismatchNmax 10   --alignIntronMax 500000   --alignMatesGapMax 1000000   --sjdbScore 2
Reads aligning to gene exons (Mus_musculus.GRCm38.102.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level with extra parameters -s 0
Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis
Assembly: GRCm38.p6
Supplementary files format and content: tab-delimited text file includes raw read counts for all samples
Supplementary files format and content: tab-delimited text file includes differetial gene expression analysis output from DESeq2

← Back to Analysis