GSE273092 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.

Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348

Dataset

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling (Zfp697 skeletal muscle knockout RNA-Seq of unloading-rel…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra arguments --outFilterScoreMinOverLread 0 --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0 --outFilterMismatchNmax 33 --seedSearchStartLmax 12 --alignSJoverhangMin 8 --alignEndsType Local

STAR v2.7.3a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
READS_R1="reads_R1.fastq.gz" # Placeholder for forward reads
READS_R2="reads_R2.fastq.gz" # Placeholder for reverse reads (assuming paired-end)
GENOME_DIR="/path/to/STAR_index/GRCm38" # Path to the STAR genome index for GRCm38
OUTPUT_BAM="aligned.bam"
OUTPUT_PREFIX="STAR_output_" # Prefix for STAR output files

# Note: The description mentions trimming adaptor sequences and masking low-complexity/low-quality sequence.
# This is typically done as a pre-processing step using tools like fastp or Trimmomatic before STAR alignment.
# The STAR command below focuses on the alignment step with the specified parameters.

# Run STAR alignment
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --runThreadN 8 \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outFilterScoreMinOverLread 0 \
     --outFilterMatchNmin 0 \
     --outFilterMatchNminOverLread 0 \
     --outFilterMismatchNmax 33 \
     --seedSearchStartLmax 12 \
     --alignSJoverhangMin 8 \
     --alignEndsType Local \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outSAMunmapped Within \
     --outSAMmapqUnique 60

# Rename the output BAM file to the desired name
mv "${OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam" "${OUTPUT_BAM}"

View on GitHub

Reads aligning to gene exons (Mus_musculus.GRCm38.94.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level.

featureCounts v2.0.3

$ Bash example

# Install Subread (which includes featureCounts)
# conda install -c bioconda subread

# Define input and output files
INPUT_BAM="aligned_reads.bam" # Placeholder for your aligned BAM file
GTF_FILE="annotation/Mus_musculus.GRCm38.94.gtf"
OUTPUT_COUNTS="gene_counts.txt"

# Download reference GTF file (Mus_musculus.GRCm38.94.gtf from Ensembl release 94)
# mkdir -p annotation
# wget -O annotation/Mus_musculus.GRCm38.94.gtf.gz ftp://ftp.ensembl.org/pub/release-94/gtf/mus_musculus/Mus_musculus.GRCm38.94.gtf.gz
# gunzip annotation/Mus_musculus.GRCm38.94.gtf.gz

# Run featureCounts to count reads aligning to gene exons, summarized at gene level
featureCounts -a "${GTF_FILE}" \
              -o "${OUTPUT_COUNTS}" \
              -F GTF \
              -t exon \
              -g gene_id \
              -T 4 \
              "${INPUT_BAM}"

Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis

DESeq2 v1.42.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Bioconductor and DESeq2 (if not already installed)
# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("DESeq2")

# Create a placeholder R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
library(DESeq2)

# --- Configuration --- #
# Input count matrix file (e.g., from featureCounts, HTSeq-count, RSEM)
# Assumes genes in rows, samples in columns, first column is gene ID
counts_file <- "counts.tsv"

# Input sample metadata file
# Assumes samples in rows, metadata in columns, first column is sample ID
# Must contain a 'condition' column for differential expression
coldata_file <- "coldata.tsv"

# Output results file
output_file <- "deseq2_results.tsv"

# Design formula (e.g., ~ condition). Adjust if more complex designs are needed.
design_formula <- ~ condition

# FDR threshold for results filtering
fdr_threshold <- 0.05

# --- Load Data --- #
# Load count data
count_data <- read.table(counts_file, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)
count_data <- as.matrix(count_data)

# Load colData
coldata <- read.table(coldata_file, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)

# Ensure sample names match and are in the same order
stopifnot(all(rownames(coldata) == colnames(count_data)))

# --- Create DESeqDataSet --- #
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = coldata,
                              design = design_formula)

# --- Pre-filtering: Exclude genes with total raw counts less than 1 across all samples ---
# This means excluding genes with zero total counts across all samples
keep <- rowSums(counts(dds)) > 0
dds <- dds[keep,]

# --- Run DESeq2 analysis --- #
dds <- DESeq(dds)

# --- Extract results with specified FDR (alpha) --- #
res <- results(dds, alpha = fdr_threshold)

# --- Save results --- #
write.table(as.data.frame(res), file = output_file, sep = "\t", quote = FALSE, col.names = NA)

message(paste("DESeq2 analysis complete. Results saved to:", output_file))
EOF

# Placeholder for creating dummy input files
# In a real pipeline, these would be generated by upstream steps

# Create dummy counts.tsv
cat << 'EOF' > counts.tsv	Sample1	Sample2	Sample3	Sample4
GeneA	100	120	50	60
GeneB	50	60	100	110
GeneC	10	15	5	8
GeneD	0	0	0	0
GeneE	200	210	180	190
EOF

# Create dummy coldata.tsv
cat << 'EOF' > coldata.tsv	condition
Sample1	Control
Sample2	Control
Sample3	Treated
Sample4	Treated
EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra arguments --outFilterScoreMinOverLread 0   --outFilterMatchNmin 0   --outFilterMatchNminOverLread 0   --outFilterMismatchNmax 33   --seedSearchStartLmax 12   --alignSJoverhangMin 8   --alignEndsType Local
Reads aligning to gene exons (Mus_musculus.GRCm38.94.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level.
Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis
Assembly: GRCm38.p6
Supplementary files format and content: tab-delimited text file includes raw read counts for all samples
Supplementary files format and content: tab-delimited text files includes differetial gene expression analysis output from DESeq2

← Back to Analysis