GSE145430 Processing Pipeline

RNA-Seq code_examples 2 steps

Publication

Zmat3 Is a Key Splicing Regulator in the p53 Tumor Suppression Program.

Molecular cell (2020) — PMID 33157015

Dataset

Identifying differentially expressed genes in Zmat3 knockout cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-seq reads were aligned to the mouse genome (mm10) and analyzed using the public server Galaxy (usegalaxy.org), which employs the STAR aligner.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# --- Reference Data Setup (mm10 genome and GTF) ---
# Create a directory for reference files
mkdir -p star_index_mm10
cd star_index_mm10

# Download mouse genome (mm10) FASTA from UCSC
# Using primary assembly for simplicity
wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
gunzip mm10.fa.gz

# Download mouse (mm10, GRCm38) GTF annotation from Ensembl
# Using a common release, e.g., release 102
wget http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz
gunzip Mus_musculus.GRCm38.102.gtf.gz

# Define paths
GENOME_FASTA="mm10.fa"
GTF_FILE="Mus_musculus.GRCm38.102.gtf"
STAR_INDEX_DIR="star_index_mm10_dir" # Directory for STAR index

# Build STAR genome index
# Adjust --runThreadN based on available cores and --sjdbOverhang based on read length (read_length - 1)
STAR --runMode genomeGenerate \
     --genomeDir ${STAR_INDEX_DIR} \
     --genomeFastaFiles ${GENOME_FASTA} \
     --sjdbGTFfile ${GTF_FILE} \
     --sjdbOverhang 100 \
     --runThreadN 8 # Use appropriate number of threads

cd .. # Go back to the working directory

# --- RNA-seq Alignment ---
# Define input FASTQ files (assuming paired-end reads)
READS_R1="reads_R1.fastq.gz" # Replace with your actual R1 file
READS_R2="reads_R2.fastq.gz" # Replace with your actual R2 file
OUTPUT_PREFIX="aligned_reads"
STAR_INDEX_PATH="star_index_mm10/${STAR_INDEX_DIR}" # Path to the generated STAR index

# Perform alignment with STAR
# Adjust --runThreadN based on available cores
# Adjust --limitBAMsortRAM based on available RAM (e.g., 30GB for 30,000,000,000 bytes)
STAR --genomeDir ${STAR_INDEX_PATH} \
     --readFilesIn ${READS_R1} ${READS_R2} \
     --readFilesCommand zcat \
     --runThreadN 8 \
     --outFileNamePrefix ${OUTPUT_PREFIX}. \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --quantMode GeneCounts \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.1 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 30000000000 # Adjust based on available RAM

View on GitHub

DESeq2 was used for differential expression analysis

DESeq2 v1.42.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor if not already present
# For Ubuntu/Debian:
# sudo apt-get update
# sudo apt-get install r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")'

# Create a placeholder R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
# Load DESeq2 library
library(DESeq2)

# --- Configuration ---
# Define input files (replace with actual paths and ensure they exist)
# counts.csv: Your raw count matrix (genes x samples), with gene IDs as the first column and sample names as column headers.
# sample_info.csv: Your sample metadata (samples x variables), with sample names as the first column and variables as column headers.
counts_file <- "counts.csv" 
sample_info_file <- "sample_info.csv" 
design_formula <- "~ condition" # Your experimental design formula (e.g., ~ condition + batch)
output_prefix <- "deseq2_results" # Prefix for output files

# --- Load Data ---
# Load count data
# Assuming counts.csv is comma-separated. Adjust read.csv parameters as needed.
count_data <- read.csv(counts_file, row.names = 1, check.names = FALSE)
count_data <- as.matrix(count_data) # DESeq2 expects a matrix

# Load sample information
# Assuming sample_info.csv is comma-separated. Adjust read.csv parameters as needed.
sample_info <- read.csv(sample_info_file, row.names = 1)

# Ensure sample names match and are in the same order
if (!all(colnames(count_data) == rownames(sample_info))) {
  stop("Sample names in count data and sample information do not match or are not in the same order.")
}

# Ensure the 'condition' variable (or whatever is in your design_formula) is a factor
# For example, if your design is ~ condition, make sure sample_info$condition is a factor.
# sample_info$condition <- factor(sample_info$condition)

# --- Create DESeqDataSet object ---
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = as.formula(design_formula))

# --- Pre-filtering (optional but recommended) ---
# Remove genes with very low counts across all samples (e.g., sum of counts < 10)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# --- Run DESeq2 analysis ---
dds <- DESeq(dds)

# --- Extract Results ---
# Get results for the primary contrast (e.g., comparing levels of 'condition')
# You might need to specify 'contrast' argument for specific comparisons
# e.g., res <- results(dds, contrast=c("condition", "treated", "untreated"))
res <- results(dds)

# Order results by adjusted p-value
res_ordered <- res[order(res$padj),]

# --- Save Results ---
# Save full results table
write.csv(as.data.frame(res_ordered), file = paste0(output_prefix, "_full_results.csv"))

# Save significant results (e.g., padj < 0.05)
res_sig <- subset(res_ordered, padj < 0.05)
write.csv(as.data.frame(res_sig), file = paste0(output_prefix, "_significant_results.csv"))

# --- Optional: Generate plots ---
# MA plot
# png(paste0(output_prefix, "_MA_plot.png"))
# plotMA(res, main="MA-plot")
# dev.off()

# Dispersion plot
# png(paste0(output_prefix, "_dispersion_plot.png"))
# plotDispEsts(dds, main="Dispersion Estimates")
# dev.off()

# PCA plot (requires 'vst' or 'rlog' transformation)
# vsd <- vst(dds, blind=FALSE)
# png(paste0(output_prefix, "_PCA_plot.png"))
# plotPCA(vsd, intgroup="condition")
# dev.off()

message("DESeq2 analysis complete. Results saved to ", output_prefix, "_full_results.csv and ", output_prefix, "_significant_results.csv")
EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

RNA-seq reads were aligned to the mouse genome (mm10) and analyzed using the public server Galaxy (usegalaxy.org), which employs the STAR aligner.
DESeq2 was used for differential expression analysis
Genome_build: mm10
Supplementary_files_format_and_content: Excel file contains the normalized counts for each sample

← Back to Analysis