GSE145430 Processing Pipeline

RNA-Seq code_examples 2 steps

Publication

Zmat3 Is a Key Splicing Regulator in the p53 Tumor Suppression Program.

Molecular cell (2020) — PMID 33157015

Dataset

GSE145430

Identifying differentially expressed genes in Zmat3 knockout cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    RNA-seq reads were aligned to the mouse genome (mm10) and analyzed using the public server Galaxy (usegalaxy.org), which employs the STAR aligner.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # --- Reference Data Setup (mm10 genome and GTF) ---
    # Create a directory for reference files
    mkdir -p star_index_mm10
    cd star_index_mm10
    
    # Download mouse genome (mm10) FASTA from UCSC
    # Using primary assembly for simplicity
    wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
    gunzip mm10.fa.gz
    
    # Download mouse (mm10, GRCm38) GTF annotation from Ensembl
    # Using a common release, e.g., release 102
    wget http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz
    gunzip Mus_musculus.GRCm38.102.gtf.gz
    
    # Define paths
    GENOME_FASTA="mm10.fa"
    GTF_FILE="Mus_musculus.GRCm38.102.gtf"
    STAR_INDEX_DIR="star_index_mm10_dir" # Directory for STAR index
    
    # Build STAR genome index
    # Adjust --runThreadN based on available cores and --sjdbOverhang based on read length (read_length - 1)
    STAR --runMode genomeGenerate \
         --genomeDir ${STAR_INDEX_DIR} \
         --genomeFastaFiles ${GENOME_FASTA} \
         --sjdbGTFfile ${GTF_FILE} \
         --sjdbOverhang 100 \
         --runThreadN 8 # Use appropriate number of threads
    
    cd .. # Go back to the working directory
    
    # --- RNA-seq Alignment ---
    # Define input FASTQ files (assuming paired-end reads)
    READS_R1="reads_R1.fastq.gz" # Replace with your actual R1 file
    READS_R2="reads_R2.fastq.gz" # Replace with your actual R2 file
    OUTPUT_PREFIX="aligned_reads"
    STAR_INDEX_PATH="star_index_mm10/${STAR_INDEX_DIR}" # Path to the generated STAR index
    
    # Perform alignment with STAR
    # Adjust --runThreadN based on available cores
    # Adjust --limitBAMsortRAM based on available RAM (e.g., 30GB for 30,000,000,000 bytes)
    STAR --genomeDir ${STAR_INDEX_PATH} \
         --readFilesIn ${READS_R1} ${READS_R2} \
         --readFilesCommand zcat \
         --runThreadN 8 \
         --outFileNamePrefix ${OUTPUT_PREFIX}. \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --quantMode GeneCounts \
         --outFilterType BySJout \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.1 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --limitBAMsortRAM 30000000000 # Adjust based on available RAM
    
  2. 2

    DESeq2 was used for differential expression analysis

    DESeq2 v1.42.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R and Bioconductor if not already present
    # For Ubuntu/Debian:
    # sudo apt-get update
    # sudo apt-get install r-base
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")'
    
    # Create a placeholder R script for DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    # Load DESeq2 library
    library(DESeq2)
    
    # --- Configuration ---
    # Define input files (replace with actual paths and ensure they exist)
    # counts.csv: Your raw count matrix (genes x samples), with gene IDs as the first column and sample names as column headers.
    # sample_info.csv: Your sample metadata (samples x variables), with sample names as the first column and variables as column headers.
    counts_file <- "counts.csv" 
    sample_info_file <- "sample_info.csv" 
    design_formula <- "~ condition" # Your experimental design formula (e.g., ~ condition + batch)
    output_prefix <- "deseq2_results" # Prefix for output files
    
    # --- Load Data ---
    # Load count data
    # Assuming counts.csv is comma-separated. Adjust read.csv parameters as needed.
    count_data <- read.csv(counts_file, row.names = 1, check.names = FALSE)
    count_data <- as.matrix(count_data) # DESeq2 expects a matrix
    
    # Load sample information
    # Assuming sample_info.csv is comma-separated. Adjust read.csv parameters as needed.
    sample_info <- read.csv(sample_info_file, row.names = 1)
    
    # Ensure sample names match and are in the same order
    if (!all(colnames(count_data) == rownames(sample_info))) {
      stop("Sample names in count data and sample information do not match or are not in the same order.")
    }
    
    # Ensure the 'condition' variable (or whatever is in your design_formula) is a factor
    # For example, if your design is ~ condition, make sure sample_info$condition is a factor.
    # sample_info$condition <- factor(sample_info$condition)
    
    # --- Create DESeqDataSet object ---
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_info,
                                  design = as.formula(design_formula))
    
    # --- Pre-filtering (optional but recommended) ---
    # Remove genes with very low counts across all samples (e.g., sum of counts < 10)
    keep <- rowSums(counts(dds)) >= 10
    dds <- dds[keep,]
    
    # --- Run DESeq2 analysis ---
    dds <- DESeq(dds)
    
    # --- Extract Results ---
    # Get results for the primary contrast (e.g., comparing levels of 'condition')
    # You might need to specify 'contrast' argument for specific comparisons
    # e.g., res <- results(dds, contrast=c("condition", "treated", "untreated"))
    res <- results(dds)
    
    # Order results by adjusted p-value
    res_ordered <- res[order(res$padj),]
    
    # --- Save Results ---
    # Save full results table
    write.csv(as.data.frame(res_ordered), file = paste0(output_prefix, "_full_results.csv"))
    
    # Save significant results (e.g., padj < 0.05)
    res_sig <- subset(res_ordered, padj < 0.05)
    write.csv(as.data.frame(res_sig), file = paste0(output_prefix, "_significant_results.csv"))
    
    # --- Optional: Generate plots ---
    # MA plot
    # png(paste0(output_prefix, "_MA_plot.png"))
    # plotMA(res, main="MA-plot")
    # dev.off()
    
    # Dispersion plot
    # png(paste0(output_prefix, "_dispersion_plot.png"))
    # plotDispEsts(dds, main="Dispersion Estimates")
    # dev.off()
    
    # PCA plot (requires 'vst' or 'rlog' transformation)
    # vsd <- vst(dds, blind=FALSE)
    # png(paste0(output_prefix, "_PCA_plot.png"))
    # plotPCA(vsd, intgroup="condition")
    # dev.off()
    
    message("DESeq2 analysis complete. Results saved to ", output_prefix, "_full_results.csv and ", output_prefix, "_significant_results.csv")
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R

Tools Used

Raw Source Text
RNA-seq reads were aligned to the mouse genome (mm10) and analyzed using the public server Galaxy (usegalaxy.org), which employs the STAR aligner.
DESeq2 was used for differential expression analysis
Genome_build: mm10
Supplementary_files_format_and_content: Excel file contains the normalized counts for each sample
← Back to Analysis