GSE145430 Processing Pipeline
RNA-Seq
code_examples
2 steps
Publication
Zmat3 Is a Key Splicing Regulator in the p53 Tumor Suppression Program.Molecular cell (2020) — PMID 33157015
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
RNA-seq reads were aligned to the mouse genome (mm10) and analyzed using the public server Galaxy (usegalaxy.org), which employs the STAR aligner.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # --- Reference Data Setup (mm10 genome and GTF) --- # Create a directory for reference files mkdir -p star_index_mm10 cd star_index_mm10 # Download mouse genome (mm10) FASTA from UCSC # Using primary assembly for simplicity wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz gunzip mm10.fa.gz # Download mouse (mm10, GRCm38) GTF annotation from Ensembl # Using a common release, e.g., release 102 wget http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz gunzip Mus_musculus.GRCm38.102.gtf.gz # Define paths GENOME_FASTA="mm10.fa" GTF_FILE="Mus_musculus.GRCm38.102.gtf" STAR_INDEX_DIR="star_index_mm10_dir" # Directory for STAR index # Build STAR genome index # Adjust --runThreadN based on available cores and --sjdbOverhang based on read length (read_length - 1) STAR --runMode genomeGenerate \ --genomeDir ${STAR_INDEX_DIR} \ --genomeFastaFiles ${GENOME_FASTA} \ --sjdbGTFfile ${GTF_FILE} \ --sjdbOverhang 100 \ --runThreadN 8 # Use appropriate number of threads cd .. # Go back to the working directory # --- RNA-seq Alignment --- # Define input FASTQ files (assuming paired-end reads) READS_R1="reads_R1.fastq.gz" # Replace with your actual R1 file READS_R2="reads_R2.fastq.gz" # Replace with your actual R2 file OUTPUT_PREFIX="aligned_reads" STAR_INDEX_PATH="star_index_mm10/${STAR_INDEX_DIR}" # Path to the generated STAR index # Perform alignment with STAR # Adjust --runThreadN based on available cores # Adjust --limitBAMsortRAM based on available RAM (e.g., 30GB for 30,000,000,000 bytes) STAR --genomeDir ${STAR_INDEX_PATH} \ --readFilesIn ${READS_R1} ${READS_R2} \ --readFilesCommand zcat \ --runThreadN 8 \ --outFileNamePrefix ${OUTPUT_PREFIX}. \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --quantMode GeneCounts \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.1 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 30000000000 # Adjust based on available RAM -
2
DESeq2 was used for differential expression analysis
$ Bash example
# Install R and Bioconductor if not already present # For Ubuntu/Debian: # sudo apt-get update # sudo apt-get install r-base # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")' # Create a placeholder R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R # Load DESeq2 library library(DESeq2) # --- Configuration --- # Define input files (replace with actual paths and ensure they exist) # counts.csv: Your raw count matrix (genes x samples), with gene IDs as the first column and sample names as column headers. # sample_info.csv: Your sample metadata (samples x variables), with sample names as the first column and variables as column headers. counts_file <- "counts.csv" sample_info_file <- "sample_info.csv" design_formula <- "~ condition" # Your experimental design formula (e.g., ~ condition + batch) output_prefix <- "deseq2_results" # Prefix for output files # --- Load Data --- # Load count data # Assuming counts.csv is comma-separated. Adjust read.csv parameters as needed. count_data <- read.csv(counts_file, row.names = 1, check.names = FALSE) count_data <- as.matrix(count_data) # DESeq2 expects a matrix # Load sample information # Assuming sample_info.csv is comma-separated. Adjust read.csv parameters as needed. sample_info <- read.csv(sample_info_file, row.names = 1) # Ensure sample names match and are in the same order if (!all(colnames(count_data) == rownames(sample_info))) { stop("Sample names in count data and sample information do not match or are not in the same order.") } # Ensure the 'condition' variable (or whatever is in your design_formula) is a factor # For example, if your design is ~ condition, make sure sample_info$condition is a factor. # sample_info$condition <- factor(sample_info$condition) # --- Create DESeqDataSet object --- dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = as.formula(design_formula)) # --- Pre-filtering (optional but recommended) --- # Remove genes with very low counts across all samples (e.g., sum of counts < 10) keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # --- Run DESeq2 analysis --- dds <- DESeq(dds) # --- Extract Results --- # Get results for the primary contrast (e.g., comparing levels of 'condition') # You might need to specify 'contrast' argument for specific comparisons # e.g., res <- results(dds, contrast=c("condition", "treated", "untreated")) res <- results(dds) # Order results by adjusted p-value res_ordered <- res[order(res$padj),] # --- Save Results --- # Save full results table write.csv(as.data.frame(res_ordered), file = paste0(output_prefix, "_full_results.csv")) # Save significant results (e.g., padj < 0.05) res_sig <- subset(res_ordered, padj < 0.05) write.csv(as.data.frame(res_sig), file = paste0(output_prefix, "_significant_results.csv")) # --- Optional: Generate plots --- # MA plot # png(paste0(output_prefix, "_MA_plot.png")) # plotMA(res, main="MA-plot") # dev.off() # Dispersion plot # png(paste0(output_prefix, "_dispersion_plot.png")) # plotDispEsts(dds, main="Dispersion Estimates") # dev.off() # PCA plot (requires 'vst' or 'rlog' transformation) # vsd <- vst(dds, blind=FALSE) # png(paste0(output_prefix, "_PCA_plot.png")) # plotPCA(vsd, intgroup="condition") # dev.off() message("DESeq2 analysis complete. Results saved to ", output_prefix, "_full_results.csv and ", output_prefix, "_significant_results.csv") EOF # Execute the R script Rscript run_deseq2.R
Raw Source Text
RNA-seq reads were aligned to the mouse genome (mm10) and analyzed using the public server Galaxy (usegalaxy.org), which employs the STAR aligner. DESeq2 was used for differential expression analysis Genome_build: mm10 Supplementary_files_format_and_content: Excel file contains the normalized counts for each sample