GSE198419 Processing Pipeline
RNA-Seq
code_examples
7 steps
Publication
Remodeling oncogenic transcriptomes by small molecules targeting NONO.Nature chemical biology (2023) — PMID 36864190
Dataset
GSE198419Transcriptome changes in 22Rv1 and MCF7 cells after treatment with NONO ligands and controls
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
For differential expression:
$ Bash example
#!/bin/bash # This script demonstrates a basic differential expression analysis using DESeq2. # It assumes that count data (e.g., from Salmon, Kallisto, featureCounts) and sample metadata are available. # --- Installation (commented out) --- # To install R and DESeq2, you can use Bioconda: # conda create -n deseq2_env r-base bioconductor-deseq2 r-tximport r-data.table r-dplyr r-ggplot2 # conda activate deseq2_env # --- Placeholder Input Files --- # In a real pipeline, 'counts.tsv' would be the output of a quantification step (e.g., Salmon, Kallisto, featureCounts). # 'metadata.tsv' would be a user-provided file describing samples and experimental conditions. # Create a dummy counts.tsv file (replace with your actual count matrix) cat << EOF > counts.tsv gene sample1 sample2 sample3 sample4 geneA 100 120 50 60 geneB 50 60 100 110 geneC 200 210 220 230 geneD 10 12 15 18 geneE 30 35 28 32 geneF 5 7 12 15 EOF # Create a dummy metadata.tsv file (replace with your actual sample metadata) cat << EOF > metadata.tsv sample condition sample1 control sample2 control sample3 treated sample4 treated EOF # --- R Script for DESeq2 Analysis --- # This R script performs the differential expression analysis. # It loads count data and metadata, creates a DESeqDataSet, runs DESeq, and saves the results. cat << 'EOF_R' > run_deseq2.R library(DESeq2) # Load count data # Assumes the first column is gene names and subsequent columns are sample counts. # Ensure counts are integers, as DESeq2 expects integer counts. count_data <- read.table("counts.tsv", header=TRUE, row.names=1, sep="\t") # Load sample metadata # Assumes the first column is sample names and subsequent columns are metadata (e.g., 'condition'). sample_metadata <- read.table("metadata.tsv", header=TRUE, row.names=1, sep="\t") # Ensure sample order matches between count data and metadata # And ensure all samples in count_data are present in metadata. sample_metadata <- sample_metadata[colnames(count_data), , drop=FALSE] # Create DESeqDataSet object # The 'design' formula specifies the experimental design. Here, we compare 'condition'. dds <- DESeqDataSetFromMatrix(countData = round(count_data), # DESeq2 expects integer counts colData = sample_metadata, design = ~ condition) # Pre-filtering: remove genes with very low counts across all samples # This helps reduce the number of tests and improves statistical power. # Here, we keep genes that have at least 10 counts in total across all samples. keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # Run DESeq analysis # This performs normalization, dispersion estimation, and differential expression testing. dds <- DESeq(dds) # Get results for a specific contrast (e.g., 'treated' vs 'control') # The 'contrast' argument specifies the comparison. The last element is the reference level. res <- results(dds, contrast=c("condition", "treated", "control")) # Order results by adjusted p-value (padj) res_ordered <- res[order(res$padj),] # Save the differential expression results to a CSV file write.csv(as.data.frame(res_ordered), file="differential_expression_results.csv") # Optional: Generate an MA plot for visualization # png("MA_plot.png") # plotMA(res, main="MA plot of Differential Expression") # dev.off() # Optional: Generate a volcano plot (requires ggplot2 if not already loaded) # library(ggplot2) # res_df <- as.data.frame(res_ordered) # res_df$gene <- rownames(res_df) # ggplot(res_df, aes(x=log2FoldChange, y=-log10(pvalue))) + # geom_point(aes(color=padj<0.05 & abs(log2FoldChange)>1), alpha=0.7) + # scale_color_manual(values=c("black", "red")) + # labs(title="Volcano Plot", x="log2 Fold Change", y="-log10 p-value") + # theme_minimal() # ggsave("volcano_plot.png") EOF_R # Execute the R script Rscript run_deseq2.R # --- Reference Datasets --- # For differential expression analysis with DESeq2, the primary inputs are count matrices and sample metadata. # The reference genome and gene annotation (e.g., GRCh38/hg38 and GENCODE GTF) are typically used in upstream steps # (e.g., alignment and quantification) to generate the count data, but not directly by DESeq2 itself. # Therefore, no specific genome assembly placeholder is provided for this step. -
2
Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation.
$ Bash example
# Install Salmon (example using conda) # conda create -n salmon_env salmon=1.3.0 -c bioconda -c conda-forge # conda activate salmon_env # Define variables GENCODE_VERSION="v37" GENCODE_RELEASE="37" # Corresponding GENCODE release number GENCODE_FTP="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_${GENCODE_RELEASE}" TRANSCRIPTOME_FASTA="gencode.${GENCODE_VERSION}.transcripts.fa.gz" ANNOTATION_GTF="gencode.${GENCODE_VERSION}.annotation.gtf.gz" SALMON_INDEX_DIR="salmon_index_${GENCODE_VERSION}" READ1="sample_R1.fastq.gz" # Placeholder for input reads (replace with actual R1 file) READ2="sample_R2.fastq.gz" # Placeholder for input reads (replace with actual R2 file) OUTPUT_DIR="salmon_quant_output" THREADS=8 # Placeholder for number of threads # Create directories mkdir -p "${SALMON_INDEX_DIR}" mkdir -p "${OUTPUT_DIR}" # Download GENCODE v37 annotation files (if not already present) # wget -nc "${GENCODE_FTP}/${TRANSCRIPTOME_FASTA}" # wget -nc "${GENCODE_FTP}/${ANNOTATION_GTF}" # GTF is not strictly required for salmon index, but is part of the annotation set # Build Salmon index using the GENCODE v37 transcriptome FASTA salmon index -t "${TRANSCRIPTOME_FASTA}" -i "${SALMON_INDEX_DIR}" --gencode # Quantify transcript abundance using Salmon # Assuming paired-end reads and automatic library type detection (-l A) salmon quant -i "${SALMON_INDEX_DIR}" \ -l A \ -1 "${READ1}" \ -2 "${READ2}" \ --validateMappings \ -p "${THREADS}" \ -o "${OUTPUT_DIR}" -
3
Gene level quantification was performed using tximeta [v1.8.4].
$ Bash example
# Install R and Bioconductor if not already present # Rscript -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("tximeta")' # Create an R script for tximeta execution cat << 'EOF' > run_tximeta_gene_quant.R #!/usr/bin/env Rscript # Load necessary libraries library(tximeta) library(SummarizedExperiment) # Get input and output paths from environment variables SALMON_OUTPUT_DIR <- Sys.getenv("SALMON_OUTPUT_DIR", stop("SALMON_OUTPUT_DIR environment variable not set.")) SAMPLE_SHEET <- Sys.getenv("SAMPLE_SHEET", stop("SAMPLE_SHEET environment variable not set.")) OUTPUT_RDS_FILE <- Sys.getenv("OUTPUT_RDS_FILE", "gene_level_quant.rds") # Read sample sheet coldata_df <- read.csv(SAMPLE_SHEET, stringsAsFactors = FALSE) # Ensure 'sample_name' and 'condition' columns exist if (!"sample_name" %in% colnames(coldata_df)) { stop("Sample sheet must contain a 'sample_name' column.") } if (!"condition" %in% colnames(coldata_df)) { warning("Sample sheet does not contain a 'condition' column. Using 'unknown' as placeholder.") coldata_df$condition <- "unknown" } # Construct 'files' column pointing to quant.sf coldata_df$files <- file.path(SALMON_OUTPUT_DIR, coldata_df$sample_name, "quant.sf") # Set 'names' for tximeta coldata_df$names <- coldata_df$sample_name rownames(coldata_df) <- coldata_df$names # Check if all quant.sf files exist if (!all(file.exists(coldata_df$files))) { missing_files <- coldata_df$files[!file.exists(coldata_df$files)] stop(paste("Missing quantification files:", paste(missing_files, collapse = ", "))) } # Perform tximeta import message("Importing quantification data with tximeta...") se <- tximeta(coldata_df) # Summarize to gene level message("Summarizing to gene level...") # tximeta automatically links to appropriate TxDb/EnsDb objects based on the Salmon index metadata. # For human, this typically means GRCh38 and a recent Ensembl annotation. gse <- summarizeToGene(se) # Save the gene-level SummarizedExperiment object message(paste("Saving gene-level SummarizedExperiment to", OUTPUT_RDS_FILE)) saveRDS(gse, file = OUTPUT_RDS_FILE) message("Gene-level quantification complete.") EOF # Make the R script executable chmod +x run_tximeta_gene_quant.R # Set environment variables for the R script # Replace 'path/to/salmon_output_directory' with the actual path to your Salmon quantification results. # Each sample's Salmon output should be in a subdirectory (e.g., salmon_output_directory/sample1/quant.sf). export SALMON_OUTPUT_DIR="path/to/salmon_output_directory" # Replace 'path/to/sample_sheet.csv' with the actual path to your sample metadata file. # The sample sheet should be a CSV with at least 'sample_name' and 'condition' columns. # Example sample_sheet.csv: # sample_name,condition # sample1,control # sample2,treated export SAMPLE_SHEET="path/to/sample_sheet.csv" # Define the output file name for the gene-level quantification results (R RDS format) export OUTPUT_RDS_FILE="gene_level_quantification.rds" # Execute the R script ./run_tximeta_gene_quant.R -
4
Differential gene expression was analyzed by DESeq2 [v1.30.1].
$ Bash example
# Install R and DESeq2 if not already installed. # Note: v1.30.1 is an older version of DESeq2 (released with Bioconductor 3.10 / R 3.6). The following command installs the latest compatible version with your R environment. # conda install -c bioconda bioconductor-deseq2 r-base # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')" # Assuming count matrix (e.g., from featureCounts, Salmon, Kallisto) and colData are prepared. # Placeholder: 'counts.tsv' should be a tab-separated file with genes as rows and samples as columns. # Placeholder: 'coldata.tsv' should be a tab-separated file with samples as rows and experimental conditions/metadata as columns. # Create an R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R library(DESeq2) # Load count data # Replace 'counts.tsv' with the actual path to your count matrix file. count_data <- read.table("counts.tsv", header = TRUE, row.names = 1, sep = "\t") # Load sample information (colData) # Replace 'coldata.tsv' with the actual path to your sample metadata file. # Ensure row names of colData match column names of count_data. col_data <- read.table("coldata.tsv", header = TRUE, row.names = 1, sep = "\t") # Ensure the order of samples in col_data matches the order of columns in count_data col_data <- col_data[colnames(count_data), , drop = FALSE] # Create DESeqDataSet object # 'condition' is a placeholder for your experimental design variable. # Replace 'condition' with the actual column name in your col_data that defines the groups for comparison. # Example: design = ~ treatment + genotype dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = ~ condition) # Pre-filtering: remove genes with very low counts (e.g., sum of counts across all samples < 10) # This helps to reduce the memory footprint and speed up calculations. keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # Run DESeq2 analysis dds <- DESeq(dds) # Get results # Replace 'condition', 'treated', and 'control' with your actual column name and factor levels for the contrast. # Example: res <- results(dds, contrast = c("treatment", "drugA", "placebo")) res <- results(dds, contrast = c("condition", "treated", "control")) # Order results by adjusted p-value res_ordered <- res[order(res$padj),] # Save results to a CSV file write.csv(as.data.frame(res_ordered), file = "deseq2_results.csv") # Optional: Save normalized counts norm_counts <- counts(dds, normalized = TRUE) write.csv(as.data.frame(norm_counts), file = "deseq2_normalized_counts.csv") # Optional: Generate a MA plot (log fold change vs. mean of normalized counts) pdf("MA_plot.pdf") plotMA(res, main="DESeq2 MA-plot") dev.off() # Optional: Generate a histogram of p-values pdf("pvalue_histogram.pdf") hist(res$pvalue[res$baseMean > 1], breaks = 20, col = "grey", border = "white", main = "Histogram of p-values", xlab = "p-value") dev.off() EOF # Execute the R script Rscript run_deseq2.R -
5
For splicing analysis, 22Rv1 data only:
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define reference genome and annotation paths (placeholders for GRCh38) # Replace with actual paths to your indexed genome and GTF file GENOME_DIR="/path/to/STAR_genome_index/GRCh38" GENOME_FASTA="/path/to/GRCh38.primary_assembly.genome.fa" GTF_FILE="/path/to/gencode.v44.annotation.gtf" # Example GTF for GRCh38 # Create STAR genome index (if not already created) # This step is usually performed once per genome and requires significant RAM. # STAR --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles ${GENOME_FASTA} \ # --sjdbGTFfile ${GTF_FILE} \ # --sjdbOverhang 100 \ # --runThreadN 8 # Adjust thread count as needed # Define input FASTQ files (placeholders for 22Rv1 data) # Assuming paired-end reads, adjust if single-end READ1="22Rv1_sample_R1.fastq.gz" READ2="22Rv1_sample_R2.fastq.gz" # Define output directory and prefix OUTPUT_DIR="star_alignment_22Rv1" SAMPLE_PREFIX="22Rv1_sample" mkdir -p ${OUTPUT_DIR} # Run STAR alignment for splicing analysis # This command performs splice-aware alignment and outputs a sorted BAM file. # Parameters are set to optimize for splice junction detection. STAR --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --readFilesCommand zcat \ --runThreadN 8 \ --outFileNamePrefix ${OUTPUT_DIR}/${SAMPLE_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJDBoverhangMin 1 \ --alignSJoverhangMin 8 \ --alignSplicedMateMapLminOverLmate 0.5 \ --alignWindowsPerReadNmax 200000 \ --alignSJstitchMismatchNmax 5 -1 5 5 \ --chimSegmentMin 15 \ --chimJunctionOverhangMin 15 \ --outSAMstrandField intronMotif # Use for strand-specific RNA-seq, remove if not -
6
RNA-seq reads were mapped to hg38 using STAR Aligner.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # --- Reference Genome Preparation (hg38) --- # This section is commented out as it's for index generation, not mapping. # It assumes you have a STAR index for hg38 already built. # # 1. Download hg38 primary assembly FASTA and GTF files (e.g., from Ensembl or GENCODE) # mkdir -p ref # wget -P ./ref/ ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # wget -P ./ref/ ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz # # 2. Create STAR genome index # mkdir -p STAR_index_hg38 # STAR --runMode genomeGenerate \ # --genomeDir STAR_index_hg38 \ # --genomeFastaFiles ./ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \ # --sjdbGTFfile ./ref/Homo_sapiens.GRCh38.109.gtf.gz \ # --sjdbOverhang 100 \ # --runThreadN 8 # Adjust threads as needed # --- RNA-seq Read Mapping with STAR --- # Placeholder for input RNA-seq reads (replace with actual file names) # For paired-end reads: # READS_R1="your_rnaseq_reads_R1.fastq.gz" # READS_R2="your_rnaseq_reads_R2.fastq.gz" # For single-end reads: # READS_SE="your_rnaseq_reads.fastq.gz" # Placeholder for STAR genome index directory GENOME_DIR="STAR_index_hg38" # Output prefix for alignment files OUTPUT_PREFIX="sample_output_" # Number of threads to use for mapping NUM_THREADS=8 # Adjust as per available CPU cores STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --runThreadN "${NUM_THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outSAMunmapped Within \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 10 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 30000000000 # ~30GB RAM for sorting, adjust based on available memory -
7
Alternative splicing (AS) analysis was completed using rMATs (v3.2.5).
$ Bash example
# Install rMATS (example using conda) # Note: rMATS v3.2.5 is an older version of the original rMATS. Installation might involve downloading source code or using specific channels. # conda create -n rmats_env python=3.8 # conda activate rmats_env # pip install rmats-turbo==3.2.5 # This installs rmats-turbo, not the original 3.2.5. For original 3.2.5, manual download/setup might be needed. # Ensure 'rmats.py' is in your PATH or specify its full path. # Placeholder for input BAM files (replace with actual paths to your treated and control replicates) # Create text files listing BAM files for each condition echo "/path/to/treated_replicate1.bam" > b1.txt echo "/path/to/treated_replicate2.bam" >> b1.txt echo "/path/to/control_replicate1.bam" > b2.txt echo "/path/to/control_replicate2.bam" >> b2.txt # Placeholder for GTF annotation file (replace with actual path, e.g., from GENCODE or Ensembl) # Example: gencode.v44.annotation.gtf for human hg38 assembly GTF_FILE="/path/to/gencode.v44.annotation.gtf" # Output directory for rMATS results OUTPUT_DIR="rmats_output" mkdir -p "${OUTPUT_DIR}" # Temporary directory for intermediate files TMP_DIR="rmats_tmp" mkdir -p "${TMP_DIR}" # Execute rMATS analysis # Common parameters are used as the description does not specify them. # Adjust --readLength, -t (read type), --nthread, and --libType as per your experimental setup. rmats.py \ --b1 b1.txt \ --b2 b2.txt \ --gtf "${GTF_FILE}" \ --od "${OUTPUT_DIR}" \ --tmp "${TMP_DIR}" \ -t paired \ # Assuming paired-end reads; use 'single' for single-end --readLength 100 \ # Placeholder read length; adjust to your actual read length --nthread 8 \ # Number of threads to use --libType fr-unstranded # Example library type; adjust (e.g., fr-firststrand, fr-secondstrand)
Raw Source Text
For differential expression: Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation. Gene level quantification was performed using tximeta [v1.8.4]. Differential gene expression was analyzed by DESeq2 [v1.30.1]. For splicing analysis, 22Rv1 data only: RNA-seq reads were mapped to hg38 using STAR Aligner. Alternative splicing (AS) analysis was completed using rMATs (v3.2.5). Assembly: hg19 Supplementary files format and content: *.csv: Gene counts were used for differential expression. Supplementary files format and content: *.xlsx: rMATs table was used for splicing analysis.