GSE198419 Processing Pipeline

RNA-Seq code_examples 7 steps

Publication

Remodeling oncogenic transcriptomes by small molecules targeting NONO.

Nature chemical biology (2023) — PMID 36864190

Dataset

GSE198419

Transcriptome changes in 22Rv1 and MCF7 cells after treatment with NONO ligands and controls

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

For differential expression:

DESeq2 (Inferred with models/gemini-2.5-flash) v1.38.3 GitHub

$ Bash example

#!/bin/bash

# This script demonstrates a basic differential expression analysis using DESeq2.
# It assumes that count data (e.g., from Salmon, Kallisto, featureCounts) and sample metadata are available.

# --- Installation (commented out) ---
# To install R and DESeq2, you can use Bioconda:
# conda create -n deseq2_env r-base bioconductor-deseq2 r-tximport r-data.table r-dplyr r-ggplot2
# conda activate deseq2_env

# --- Placeholder Input Files ---
# In a real pipeline, 'counts.tsv' would be the output of a quantification step (e.g., Salmon, Kallisto, featureCounts).
# 'metadata.tsv' would be a user-provided file describing samples and experimental conditions.

# Create a dummy counts.tsv file (replace with your actual count matrix)
cat << EOF > counts.tsv
gene	sample1	sample2	sample3	sample4
geneA	100	120	50	60
geneB	50	60	100	110
geneC	200	210	220	230
geneD	10	12	15	18
geneE	30	35	28	32
geneF	5	7	12	15
EOF

# Create a dummy metadata.tsv file (replace with your actual sample metadata)
cat << EOF > metadata.tsv
sample	condition
sample1	control
sample2	control
sample3	treated
sample4	treated
EOF

# --- R Script for DESeq2 Analysis ---
# This R script performs the differential expression analysis.
# It loads count data and metadata, creates a DESeqDataSet, runs DESeq, and saves the results.
cat << 'EOF_R' > run_deseq2.R
library(DESeq2)

# Load count data
# Assumes the first column is gene names and subsequent columns are sample counts.
# Ensure counts are integers, as DESeq2 expects integer counts.
count_data <- read.table("counts.tsv", header=TRUE, row.names=1, sep="\t")

# Load sample metadata
# Assumes the first column is sample names and subsequent columns are metadata (e.g., 'condition').
sample_metadata <- read.table("metadata.tsv", header=TRUE, row.names=1, sep="\t")

# Ensure sample order matches between count data and metadata
# And ensure all samples in count_data are present in metadata.
sample_metadata <- sample_metadata[colnames(count_data), , drop=FALSE]

# Create DESeqDataSet object
# The 'design' formula specifies the experimental design. Here, we compare 'condition'.
dds <- DESeqDataSetFromMatrix(countData = round(count_data), # DESeq2 expects integer counts
                              colData = sample_metadata,
                              design = ~ condition)

# Pre-filtering: remove genes with very low counts across all samples
# This helps reduce the number of tests and improves statistical power.
# Here, we keep genes that have at least 10 counts in total across all samples.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# Run DESeq analysis
# This performs normalization, dispersion estimation, and differential expression testing.
dds <- DESeq(dds)

# Get results for a specific contrast (e.g., 'treated' vs 'control')
# The 'contrast' argument specifies the comparison. The last element is the reference level.
res <- results(dds, contrast=c("condition", "treated", "control"))

# Order results by adjusted p-value (padj)
res_ordered <- res[order(res$padj),]

# Save the differential expression results to a CSV file
write.csv(as.data.frame(res_ordered), file="differential_expression_results.csv")

# Optional: Generate an MA plot for visualization
# png("MA_plot.png")
# plotMA(res, main="MA plot of Differential Expression")
# dev.off()

# Optional: Generate a volcano plot (requires ggplot2 if not already loaded)
# library(ggplot2)
# res_df <- as.data.frame(res_ordered)
# res_df$gene <- rownames(res_df)
# ggplot(res_df, aes(x=log2FoldChange, y=-log10(pvalue))) +
#   geom_point(aes(color=padj<0.05 & abs(log2FoldChange)>1), alpha=0.7) +
#   scale_color_manual(values=c("black", "red")) +
#   labs(title="Volcano Plot", x="log2 Fold Change", y="-log10 p-value") +
#   theme_minimal()
# ggsave("volcano_plot.png")

EOF_R

# Execute the R script
Rscript run_deseq2.R

# --- Reference Datasets ---
# For differential expression analysis with DESeq2, the primary inputs are count matrices and sample metadata.
# The reference genome and gene annotation (e.g., GRCh38/hg38 and GENCODE GTF) are typically used in upstream steps
# (e.g., alignment and quantification) to generate the count data, but not directly by DESeq2 itself.
# Therefore, no specific genome assembly placeholder is provided for this step.

View on GitHub

Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation.

Salmon v1.3.0 GitHub

$ Bash example

# Install Salmon (example using conda)
# conda create -n salmon_env salmon=1.3.0 -c bioconda -c conda-forge
# conda activate salmon_env

# Define variables
GENCODE_VERSION="v37"
GENCODE_RELEASE="37" # Corresponding GENCODE release number
GENCODE_FTP="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_${GENCODE_RELEASE}"
TRANSCRIPTOME_FASTA="gencode.${GENCODE_VERSION}.transcripts.fa.gz"
ANNOTATION_GTF="gencode.${GENCODE_VERSION}.annotation.gtf.gz"
SALMON_INDEX_DIR="salmon_index_${GENCODE_VERSION}"
READ1="sample_R1.fastq.gz" # Placeholder for input reads (replace with actual R1 file)
READ2="sample_R2.fastq.gz" # Placeholder for input reads (replace with actual R2 file)
OUTPUT_DIR="salmon_quant_output"
THREADS=8 # Placeholder for number of threads

# Create directories
mkdir -p "${SALMON_INDEX_DIR}"
mkdir -p "${OUTPUT_DIR}"

# Download GENCODE v37 annotation files (if not already present)
# wget -nc "${GENCODE_FTP}/${TRANSCRIPTOME_FASTA}"
# wget -nc "${GENCODE_FTP}/${ANNOTATION_GTF}" # GTF is not strictly required for salmon index, but is part of the annotation set

# Build Salmon index using the GENCODE v37 transcriptome FASTA
salmon index -t "${TRANSCRIPTOME_FASTA}" -i "${SALMON_INDEX_DIR}" --gencode

# Quantify transcript abundance using Salmon
# Assuming paired-end reads and automatic library type detection (-l A)
salmon quant -i "${SALMON_INDEX_DIR}" \
             -l A \
             -1 "${READ1}" \
             -2 "${READ2}" \
             --validateMappings \
             -p "${THREADS}" \
             -o "${OUTPUT_DIR}"

View on GitHub

Gene level quantification was performed using tximeta [v1.8.4].

tximeta v1.8.4 GitHub

$ Bash example

# Install R and Bioconductor if not already present
# Rscript -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("tximeta")'

# Create an R script for tximeta execution
cat << 'EOF' > run_tximeta_gene_quant.R
#!/usr/bin/env Rscript

# Load necessary libraries
library(tximeta)
library(SummarizedExperiment)

# Get input and output paths from environment variables
SALMON_OUTPUT_DIR <- Sys.getenv("SALMON_OUTPUT_DIR", stop("SALMON_OUTPUT_DIR environment variable not set."))
SAMPLE_SHEET <- Sys.getenv("SAMPLE_SHEET", stop("SAMPLE_SHEET environment variable not set."))
OUTPUT_RDS_FILE <- Sys.getenv("OUTPUT_RDS_FILE", "gene_level_quant.rds")

# Read sample sheet
coldata_df <- read.csv(SAMPLE_SHEET, stringsAsFactors = FALSE)

# Ensure 'sample_name' and 'condition' columns exist
if (!"sample_name" %in% colnames(coldata_df)) {
  stop("Sample sheet must contain a 'sample_name' column.")
}
if (!"condition" %in% colnames(coldata_df)) {
  warning("Sample sheet does not contain a 'condition' column. Using 'unknown' as placeholder.")
  coldata_df$condition <- "unknown"
}

# Construct 'files' column pointing to quant.sf
coldata_df$files <- file.path(SALMON_OUTPUT_DIR, coldata_df$sample_name, "quant.sf")

# Set 'names' for tximeta
coldata_df$names <- coldata_df$sample_name
rownames(coldata_df) <- coldata_df$names

# Check if all quant.sf files exist
if (!all(file.exists(coldata_df$files))) {
  missing_files <- coldata_df$files[!file.exists(coldata_df$files)]
  stop(paste("Missing quantification files:", paste(missing_files, collapse = ", ")))
}

# Perform tximeta import
message("Importing quantification data with tximeta...")
se <- tximeta(coldata_df)

# Summarize to gene level
message("Summarizing to gene level...")
# tximeta automatically links to appropriate TxDb/EnsDb objects based on the Salmon index metadata.
# For human, this typically means GRCh38 and a recent Ensembl annotation.
gse <- summarizeToGene(se)

# Save the gene-level SummarizedExperiment object
message(paste("Saving gene-level SummarizedExperiment to", OUTPUT_RDS_FILE))
saveRDS(gse, file = OUTPUT_RDS_FILE)

message("Gene-level quantification complete.")
EOF

# Make the R script executable
chmod +x run_tximeta_gene_quant.R

# Set environment variables for the R script
# Replace 'path/to/salmon_output_directory' with the actual path to your Salmon quantification results.
# Each sample's Salmon output should be in a subdirectory (e.g., salmon_output_directory/sample1/quant.sf).
export SALMON_OUTPUT_DIR="path/to/salmon_output_directory"

# Replace 'path/to/sample_sheet.csv' with the actual path to your sample metadata file.
# The sample sheet should be a CSV with at least 'sample_name' and 'condition' columns.
# Example sample_sheet.csv:
# sample_name,condition
# sample1,control
# sample2,treated
export SAMPLE_SHEET="path/to/sample_sheet.csv"

# Define the output file name for the gene-level quantification results (R RDS format)
export OUTPUT_RDS_FILE="gene_level_quantification.rds"

# Execute the R script
./run_tximeta_gene_quant.R

View on GitHub

Differential gene expression was analyzed by DESeq2 [v1.30.1].

DESeq2 v1.30.1 GitHub

$ Bash example

# Install R and DESeq2 if not already installed.
# Note: v1.30.1 is an older version of DESeq2 (released with Bioconductor 3.10 / R 3.6). The following command installs the latest compatible version with your R environment.
# conda install -c bioconda bioconductor-deseq2 r-base
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')"

# Assuming count matrix (e.g., from featureCounts, Salmon, Kallisto) and colData are prepared.
# Placeholder: 'counts.tsv' should be a tab-separated file with genes as rows and samples as columns.
# Placeholder: 'coldata.tsv' should be a tab-separated file with samples as rows and experimental conditions/metadata as columns.

# Create an R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
library(DESeq2)

# Load count data
# Replace 'counts.tsv' with the actual path to your count matrix file.
count_data <- read.table("counts.tsv", header = TRUE, row.names = 1, sep = "\t")

# Load sample information (colData)
# Replace 'coldata.tsv' with the actual path to your sample metadata file.
# Ensure row names of colData match column names of count_data.
col_data <- read.table("coldata.tsv", header = TRUE, row.names = 1, sep = "\t")

# Ensure the order of samples in col_data matches the order of columns in count_data
col_data <- col_data[colnames(count_data), , drop = FALSE]

# Create DESeqDataSet object
# 'condition' is a placeholder for your experimental design variable.
# Replace 'condition' with the actual column name in your col_data that defines the groups for comparison.
# Example: design = ~ treatment + genotype
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = col_data,
                              design = ~ condition)

# Pre-filtering: remove genes with very low counts (e.g., sum of counts across all samples < 10)
# This helps to reduce the memory footprint and speed up calculations.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results
# Replace 'condition', 'treated', and 'control' with your actual column name and factor levels for the contrast.
# Example: res <- results(dds, contrast = c("treatment", "drugA", "placebo"))
res <- results(dds, contrast = c("condition", "treated", "control"))

# Order results by adjusted p-value
res_ordered <- res[order(res$padj),]

# Save results to a CSV file
write.csv(as.data.frame(res_ordered), file = "deseq2_results.csv")

# Optional: Save normalized counts
norm_counts <- counts(dds, normalized = TRUE)
write.csv(as.data.frame(norm_counts), file = "deseq2_normalized_counts.csv")

# Optional: Generate a MA plot (log fold change vs. mean of normalized counts)
pdf("MA_plot.pdf")
plotMA(res, main="DESeq2 MA-plot")
dev.off()

# Optional: Generate a histogram of p-values
pdf("pvalue_histogram.pdf")
hist(res$pvalue[res$baseMean > 1], breaks = 20, col = "grey", border = "white",
     main = "Histogram of p-values", xlab = "p-value")
dev.off()

EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

For splicing analysis, 22Rv1 data only:

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define reference genome and annotation paths (placeholders for GRCh38)
# Replace with actual paths to your indexed genome and GTF file
GENOME_DIR="/path/to/STAR_genome_index/GRCh38"
GENOME_FASTA="/path/to/GRCh38.primary_assembly.genome.fa"
GTF_FILE="/path/to/gencode.v44.annotation.gtf" # Example GTF for GRCh38

# Create STAR genome index (if not already created)
# This step is usually performed once per genome and requires significant RAM.
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles ${GENOME_FASTA} \
#      --sjdbGTFfile ${GTF_FILE} \
#      --sjdbOverhang 100 \
#      --runThreadN 8 # Adjust thread count as needed

# Define input FASTQ files (placeholders for 22Rv1 data)
# Assuming paired-end reads, adjust if single-end
READ1="22Rv1_sample_R1.fastq.gz"
READ2="22Rv1_sample_R2.fastq.gz"

# Define output directory and prefix
OUTPUT_DIR="star_alignment_22Rv1"
SAMPLE_PREFIX="22Rv1_sample"

mkdir -p ${OUTPUT_DIR}

# Run STAR alignment for splicing analysis
# This command performs splice-aware alignment and outputs a sorted BAM file.
# Parameters are set to optimize for splice junction detection.
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READ1} ${READ2} \
     --readFilesCommand zcat \
     --runThreadN 8 \
     --outFileNamePrefix ${OUTPUT_DIR}/${SAMPLE_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --alignSJDBoverhangMin 1 \
     --alignSJoverhangMin 8 \
     --alignSplicedMateMapLminOverLmate 0.5 \
     --alignWindowsPerReadNmax 200000 \
     --alignSJstitchMismatchNmax 5 -1 5 5 \
     --chimSegmentMin 15 \
     --chimJunctionOverhangMin 15 \
     --outSAMstrandField intronMotif # Use for strand-specific RNA-seq, remove if not

View on GitHub

RNA-seq reads were mapped to hg38 using STAR Aligner.

STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# --- Reference Genome Preparation (hg38) ---
# This section is commented out as it's for index generation, not mapping.
# It assumes you have a STAR index for hg38 already built.
#
# 1. Download hg38 primary assembly FASTA and GTF files (e.g., from Ensembl or GENCODE)
#    mkdir -p ref
#    wget -P ./ref/ ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
#    wget -P ./ref/ ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
#
# 2. Create STAR genome index
#    mkdir -p STAR_index_hg38
#    STAR --runMode genomeGenerate \
#         --genomeDir STAR_index_hg38 \
#         --genomeFastaFiles ./ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
#         --sjdbGTFfile ./ref/Homo_sapiens.GRCh38.109.gtf.gz \
#         --sjdbOverhang 100 \
#         --runThreadN 8 # Adjust threads as needed

# --- RNA-seq Read Mapping with STAR ---
# Placeholder for input RNA-seq reads (replace with actual file names)
# For paired-end reads:
# READS_R1="your_rnaseq_reads_R1.fastq.gz"
# READS_R2="your_rnaseq_reads_R2.fastq.gz"
# For single-end reads:
# READS_SE="your_rnaseq_reads.fastq.gz"

# Placeholder for STAR genome index directory
GENOME_DIR="STAR_index_hg38"

# Output prefix for alignment files
OUTPUT_PREFIX="sample_output_"

# Number of threads to use for mapping
NUM_THREADS=8 # Adjust as per available CPU cores

STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 10 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 30000000000 # ~30GB RAM for sorting, adjust based on available memory

View on GitHub

Alternative splicing (AS) analysis was completed using rMATs (v3.2.5).

rMATS v3.2.5 GitHub

$ Bash example

# Install rMATS (example using conda)
# Note: rMATS v3.2.5 is an older version of the original rMATS. Installation might involve downloading source code or using specific channels.
# conda create -n rmats_env python=3.8
# conda activate rmats_env
# pip install rmats-turbo==3.2.5 # This installs rmats-turbo, not the original 3.2.5. For original 3.2.5, manual download/setup might be needed.
# Ensure 'rmats.py' is in your PATH or specify its full path.

# Placeholder for input BAM files (replace with actual paths to your treated and control replicates)
# Create text files listing BAM files for each condition
echo "/path/to/treated_replicate1.bam" > b1.txt
echo "/path/to/treated_replicate2.bam" >> b1.txt
echo "/path/to/control_replicate1.bam" > b2.txt
echo "/path/to/control_replicate2.bam" >> b2.txt

# Placeholder for GTF annotation file (replace with actual path, e.g., from GENCODE or Ensembl)
# Example: gencode.v44.annotation.gtf for human hg38 assembly
GTF_FILE="/path/to/gencode.v44.annotation.gtf"

# Output directory for rMATS results
OUTPUT_DIR="rmats_output"
mkdir -p "${OUTPUT_DIR}"

# Temporary directory for intermediate files
TMP_DIR="rmats_tmp"
mkdir -p "${TMP_DIR}"

# Execute rMATS analysis
# Common parameters are used as the description does not specify them.
# Adjust --readLength, -t (read type), --nthread, and --libType as per your experimental setup.
rmats.py \
  --b1 b1.txt \
  --b2 b2.txt \
  --gtf "${GTF_FILE}" \
  --od "${OUTPUT_DIR}" \
  --tmp "${TMP_DIR}" \
  -t paired \ # Assuming paired-end reads; use 'single' for single-end
  --readLength 100 \ # Placeholder read length; adjust to your actual read length
  --nthread 8 \ # Number of threads to use
  --libType fr-unstranded # Example library type; adjust (e.g., fr-firststrand, fr-secondstrand)

View on GitHub

Tools Used

DESeq2 STAR

Raw Source Text

For differential expression:
Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation.
Gene level quantification was performed using tximeta [v1.8.4].
Differential gene expression was analyzed by DESeq2 [v1.30.1].
For splicing analysis, 22Rv1 data only:
RNA-seq reads were mapped to hg38 using STAR Aligner.
Alternative splicing (AS) analysis was completed using rMATs (v3.2.5).
Assembly: hg19
Supplementary files format and content: *.csv: Gene counts were used for differential expression.
Supplementary files format and content: *.xlsx: rMATs table was used for splicing analysis.

← Back to Analysis