GSE222217 Processing Pipeline

GSE code_examples 7 steps

Publication

Remodeling oncogenic transcriptomes by small molecules targeting NONO.

Nature chemical biology (2023) — PMID 36864190

Dataset

Remodeling oncogenic transcriptomes by small molecules targeting NONO

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

For differential expression:

DESeq2 (Inferred with models/gemini-2.5-flash) v1.42.0 GitHub

$ Bash example

# Install R and DESeq2 (if not already installed)
# sudo apt-get update
# sudo apt-get install r-base
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')"

# Placeholder for input files:
# counts.tsv: A tab-separated file with gene IDs as row names and sample names as column names, containing raw counts.
# design.tsv: A tab-separated file with sample names as row names and experimental factors (e.g., 'condition') as columns.

# Create dummy counts.tsv and design.tsv for demonstration purposes.
# In a real scenario, these would be generated by upstream quantification tools (e.g., Salmon, STAR+featureCounts).
echo -e "gene\tsampleA_rep1\tsampleA_rep2\tsampleB_rep1\tsampleB_rep2" > counts.tsv
echo -e "gene1\t100\t120\t50\t60" >> counts.tsv
echo -e "gene2\t20\t25\t80\t90" >> counts.tsv
echo -e "gene3\t500\t550\t480\t510" >> counts.tsv

echo -e "sample\tcondition" > design.tsv
echo -e "sampleA_rep1\tcontrol" >> design.tsv
echo -e "sampleA_rep2\tcontrol" >> design.tsv
echo -e "sampleB_rep1\ttreated" >> design.tsv
echo -e "sampleB_rep2\ttreated" >> design.tsv

# Create an R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
# Load DESeq2 library
library(DESeq2)

# --- Configuration Parameters ---
counts_file <- "counts.tsv"
design_file <- "design.tsv"
output_dir <- "deseq2_results"
contrast_numerator <- "treated" # The level to compare against the denominator
contrast_denominator <- "control" # The baseline level
design_formula <- "~ condition" # The design formula for DESeq2

# Reference genome assembly for context (not directly used by DESeq2, but for data origin)
# This analysis assumes counts were generated against a reference like GRCh38.
reference_assembly <- "GRCh38" 

# Create output directory if it doesn't exist
if (!dir.exists(output_dir)) {
  dir.create(output_dir)
}

# Read count data
# Assuming the first column is gene ID and subsequent columns are sample counts
count_data <- read.table(counts_file, header = TRUE, row.names = 1, sep = "\t")

# Read sample metadata (design file)
# Assuming the first column is sample ID and subsequent columns are factors
design_data <- read.table(design_file, header = TRUE, row.names = 1, sep = "\t")

# Ensure sample names match and are in the same order
# This is crucial for DESeq2
design_data <- design_data[colnames(count_data), , drop = FALSE]

# Create DESeqDataSet object
# The 'design' argument specifies the experimental design formula
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = design_data,
                              design = as.formula(design_formula))

# Set factor levels for proper contrast
dds$condition <- factor(dds$condition, levels = c(contrast_denominator, contrast_numerator))

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results for the specified contrast
# Here, 'condition' is the factor, 'treated' is the numerator, 'control' is the denominator
res <- results(dds, contrast = c("condition", contrast_numerator, contrast_denominator))

# Order results by adjusted p-value
res <- res[order(res$padj), ]

# Save results to a CSV file
write.csv(as.data.frame(res), file = file.path(output_dir, "deseq2_differential_expression_results.csv"))

# Save normalized counts
norm_counts <- counts(dds, normalized = TRUE)
write.csv(as.data.frame(norm_counts), file = file.path(output_dir, "deseq2_normalized_counts.csv"))

message(paste0("DESeq2 analysis complete. Results saved to '", output_dir, "' directory."))
EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation.

Salmon v1.3.0 GitHub

$ Bash example

# Install Salmon (e.g., using Conda)
# conda create -n salmon_env salmon=1.3.0 -c bioconda -c conda-forge
# conda activate salmon_env

# --- Reference Data Preparation (GENCODE v37) ---
# Download GENCODE v37 transcriptome FASTA for indexing
# mkdir -p references
# wget -O references/gencode.v37.transcripts.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.transcripts.fa.gz
# gunzip references/gencode.v37.transcripts.fa.gz

# (Optional but recommended for better accuracy) Download primary assembly FASTA for decoy-aware indexing
# wget -O references/GRCh38.primary_assembly.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/GRCh38.primary_assembly.fa.gz
# gunzip references/GRCh38.primary_assembly.fa.gz

# Build Salmon index (assuming a simple transcriptome index for this example)
# For a more robust decoy-aware index (recommended for RNA-seq):
# cat references/gencode.v37.transcripts.fa references/GRCh38.primary_assembly.fa > references/gencode_v37_gentrome.fa
# salmon index -t references/gencode_v37_gentrome.fa -d references/GRCh38.primary_assembly.fa -i gencode_v37_salmon_index --gencode

# If only using transcriptome FASTA (less robust):
# salmon index -t references/gencode.v37.transcripts.fa -i gencode_v37_salmon_index --gencode

# --- Transcript Abundance Quantification ---
# Assume input reads are in current directory or specified path
# Replace 'sample_1_R1.fastq.gz' and 'sample_1_R2.fastq.gz' with actual input files
# Replace 'gencode_v37_salmon_index' with the path to your Salmon index
# Replace 'salmon_quant_output' with your desired output directory

salmon quant -i gencode_v37_salmon_index -l A \
    -1 sample_1_R1.fastq.gz \
    -2 sample_1_R2.fastq.gz \
    -o salmon_quant_output \
    --validateMappings \
    --gcBias \
    --seqBias

View on GitHub

Gene level quantification was performed using tximeta [v1.8.4].

tximeta v1.8.4 GitHub

$ Bash example

# Install tximeta if not already installed
# R -q -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -q -e 'BiocManager::install("tximeta")'
# R -q -e 'BiocManager::install("tximport")' # tximeta depends on tximport

# Create an R script for gene-level quantification with tximeta
cat << 'EOF' > run_tximeta.R
library(tximeta)
library(tximport)
library(SummarizedExperiment)

# --- User-defined parameters (placeholders) ---
# Path to directory containing quantification files (e.g., from Salmon, Kallisto)
# Each sample should have its own subdirectory containing 'quant.sf' or 'abundance.h5'
quant_dir <- "path/to/quantification_output_directory"

# Sample names (corresponding to subdirectories in quant_dir)
sample_names <- c("sample1", "sample2", "sample3")

# Optional: Define experimental conditions or other metadata for samples
conditions <- c("control", "control", "treated")

# Output file name for gene-level counts
output_counts_file <- "gene_counts_tximeta.tsv"

# Output file name for the SummarizedExperiment object (for further R analysis)
output_se_file <- "gene_se_tximeta.rds"

# --- Prepare sample information ---
# Construct full paths to quantification files
files <- file.path(quant_dir, sample_names, "quant.sf") # Assuming Salmon 'quant.sf' files
names(files) <- sample_names

# Create a data frame with sample metadata (coldata)
coldata <- data.frame(
  names = sample_names,
  files = files,
  condition = conditions, # Add other metadata columns as needed
  stringsAsFactors = FALSE
)

# --- Import quantification data with tximeta ---
# tximeta automatically infers the transcriptome and genome from the quantification files
# and links to appropriate annotation resources (e.g., Ensembl, Gencode).
# This step returns a SummarizedExperiment object at the transcript level.
se <- tximeta(coldata)

# --- Summarize to gene level ---
# The summarizeToGene function aggregates transcript-level data to gene-level.
# It uses tximport internally. By default, it produces 'lengthScaledTPM' counts
# which are suitable for many differential expression analyses.
# Other options for 'countsFromAbundance' can be passed if needed (e.g., "no" for raw counts).
gse <- summarizeToGene(se)

# Extract gene-level counts (e.g., lengthScaledTPM) from the SummarizedExperiment object
gene_counts <- assay(gse)

# Write gene-level counts to a TSV file
write.table(gene_counts, file = output_counts_file, sep = "\t", quote = FALSE, col.names = NA)

# Save the SummarizedExperiment object for further R analysis
saveRDS(gse, file = output_se_file)

message(paste("Gene-level quantification complete. Counts saved to:", output_counts_file))
message(paste("SummarizedExperiment object saved to:", output_se_file))

EOF

# Execute the R script
Rscript run_tximeta.R

# Clean up the R script (optional)
# rm run_tximeta.R

View on GitHub

Differential gene expression was analyzed by DESeq2 [v1.30.1].

DESeq2 v1.30.1 GitHub

$ Bash example

# Install R and DESeq2 if not already installed
# conda create -n deseq2_env r-base r-essentials bioconductor-deseq2=1.30.1 -c bioconda -c conda-forge
# conda activate deseq2_env

# Create a placeholder R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
#!/usr/bin/env Rscript

# Load necessary libraries
library(DESeq2)
library(data.table) # For fread/fwrite, install with install.packages("data.table") if not present

# --- Configuration ---
# Input count matrix file (e.g., from featureCounts, htseq-count, Salmon/Kallisto tximport)
# This file should have genes/features as rows and samples as columns, with raw counts.
count_matrix_file <- "counts.tsv"

# Input sample information file
# This file should have sample IDs as rows (or a column named 'sample_id') and
# experimental design variables (e.g., 'condition', 'treatment') as columns.
# The row names (or 'sample_id' column) should match the column names in the count_matrix_file.
sample_info_file <- "sample_info.tsv"

# Output file for differential expression results
output_results_file <- "deseq2_results.tsv"

# Define the design formula (e.g., ~ condition)
# Replace 'condition' with your actual experimental variable from sample_info_file
design_formula_str <- "~ condition"

# --- Main Analysis ---

# Read count matrix
# Assuming the first column is gene ID and subsequent columns are sample counts
count_data <- as.matrix(fread(count_matrix_file, header = TRUE, sep = "\t", data.table = FALSE), rownames = 1)

# Read sample information
# Assuming the first column is sample ID and subsequent columns are design variables
sample_info <- fread(sample_info_file, header = TRUE, sep = "\t", data.table = FALSE)
rownames(sample_info) <- sample_info$sample_id # Ensure sample_id column exists and is used for row names

# Ensure sample order matches between count data and sample info
sample_info <- sample_info[colnames(count_data), , drop = FALSE]

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = as.formula(design_formula_str))

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results (e.g., comparing the last level of 'condition' to the first)
# For specific contrasts, use: results(dds, contrast=c("condition","treated","untreated"))
res <- results(dds)

# Order results by adjusted p-value
res_ordered <- res[order(res$padj),]

# Write results to file
fwrite(as.data.frame(res_ordered), file = output_results_file, sep = "\t", row.names = TRUE)

message(paste("DESeq2 analysis complete. Results saved to:", output_results_file))

EOF

# Example placeholder files (uncomment and create these for a runnable example):
# echo -e "gene_id\tsampleA_rep1\tsampleA_rep2\tsampleB_rep1\tsampleB_rep2" > counts.tsv
# echo -e "gene1\t100\t120\t50\t60" >> counts.tsv
# echo -e "gene2\t50\t60\t100\t110" >> counts.tsv
# echo -e "gene3\t200\t210\t200\t220" >> counts.tsv
# echo -e "sample_id\tcondition" > sample_info.tsv
# echo -e "sampleA_rep1\tcontrol" >> sample_info.tsv
# echo -e "sampleA_rep2\tcontrol" >> sample_info.tsv
# echo -e "sampleB_rep1\ttreatment" >> sample_info.tsv
# echo -e "sampleB_rep2\ttreatment" >> sample_info.tsv

# Execute the R script
Rscript run_deseq2.R

View on GitHub

For splicing analysis, 22Rv1 data only:

rMATS (Inferred with models/gemini-2.5-flash) v4.1.2 GitHub

$ Bash example

# Install rMATS (example using conda)
# conda create -n rmats_env python=3.8
# conda activate rmats_env
# pip install rmats-turbo

# Define input and reference files
# Placeholder BAM files for 22Rv1 control and experimental groups.
# Replace with actual paths to your 22Rv1 RNA-seq alignment BAM files.
# For example, if comparing treated vs. untreated 22Rv1 cells.
CONTROL_BAM_FILES="/path/to/22Rv1_control_rep1.bam,/path/to/22Rv1_control_rep2.bam"
EXPERIMENTAL_BAM_FILES="/path/to/22Rv1_experimental_rep1.bam,/path/to/22Rv1_experimental_rep2.bam"

# Reference genome annotation (GTF) - using hg38 as a placeholder.
# Download from GENCODE or Ensembl (e.g., gencode.v38.annotation.gtf).
GTF_FILE="/path/to/gencode.v38.annotation.gtf"

# Output directory
OUTPUT_DIR="rmats_splicing_analysis_22Rv1"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run rMATS for differential splicing analysis
# Assuming paired-end reads and RNA-seq data.
# Adjust --nthread based on available CPU cores.
rmats.py \
    --b1 "${CONTROL_BAM_FILES}" \
    --b2 "${EXPERIMENTAL_BAM_FILES}" \
    --gtf "${GTF_FILE}" \
    --readType paired \
    --seqType rnaseq \
    --od "${OUTPUT_DIR}" \
    --tmp "${OUTPUT_DIR}/tmp" \
    --nthread 8

View on GitHub

RNA-seq reads were mapped to hg38 using STAR Aligner.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# --- Reference Genome Preparation (run once) ---
# Download hg38 primary assembly FASTA and a comprehensive GTF file (e.g., from GENCODE or Ensembl)
# Example for GRCh38.p14 (hg38) and GENCODE v45:
# mkdir -p ~/references/hg38/fasta
# wget -P ~/references/hg38/fasta/ https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/GRCh38.primary_assembly.genome.fa.gz
# mkdir -p ~/references/hg38/gtf
# wget -P ~/references/hg38/gtf/ https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz
# gunzip ~/references/hg38/fasta/GRCh38.primary_assembly.genome.fa.gz
# gunzip ~/references/hg38/gtf/gencode.v45.annotation.gtf.gz

# Define paths for genome generation
# GENOME_FASTA="~/references/hg38/fasta/GRCh38.primary_assembly.genome.fa"
# GENOME_GTF="~/references/hg38/gtf/gencode.v45.annotation.gtf"
# STAR_INDEX_DIR="~/STAR_genome_indices/hg38_gencode_v45" # Example index directory
# NUM_THREADS_INDEX=16 # Number of threads for index generation

# Generate STAR genome index (uncomment and run once)
# STAR --runMode genomeGenerate \
#      --genomeDir "${STAR_INDEX_DIR}" \
#      --genomeFastaFiles "${GENOME_FASTA}" \
#      --sjdbGTFfile "${GENOME_GTF}" \
#      --sjdbOverhang 100 \
#      --runThreadN "${NUM_THREADS_INDEX}"

# --- RNA-seq Read Mapping ---
# Define variables for mapping
STAR_INDEX_DIR="~/STAR_genome_indices/hg38_gencode_v45" # Path to the pre-built STAR genome index for hg38
READS_R1="sample_R1.fastq.gz" # Placeholder for forward reads
READS_R2="sample_R2.fastq.gz" # Placeholder for reverse reads (remove if single-end)
OUTPUT_DIR="star_output"
OUTPUT_PREFIX="${OUTPUT_DIR}/sample"
NUM_THREADS_MAP=8 # Number of threads for mapping

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Execute STAR mapping command
# For paired-end reads:
STAR --genomeDir "${STAR_INDEX_DIR}" \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.1 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --sjdbScore 1 \
     --runThreadN "${NUM_THREADS_MAP}"

# For single-end reads, modify the --readFilesIn parameter:
# STAR --genomeDir "${STAR_INDEX_DIR}" \
#      --readFilesIn "${READS_R1}" \
#      --readFilesCommand zcat \
#      --outFileNamePrefix "${OUTPUT_PREFIX}_" \
#      --outSAMtype BAM SortedByCoordinate \
#      --outSAMattributes Standard \
#      --outFilterType BySJout \
#      --outFilterMultimapNmax 20 \
#      --outFilterMismatchNmax 999 \
#      --outFilterMismatchNoverLmax 0.1 \
#      --alignIntronMin 20 \
#      --alignIntronMax 1000000 \
#      --alignMatesGapMax 1000000 \
#      --sjdbScore 1 \
#      --runThreadN "${NUM_THREADS_MAP}"

View on GitHub

Alternative splicing (AS) analysis was completed using rMATs (v3.2.5).

rMATS v3.2.5 GitHub

$ Bash example

# Installation (example using conda)
# conda create -n rmats_env python=3.8
# conda activate rmats_env
# pip install rmats-turbo==3.2.5

# Define input files and parameters (placeholders as none specified in description)
# Replace with actual paths to your BAM files, GTF, and desired output directory.
CONDITION1_BAM_FILES="condition1_rep1.bam,condition1_rep2.bam" # Comma-separated BAMs for condition 1
CONDITION2_BAM_FILES="condition2_rep1.bam,condition2_rep2.bam" # Comma-separated BAMs for condition 2
GTF_FILE="/path/to/gencode.v44.annotation.gtf" # Placeholder for GTF annotation file (e.g., GENCODE for GRCh38/hg38)
OUTPUT_DIR="rmats_output"
TMP_DIR="rmats_tmp"
READ_LENGTH=100 # Placeholder, adjust based on actual data's read length
LIB_TYPE="fr-firststrand" # Common library type, adjust if known (e.g., fr-unstranded, fr-secondstrand)
N_THREADS=8 # Number of threads

mkdir -p "${OUTPUT_DIR}"
mkdir -p "${TMP_DIR}"

# Run rMATS analysis for alternative splicing
rmats.py \
    --b1 "${CONDITION1_BAM_FILES}" \
    --b2 "${CONDITION2_BAM_FILES}" \
    --gtf "${GTF_FILE}" \
    --od "${OUTPUT_DIR}" \
    --tmp "${TMP_DIR}" \
    --readLength "${READ_LENGTH}" \
    --libType "${LIB_TYPE}" \
    --nthread "${N_THREADS}" \
    --task "as" # "as" for alternative splicing analysis

View on GitHub

Tools Used

DESeq2 STAR

Raw Source Text

For differential expression:
Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation.
Gene level quantification was performed using tximeta [v1.8.4].
Differential gene expression was analyzed by DESeq2 [v1.30.1].
For splicing analysis, 22Rv1 data only:
RNA-seq reads were mapped to hg38 using STAR Aligner.
Alternative splicing (AS) analysis was completed using rMATs (v3.2.5).
Assembly: hg19
Supplementary files format and content: *.csv: Gene counts were used for differential expression.
Supplementary files format and content: *.xlsx: rMATs table was used for splicing analysis.

← Back to Analysis