GSE222216 Processing Pipeline — Yeo Lab Publications

Publication

Remodeling oncogenic transcriptomes by small molecules targeting NONO.

Nature chemical biology (2023) — PMID 36864190

Dataset

Transcriptome changes in 22Rv1 cells with a double knockout or over expression of C145S, Empty vector, or Wild type NONO.

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1

Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation.

Salmon v1.3.0 GitHub

$ Bash example

# Install Salmon (if not already installed)
# conda install -c bioconda salmon=1.3.0

# --- Reference Data Setup ---
# Create a directory for reference data
mkdir -p reference_data

# Define paths for GENCODE v37 annotation
GENCODE_TRANSCRIPTS_FASTA="reference_data/gencode.v37.transcripts.fa"
SALMON_INDEX_DIR="gencode_v37_salmon_index"

# Download GENCODE v37 human transcripts FASTA (required for Salmon indexing)
# Source: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.transcripts.fa.gz
if [ ! -f "${GENCODE_TRANSCRIPTS_FASTA}" ]; then
    echo "Downloading GENCODE v37 transcripts FASTA..."
    wget -O "${GENCODE_TRANSCRIPTS_FASTA}.gz" ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.transcripts.fa.gz
    gunzip -f "${GENCODE_TRANSCRIPTS_FASTA}.gz"
fi

# --- Salmon Indexing ---
# Build Salmon index using GENCODE v37 transcripts
# The --gencode flag is recommended for GENCODE annotations
if [ ! -d "${SALMON_INDEX_DIR}" ]; then
    echo "Building Salmon index..."
    salmon index -t "${GENCODE_TRANSCRIPTS_FASTA}" -i "${SALMON_INDEX_DIR}" --gencode
fi

# --- Salmon Quantification ---
# Define input and output paths
# Replace these with your actual input FASTQ files and desired output directory
READS_R1="raw_data/sample_R1.fastq.gz" # Placeholder for forward reads
READS_R2="raw_data/sample_R2.fastq.gz" # Placeholder for reverse reads (if paired-end)
OUTPUT_DIR="salmon_quant_output"

# Create a directory for raw data (placeholder)
mkdir -p raw_data
# Example: touch raw_data/sample_R1.fastq.gz raw_data/sample_R2.fastq.gz

echo "Running Salmon quantification..."
# Salmon quantification command
# -i: path to the Salmon index
# -l A: automatically detect library type (e.g., strandedness)
# -1: path to forward reads (gzipped FASTQ)
# -2: path to reverse reads (gzipped FASTQ) - omit for single-end
# --gcBias: enable GC bias correction (recommended for RNA-seq)
# --seqBias: enable sequence-specific bias correction (recommended for RNA-seq)
# -o: output directory
salmon quant -i "${SALMON_INDEX_DIR}" \
             -l A \
             -1 "${READS_R1}" \
             -2 "${READS_R2}" \
             --gcBias \
             --seqBias \
             -o "${OUTPUT_DIR}"

echo "Salmon quantification complete. Results are in ${OUTPUT_DIR}"

View on GitHub

2

Gene level quantification was performed using tximeta [v1.8.4].

tximeta v1.8.4 GitHub

$ Bash example

# Install BiocManager if not already installed
# R -q -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')"

# Install tximeta and other necessary packages
# R -q -e "BiocManager::install(c('tximeta', 'SummarizedExperiment'))"

# --- Prepare dummy input data for demonstration ---
# Create dummy quantification directories and files (e.g., from Salmon)
mkdir -p quant_results/sample1 quant_results/sample2 quant_results/sample3
# Create empty quant.sf files as placeholders
touch quant_results/sample1/quant.sf
touch quant_results/sample2/quant.sf
touch quant_results/sample3/quant.sf

# Placeholder for the transcriptome annotation (GTF/GFF)
# For human GRCh38, Ensembl release 109 (or latest) is a common choice.
# You would typically have this file pre-downloaded or generated.
# Example command to download a GTF (uncomment and run if needed):
# wget -nc ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
# Ensure this file exists in the current directory or provide its full path.
GTF_FILE="Homo_sapiens.GRCh38.109.gtf.gz" # Placeholder, replace with actual path

# Create an R script to perform gene-level quantification using tximeta
cat << 'EOF' > run_tximeta.R
library(tximeta)
library(SummarizedExperiment)

# --- Configuration ---
# Paths to quantification directories (e.g., Salmon output)
quant_dirs <- c(
  "quant_results/sample1",
  "quant_results/sample2",
  "quant_results/sample3"
)
files <- file.path(quant_dirs, "quant.sf")

# Sample metadata
coldata <- data.frame(
  names = basename(quant_dirs),
  files = files,
  stringsAsFactors = FALSE
)

# Path to the transcriptome annotation GTF/GFF file
# This should match the GTF used for building the Salmon/Kallisto index.
gtf_file <- Sys.getenv("GTF_FILE") # Get GTF path from environment variable

# --- Import and Summarize ---
# Import quantification data using tximeta
# tximeta automatically tries to link to the correct annotation based on the index.
# If it cannot, you might need to manually specify the GTF or ensure the index
# was built with a recognized annotation source.
se <- tximeta(coldata)

# Summarize to gene level
# tximeta often provides the necessary 'tx2gene' mapping automatically.
gene_se <- summarizeToGene(se)

# Access gene-level counts
gene_counts <- assay(gene_se, "counts")

# Save results
write.csv(gene_counts, "gene_level_counts.csv")

message("Gene level quantification completed and saved to gene_level_counts.csv")
EOF

# Set the GTF_FILE environment variable for the R script
export GTF_FILE="${GTF_FILE}"

# Execute the R script
Rscript run_tximeta.R

# Clean up dummy files (optional)
# rm -rf quant_results run_tximeta.R gene_level_counts.csv

View on GitHub

3

Differential gene expression was analyzed by DESeq2 [v1.30.1]

DESeq2 v1.30.1 GitHub

$ Bash example

# Installation:
# conda install -c bioconda bioconductor-deseq2=1.30.1

# --- Placeholder for input data ---
# In a real scenario, replace these with actual paths to your count matrix and sample information.
# Example: counts.csv should have genes as rows and samples as columns, with the first column being gene IDs.
# Example: sample_info.csv should have sample names as rows and experimental conditions as columns (e.g., 'condition', 'batch').
# Make sure row names of colData match column names of countData.

# Create dummy counts.csv and sample_info.csv for demonstration purposes.
# In a real pipeline, these files would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto).
echo "gene_id,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10" > counts.csv
echo "geneA,100,120,110,90,130,500,550,520,480,600" >> counts.csv
echo "geneB,50,60,55,45,65,200,220,210,190,230" >> counts.csv
echo "geneC,200,210,190,220,180,100,110,90,120,80" >> counts.csv
echo "geneD,300,320,310,290,330,300,320,310,290,330" >> counts.csv
echo "geneE,10,12,11,9,13,50,55,52,48,60" >> counts.csv

echo "sample,condition" > sample_info.csv
echo "sample1,control" >> sample_info.csv
echo "sample2,control" >> sample_info.csv
echo "sample3,control" >> sample_info.csv
echo "sample4,control" >> sample_info.csv
echo "sample5,control" >> sample_info.csv
echo "sample6,treated" >> sample_info.csv
echo "sample7,treated" >> sample_info.csv
echo "sample8,treated" >> sample_info.csv
echo "sample9,treated" >> sample_info.csv
echo "sample10,treated" >> sample_info.csv

# Create the R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
# Load DESeq2 library
library(DESeq2)

# Load count data
# Assuming counts.csv has gene IDs in the first column and samples in subsequent columns
count_data_raw <- read.csv("counts.csv", row.names = 1)
# Ensure count data is integer matrix
count_data <- as.matrix(round(count_data_raw))

# Load sample information
# Assuming sample_info.csv has sample names and condition information
sample_info <- read.csv("sample_info.csv", row.names = 1)

# Ensure sample names in sample_info match column names in count_data
if (!all(rownames(sample_info) == colnames(count_data))) {
  # Attempt to reorder sample_info to match count_data columns
  sample_info <- sample_info[colnames(count_data), , drop = FALSE]
  if (!all(rownames(sample_info) == colnames(count_data))) {
    stop("Sample names in 'sample_info.csv' do not match column names in 'counts.csv'.")
  }
}

# Create DESeqDataSet object
# The design formula should reflect your experimental setup.
# Here, we assume a simple comparison between 'treated' and 'control' conditions.
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = ~ condition) # Example design formula

# Pre-filtering: remove genes with very low counts
# This helps to reduce the number of tests and improve statistical power.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# Run DESeq2 analysis
dds <- DESeq(dds)

# Extract results for a specific contrast (e.g., treated vs control)
# The levels for contrast should match the levels of your 'condition' factor.
# You might need to adjust 'treated' and 'control' based on your actual data.
res <- results(dds, contrast = c("condition", "treated", "control"))

# Order results by adjusted p-value
res <- res[order(res$padj),]

# Save results
write.csv(as.data.frame(res), "deseq2_results.csv")

# Optional: Save normalized counts
norm_counts <- counts(dds, normalized=TRUE)
write.csv(as.data.frame(norm_counts), "deseq2_normalized_counts.csv")
EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

4

processed data files format and content: gene counts

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# conda install -c bioconda rsem

# Define variables
INPUT_BAM="aligned_reads.bam" # Path to the input BAM file (e.g., from STAR alignment)
RSEM_REFERENCE_INDEX="path/to/rsem_reference_index/GRCh38" # Path to the RSEM reference index (prepared using rsem-prepare-reference for human GRCh38)
OUTPUT_PREFIX="sample_gene_counts" # Prefix for output files

# Execute RSEM to calculate gene and isoform expression
# --bam: Specifies that the input is a BAM file
# --paired-end: Specifies that the input reads are paired-end
# --no-qualities: (Optional) Use if the BAM file does not contain quality scores (e.g., from pseudo-alignment)
# --output-genome-bam: (Optional) Output a genome-aligned BAM file sorted by coordinate
# --seed 12345: (Optional) Set a random seed for reproducibility
rsem-calculate-expression \
    --bam \
    --paired-end \
    "${INPUT_BAM}" \
    "${RSEM_REFERENCE_INDEX}" \
    "${OUTPUT_PREFIX}"

# The main output files will be:
# ${OUTPUT_PREFIX}.genes.results (contains gene-level counts and FPKM/TPM)
# ${OUTPUT_PREFIX}.isoforms.results (contains isoform-level counts and FPKM/TPM)

View on GitHub

Tools Used

DESeq2