GSE222217 Processing Pipeline

GSE code_examples 7 steps

Publication

Remodeling oncogenic transcriptomes by small molecules targeting NONO.

Nature chemical biology (2023) — PMID 36864190

Dataset

GSE222217

Remodeling oncogenic transcriptomes by small molecules targeting NONO

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    For differential expression:

    DESeq2 (Inferred with models/gemini-2.5-flash) v1.42.0 GitHub
    $ Bash example
    # Install R and DESeq2 (if not already installed)
    # sudo apt-get update
    # sudo apt-get install r-base
    # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')"
    
    # Placeholder for input files:
    # counts.tsv: A tab-separated file with gene IDs as row names and sample names as column names, containing raw counts.
    # design.tsv: A tab-separated file with sample names as row names and experimental factors (e.g., 'condition') as columns.
    
    # Create dummy counts.tsv and design.tsv for demonstration purposes.
    # In a real scenario, these would be generated by upstream quantification tools (e.g., Salmon, STAR+featureCounts).
    echo -e "gene\tsampleA_rep1\tsampleA_rep2\tsampleB_rep1\tsampleB_rep2" > counts.tsv
    echo -e "gene1\t100\t120\t50\t60" >> counts.tsv
    echo -e "gene2\t20\t25\t80\t90" >> counts.tsv
    echo -e "gene3\t500\t550\t480\t510" >> counts.tsv
    
    echo -e "sample\tcondition" > design.tsv
    echo -e "sampleA_rep1\tcontrol" >> design.tsv
    echo -e "sampleA_rep2\tcontrol" >> design.tsv
    echo -e "sampleB_rep1\ttreated" >> design.tsv
    echo -e "sampleB_rep2\ttreated" >> design.tsv
    
    # Create an R script for DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    # Load DESeq2 library
    library(DESeq2)
    
    # --- Configuration Parameters ---
    counts_file <- "counts.tsv"
    design_file <- "design.tsv"
    output_dir <- "deseq2_results"
    contrast_numerator <- "treated" # The level to compare against the denominator
    contrast_denominator <- "control" # The baseline level
    design_formula <- "~ condition" # The design formula for DESeq2
    
    # Reference genome assembly for context (not directly used by DESeq2, but for data origin)
    # This analysis assumes counts were generated against a reference like GRCh38.
    reference_assembly <- "GRCh38" 
    
    # Create output directory if it doesn't exist
    if (!dir.exists(output_dir)) {
      dir.create(output_dir)
    }
    
    # Read count data
    # Assuming the first column is gene ID and subsequent columns are sample counts
    count_data <- read.table(counts_file, header = TRUE, row.names = 1, sep = "\t")
    
    # Read sample metadata (design file)
    # Assuming the first column is sample ID and subsequent columns are factors
    design_data <- read.table(design_file, header = TRUE, row.names = 1, sep = "\t")
    
    # Ensure sample names match and are in the same order
    # This is crucial for DESeq2
    design_data <- design_data[colnames(count_data), , drop = FALSE]
    
    # Create DESeqDataSet object
    # The 'design' argument specifies the experimental design formula
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = design_data,
                                  design = as.formula(design_formula))
    
    # Set factor levels for proper contrast
    dds$condition <- factor(dds$condition, levels = c(contrast_denominator, contrast_numerator))
    
    # Run DESeq2 analysis
    dds <- DESeq(dds)
    
    # Get results for the specified contrast
    # Here, 'condition' is the factor, 'treated' is the numerator, 'control' is the denominator
    res <- results(dds, contrast = c("condition", contrast_numerator, contrast_denominator))
    
    # Order results by adjusted p-value
    res <- res[order(res$padj), ]
    
    # Save results to a CSV file
    write.csv(as.data.frame(res), file = file.path(output_dir, "deseq2_differential_expression_results.csv"))
    
    # Save normalized counts
    norm_counts <- counts(dds, normalized = TRUE)
    write.csv(as.data.frame(norm_counts), file = file.path(output_dir, "deseq2_normalized_counts.csv"))
    
    message(paste0("DESeq2 analysis complete. Results saved to '", output_dir, "' directory."))
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R
  2. 2

    Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation.

    Salmon v1.3.0 GitHub
    $ Bash example
    # Install Salmon (e.g., using Conda)
    # conda create -n salmon_env salmon=1.3.0 -c bioconda -c conda-forge
    # conda activate salmon_env
    
    # --- Reference Data Preparation (GENCODE v37) ---
    # Download GENCODE v37 transcriptome FASTA for indexing
    # mkdir -p references
    # wget -O references/gencode.v37.transcripts.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.transcripts.fa.gz
    # gunzip references/gencode.v37.transcripts.fa.gz
    
    # (Optional but recommended for better accuracy) Download primary assembly FASTA for decoy-aware indexing
    # wget -O references/GRCh38.primary_assembly.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/GRCh38.primary_assembly.fa.gz
    # gunzip references/GRCh38.primary_assembly.fa.gz
    
    # Build Salmon index (assuming a simple transcriptome index for this example)
    # For a more robust decoy-aware index (recommended for RNA-seq):
    # cat references/gencode.v37.transcripts.fa references/GRCh38.primary_assembly.fa > references/gencode_v37_gentrome.fa
    # salmon index -t references/gencode_v37_gentrome.fa -d references/GRCh38.primary_assembly.fa -i gencode_v37_salmon_index --gencode
    
    # If only using transcriptome FASTA (less robust):
    # salmon index -t references/gencode.v37.transcripts.fa -i gencode_v37_salmon_index --gencode
    
    # --- Transcript Abundance Quantification ---
    # Assume input reads are in current directory or specified path
    # Replace 'sample_1_R1.fastq.gz' and 'sample_1_R2.fastq.gz' with actual input files
    # Replace 'gencode_v37_salmon_index' with the path to your Salmon index
    # Replace 'salmon_quant_output' with your desired output directory
    
    salmon quant -i gencode_v37_salmon_index -l A \
        -1 sample_1_R1.fastq.gz \
        -2 sample_1_R2.fastq.gz \
        -o salmon_quant_output \
        --validateMappings \
        --gcBias \
        --seqBias
  3. 3

    Gene level quantification was performed using tximeta [v1.8.4].

    tximeta v1.8.4 GitHub
    $ Bash example
    # Install tximeta if not already installed
    # R -q -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # R -q -e 'BiocManager::install("tximeta")'
    # R -q -e 'BiocManager::install("tximport")' # tximeta depends on tximport
    
    # Create an R script for gene-level quantification with tximeta
    cat << 'EOF' > run_tximeta.R
    library(tximeta)
    library(tximport)
    library(SummarizedExperiment)
    
    # --- User-defined parameters (placeholders) ---
    # Path to directory containing quantification files (e.g., from Salmon, Kallisto)
    # Each sample should have its own subdirectory containing 'quant.sf' or 'abundance.h5'
    quant_dir <- "path/to/quantification_output_directory"
    
    # Sample names (corresponding to subdirectories in quant_dir)
    sample_names <- c("sample1", "sample2", "sample3")
    
    # Optional: Define experimental conditions or other metadata for samples
    conditions <- c("control", "control", "treated")
    
    # Output file name for gene-level counts
    output_counts_file <- "gene_counts_tximeta.tsv"
    
    # Output file name for the SummarizedExperiment object (for further R analysis)
    output_se_file <- "gene_se_tximeta.rds"
    
    # --- Prepare sample information ---
    # Construct full paths to quantification files
    files <- file.path(quant_dir, sample_names, "quant.sf") # Assuming Salmon 'quant.sf' files
    names(files) <- sample_names
    
    # Create a data frame with sample metadata (coldata)
    coldata <- data.frame(
      names = sample_names,
      files = files,
      condition = conditions, # Add other metadata columns as needed
      stringsAsFactors = FALSE
    )
    
    # --- Import quantification data with tximeta ---
    # tximeta automatically infers the transcriptome and genome from the quantification files
    # and links to appropriate annotation resources (e.g., Ensembl, Gencode).
    # This step returns a SummarizedExperiment object at the transcript level.
    se <- tximeta(coldata)
    
    # --- Summarize to gene level ---
    # The summarizeToGene function aggregates transcript-level data to gene-level.
    # It uses tximport internally. By default, it produces 'lengthScaledTPM' counts
    # which are suitable for many differential expression analyses.
    # Other options for 'countsFromAbundance' can be passed if needed (e.g., "no" for raw counts).
    gse <- summarizeToGene(se)
    
    # Extract gene-level counts (e.g., lengthScaledTPM) from the SummarizedExperiment object
    gene_counts <- assay(gse)
    
    # Write gene-level counts to a TSV file
    write.table(gene_counts, file = output_counts_file, sep = "\t", quote = FALSE, col.names = NA)
    
    # Save the SummarizedExperiment object for further R analysis
    saveRDS(gse, file = output_se_file)
    
    message(paste("Gene-level quantification complete. Counts saved to:", output_counts_file))
    message(paste("SummarizedExperiment object saved to:", output_se_file))
    
    EOF
    
    # Execute the R script
    Rscript run_tximeta.R
    
    # Clean up the R script (optional)
    # rm run_tximeta.R
  4. 4

    Differential gene expression was analyzed by DESeq2 [v1.30.1].

    $ Bash example
    # Install R and DESeq2 if not already installed
    # conda create -n deseq2_env r-base r-essentials bioconductor-deseq2=1.30.1 -c bioconda -c conda-forge
    # conda activate deseq2_env
    
    # Create a placeholder R script for DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    #!/usr/bin/env Rscript
    
    # Load necessary libraries
    library(DESeq2)
    library(data.table) # For fread/fwrite, install with install.packages("data.table") if not present
    
    # --- Configuration ---
    # Input count matrix file (e.g., from featureCounts, htseq-count, Salmon/Kallisto tximport)
    # This file should have genes/features as rows and samples as columns, with raw counts.
    count_matrix_file <- "counts.tsv"
    
    # Input sample information file
    # This file should have sample IDs as rows (or a column named 'sample_id') and
    # experimental design variables (e.g., 'condition', 'treatment') as columns.
    # The row names (or 'sample_id' column) should match the column names in the count_matrix_file.
    sample_info_file <- "sample_info.tsv"
    
    # Output file for differential expression results
    output_results_file <- "deseq2_results.tsv"
    
    # Define the design formula (e.g., ~ condition)
    # Replace 'condition' with your actual experimental variable from sample_info_file
    design_formula_str <- "~ condition"
    
    # --- Main Analysis ---
    
    # Read count matrix
    # Assuming the first column is gene ID and subsequent columns are sample counts
    count_data <- as.matrix(fread(count_matrix_file, header = TRUE, sep = "\t", data.table = FALSE), rownames = 1)
    
    # Read sample information
    # Assuming the first column is sample ID and subsequent columns are design variables
    sample_info <- fread(sample_info_file, header = TRUE, sep = "\t", data.table = FALSE)
    rownames(sample_info) <- sample_info$sample_id # Ensure sample_id column exists and is used for row names
    
    # Ensure sample order matches between count data and sample info
    sample_info <- sample_info[colnames(count_data), , drop = FALSE]
    
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_info,
                                  design = as.formula(design_formula_str))
    
    # Run DESeq2 analysis
    dds <- DESeq(dds)
    
    # Get results (e.g., comparing the last level of 'condition' to the first)
    # For specific contrasts, use: results(dds, contrast=c("condition","treated","untreated"))
    res <- results(dds)
    
    # Order results by adjusted p-value
    res_ordered <- res[order(res$padj),]
    
    # Write results to file
    fwrite(as.data.frame(res_ordered), file = output_results_file, sep = "\t", row.names = TRUE)
    
    message(paste("DESeq2 analysis complete. Results saved to:", output_results_file))
    
    EOF
    
    # Example placeholder files (uncomment and create these for a runnable example):
    # echo -e "gene_id\tsampleA_rep1\tsampleA_rep2\tsampleB_rep1\tsampleB_rep2" > counts.tsv
    # echo -e "gene1\t100\t120\t50\t60" >> counts.tsv
    # echo -e "gene2\t50\t60\t100\t110" >> counts.tsv
    # echo -e "gene3\t200\t210\t200\t220" >> counts.tsv
    # echo -e "sample_id\tcondition" > sample_info.tsv
    # echo -e "sampleA_rep1\tcontrol" >> sample_info.tsv
    # echo -e "sampleA_rep2\tcontrol" >> sample_info.tsv
    # echo -e "sampleB_rep1\ttreatment" >> sample_info.tsv
    # echo -e "sampleB_rep2\ttreatment" >> sample_info.tsv
    
    # Execute the R script
    Rscript run_deseq2.R
  5. 5

    For splicing analysis, 22Rv1 data only:

    rMATS (Inferred with models/gemini-2.5-flash) v4.1.2 GitHub
    $ Bash example
    # Install rMATS (example using conda)
    # conda create -n rmats_env python=3.8
    # conda activate rmats_env
    # pip install rmats-turbo
    
    # Define input and reference files
    # Placeholder BAM files for 22Rv1 control and experimental groups.
    # Replace with actual paths to your 22Rv1 RNA-seq alignment BAM files.
    # For example, if comparing treated vs. untreated 22Rv1 cells.
    CONTROL_BAM_FILES="/path/to/22Rv1_control_rep1.bam,/path/to/22Rv1_control_rep2.bam"
    EXPERIMENTAL_BAM_FILES="/path/to/22Rv1_experimental_rep1.bam,/path/to/22Rv1_experimental_rep2.bam"
    
    # Reference genome annotation (GTF) - using hg38 as a placeholder.
    # Download from GENCODE or Ensembl (e.g., gencode.v38.annotation.gtf).
    GTF_FILE="/path/to/gencode.v38.annotation.gtf"
    
    # Output directory
    OUTPUT_DIR="rmats_splicing_analysis_22Rv1"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run rMATS for differential splicing analysis
    # Assuming paired-end reads and RNA-seq data.
    # Adjust --nthread based on available CPU cores.
    rmats.py \
        --b1 "${CONTROL_BAM_FILES}" \
        --b2 "${EXPERIMENTAL_BAM_FILES}" \
        --gtf "${GTF_FILE}" \
        --readType paired \
        --seqType rnaseq \
        --od "${OUTPUT_DIR}" \
        --tmp "${OUTPUT_DIR}/tmp" \
        --nthread 8
  6. 6

    RNA-seq reads were mapped to hg38 using STAR Aligner.

    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # --- Reference Genome Preparation (run once) ---
    # Download hg38 primary assembly FASTA and a comprehensive GTF file (e.g., from GENCODE or Ensembl)
    # Example for GRCh38.p14 (hg38) and GENCODE v45:
    # mkdir -p ~/references/hg38/fasta
    # wget -P ~/references/hg38/fasta/ https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/GRCh38.primary_assembly.genome.fa.gz
    # mkdir -p ~/references/hg38/gtf
    # wget -P ~/references/hg38/gtf/ https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz
    # gunzip ~/references/hg38/fasta/GRCh38.primary_assembly.genome.fa.gz
    # gunzip ~/references/hg38/gtf/gencode.v45.annotation.gtf.gz
    
    # Define paths for genome generation
    # GENOME_FASTA="~/references/hg38/fasta/GRCh38.primary_assembly.genome.fa"
    # GENOME_GTF="~/references/hg38/gtf/gencode.v45.annotation.gtf"
    # STAR_INDEX_DIR="~/STAR_genome_indices/hg38_gencode_v45" # Example index directory
    # NUM_THREADS_INDEX=16 # Number of threads for index generation
    
    # Generate STAR genome index (uncomment and run once)
    # STAR --runMode genomeGenerate \
    #      --genomeDir "${STAR_INDEX_DIR}" \
    #      --genomeFastaFiles "${GENOME_FASTA}" \
    #      --sjdbGTFfile "${GENOME_GTF}" \
    #      --sjdbOverhang 100 \
    #      --runThreadN "${NUM_THREADS_INDEX}"
    
    # --- RNA-seq Read Mapping ---
    # Define variables for mapping
    STAR_INDEX_DIR="~/STAR_genome_indices/hg38_gencode_v45" # Path to the pre-built STAR genome index for hg38
    READS_R1="sample_R1.fastq.gz" # Placeholder for forward reads
    READS_R2="sample_R2.fastq.gz" # Placeholder for reverse reads (remove if single-end)
    OUTPUT_DIR="star_output"
    OUTPUT_PREFIX="${OUTPUT_DIR}/sample"
    NUM_THREADS_MAP=8 # Number of threads for mapping
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Execute STAR mapping command
    # For paired-end reads:
    STAR --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${READS_R1}" "${READS_R2}" \
         --readFilesCommand zcat \
         --outFileNamePrefix "${OUTPUT_PREFIX}_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes Standard \
         --outFilterType BySJout \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.1 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --sjdbScore 1 \
         --runThreadN "${NUM_THREADS_MAP}"
    
    # For single-end reads, modify the --readFilesIn parameter:
    # STAR --genomeDir "${STAR_INDEX_DIR}" \
    #      --readFilesIn "${READS_R1}" \
    #      --readFilesCommand zcat \
    #      --outFileNamePrefix "${OUTPUT_PREFIX}_" \
    #      --outSAMtype BAM SortedByCoordinate \
    #      --outSAMattributes Standard \
    #      --outFilterType BySJout \
    #      --outFilterMultimapNmax 20 \
    #      --outFilterMismatchNmax 999 \
    #      --outFilterMismatchNoverLmax 0.1 \
    #      --alignIntronMin 20 \
    #      --alignIntronMax 1000000 \
    #      --alignMatesGapMax 1000000 \
    #      --sjdbScore 1 \
    #      --runThreadN "${NUM_THREADS_MAP}"
  7. 7

    Alternative splicing (AS) analysis was completed using rMATs (v3.2.5).

    rMATS v3.2.5 GitHub
    $ Bash example
    # Installation (example using conda)
    # conda create -n rmats_env python=3.8
    # conda activate rmats_env
    # pip install rmats-turbo==3.2.5
    
    # Define input files and parameters (placeholders as none specified in description)
    # Replace with actual paths to your BAM files, GTF, and desired output directory.
    CONDITION1_BAM_FILES="condition1_rep1.bam,condition1_rep2.bam" # Comma-separated BAMs for condition 1
    CONDITION2_BAM_FILES="condition2_rep1.bam,condition2_rep2.bam" # Comma-separated BAMs for condition 2
    GTF_FILE="/path/to/gencode.v44.annotation.gtf" # Placeholder for GTF annotation file (e.g., GENCODE for GRCh38/hg38)
    OUTPUT_DIR="rmats_output"
    TMP_DIR="rmats_tmp"
    READ_LENGTH=100 # Placeholder, adjust based on actual data's read length
    LIB_TYPE="fr-firststrand" # Common library type, adjust if known (e.g., fr-unstranded, fr-secondstrand)
    N_THREADS=8 # Number of threads
    
    mkdir -p "${OUTPUT_DIR}"
    mkdir -p "${TMP_DIR}"
    
    # Run rMATS analysis for alternative splicing
    rmats.py \
        --b1 "${CONDITION1_BAM_FILES}" \
        --b2 "${CONDITION2_BAM_FILES}" \
        --gtf "${GTF_FILE}" \
        --od "${OUTPUT_DIR}" \
        --tmp "${TMP_DIR}" \
        --readLength "${READ_LENGTH}" \
        --libType "${LIB_TYPE}" \
        --nthread "${N_THREADS}" \
        --task "as" # "as" for alternative splicing analysis
    

Tools Used

Raw Source Text
For differential expression:
Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation.
Gene level quantification was performed using tximeta [v1.8.4].
Differential gene expression was analyzed by DESeq2 [v1.30.1].
For splicing analysis, 22Rv1 data only:
RNA-seq reads were mapped to hg38 using STAR Aligner.
Alternative splicing (AS) analysis was completed using rMATs (v3.2.5).
Assembly: hg19
Supplementary files format and content: *.csv: Gene counts were used for differential expression.
Supplementary files format and content: *.xlsx: rMATs table was used for splicing analysis.
← Back to Analysis