GSE222216 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Remodeling oncogenic transcriptomes by small molecules targeting NONO.

Nature chemical biology (2023) — PMID 36864190

Dataset

GSE222216

Transcriptome changes in 22Rv1 cells with a double knockout or over expression of C145S, Empty vector, or Wild type NONO.

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation.

    Salmon v1.3.0 GitHub
    $ Bash example
    # Install Salmon (if not already installed)
    # conda install -c bioconda salmon=1.3.0
    
    # --- Reference Data Setup ---
    # Create a directory for reference data
    mkdir -p reference_data
    
    # Define paths for GENCODE v37 annotation
    GENCODE_TRANSCRIPTS_FASTA="reference_data/gencode.v37.transcripts.fa"
    SALMON_INDEX_DIR="gencode_v37_salmon_index"
    
    # Download GENCODE v37 human transcripts FASTA (required for Salmon indexing)
    # Source: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.transcripts.fa.gz
    if [ ! -f "${GENCODE_TRANSCRIPTS_FASTA}" ]; then
        echo "Downloading GENCODE v37 transcripts FASTA..."
        wget -O "${GENCODE_TRANSCRIPTS_FASTA}.gz" ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.transcripts.fa.gz
        gunzip -f "${GENCODE_TRANSCRIPTS_FASTA}.gz"
    fi
    
    # --- Salmon Indexing ---
    # Build Salmon index using GENCODE v37 transcripts
    # The --gencode flag is recommended for GENCODE annotations
    if [ ! -d "${SALMON_INDEX_DIR}" ]; then
        echo "Building Salmon index..."
        salmon index -t "${GENCODE_TRANSCRIPTS_FASTA}" -i "${SALMON_INDEX_DIR}" --gencode
    fi
    
    # --- Salmon Quantification ---
    # Define input and output paths
    # Replace these with your actual input FASTQ files and desired output directory
    READS_R1="raw_data/sample_R1.fastq.gz" # Placeholder for forward reads
    READS_R2="raw_data/sample_R2.fastq.gz" # Placeholder for reverse reads (if paired-end)
    OUTPUT_DIR="salmon_quant_output"
    
    # Create a directory for raw data (placeholder)
    mkdir -p raw_data
    # Example: touch raw_data/sample_R1.fastq.gz raw_data/sample_R2.fastq.gz
    
    echo "Running Salmon quantification..."
    # Salmon quantification command
    # -i: path to the Salmon index
    # -l A: automatically detect library type (e.g., strandedness)
    # -1: path to forward reads (gzipped FASTQ)
    # -2: path to reverse reads (gzipped FASTQ) - omit for single-end
    # --gcBias: enable GC bias correction (recommended for RNA-seq)
    # --seqBias: enable sequence-specific bias correction (recommended for RNA-seq)
    # -o: output directory
    salmon quant -i "${SALMON_INDEX_DIR}" \
                 -l A \
                 -1 "${READS_R1}" \
                 -2 "${READS_R2}" \
                 --gcBias \
                 --seqBias \
                 -o "${OUTPUT_DIR}"
    
    echo "Salmon quantification complete. Results are in ${OUTPUT_DIR}"
  2. 2

    Gene level quantification was performed using tximeta [v1.8.4].

    tximeta v1.8.4 GitHub
    $ Bash example
    # Install BiocManager if not already installed
    # R -q -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')"
    
    # Install tximeta and other necessary packages
    # R -q -e "BiocManager::install(c('tximeta', 'SummarizedExperiment'))"
    
    # --- Prepare dummy input data for demonstration ---
    # Create dummy quantification directories and files (e.g., from Salmon)
    mkdir -p quant_results/sample1 quant_results/sample2 quant_results/sample3
    # Create empty quant.sf files as placeholders
    touch quant_results/sample1/quant.sf
    touch quant_results/sample2/quant.sf
    touch quant_results/sample3/quant.sf
    
    # Placeholder for the transcriptome annotation (GTF/GFF)
    # For human GRCh38, Ensembl release 109 (or latest) is a common choice.
    # You would typically have this file pre-downloaded or generated.
    # Example command to download a GTF (uncomment and run if needed):
    # wget -nc ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
    # Ensure this file exists in the current directory or provide its full path.
    GTF_FILE="Homo_sapiens.GRCh38.109.gtf.gz" # Placeholder, replace with actual path
    
    # Create an R script to perform gene-level quantification using tximeta
    cat << 'EOF' > run_tximeta.R
    library(tximeta)
    library(SummarizedExperiment)
    
    # --- Configuration ---
    # Paths to quantification directories (e.g., Salmon output)
    quant_dirs <- c(
      "quant_results/sample1",
      "quant_results/sample2",
      "quant_results/sample3"
    )
    files <- file.path(quant_dirs, "quant.sf")
    
    # Sample metadata
    coldata <- data.frame(
      names = basename(quant_dirs),
      files = files,
      stringsAsFactors = FALSE
    )
    
    # Path to the transcriptome annotation GTF/GFF file
    # This should match the GTF used for building the Salmon/Kallisto index.
    gtf_file <- Sys.getenv("GTF_FILE") # Get GTF path from environment variable
    
    # --- Import and Summarize ---
    # Import quantification data using tximeta
    # tximeta automatically tries to link to the correct annotation based on the index.
    # If it cannot, you might need to manually specify the GTF or ensure the index
    # was built with a recognized annotation source.
    se <- tximeta(coldata)
    
    # Summarize to gene level
    # tximeta often provides the necessary 'tx2gene' mapping automatically.
    gene_se <- summarizeToGene(se)
    
    # Access gene-level counts
    gene_counts <- assay(gene_se, "counts")
    
    # Save results
    write.csv(gene_counts, "gene_level_counts.csv")
    
    message("Gene level quantification completed and saved to gene_level_counts.csv")
    EOF
    
    # Set the GTF_FILE environment variable for the R script
    export GTF_FILE="${GTF_FILE}"
    
    # Execute the R script
    Rscript run_tximeta.R
    
    # Clean up dummy files (optional)
    # rm -rf quant_results run_tximeta.R gene_level_counts.csv
  3. 3

    Differential gene expression was analyzed by DESeq2 [v1.30.1]

    $ Bash example
    # Installation:
    # conda install -c bioconda bioconductor-deseq2=1.30.1
    
    # --- Placeholder for input data ---
    # In a real scenario, replace these with actual paths to your count matrix and sample information.
    # Example: counts.csv should have genes as rows and samples as columns, with the first column being gene IDs.
    # Example: sample_info.csv should have sample names as rows and experimental conditions as columns (e.g., 'condition', 'batch').
    # Make sure row names of colData match column names of countData.
    
    # Create dummy counts.csv and sample_info.csv for demonstration purposes.
    # In a real pipeline, these files would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto).
    echo "gene_id,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10" > counts.csv
    echo "geneA,100,120,110,90,130,500,550,520,480,600" >> counts.csv
    echo "geneB,50,60,55,45,65,200,220,210,190,230" >> counts.csv
    echo "geneC,200,210,190,220,180,100,110,90,120,80" >> counts.csv
    echo "geneD,300,320,310,290,330,300,320,310,290,330" >> counts.csv
    echo "geneE,10,12,11,9,13,50,55,52,48,60" >> counts.csv
    
    echo "sample,condition" > sample_info.csv
    echo "sample1,control" >> sample_info.csv
    echo "sample2,control" >> sample_info.csv
    echo "sample3,control" >> sample_info.csv
    echo "sample4,control" >> sample_info.csv
    echo "sample5,control" >> sample_info.csv
    echo "sample6,treated" >> sample_info.csv
    echo "sample7,treated" >> sample_info.csv
    echo "sample8,treated" >> sample_info.csv
    echo "sample9,treated" >> sample_info.csv
    echo "sample10,treated" >> sample_info.csv
    
    # Create the R script for DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    # Load DESeq2 library
    library(DESeq2)
    
    # Load count data
    # Assuming counts.csv has gene IDs in the first column and samples in subsequent columns
    count_data_raw <- read.csv("counts.csv", row.names = 1)
    # Ensure count data is integer matrix
    count_data <- as.matrix(round(count_data_raw))
    
    # Load sample information
    # Assuming sample_info.csv has sample names and condition information
    sample_info <- read.csv("sample_info.csv", row.names = 1)
    
    # Ensure sample names in sample_info match column names in count_data
    if (!all(rownames(sample_info) == colnames(count_data))) {
      # Attempt to reorder sample_info to match count_data columns
      sample_info <- sample_info[colnames(count_data), , drop = FALSE]
      if (!all(rownames(sample_info) == colnames(count_data))) {
        stop("Sample names in 'sample_info.csv' do not match column names in 'counts.csv'.")
      }
    }
    
    # Create DESeqDataSet object
    # The design formula should reflect your experimental setup.
    # Here, we assume a simple comparison between 'treated' and 'control' conditions.
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_info,
                                  design = ~ condition) # Example design formula
    
    # Pre-filtering: remove genes with very low counts
    # This helps to reduce the number of tests and improve statistical power.
    keep <- rowSums(counts(dds)) >= 10
    dds <- dds[keep,]
    
    # Run DESeq2 analysis
    dds <- DESeq(dds)
    
    # Extract results for a specific contrast (e.g., treated vs control)
    # The levels for contrast should match the levels of your 'condition' factor.
    # You might need to adjust 'treated' and 'control' based on your actual data.
    res <- results(dds, contrast = c("condition", "treated", "control"))
    
    # Order results by adjusted p-value
    res <- res[order(res$padj),]
    
    # Save results
    write.csv(as.data.frame(res), "deseq2_results.csv")
    
    # Optional: Save normalized counts
    norm_counts <- counts(dds, normalized=TRUE)
    write.csv(as.data.frame(norm_counts), "deseq2_normalized_counts.csv")
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R
  4. 4

    processed data files format and content: gene counts

    RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub
    $ Bash example
    # conda install -c bioconda rsem
    
    # Define variables
    INPUT_BAM="aligned_reads.bam" # Path to the input BAM file (e.g., from STAR alignment)
    RSEM_REFERENCE_INDEX="path/to/rsem_reference_index/GRCh38" # Path to the RSEM reference index (prepared using rsem-prepare-reference for human GRCh38)
    OUTPUT_PREFIX="sample_gene_counts" # Prefix for output files
    
    # Execute RSEM to calculate gene and isoform expression
    # --bam: Specifies that the input is a BAM file
    # --paired-end: Specifies that the input reads are paired-end
    # --no-qualities: (Optional) Use if the BAM file does not contain quality scores (e.g., from pseudo-alignment)
    # --output-genome-bam: (Optional) Output a genome-aligned BAM file sorted by coordinate
    # --seed 12345: (Optional) Set a random seed for reproducibility
    rsem-calculate-expression \
        --bam \
        --paired-end \
        "${INPUT_BAM}" \
        "${RSEM_REFERENCE_INDEX}" \
        "${OUTPUT_PREFIX}"
    
    # The main output files will be:
    # ${OUTPUT_PREFIX}.genes.results (contains gene-level counts and FPKM/TPM)
    # ${OUTPUT_PREFIX}.isoforms.results (contains isoform-level counts and FPKM/TPM)

Tools Used

Raw Source Text
Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation.
Gene level quantification was performed using tximeta [v1.8.4].
Differential gene expression was analyzed by DESeq2 [v1.30.1]
genome build: HG19
processed data files format and content: gene counts
← Back to Analysis