GSE222217 Processing Pipeline
GSE
code_examples
7 steps
Publication
Remodeling oncogenic transcriptomes by small molecules targeting NONO.Nature chemical biology (2023) — PMID 36864190
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
For differential expression:
$ Bash example
# Install R and DESeq2 (if not already installed) # sudo apt-get update # sudo apt-get install r-base # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')" # Placeholder for input files: # counts.tsv: A tab-separated file with gene IDs as row names and sample names as column names, containing raw counts. # design.tsv: A tab-separated file with sample names as row names and experimental factors (e.g., 'condition') as columns. # Create dummy counts.tsv and design.tsv for demonstration purposes. # In a real scenario, these would be generated by upstream quantification tools (e.g., Salmon, STAR+featureCounts). echo -e "gene\tsampleA_rep1\tsampleA_rep2\tsampleB_rep1\tsampleB_rep2" > counts.tsv echo -e "gene1\t100\t120\t50\t60" >> counts.tsv echo -e "gene2\t20\t25\t80\t90" >> counts.tsv echo -e "gene3\t500\t550\t480\t510" >> counts.tsv echo -e "sample\tcondition" > design.tsv echo -e "sampleA_rep1\tcontrol" >> design.tsv echo -e "sampleA_rep2\tcontrol" >> design.tsv echo -e "sampleB_rep1\ttreated" >> design.tsv echo -e "sampleB_rep2\ttreated" >> design.tsv # Create an R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R # Load DESeq2 library library(DESeq2) # --- Configuration Parameters --- counts_file <- "counts.tsv" design_file <- "design.tsv" output_dir <- "deseq2_results" contrast_numerator <- "treated" # The level to compare against the denominator contrast_denominator <- "control" # The baseline level design_formula <- "~ condition" # The design formula for DESeq2 # Reference genome assembly for context (not directly used by DESeq2, but for data origin) # This analysis assumes counts were generated against a reference like GRCh38. reference_assembly <- "GRCh38" # Create output directory if it doesn't exist if (!dir.exists(output_dir)) { dir.create(output_dir) } # Read count data # Assuming the first column is gene ID and subsequent columns are sample counts count_data <- read.table(counts_file, header = TRUE, row.names = 1, sep = "\t") # Read sample metadata (design file) # Assuming the first column is sample ID and subsequent columns are factors design_data <- read.table(design_file, header = TRUE, row.names = 1, sep = "\t") # Ensure sample names match and are in the same order # This is crucial for DESeq2 design_data <- design_data[colnames(count_data), , drop = FALSE] # Create DESeqDataSet object # The 'design' argument specifies the experimental design formula dds <- DESeqDataSetFromMatrix(countData = count_data, colData = design_data, design = as.formula(design_formula)) # Set factor levels for proper contrast dds$condition <- factor(dds$condition, levels = c(contrast_denominator, contrast_numerator)) # Run DESeq2 analysis dds <- DESeq(dds) # Get results for the specified contrast # Here, 'condition' is the factor, 'treated' is the numerator, 'control' is the denominator res <- results(dds, contrast = c("condition", contrast_numerator, contrast_denominator)) # Order results by adjusted p-value res <- res[order(res$padj), ] # Save results to a CSV file write.csv(as.data.frame(res), file = file.path(output_dir, "deseq2_differential_expression_results.csv")) # Save normalized counts norm_counts <- counts(dds, normalized = TRUE) write.csv(as.data.frame(norm_counts), file = file.path(output_dir, "deseq2_normalized_counts.csv")) message(paste0("DESeq2 analysis complete. Results saved to '", output_dir, "' directory.")) EOF # Execute the R script Rscript run_deseq2.R -
2
Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation.
$ Bash example
# Install Salmon (e.g., using Conda) # conda create -n salmon_env salmon=1.3.0 -c bioconda -c conda-forge # conda activate salmon_env # --- Reference Data Preparation (GENCODE v37) --- # Download GENCODE v37 transcriptome FASTA for indexing # mkdir -p references # wget -O references/gencode.v37.transcripts.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.transcripts.fa.gz # gunzip references/gencode.v37.transcripts.fa.gz # (Optional but recommended for better accuracy) Download primary assembly FASTA for decoy-aware indexing # wget -O references/GRCh38.primary_assembly.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/GRCh38.primary_assembly.fa.gz # gunzip references/GRCh38.primary_assembly.fa.gz # Build Salmon index (assuming a simple transcriptome index for this example) # For a more robust decoy-aware index (recommended for RNA-seq): # cat references/gencode.v37.transcripts.fa references/GRCh38.primary_assembly.fa > references/gencode_v37_gentrome.fa # salmon index -t references/gencode_v37_gentrome.fa -d references/GRCh38.primary_assembly.fa -i gencode_v37_salmon_index --gencode # If only using transcriptome FASTA (less robust): # salmon index -t references/gencode.v37.transcripts.fa -i gencode_v37_salmon_index --gencode # --- Transcript Abundance Quantification --- # Assume input reads are in current directory or specified path # Replace 'sample_1_R1.fastq.gz' and 'sample_1_R2.fastq.gz' with actual input files # Replace 'gencode_v37_salmon_index' with the path to your Salmon index # Replace 'salmon_quant_output' with your desired output directory salmon quant -i gencode_v37_salmon_index -l A \ -1 sample_1_R1.fastq.gz \ -2 sample_1_R2.fastq.gz \ -o salmon_quant_output \ --validateMappings \ --gcBias \ --seqBias -
3
Gene level quantification was performed using tximeta [v1.8.4].
$ Bash example
# Install tximeta if not already installed # R -q -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -q -e 'BiocManager::install("tximeta")' # R -q -e 'BiocManager::install("tximport")' # tximeta depends on tximport # Create an R script for gene-level quantification with tximeta cat << 'EOF' > run_tximeta.R library(tximeta) library(tximport) library(SummarizedExperiment) # --- User-defined parameters (placeholders) --- # Path to directory containing quantification files (e.g., from Salmon, Kallisto) # Each sample should have its own subdirectory containing 'quant.sf' or 'abundance.h5' quant_dir <- "path/to/quantification_output_directory" # Sample names (corresponding to subdirectories in quant_dir) sample_names <- c("sample1", "sample2", "sample3") # Optional: Define experimental conditions or other metadata for samples conditions <- c("control", "control", "treated") # Output file name for gene-level counts output_counts_file <- "gene_counts_tximeta.tsv" # Output file name for the SummarizedExperiment object (for further R analysis) output_se_file <- "gene_se_tximeta.rds" # --- Prepare sample information --- # Construct full paths to quantification files files <- file.path(quant_dir, sample_names, "quant.sf") # Assuming Salmon 'quant.sf' files names(files) <- sample_names # Create a data frame with sample metadata (coldata) coldata <- data.frame( names = sample_names, files = files, condition = conditions, # Add other metadata columns as needed stringsAsFactors = FALSE ) # --- Import quantification data with tximeta --- # tximeta automatically infers the transcriptome and genome from the quantification files # and links to appropriate annotation resources (e.g., Ensembl, Gencode). # This step returns a SummarizedExperiment object at the transcript level. se <- tximeta(coldata) # --- Summarize to gene level --- # The summarizeToGene function aggregates transcript-level data to gene-level. # It uses tximport internally. By default, it produces 'lengthScaledTPM' counts # which are suitable for many differential expression analyses. # Other options for 'countsFromAbundance' can be passed if needed (e.g., "no" for raw counts). gse <- summarizeToGene(se) # Extract gene-level counts (e.g., lengthScaledTPM) from the SummarizedExperiment object gene_counts <- assay(gse) # Write gene-level counts to a TSV file write.table(gene_counts, file = output_counts_file, sep = "\t", quote = FALSE, col.names = NA) # Save the SummarizedExperiment object for further R analysis saveRDS(gse, file = output_se_file) message(paste("Gene-level quantification complete. Counts saved to:", output_counts_file)) message(paste("SummarizedExperiment object saved to:", output_se_file)) EOF # Execute the R script Rscript run_tximeta.R # Clean up the R script (optional) # rm run_tximeta.R -
4
Differential gene expression was analyzed by DESeq2 [v1.30.1].
$ Bash example
# Install R and DESeq2 if not already installed # conda create -n deseq2_env r-base r-essentials bioconductor-deseq2=1.30.1 -c bioconda -c conda-forge # conda activate deseq2_env # Create a placeholder R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R #!/usr/bin/env Rscript # Load necessary libraries library(DESeq2) library(data.table) # For fread/fwrite, install with install.packages("data.table") if not present # --- Configuration --- # Input count matrix file (e.g., from featureCounts, htseq-count, Salmon/Kallisto tximport) # This file should have genes/features as rows and samples as columns, with raw counts. count_matrix_file <- "counts.tsv" # Input sample information file # This file should have sample IDs as rows (or a column named 'sample_id') and # experimental design variables (e.g., 'condition', 'treatment') as columns. # The row names (or 'sample_id' column) should match the column names in the count_matrix_file. sample_info_file <- "sample_info.tsv" # Output file for differential expression results output_results_file <- "deseq2_results.tsv" # Define the design formula (e.g., ~ condition) # Replace 'condition' with your actual experimental variable from sample_info_file design_formula_str <- "~ condition" # --- Main Analysis --- # Read count matrix # Assuming the first column is gene ID and subsequent columns are sample counts count_data <- as.matrix(fread(count_matrix_file, header = TRUE, sep = "\t", data.table = FALSE), rownames = 1) # Read sample information # Assuming the first column is sample ID and subsequent columns are design variables sample_info <- fread(sample_info_file, header = TRUE, sep = "\t", data.table = FALSE) rownames(sample_info) <- sample_info$sample_id # Ensure sample_id column exists and is used for row names # Ensure sample order matches between count data and sample info sample_info <- sample_info[colnames(count_data), , drop = FALSE] # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = as.formula(design_formula_str)) # Run DESeq2 analysis dds <- DESeq(dds) # Get results (e.g., comparing the last level of 'condition' to the first) # For specific contrasts, use: results(dds, contrast=c("condition","treated","untreated")) res <- results(dds) # Order results by adjusted p-value res_ordered <- res[order(res$padj),] # Write results to file fwrite(as.data.frame(res_ordered), file = output_results_file, sep = "\t", row.names = TRUE) message(paste("DESeq2 analysis complete. Results saved to:", output_results_file)) EOF # Example placeholder files (uncomment and create these for a runnable example): # echo -e "gene_id\tsampleA_rep1\tsampleA_rep2\tsampleB_rep1\tsampleB_rep2" > counts.tsv # echo -e "gene1\t100\t120\t50\t60" >> counts.tsv # echo -e "gene2\t50\t60\t100\t110" >> counts.tsv # echo -e "gene3\t200\t210\t200\t220" >> counts.tsv # echo -e "sample_id\tcondition" > sample_info.tsv # echo -e "sampleA_rep1\tcontrol" >> sample_info.tsv # echo -e "sampleA_rep2\tcontrol" >> sample_info.tsv # echo -e "sampleB_rep1\ttreatment" >> sample_info.tsv # echo -e "sampleB_rep2\ttreatment" >> sample_info.tsv # Execute the R script Rscript run_deseq2.R -
5
For splicing analysis, 22Rv1 data only:
$ Bash example
# Install rMATS (example using conda) # conda create -n rmats_env python=3.8 # conda activate rmats_env # pip install rmats-turbo # Define input and reference files # Placeholder BAM files for 22Rv1 control and experimental groups. # Replace with actual paths to your 22Rv1 RNA-seq alignment BAM files. # For example, if comparing treated vs. untreated 22Rv1 cells. CONTROL_BAM_FILES="/path/to/22Rv1_control_rep1.bam,/path/to/22Rv1_control_rep2.bam" EXPERIMENTAL_BAM_FILES="/path/to/22Rv1_experimental_rep1.bam,/path/to/22Rv1_experimental_rep2.bam" # Reference genome annotation (GTF) - using hg38 as a placeholder. # Download from GENCODE or Ensembl (e.g., gencode.v38.annotation.gtf). GTF_FILE="/path/to/gencode.v38.annotation.gtf" # Output directory OUTPUT_DIR="rmats_splicing_analysis_22Rv1" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run rMATS for differential splicing analysis # Assuming paired-end reads and RNA-seq data. # Adjust --nthread based on available CPU cores. rmats.py \ --b1 "${CONTROL_BAM_FILES}" \ --b2 "${EXPERIMENTAL_BAM_FILES}" \ --gtf "${GTF_FILE}" \ --readType paired \ --seqType rnaseq \ --od "${OUTPUT_DIR}" \ --tmp "${OUTPUT_DIR}/tmp" \ --nthread 8 -
6
RNA-seq reads were mapped to hg38 using STAR Aligner.
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # --- Reference Genome Preparation (run once) --- # Download hg38 primary assembly FASTA and a comprehensive GTF file (e.g., from GENCODE or Ensembl) # Example for GRCh38.p14 (hg38) and GENCODE v45: # mkdir -p ~/references/hg38/fasta # wget -P ~/references/hg38/fasta/ https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/GRCh38.primary_assembly.genome.fa.gz # mkdir -p ~/references/hg38/gtf # wget -P ~/references/hg38/gtf/ https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz # gunzip ~/references/hg38/fasta/GRCh38.primary_assembly.genome.fa.gz # gunzip ~/references/hg38/gtf/gencode.v45.annotation.gtf.gz # Define paths for genome generation # GENOME_FASTA="~/references/hg38/fasta/GRCh38.primary_assembly.genome.fa" # GENOME_GTF="~/references/hg38/gtf/gencode.v45.annotation.gtf" # STAR_INDEX_DIR="~/STAR_genome_indices/hg38_gencode_v45" # Example index directory # NUM_THREADS_INDEX=16 # Number of threads for index generation # Generate STAR genome index (uncomment and run once) # STAR --runMode genomeGenerate \ # --genomeDir "${STAR_INDEX_DIR}" \ # --genomeFastaFiles "${GENOME_FASTA}" \ # --sjdbGTFfile "${GENOME_GTF}" \ # --sjdbOverhang 100 \ # --runThreadN "${NUM_THREADS_INDEX}" # --- RNA-seq Read Mapping --- # Define variables for mapping STAR_INDEX_DIR="~/STAR_genome_indices/hg38_gencode_v45" # Path to the pre-built STAR genome index for hg38 READS_R1="sample_R1.fastq.gz" # Placeholder for forward reads READS_R2="sample_R2.fastq.gz" # Placeholder for reverse reads (remove if single-end) OUTPUT_DIR="star_output" OUTPUT_PREFIX="${OUTPUT_DIR}/sample" NUM_THREADS_MAP=8 # Number of threads for mapping # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Execute STAR mapping command # For paired-end reads: STAR --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.1 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --sjdbScore 1 \ --runThreadN "${NUM_THREADS_MAP}" # For single-end reads, modify the --readFilesIn parameter: # STAR --genomeDir "${STAR_INDEX_DIR}" \ # --readFilesIn "${READS_R1}" \ # --readFilesCommand zcat \ # --outFileNamePrefix "${OUTPUT_PREFIX}_" \ # --outSAMtype BAM SortedByCoordinate \ # --outSAMattributes Standard \ # --outFilterType BySJout \ # --outFilterMultimapNmax 20 \ # --outFilterMismatchNmax 999 \ # --outFilterMismatchNoverLmax 0.1 \ # --alignIntronMin 20 \ # --alignIntronMax 1000000 \ # --alignMatesGapMax 1000000 \ # --sjdbScore 1 \ # --runThreadN "${NUM_THREADS_MAP}" -
7
Alternative splicing (AS) analysis was completed using rMATs (v3.2.5).
$ Bash example
# Installation (example using conda) # conda create -n rmats_env python=3.8 # conda activate rmats_env # pip install rmats-turbo==3.2.5 # Define input files and parameters (placeholders as none specified in description) # Replace with actual paths to your BAM files, GTF, and desired output directory. CONDITION1_BAM_FILES="condition1_rep1.bam,condition1_rep2.bam" # Comma-separated BAMs for condition 1 CONDITION2_BAM_FILES="condition2_rep1.bam,condition2_rep2.bam" # Comma-separated BAMs for condition 2 GTF_FILE="/path/to/gencode.v44.annotation.gtf" # Placeholder for GTF annotation file (e.g., GENCODE for GRCh38/hg38) OUTPUT_DIR="rmats_output" TMP_DIR="rmats_tmp" READ_LENGTH=100 # Placeholder, adjust based on actual data's read length LIB_TYPE="fr-firststrand" # Common library type, adjust if known (e.g., fr-unstranded, fr-secondstrand) N_THREADS=8 # Number of threads mkdir -p "${OUTPUT_DIR}" mkdir -p "${TMP_DIR}" # Run rMATS analysis for alternative splicing rmats.py \ --b1 "${CONDITION1_BAM_FILES}" \ --b2 "${CONDITION2_BAM_FILES}" \ --gtf "${GTF_FILE}" \ --od "${OUTPUT_DIR}" \ --tmp "${TMP_DIR}" \ --readLength "${READ_LENGTH}" \ --libType "${LIB_TYPE}" \ --nthread "${N_THREADS}" \ --task "as" # "as" for alternative splicing analysis
Raw Source Text
For differential expression: Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation. Gene level quantification was performed using tximeta [v1.8.4]. Differential gene expression was analyzed by DESeq2 [v1.30.1]. For splicing analysis, 22Rv1 data only: RNA-seq reads were mapped to hg38 using STAR Aligner. Alternative splicing (AS) analysis was completed using rMATs (v3.2.5). Assembly: hg19 Supplementary files format and content: *.csv: Gene counts were used for differential expression. Supplementary files format and content: *.xlsx: rMATs table was used for splicing analysis.