GSE222216 Processing Pipeline
RNA-Seq
code_examples
4 steps
Publication
Remodeling oncogenic transcriptomes by small molecules targeting NONO.Nature chemical biology (2023) — PMID 36864190
Dataset
GSE222216Transcriptome changes in 22Rv1 cells with a double knockout or over expression of C145S, Empty vector, or Wild type NONO.
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation.
$ Bash example
# Install Salmon (if not already installed) # conda install -c bioconda salmon=1.3.0 # --- Reference Data Setup --- # Create a directory for reference data mkdir -p reference_data # Define paths for GENCODE v37 annotation GENCODE_TRANSCRIPTS_FASTA="reference_data/gencode.v37.transcripts.fa" SALMON_INDEX_DIR="gencode_v37_salmon_index" # Download GENCODE v37 human transcripts FASTA (required for Salmon indexing) # Source: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.transcripts.fa.gz if [ ! -f "${GENCODE_TRANSCRIPTS_FASTA}" ]; then echo "Downloading GENCODE v37 transcripts FASTA..." wget -O "${GENCODE_TRANSCRIPTS_FASTA}.gz" ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.transcripts.fa.gz gunzip -f "${GENCODE_TRANSCRIPTS_FASTA}.gz" fi # --- Salmon Indexing --- # Build Salmon index using GENCODE v37 transcripts # The --gencode flag is recommended for GENCODE annotations if [ ! -d "${SALMON_INDEX_DIR}" ]; then echo "Building Salmon index..." salmon index -t "${GENCODE_TRANSCRIPTS_FASTA}" -i "${SALMON_INDEX_DIR}" --gencode fi # --- Salmon Quantification --- # Define input and output paths # Replace these with your actual input FASTQ files and desired output directory READS_R1="raw_data/sample_R1.fastq.gz" # Placeholder for forward reads READS_R2="raw_data/sample_R2.fastq.gz" # Placeholder for reverse reads (if paired-end) OUTPUT_DIR="salmon_quant_output" # Create a directory for raw data (placeholder) mkdir -p raw_data # Example: touch raw_data/sample_R1.fastq.gz raw_data/sample_R2.fastq.gz echo "Running Salmon quantification..." # Salmon quantification command # -i: path to the Salmon index # -l A: automatically detect library type (e.g., strandedness) # -1: path to forward reads (gzipped FASTQ) # -2: path to reverse reads (gzipped FASTQ) - omit for single-end # --gcBias: enable GC bias correction (recommended for RNA-seq) # --seqBias: enable sequence-specific bias correction (recommended for RNA-seq) # -o: output directory salmon quant -i "${SALMON_INDEX_DIR}" \ -l A \ -1 "${READS_R1}" \ -2 "${READS_R2}" \ --gcBias \ --seqBias \ -o "${OUTPUT_DIR}" echo "Salmon quantification complete. Results are in ${OUTPUT_DIR}" -
2
Gene level quantification was performed using tximeta [v1.8.4].
$ Bash example
# Install BiocManager if not already installed # R -q -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')" # Install tximeta and other necessary packages # R -q -e "BiocManager::install(c('tximeta', 'SummarizedExperiment'))" # --- Prepare dummy input data for demonstration --- # Create dummy quantification directories and files (e.g., from Salmon) mkdir -p quant_results/sample1 quant_results/sample2 quant_results/sample3 # Create empty quant.sf files as placeholders touch quant_results/sample1/quant.sf touch quant_results/sample2/quant.sf touch quant_results/sample3/quant.sf # Placeholder for the transcriptome annotation (GTF/GFF) # For human GRCh38, Ensembl release 109 (or latest) is a common choice. # You would typically have this file pre-downloaded or generated. # Example command to download a GTF (uncomment and run if needed): # wget -nc ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz # Ensure this file exists in the current directory or provide its full path. GTF_FILE="Homo_sapiens.GRCh38.109.gtf.gz" # Placeholder, replace with actual path # Create an R script to perform gene-level quantification using tximeta cat << 'EOF' > run_tximeta.R library(tximeta) library(SummarizedExperiment) # --- Configuration --- # Paths to quantification directories (e.g., Salmon output) quant_dirs <- c( "quant_results/sample1", "quant_results/sample2", "quant_results/sample3" ) files <- file.path(quant_dirs, "quant.sf") # Sample metadata coldata <- data.frame( names = basename(quant_dirs), files = files, stringsAsFactors = FALSE ) # Path to the transcriptome annotation GTF/GFF file # This should match the GTF used for building the Salmon/Kallisto index. gtf_file <- Sys.getenv("GTF_FILE") # Get GTF path from environment variable # --- Import and Summarize --- # Import quantification data using tximeta # tximeta automatically tries to link to the correct annotation based on the index. # If it cannot, you might need to manually specify the GTF or ensure the index # was built with a recognized annotation source. se <- tximeta(coldata) # Summarize to gene level # tximeta often provides the necessary 'tx2gene' mapping automatically. gene_se <- summarizeToGene(se) # Access gene-level counts gene_counts <- assay(gene_se, "counts") # Save results write.csv(gene_counts, "gene_level_counts.csv") message("Gene level quantification completed and saved to gene_level_counts.csv") EOF # Set the GTF_FILE environment variable for the R script export GTF_FILE="${GTF_FILE}" # Execute the R script Rscript run_tximeta.R # Clean up dummy files (optional) # rm -rf quant_results run_tximeta.R gene_level_counts.csv -
3
Differential gene expression was analyzed by DESeq2 [v1.30.1]
$ Bash example
# Installation: # conda install -c bioconda bioconductor-deseq2=1.30.1 # --- Placeholder for input data --- # In a real scenario, replace these with actual paths to your count matrix and sample information. # Example: counts.csv should have genes as rows and samples as columns, with the first column being gene IDs. # Example: sample_info.csv should have sample names as rows and experimental conditions as columns (e.g., 'condition', 'batch'). # Make sure row names of colData match column names of countData. # Create dummy counts.csv and sample_info.csv for demonstration purposes. # In a real pipeline, these files would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto). echo "gene_id,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10" > counts.csv echo "geneA,100,120,110,90,130,500,550,520,480,600" >> counts.csv echo "geneB,50,60,55,45,65,200,220,210,190,230" >> counts.csv echo "geneC,200,210,190,220,180,100,110,90,120,80" >> counts.csv echo "geneD,300,320,310,290,330,300,320,310,290,330" >> counts.csv echo "geneE,10,12,11,9,13,50,55,52,48,60" >> counts.csv echo "sample,condition" > sample_info.csv echo "sample1,control" >> sample_info.csv echo "sample2,control" >> sample_info.csv echo "sample3,control" >> sample_info.csv echo "sample4,control" >> sample_info.csv echo "sample5,control" >> sample_info.csv echo "sample6,treated" >> sample_info.csv echo "sample7,treated" >> sample_info.csv echo "sample8,treated" >> sample_info.csv echo "sample9,treated" >> sample_info.csv echo "sample10,treated" >> sample_info.csv # Create the R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R # Load DESeq2 library library(DESeq2) # Load count data # Assuming counts.csv has gene IDs in the first column and samples in subsequent columns count_data_raw <- read.csv("counts.csv", row.names = 1) # Ensure count data is integer matrix count_data <- as.matrix(round(count_data_raw)) # Load sample information # Assuming sample_info.csv has sample names and condition information sample_info <- read.csv("sample_info.csv", row.names = 1) # Ensure sample names in sample_info match column names in count_data if (!all(rownames(sample_info) == colnames(count_data))) { # Attempt to reorder sample_info to match count_data columns sample_info <- sample_info[colnames(count_data), , drop = FALSE] if (!all(rownames(sample_info) == colnames(count_data))) { stop("Sample names in 'sample_info.csv' do not match column names in 'counts.csv'.") } } # Create DESeqDataSet object # The design formula should reflect your experimental setup. # Here, we assume a simple comparison between 'treated' and 'control' conditions. dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = ~ condition) # Example design formula # Pre-filtering: remove genes with very low counts # This helps to reduce the number of tests and improve statistical power. keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # Run DESeq2 analysis dds <- DESeq(dds) # Extract results for a specific contrast (e.g., treated vs control) # The levels for contrast should match the levels of your 'condition' factor. # You might need to adjust 'treated' and 'control' based on your actual data. res <- results(dds, contrast = c("condition", "treated", "control")) # Order results by adjusted p-value res <- res[order(res$padj),] # Save results write.csv(as.data.frame(res), "deseq2_results.csv") # Optional: Save normalized counts norm_counts <- counts(dds, normalized=TRUE) write.csv(as.data.frame(norm_counts), "deseq2_normalized_counts.csv") EOF # Execute the R script Rscript run_deseq2.R -
4
processed data files format and content: gene counts
$ Bash example
# conda install -c bioconda rsem # Define variables INPUT_BAM="aligned_reads.bam" # Path to the input BAM file (e.g., from STAR alignment) RSEM_REFERENCE_INDEX="path/to/rsem_reference_index/GRCh38" # Path to the RSEM reference index (prepared using rsem-prepare-reference for human GRCh38) OUTPUT_PREFIX="sample_gene_counts" # Prefix for output files # Execute RSEM to calculate gene and isoform expression # --bam: Specifies that the input is a BAM file # --paired-end: Specifies that the input reads are paired-end # --no-qualities: (Optional) Use if the BAM file does not contain quality scores (e.g., from pseudo-alignment) # --output-genome-bam: (Optional) Output a genome-aligned BAM file sorted by coordinate # --seed 12345: (Optional) Set a random seed for reproducibility rsem-calculate-expression \ --bam \ --paired-end \ "${INPUT_BAM}" \ "${RSEM_REFERENCE_INDEX}" \ "${OUTPUT_PREFIX}" # The main output files will be: # ${OUTPUT_PREFIX}.genes.results (contains gene-level counts and FPKM/TPM) # ${OUTPUT_PREFIX}.isoforms.results (contains isoform-level counts and FPKM/TPM)
Tools Used
Raw Source Text
Transcript abundance was quantified using Salmon [v1.3.0] with GENCODE v37 annotation. Gene level quantification was performed using tximeta [v1.8.4]. Differential gene expression was analyzed by DESeq2 [v1.30.1] genome build: HG19 processed data files format and content: gene counts