GSE147080 Processing Pipeline
RNA-Seq
code_examples
3 steps
Publication
Heterogenous Populations of Tissue-Resident CD8<sup>+</sup> T Cells Are Generated in Response to Infection and Malignancy.Immunity (2020) — PMID 32433949
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Tophat
TopHat v2.1.1 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install TopHat (example using conda) # conda install -c bioconda tophat=2.1.1 # Define variables for input and reference files # Replace with actual paths to your data and reference files # For human (Homo sapiens), GRCh38/hg38 is the latest assembly. GENOME_INDEX="/path/to/your/genome/index/hg38_index" # Bowtie2 index for GRCh38 GTF_FILE="/path/to/your/annotation/hg38.gtf" # GTF file for GRCh38 (e.g., from GENCODE or Ensembl) READ1_FASTQ="sample_R1.fastq.gz" READ2_FASTQ="sample_R2.fastq.gz" # Omit if single-end OUTPUT_DIR="tophat_output" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run TopHat for paired-end RNA-seq alignment # -p: number of threads # -G: GTF annotation file for splice junction mapping # -o: output directory # The last arguments are the Bowtie2 index prefix and the input FASTQ files tophat -p 8 -G "${GTF_FILE}" -o "${OUTPUT_DIR}" "${GENOME_INDEX}" "${READ1_FASTQ}" "${READ2_FASTQ}" # For single-end reads, the command would be: # tophat -p 8 -G "${GTF_FILE}" -o "${OUTPUT_DIR}" "${GENOME_INDEX}" "${READ1_FASTQ}" -
2
HT-Seq Count
$ Bash example
# Install HTSeq (if not already installed) # conda install -c bioconda htseq # Example usage of htseq-count for gene quantification. # This command assumes single-end reads, 'no' strandedness, and default 'union' mode. # Adjust the '-s' (strandedness) parameter based on your library preparation protocol (e.g., 'yes', 'reverse', 'no'). # Adjust the '-r' (read order) parameter if using paired-end data and reads are sorted by name (e.g., 'name' instead of 'pos'). # Adjust the '-f' (input format) parameter if using SAM instead of BAM. # The annotation GTF file (Homo_sapiens.GRCh38.109.gtf) is a placeholder; replace with your specific annotation file. htseq-count \ -f bam \ -r pos \ -s no \ -a 10 \ aligned_reads.bam \ Homo_sapiens.GRCh38.109.gtf \ > gene_counts.txt -
3
DESeq from BioConductor
$ Bash example
# Install R and Bioconductor if not already present # For Ubuntu/Debian: # sudo apt-get update # sudo apt-get install r-base # # For Bioconductor packages: # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("DESeq2")' # Create dummy input files for demonstration # In a real scenario, these would be generated by upstream steps (e.g., featureCounts, htseq-count) echo "gene_id,sample1,sample2,sample3,sample4" > counts.csv echo "geneA,100,120,50,60" >> counts.csv echo "geneB,20,25,10,12" >> counts.csv echo "geneC,500,550,200,220" >> counts.csv echo "geneD,5,6,2,3" >> counts.csv echo "geneE,10,12,80,90" >> counts.csv echo "sample_id,condition" > coldata.csv echo "sample1,untreated" >> coldata.csv echo "sample2,untreated" >> coldata.csv echo "sample3,treated" >> coldata.csv echo "sample4,treated" >> coldata.csv # Create the R script for DESeq2 analysis cat << 'EOF' > deseq2_analysis.R # Load DESeq2 package library(DESeq2) # --- Placeholder for input data --- # Replace with your actual count matrix and sample metadata files # Example: counts.csv should have genes as rows and samples as columns # Example: coldata.csv should have sample names as rows and experimental factors as columns (e.g., condition, batch) # Load count data (replace 'counts.csv' with your file) # Assuming counts are integers and first column is gene names count_data <- read.csv("counts.csv", row.names = 1, header = TRUE) # Ensure counts are integers count_data <- round(count_data) # Load sample metadata (replace 'coldata.csv' with your file) # Assuming first column is sample names, matching column names in count_data coldata <- read.csv("coldata.csv", row.names = 1, header = TRUE) # Ensure sample names match between count_data and coldata # And ensure coldata rows are in the same order as count_data columns coldata <- coldata[colnames(count_data), , drop = FALSE] # --- Define your experimental design formula --- # Example: ~ condition (if you have a 'condition' column in coldata) # Example: ~ batch + condition (if you want to account for batch effects) design_formula <- ~ condition # Placeholder: Adjust this based on your experiment # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = count_data, colData = coldata, design = design_formula) # --- Pre-filtering (optional but recommended) --- # Remove genes with very low counts across all samples keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # --- Run DESeq2 analysis --- dds <- DESeq(dds) # --- Extract results --- # Example: comparing 'treated' vs 'untreated' for the 'condition' factor # Replace 'condition', 'treated', 'untreated' with your actual factor levels res <- results(dds, contrast = c("condition", "treated", "untreated")) # Order results by adjusted p-value res <- res[order(res$padj),] # --- Save results --- write.csv(as.data.frame(res), file = "deseq2_results.csv") # --- Optional: Save normalized counts --- normalized_counts <- counts(dds, normalized = TRUE) write.csv(as.data.frame(normalized_counts), file = "deseq2_normalized_counts.csv") message("DESeq2 analysis complete. Results saved to deseq2_results.csv") EOF # Execute the R script Rscript deseq2_analysis.R
Raw Source Text
Tophat HT-Seq Count DESeq from BioConductor Genome_build: UCSC mm10 Supplementary_files_format_and_content: gct