GSE147021 Processing Pipeline — Yeo Lab Publications

Publication

Heterogenous Populations of Tissue-Resident CD8<sup>+</sup> T Cells Are Generated in Response to Infection and Malignancy.

Immunity (2020) — PMID 32433949

Dataset

GSE147021

RNA-Seq of CD8+ T cell subsets during LCMV infection I

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1

Tophat

TopHat v2.1.1 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install TopHat (example using conda)
# conda install -c bioconda tophat=2.1.1

# Define variables for input and reference files
# Replace with actual paths to your data and reference files
# For human (Homo sapiens), GRCh38/hg38 is the latest assembly.
GENOME_INDEX="/path/to/your/genome/index/hg38_index" # Bowtie2 index for GRCh38
GTF_FILE="/path/to/your/annotation/hg38.gtf" # GTF file for GRCh38 (e.g., from GENCODE or Ensembl)
READ1_FASTQ="sample_R1.fastq.gz"
READ2_FASTQ="sample_R2.fastq.gz" # Omit if single-end
OUTPUT_DIR="tophat_output"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run TopHat for paired-end RNA-seq alignment
# -p: number of threads
# -G: GTF annotation file for splice junction mapping
# -o: output directory
# The last arguments are the Bowtie2 index prefix and the input FASTQ files
tophat -p 8 -G "${GTF_FILE}" -o "${OUTPUT_DIR}" "${GENOME_INDEX}" "${READ1_FASTQ}" "${READ2_FASTQ}"

# For single-end reads, the command would be:
# tophat -p 8 -G "${GTF_FILE}" -o "${OUTPUT_DIR}" "${GENOME_INDEX}" "${READ1_FASTQ}"

2

HT-Seq Count

htseq-count v0.11.2 GitHub

$ Bash example

# Install HTSeq (if not already installed)
# conda install -c bioconda htseq

# Example usage of htseq-count for gene quantification.
# This command assumes single-end reads, 'no' strandedness, and default 'union' mode.
# Adjust the '-s' (strandedness) parameter based on your library preparation protocol (e.g., 'yes', 'reverse', 'no').
# Adjust the '-r' (read order) parameter if using paired-end data and reads are sorted by name (e.g., 'name' instead of 'pos').
# Adjust the '-f' (input format) parameter if using SAM instead of BAM.
# The annotation GTF file (Homo_sapiens.GRCh38.109.gtf) is a placeholder; replace with your specific annotation file.

htseq-count \
    -f bam \
    -r pos \
    -s no \
    -a 10 \
    aligned_reads.bam \
    Homo_sapiens.GRCh38.109.gtf \
    > gene_counts.txt

View on GitHub

3

DESeq from BioConductor

DESeq2 vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install R and Bioconductor if not already present
# For Ubuntu/Debian:
# sudo apt-get update
# sudo apt-get install r-base
#
# For Bioconductor packages:
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("DESeq2")'

# Create dummy input files for demonstration
# In a real scenario, these would be generated by upstream steps (e.g., featureCounts, htseq-count)
echo "gene_id,sample1,sample2,sample3,sample4" > counts.csv
echo "geneA,100,120,50,60" >> counts.csv
echo "geneB,20,25,10,12" >> counts.csv
echo "geneC,500,550,200,220" >> counts.csv
echo "geneD,5,6,2,3" >> counts.csv
echo "geneE,10,12,80,90" >> counts.csv

echo "sample_id,condition" > coldata.csv
echo "sample1,untreated" >> coldata.csv
echo "sample2,untreated" >> coldata.csv
echo "sample3,treated" >> coldata.csv
echo "sample4,treated" >> coldata.csv

# Create the R script for DESeq2 analysis
cat << 'EOF' > deseq2_analysis.R
# Load DESeq2 package
library(DESeq2)

# --- Placeholder for input data ---
# Replace with your actual count matrix and sample metadata files
# Example: counts.csv should have genes as rows and samples as columns
# Example: coldata.csv should have sample names as rows and experimental factors as columns (e.g., condition, batch)

# Load count data (replace 'counts.csv' with your file)
# Assuming counts are integers and first column is gene names
count_data <- read.csv("counts.csv", row.names = 1, header = TRUE)
# Ensure counts are integers
count_data <- round(count_data)

# Load sample metadata (replace 'coldata.csv' with your file)
# Assuming first column is sample names, matching column names in count_data
coldata <- read.csv("coldata.csv", row.names = 1, header = TRUE)

# Ensure sample names match between count_data and coldata
# And ensure coldata rows are in the same order as count_data columns
coldata <- coldata[colnames(count_data), , drop = FALSE]

# --- Define your experimental design formula ---
# Example: ~ condition (if you have a 'condition' column in coldata)
# Example: ~ batch + condition (if you want to account for batch effects)
design_formula <- ~ condition # Placeholder: Adjust this based on your experiment

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = coldata,
                              design = design_formula)

# --- Pre-filtering (optional but recommended) ---
# Remove genes with very low counts across all samples
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# --- Run DESeq2 analysis ---
dds <- DESeq(dds)

# --- Extract results ---
# Example: comparing 'treated' vs 'untreated' for the 'condition' factor
# Replace 'condition', 'treated', 'untreated' with your actual factor levels
res <- results(dds, contrast = c("condition", "treated", "untreated"))

# Order results by adjusted p-value
res <- res[order(res$padj),]

# --- Save results ---
write.csv(as.data.frame(res), file = "deseq2_results.csv")

# --- Optional: Save normalized counts ---
normalized_counts <- counts(dds, normalized = TRUE)
write.csv(as.data.frame(normalized_counts), file = "deseq2_normalized_counts.csv")

message("DESeq2 analysis complete. Results saved to deseq2_results.csv")
EOF

# Execute the R script
Rscript deseq2_analysis.R

View on GitHub

Tools Used

TopHat DESeq2