GSE147021 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Heterogenous Populations of Tissue-Resident CD8<sup>+</sup> T Cells Are Generated in Response to Infection and Malignancy.

Immunity (2020) — PMID 32433949

Dataset

GSE147021

RNA-Seq of CD8+ T cell subsets during LCMV infection I

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Tophat

    TopHat v2.1.1 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install TopHat (example using conda)
    # conda install -c bioconda tophat=2.1.1
    
    # Define variables for input and reference files
    # Replace with actual paths to your data and reference files
    # For human (Homo sapiens), GRCh38/hg38 is the latest assembly.
    GENOME_INDEX="/path/to/your/genome/index/hg38_index" # Bowtie2 index for GRCh38
    GTF_FILE="/path/to/your/annotation/hg38.gtf" # GTF file for GRCh38 (e.g., from GENCODE or Ensembl)
    READ1_FASTQ="sample_R1.fastq.gz"
    READ2_FASTQ="sample_R2.fastq.gz" # Omit if single-end
    OUTPUT_DIR="tophat_output"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run TopHat for paired-end RNA-seq alignment
    # -p: number of threads
    # -G: GTF annotation file for splice junction mapping
    # -o: output directory
    # The last arguments are the Bowtie2 index prefix and the input FASTQ files
    tophat -p 8 -G "${GTF_FILE}" -o "${OUTPUT_DIR}" "${GENOME_INDEX}" "${READ1_FASTQ}" "${READ2_FASTQ}"
    
    # For single-end reads, the command would be:
    # tophat -p 8 -G "${GTF_FILE}" -o "${OUTPUT_DIR}" "${GENOME_INDEX}" "${READ1_FASTQ}"
    
  2. 2

    HT-Seq Count

    htseq-count v0.11.2 GitHub
    $ Bash example
    # Install HTSeq (if not already installed)
    # conda install -c bioconda htseq
    
    # Example usage of htseq-count for gene quantification.
    # This command assumes single-end reads, 'no' strandedness, and default 'union' mode.
    # Adjust the '-s' (strandedness) parameter based on your library preparation protocol (e.g., 'yes', 'reverse', 'no').
    # Adjust the '-r' (read order) parameter if using paired-end data and reads are sorted by name (e.g., 'name' instead of 'pos').
    # Adjust the '-f' (input format) parameter if using SAM instead of BAM.
    # The annotation GTF file (Homo_sapiens.GRCh38.109.gtf) is a placeholder; replace with your specific annotation file.
    
    htseq-count \
        -f bam \
        -r pos \
        -s no \
        -a 10 \
        aligned_reads.bam \
        Homo_sapiens.GRCh38.109.gtf \
        > gene_counts.txt
  3. 3

    DESeq from BioConductor

    DESeq2 vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install R and Bioconductor if not already present
    # For Ubuntu/Debian:
    # sudo apt-get update
    # sudo apt-get install r-base
    #
    # For Bioconductor packages:
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # R -e 'BiocManager::install("DESeq2")'
    
    # Create dummy input files for demonstration
    # In a real scenario, these would be generated by upstream steps (e.g., featureCounts, htseq-count)
    echo "gene_id,sample1,sample2,sample3,sample4" > counts.csv
    echo "geneA,100,120,50,60" >> counts.csv
    echo "geneB,20,25,10,12" >> counts.csv
    echo "geneC,500,550,200,220" >> counts.csv
    echo "geneD,5,6,2,3" >> counts.csv
    echo "geneE,10,12,80,90" >> counts.csv
    
    echo "sample_id,condition" > coldata.csv
    echo "sample1,untreated" >> coldata.csv
    echo "sample2,untreated" >> coldata.csv
    echo "sample3,treated" >> coldata.csv
    echo "sample4,treated" >> coldata.csv
    
    # Create the R script for DESeq2 analysis
    cat << 'EOF' > deseq2_analysis.R
    # Load DESeq2 package
    library(DESeq2)
    
    # --- Placeholder for input data ---
    # Replace with your actual count matrix and sample metadata files
    # Example: counts.csv should have genes as rows and samples as columns
    # Example: coldata.csv should have sample names as rows and experimental factors as columns (e.g., condition, batch)
    
    # Load count data (replace 'counts.csv' with your file)
    # Assuming counts are integers and first column is gene names
    count_data <- read.csv("counts.csv", row.names = 1, header = TRUE)
    # Ensure counts are integers
    count_data <- round(count_data)
    
    # Load sample metadata (replace 'coldata.csv' with your file)
    # Assuming first column is sample names, matching column names in count_data
    coldata <- read.csv("coldata.csv", row.names = 1, header = TRUE)
    
    # Ensure sample names match between count_data and coldata
    # And ensure coldata rows are in the same order as count_data columns
    coldata <- coldata[colnames(count_data), , drop = FALSE]
    
    # --- Define your experimental design formula ---
    # Example: ~ condition (if you have a 'condition' column in coldata)
    # Example: ~ batch + condition (if you want to account for batch effects)
    design_formula <- ~ condition # Placeholder: Adjust this based on your experiment
    
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = coldata,
                                  design = design_formula)
    
    # --- Pre-filtering (optional but recommended) ---
    # Remove genes with very low counts across all samples
    keep <- rowSums(counts(dds)) >= 10
    dds <- dds[keep,]
    
    # --- Run DESeq2 analysis ---
    dds <- DESeq(dds)
    
    # --- Extract results ---
    # Example: comparing 'treated' vs 'untreated' for the 'condition' factor
    # Replace 'condition', 'treated', 'untreated' with your actual factor levels
    res <- results(dds, contrast = c("condition", "treated", "untreated"))
    
    # Order results by adjusted p-value
    res <- res[order(res$padj),]
    
    # --- Save results ---
    write.csv(as.data.frame(res), file = "deseq2_results.csv")
    
    # --- Optional: Save normalized counts ---
    normalized_counts <- counts(dds, normalized = TRUE)
    write.csv(as.data.frame(normalized_counts), file = "deseq2_normalized_counts.csv")
    
    message("DESeq2 analysis complete. Results saved to deseq2_results.csv")
    EOF
    
    # Execute the R script
    Rscript deseq2_analysis.R

Tools Used

Raw Source Text
Tophat
HT-Seq Count
DESeq from BioConductor
Genome_build: UCSC mm10
Supplementary_files_format_and_content: gct
← Back to Analysis