GSE89405 Processing Pipeline

RNA-Seq code_examples 39 steps

Publication

Early transcriptional and epigenetic regulation of CD8<sup>+</sup> T cell differentiation revealed by single-cell RNA sequencing.

Nature immunology (2017) — PMID 28218746

Dataset

GSE89405

Early transcriptional and epigenetic regulation of CD8+ T cell differentiation revealed by single-cell RNA-seq

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Basecalls performed using CASAVA version 1.4

CASAVA v1.4

$ Bash example

# CASAVA (Consensus Assessment of Sequence And Variation) is an older, proprietary Illumina software suite for base calling and initial data processing.
# It is not typically installed via public package managers like conda.
# Installation would involve obtaining the software package directly from Illumina.

# Example command for base calling and demultiplexing using CASAVA 1.4's configureBclToFastq.pl script.
# This script processes BCL files from an Illumina run directory into FASTQ files,
# using a sample sheet for demultiplexing.
# Replace <path_to_illumina_run_directory>, <path_to_output_fastq_directory>, and <path_to_sample_sheet.csv> with actual paths.

# The exact path to configureBclToFastq.pl might vary based on CASAVA installation.
# This command assumes the script is in the PATH or specified with its full path.
configureBclToFastq.pl \
  --input-dir <path_to_illumina_run_directory>/Data/Intensities/BaseCalls \
  --output-dir <path_to_output_fastq_directory> \
  --sample-sheet <path_to_sample_sheet.csv>

ChIP-seq reads were aligned to the mm10 genome assembly using STAR version 2.4.1b with the following options --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --limitOutSJoneRead --outFilterScoreMin 3 --outFilterScoreMinOverLread 0.66 --outFilterMatchNminOverLread 0.66 --outFilterMismatchNoverLmax 0.3 and the other parameter kept as default

STAR v2.4.1b GitHub

$ Bash example

# Install STAR if not already available
# conda install -c bioconda star=2.4.1b

# Define variables for clarity
STAR_GENOME_DIR="/path/to/genome/mm10/STAR_index" # Placeholder for STAR genome index directory. This directory should contain genome.fa, SA, SAindex, etc.
# Input reads: Assuming paired-end. For single-end, remove READ2 and adjust --readFilesIn to only "${READ1}"
READ1="input_reads_R1.fastq.gz" # Placeholder for input ChIP-seq reads R1
READ2="input_reads_R2.fastq.gz" # Placeholder for input ChIP-seq reads R2
OUTPUT_PREFIX="star_alignment_output/" # Prefix for output files (e.g., star_alignment_output/Aligned.sortedByCoordinate.out.bam)

# Run STAR alignment
# Note: --runThreadN is often specified for performance but is not explicitly mentioned in the description.
#       It's omitted here to strictly adhere to "other parameter kept as default".
STAR \
  --genomeDir "${STAR_GENOME_DIR}" \
  --readFilesIn "${READ1}" "${READ2}" \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outFilterMultimapNmax 10 \
  --outFilterMultimapScoreRange 1 \
  --limitOutSJoneRead \
  --outFilterScoreMin 3 \
  --outFilterScoreMinOverLread 0.66 \
  --outFilterMatchNminOverLread 0.66 \
  --outFilterMismatchNoverLmax 0.3

View on GitHub

BAM files of sample1-11 are downsampled, if needed, using samtools to ensure sample 1-4, sample 5-8, sample 9-11 have same read numbers.

samtools v1.15.1 GitHub

$ Bash example

# Install samtools if not available
# conda install -c bioconda samtools

# Define input BAM files (replace with actual file paths if different)
BAM_FILES=(
    "sample1.bam" "sample2.bam" "sample3.bam" "sample4.bam"
    "sample5.bam" "sample6.bam" "sample7.bam" "sample8.bam"
    "sample9.bam" "sample10.bam" "sample11.bam"
)

# Create an associative array to store read counts
declare -A READ_COUNTS

# Calculate read counts for each BAM file and find the global minimum
MIN_READS=-1
for bam_file in "${BAM_FILES[@]}"; do
    if [ ! -f "$bam_file" ]; then
        echo "Error: Input BAM file '$bam_file' not found. Please ensure all input files exist." >&2
        exit 1
    fi
    current_reads=$(samtools view -c "$bam_file")
    READ_COUNTS["$bam_file"]=$current_reads
    echo "Reads in $bam_file: $current_reads"

    if (( MIN_READS == -1 || current_reads < MIN_READS )); then
        MIN_READS=$current_reads
    fi
done

if (( MIN_READS == -1 )); then
    echo "Error: No valid BAM files processed to determine minimum read count." >&2
    exit 1
fi

echo "Target read count for downsampling (global minimum): $MIN_READS"

# Downsample BAM files if their read count is above the target minimum
# A fixed random seed (e.g., 42) is used for reproducibility.
for bam_file in "${BAM_FILES[@]}"; do
    current_reads=${READ_COUNTS["$bam_file"]}
    output_bam="${bam_file%.bam}_downsampled.bam"

    if (( current_reads > MIN_READS )); then
        # Calculate the fraction for samtools -s. samtools -s takes a seed and a fraction (e.g., 42.0.5 for 50% with seed 42).
        fraction=$(awk "BEGIN {printf \"%.6f\", $MIN_READS / $current_reads}")
        seed_fraction="42.$fraction" # Using a fixed seed 42

        echo "Downsampling $bam_file from $current_reads to $MIN_READS reads (fraction: $fraction) -> $output_bam"
        samtools view -b -s "$seed_fraction" "$bam_file" > "$output_bam"
        samtools index "$output_bam"
    else
        echo "$bam_file already has $current_reads reads, which is <= target $MIN_READS. Copying to $output_bam."
        cp "$bam_file" "$output_bam"
        samtools index "$output_bam" # Ensure index exists for consistency
    fi
done

View on GitHub

for sample 1-4: peaks were called using MACS version 1.4.2 using sample 5-8 as backgournd, with the following setting -g mm --nomodel --nolambda and others as default.

MACS2 v1.4.2 GitHub

$ Bash example

# Install MACS2 (if not already installed)
# conda install -c bioconda macs2

# Define input and control BAM files
TREATMENT_BAMS="sample1.bam sample2.bam sample3.bam sample4.bam"
CONTROL_BAMS="sample5.bam sample6.bam sample7.bam sample8.bam"

# Define output prefix
OUTPUT_PREFIX="peaks_sample1-4_vs_sample5-8"

# Define genome size (e.g., mm10 for mouse)
GENOME_SIZE="mm"

# Run MACS2 callpeak
macs2 callpeak \
    -t ${TREATMENT_BAMS} \
    -c ${CONTROL_BAMS} \
    -g ${GENOME_SIZE} \
    --nomodel \
    --nolambda \
    -n ${OUTPUT_PREFIX} \
    --outdir .

View on GitHub

for sample 9-11: tag directories were first generated by HOMER makeTagDirectory with -keepOne -tbp 1 and other options as default.

HOMER vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install HOMER (if not already installed)
# For example, using conda:
# conda install -c bioconda homer

# Generate tag directories for sample 9
makeTagDirectory sample_9_tag_dir sample_9.bam -keepOne -tbp 1

# Generate tag directories for sample 10
makeTagDirectory sample_10_tag_dir sample_10.bam -keepOne -tbp 1

# Generate tag directories for sample 11
makeTagDirectory sample_11_tag_dir sample_11.bam -keepOne -tbp 1

# Note: Replace 'sample_X.bam' with the actual input BAM files for each sample
# and 'sample_X_tag_dir' with the desired output tag directory names.

View on GitHub

for sample 9 and 11: peakd were call by HOMER findpeak using sample 12 as background and -style factor -size 100 -fragLength 50 with other options as default

HOMER v4.12 GitHub

$ Bash example

# Install HOMER (if not already installed)
# conda install -c bioconda homer

# Assuming input BAM files are available in the current directory
# and are named sample9.bam, sample11.bam, and sample12.bam

# Peak calling for sample 9
findPeaks sample9.bam \
    -o sample9_peaks.txt \
    -i sample12.bam \
    -style factor \
    -size 100 \
    -fragLength 50

# Peak calling for sample 11
findPeaks sample11.bam \
    -o sample11_peaks.txt \
    -i sample12.bam \
    -style factor \
    -size 100 \
    -fragLength 50

View on GitHub

for sample 10: peakd were call by HOMER findpeak using sample 12 as background and -style histone -fragLength 50 -size 200 -minDist 1000 -L 0 with other options as default

HOMER vInfer from description GitHub

$ Bash example

# Install HOMER (if not already installed)
# For example, using conda:
# conda install -c bioconda homer
# Or from source:
# wget http://homer.ucsd.edu/homer/configureHomer.pl
# perl configureHomer.pl -install

# Assuming sample10.bam is the treatment BAM file and sample12.bam is the background BAM file
# And these files are located in the current directory or specified with full paths
findPeaks sample10.bam -o sample10_peaks.txt -i sample12.bam -style histone -fragLength 50 -size 200 -minDist 1000 -L 0

View on GitHub

for sample 13-16: transcript count were generated by kallisto version 0.42.4 using GENCODE GRCm38.p4 transcriptome as reference, with the following parameter: -l 200 -s 20 --single and other setting as default

kallisto v0.42.4 GitHub

$ Bash example

# Install kallisto if not already installed
# conda install -c bioconda kallisto

# Placeholder for kallisto index generation (if not already done)
# The GENCODE GRCm38.p4 transcriptome FASTA file would be needed here.
# For example, if the FASTA file is gencode.vM4.transcripts.fa.gz:
# kallisto index -i gencode.vM4.transcripts.idx gencode.vM4.transcripts.fa.gz

# Run kallisto quantification for a single sample (e.g., sample 13)
# Assuming input fastq is sample_13.fastq.gz and output directory is kallisto_output_sample_13
kallisto quant -i gencode.vM4.transcripts.idx -o kallisto_output_sample_13 --single -l 200 -s 20 sample_13.fastq.gz

View on GitHub

For all Single Cell Samples the following processing steps apply:

(Inferred with models/gemini-2.5-flash) v(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# This is a placeholder script for single-cell sample processing.
# The actual tools and commands depend heavily on the specific single-cell assay (e.g., scRNA-seq, scATAC-seq)
# and the technology used (e.g., 10x Genomics, Smart-seq2).

# Define reference genome (placeholder for human hg38)
# Replace with the actual path to your reference genome and annotation files.
REFERENCE_GENOME_FASTA="/path/to/human/hg38/GRCh38.primary_assembly.genome.fa"
REFERENCE_GENOME_GTF="/path/to/human/hg38/gencode.v38.annotation.gtf"

# For 10x Genomics data, a Cell Ranger reference transcriptome is typically required:
# CELLRANGER_TRANSCRIPTOME_REF="/path/to/cellranger/refdata-gex-GRCh38-2020-A"

# Example conceptual steps for single-cell data processing:

# 1. Quality Control (e.g., using FastQC)
# fastqc -o qc_output/ sample_R1.fastq.gz sample_R2.fastq.gz

# 2. Demultiplexing and Barcode Processing (if applicable, e.g., for 10x Genomics)
# For 10x Genomics, this is often done with 'cellranger mkfastq' or handled by the sequencing provider.
# cellranger mkfastq --id=sample_id --run=/path/to/sequencing_run_folder --csv=/path/to/sample_sheet.csv

# 3. Alignment and Quantification (e.g., for scRNA-seq)
# For 10x Genomics scRNA-seq, 'cellranger count' is the standard tool:
# cellranger count \
#   --id=sample_id \
#   --transcriptome=${CELLRANGER_TRANSCRIPTOME_REF} \
#   --fastqs=/path/to/fastqs_directory \
#   --sample=sample_name \
#   --expect-cells=3000

# For generic scRNA-seq (e.g., Smart-seq2, using a splice-aware aligner like STAR and a quantifier like Salmon):
# # Build STAR index (run once per reference genome)
# # STAR --runMode genomeGenerate --genomeDir /path/to/STAR_index --genomeFastaFiles ${REFERENCE_GENOME_FASTA} --sjdbGTFfile ${REFERENCE_GENOME_GTF} --runThreadN 8

# # Align reads with STAR
# # STAR \
# #   --genomeDir /path/to/STAR_index \
# #   --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
# #   --readFilesCommand zcat \
# #   --outFileNamePrefix star_output/sample_ \
# #   --outSAMtype BAM SortedByCoordinate \
# #   --outSAMunmapped Within \
# #   --outFilterType BySJout \
# #   --outFilterMultimapNmax 20 \
# #   --alignSJDBoverhangMin 1 \
# #   --alignSJoverhangMin 8 \
# #   --alignIntronMin 20 \
# #   --alignIntronMax 1000000 \
# #   --alignMatesGapMax 1000000 \
# #   --runThreadN 8

# # Quantify transcripts with Salmon (requires a Salmon index built from a transcriptome FASTA)
# # salmon index -t /path/to/transcripts.fa -i salmon_index
# # salmon quant \
# #   -i salmon_index \
# #   -l A \
# #   -1 sample_R1.fastq.gz \
# #   -2 sample_R2.fastq.gz \
# #   -o salmon_output/sample_quant \
# #   --validateMappings \
# #   --gcBias

# Further downstream analysis typically involves dimension reduction, clustering, and differential expression analysis
# using tools like Seurat, Scanpy, or Bioconductor packages in R.

View on GitHub

Single-cell RNA-seq data pre-processing.

RNA-seq v7.1.0 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install Cell Ranger (example using conda, though 10x Genomics provides binaries and recommends direct download)
# conda create -n cellranger_env -c bioconda cellranger=7.1.0
# conda activate cellranger_env

# Download a pre-built human transcriptome reference (e.g., GRCh38 from 10x Genomics)
# mkdir -p /path/to/references
# cd /path/to/references
# wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
# tar -xzf refdata-gex-GRCh38-2020-A.tar.gz

# Run cellranger count for single-cell RNA-seq pre-processing (alignment and quantification)
# Replace /path/to/fastq_files with the actual directory containing your FASTQ files.
# Ensure FASTQ files follow 10x Genomics naming conventions (e.g., SAMPLE_S1_L001_R1_001.fastq.gz).
# Replace my_sample_analysis with a unique identifier for your analysis output directory.
# Replace my_sample_prefix with the prefix of your FASTQ files (e.g., 'SAMPLE' if files are SAMPLE_S1_L001_R1_001.fastq.gz).
cellranger count \
    --id=my_sample_analysis \
    --transcriptome=/path/to/references/refdata-gex-GRCh38-2020-A \
    --fastqs=/path/to/fastq_files \
    --sample=my_sample_prefix \
    --localcores=8 \
    --localmem=64

Single-cell mRNA sequencing data from 256 murine CD8+ T cells were processed with a bioinformatics pipeline focusing on quality control (QC) and robust expression quantification.

RNA-seq v1.10.2 GitHub

$ Bash example

# Install Salmon and FastQC (example using conda)
# conda create -n sc_rna_seq_env salmon fastqc -y
# conda activate sc_rna_seq_env

# --- Reference Data Setup ---
# Download murine (GRCm39) cDNA transcriptome from Ensembl (example release 110)
# mkdir -p reference
# wget -P reference ftp://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/cdna/Mus_musculus.GRCm39.cdna.all.fa.gz

# Build Salmon index for GRCm39 transcriptome
# salmon index -t reference/Mus_musculus.GRCm39.cdna.all.fa.gz -i salmon_index_GRCm39 --gencode

# --- Quality Control (QC) ---
# Assuming raw paired-end reads are named 'sample_R1.fastq.gz' and 'sample_R2.fastq.gz'
# Replace with actual input file names
mkdir -p fastqc_results
fastqc sample_R1.fastq.gz sample_R2.fastq.gz -o fastqc_results

# --- Robust Expression Quantification with Salmon ---
# Replace 'salmon_index_GRCm39' with the path to your built Salmon index
# Replace 'sample_R1.fastq.gz' and 'sample_R2.fastq.gz' with actual input file names
# -l A: Auto-detect library type (common for scRNA-seq)
# -p 8: Use 8 threads for quantification
# --validateMappings: Perform additional validation of mappings for improved accuracy
mkdir -p salmon_quant_output
salmon quant -i salmon_index_GRCm39 -l A \
             -1 sample_R1.fastq.gz \
             -2 sample_R2.fastq.gz \
             -p 8 --validateMappings \
             -o salmon_quant_output/sample_quant

View on GitHub

For each cell, raw RNA-seq reads were: checked for quality metrics with fastqc (v0.10.1); poly-A and adaptor-trimmed with cutadapt (v1.8.1); quantified by kallisto (v0.42.1) to a reference transcriptome (Gencode vM3) without bias correction; and aligned by STAR (v2.4.1b) to the reference mouse genome (mm10) with default parameters for quality control and downstream analysis.

STAR v2.4.1b GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Placeholder for STAR genome index for the reference mouse genome (mm10/GRCm38).
# You would typically build this once using the mm10 genome FASTA and GTF files.
# Example for building index (uncomment and modify paths as needed):
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/mm10_star_index \
#      --genomeFastaFiles /path/to/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz \
#      --sjdbGTFfile /path/to/Mus_musculus.GRCm38.98.gtf.gz \
#      --runThreadN 8 # Adjust threads as needed

# Align RNA-seq reads to the mm10 reference genome using STAR with default parameters.
# Input: reads.fastq.gz (replace with your actual input file, e.g., from fastqc/cutadapt output)
# Output: Aligned BAM file (e.g., aligned_reads_Aligned.sortedByCoord.out.bam)
STAR --genomeDir /path/to/mm10_star_index \
     --readFilesIn reads.fastq.gz \
     --runThreadN 8 \
     --outFileNamePrefix aligned_reads_ \
     --outSAMtype BAM SortedByCoordinate

View on GitHub

Next, the transcript per million (TPM) outputs of kallisto for all cells were combined into a cell-by-gene expression matrix (C=288 cells=rows, G=22425 genes=columns) by summing the expression values for all quantified transcripts of a given gene.

kallisto v0.46.2 GitHub

$ Bash example

# Install kallisto (if not already installed)
# conda install -c bioconda kallisto
# pip install pandas

# --- Reference Data Placeholders ---
# For kallisto quantification, a transcriptome index is needed.
# Example: Homo_sapiens.GRCh38.cdna.all.fa.gz from Ensembl
# kallisto index -i Homo_sapiens.GRCh38.idx Homo_sapiens.GRCh38.cdna.all.fa.gz

# A gene-transcript mapping file is crucial for summing transcript TPMs to gene TPMs.
# This can be generated by parsing a GTF/GFF file (e.g., Homo_sapiens.GRCh38.109.gtf from Ensembl).
# Example command to create gene_transcript_map.tsv (gene_id\ttranscript_id):
# Replace 'Homo_sapiens.GRCh38.109.gtf' with your actual GTF file path.
# awk '$3 == "transcript" { 
#     gene_id=""; transcript_id=""; 
#     split($9, a, ";"); 
#     for (i in a) { 
#         if (a[i] ~ /gene_id/) gene_id=substr(a[i], index(a[i], "\"")+1, length(a[i])-index(a[i], "\"")-1); 
#         if (a[i] ~ /transcript_id/) transcript_id=substr(a[i], index(a[i], "\"")+1, length(a[i])-index(a[i], "\"")-1) 
#     } 
#     if (gene_id && transcript_id) print gene_id "\t" transcript_id 
# }' Homo_sapiens.GRCh38.109.gtf > gene_transcript_map.tsv

# --- Define Input and Output Paths ---
# Directory containing kallisto output subdirectories for each cell (e.g., cell1/abundance.tsv, cell2/abundance.tsv)
KALLISTO_OUTPUT_DIR="kallisto_quant_outputs"
# Path to the gene-transcript mapping file
GENE_TRANSCRIPT_MAP="gene_transcript_map.tsv"
# Output file for the combined cell-by-gene expression matrix
OUTPUT_MATRIX="cell_by_gene_expression_matrix.tsv"

# --- Create a Python script to combine kallisto outputs ---
# This script will read abundance.tsv files, map transcripts to genes, 
# sum TPMs per gene, and combine into a single cell-by-gene matrix.
cat << 'EOF' > merge_kallisto_gene_tpm.py
import pandas as pd
import os
import sys

kallisto_output_dir = sys.argv[1]
gene_transcript_map_file = sys.argv[2]
output_matrix_file = sys.argv[3]

# Load gene-transcript map
# Assuming gene_transcript_map.tsv has two columns: gene_id\ttranscript_id
gene_map = pd.read_csv(gene_transcript_map_file, sep='\t', header=None, names=['gene_id', 'transcript_id'])
transcript_to_gene = gene_map.set_index('transcript_id')['gene_id'].to_dict()

all_gene_tpm_dfs = []

# Iterate through each cell's kallisto output directory
for cell_dir_name in sorted(os.listdir(kallisto_output_dir)):
    full_cell_dir = os.path.join(kallisto_output_dir, cell_dir_name)
    if os.path.isdir(full_cell_dir):
        abundance_file = os.path.join(full_cell_dir, 'abundance.tsv')
        if os.path.exists(abundance_file):
            cell_name = cell_dir_name
            
            # Load abundance data (target_id, length, eff_length, est_counts, tpm)
            df_abundance = pd.read_csv(abundance_file, sep='\t')
            
            # Map transcripts (target_id) to genes and sum TPMs
            df_abundance['gene_id'] = df_abundance['target_id'].map(transcript_to_gene)
            
            # Drop transcripts that couldn't be mapped to a gene (if any, e.g., non-coding RNAs not in map)
            df_abundance = df_abundance.dropna(subset=['gene_id'])
            
            # Sum TPMs by gene
            gene_tpm_series = df_abundance.groupby('gene_id')['tpm'].sum()
            all_gene_tpm_dfs.append(gene_tpm_series.rename(cell_name))
        else:
            print(f"Warning: {abundance_file} not found for {cell_dir_name}", file=sys.stderr)

if not all_gene_tpm_dfs:
    print("Error: No kallisto abundance files found to process.", file=sys.stderr)
    sys.exit(1)

# Combine all gene TPM series into a single DataFrame
# Use outer join to include all genes across all cells, filling missing values with 0
final_matrix = pd.concat(all_gene_tpm_dfs, axis=1).fillna(0)

# Transpose the matrix to have cells as rows and genes as columns, as per description
# (C=288 cells=rows, G=22425 genes=columns)
final_matrix = final_matrix.T

# Write the final matrix to a TSV file
final_matrix.to_csv(output_matrix_file, sep='\t', index=True, index_label='CellID')
print(f"Successfully created {output_matrix_file}")
EOF

# --- Execute the Python script ---
# Ensure Python and pandas are installed in your environment.
python merge_kallisto_gene_tpm.py "${KALLISTO_OUTPUT_DIR}" "${GENE_TRANSCRIPT_MAP}" "${OUTPUT_MATRIX}"

# --- Clean up (optional) ---
# rm merge_kallisto_gene_tpm.py

View on GitHub

Finally, the TPM value for each cell c and gene g was natural log-transformed to yield a normalized expression value: EXPRc,g = ln(1+TPMc,g).

awk (Inferred with models/gemini-2.5-flash) vGNU Awk (gawk) (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Assuming input_tpm_file.tsv is a tab-separated file with a header.
# Replace '3' with the actual 1-indexed column number containing TPM values.
# If your file is comma-separated, change 'FS=OFS="\t"' to 'FS=OFS=","'.
awk 'BEGIN {FS=OFS="\t"} NR==1 {print $0, "EXPR_value"; next} {print $0, log(1+$3)}' input_tpm_file.tsv > output_expr_file.tsv

View on GitHub

Dimensionality reduction and cell heterogeneity visualization.

Seurat (Inferred with models/gemini-2.5-flash) v5.0.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Seurat if not already present
# conda install -c conda-forge r-seurat -y

# This command assumes an R script named 'run_seurat_dr.R' exists
# that performs dimensionality reduction (e.g., PCA, UMAP) and visualization
# on a Seurat object. The script would typically take an input Seurat object,
# process it, and save an updated object and a visualization plot.
#
# Example content for 'run_seurat_dr.R':
# library(Seurat)
# args <- commandArgs(trailingOnly = TRUE)
# input_file <- args[1]
# output_file <- args[2]
# output_plot <- args[3]
# seurat_obj <- readRDS(input_file)
# seurat_obj <- RunPCA(seurat_obj, features = VariableFeatures(object = seurat_obj))
# seurat_obj <- RunUMAP(seurat_obj, dims = 1:30) # Adjust dims as needed
# png(output_plot, width = 800, height = 600)
# DimPlot(seurat_obj, reduction = "umap", group.by = "seurat_clusters") # Adjust group.by
# dev.off()
# saveRDS(seurat_obj, output_file)

# Execute the R script for dimensionality reduction and visualization
Rscript run_seurat_dr.R input_seurat_object.rds processed_seurat_object.rds umap_visualization.png

View on GitHub

To reduce the dimensionality of the cell-by-gene expression matrix EXPR and visualize the diversity of gene expression among CD8+ T cells of different subtypes in a 2-dimensional scatter plot, we applied the t-distributed Stochastic Neighborhood Embedding (tSNE) algorithm via its Barnes-Hut approximation (bhSNE).

t-SNE v(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Create a dummy EXPR.csv for demonstration if it doesn't exist
# In a real pipeline, 'EXPR.csv' would be the actual cell-by-gene expression matrix.
if [ ! -f EXPR.csv ]; then
    echo "Creating dummy EXPR.csv for demonstration..."
    python -c "
import pandas as pd
import numpy as np
num_cells = 100
num_genes = 500
EXPR = pd.DataFrame(np.random.rand(num_cells, num_genes),
                    index=[f'cell_{i}' for i in range(num_cells)],
                    columns=[f'gene_{j}' for j in range(num_genes)])
EXPR.to_csv('EXPR.csv', index=False)
print('Dummy EXPR.csv created.')
"
fi

# Create the Python script for t-SNE dimensionality reduction
cat << 'EOF' > run_tsne.py
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import sys

# Load the cell-by-gene expression matrix
# Replace 'EXPR.csv' with the actual path to your expression matrix
# Ensure the matrix is in a format suitable for TSNE (e.g., cells as rows, genes as columns)
try:
    EXPR = pd.read_csv("EXPR.csv")
except FileNotFoundError:
    print("Error: EXPR.csv not found. Please provide the input expression matrix.", file=sys.stderr)
    sys.exit(1)

# Initialize t-SNE with Barnes-Hut approximation (default for sklearn's TSNE)
# n_components=2 for 2-dimensional visualization as specified in the description.
# perplexity and learning_rate are crucial parameters for t-SNE, often tuned.
# Using common default values here as they are not specified in the description.
# n_jobs=-1 utilizes all available CPU cores for faster computation.
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42, method='barnes_hut', n_jobs=-1)

# Apply t-SNE to the expression matrix
tsne_results = tsne.fit_transform(EXPR)

# Create a DataFrame for the 2-dimensional t-SNE coordinates
tsne_df = pd.DataFrame(data=tsne_results, columns=['tSNE_1', 'tSNE_2'])
# If EXPR had cell IDs as an index, you might want to add them back:
# tsne_df.index = EXPR.index

# Save the 2-dimensional t-SNE coordinates to a CSV file
tsne_df.to_csv("tsne_2d_coordinates.csv", index=False)

print("t-SNE dimensionality reduction complete. Results saved to tsne_2d_coordinates.csv")
EOF

# Install required Python packages if not already installed
# conda install scikit-learn pandas numpy
# pip install scikit-learn pandas numpy

# Execute the t-SNE Python script
python run_tsne.py

View on GitHub

We first applied standard Principal Components Analysis (PCA) to reduce the dimensionality down to D=10, and only then applied bhSNE to visualize in D=2 (with perplexity=30 and theta=0.75 parameters).

PCA vNot specified GitHub

$ Bash example

# Install necessary libraries (if not already installed)
# pip install scikit-learn pandas

# Create a Python script for PCA
cat << 'EOF' > run_pca.py
import pandas as pd
from sklearn.decomposition import PCA
import sys

# Expects input_file, output_file, and n_components as command-line arguments
input_file = sys.argv[1]
output_file = sys.argv[2]
n_components = int(sys.argv[3])

# Load data (assuming tab-separated, first column as index for sample IDs)
# Data should be samples x features, where samples are rows and features are columns.
# If your data is features x samples, you might need to transpose it before PCA.
data = pd.read_csv(input_file, sep="\t", index_col=0)

# Initialize PCA with the specified number of components
pca = PCA(n_components=n_components)

# Fit PCA to the data and transform it to get the principal components
principal_components = pca.fit_transform(data)

# Create a DataFrame for the principal components with meaningful column names
pc_df = pd.DataFrame(data=principal_components,
                     columns=[f'PC{i+1}' for i in range(n_components)],
                     index=data.index)

# Save the principal components to a tab-separated file
pc_df.to_csv(output_file, sep="\t")

# Optionally, save the explained variance ratio for each component
explained_variance_ratio_file = output_file.replace('.tsv', '_explained_variance_ratio.tsv')
explained_variance_ratio = pd.DataFrame(pca.explained_variance_ratio_,
                                        index=[f'PC{i+1}' for i in range(n_components)],
                                        columns=['Explained Variance Ratio'])
explained_variance_ratio.to_csv(explained_variance_ratio_file, sep="\t")
EOF

# Example usage:
# Replace 'input_matrix.tsv' with your actual preprocessed data matrix file.
# This file should be tab-separated, with the first column as sample identifiers
# and subsequent columns as features (e.g., gene expression, methylation levels).
# The output 'pca_output_D10.tsv' will contain the top 10 principal components.
python run_pca.py input_matrix.tsv pca_output_D10.tsv 10

View on GitHub

Differential gene expression analysis.

DESeq2 (Inferred with models/gemini-2.5-flash) vBioconductor 3.18 GitHub

$ Bash example

# Install R and Bioconductor if not already present
# conda create -n deseq2_env r-base bioconductor-deseq2 -c conda-forge -c bioconda
# conda activate deseq2_env

# Or, if R is already installed:
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')"

# Create dummy count and metadata files for demonstration
# In a real scenario, these would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto)
# gene_counts.tsv: Tab-separated file with gene IDs as first column, sample names as header, and raw counts.
# sample_metadata.tsv: Tab-separated file with sample IDs as first column, and experimental conditions/covariates.

echo -e "gene\tsample1\tsample2\tsample3\tsample4" > gene_counts.tsv
echo -e "geneA\t100\t120\t500\t550" >> gene_counts.tsv
echo -e "geneB\t50\t60\t30\t35" >> gene_counts.tsv
echo -e "geneC\t200\t210\t220\t230" >> gene_counts.tsv
echo -e "geneD\t10\t12\t100\t110" >> gene_counts.tsv

echo -e "sample_id\tcondition" > sample_metadata.tsv
echo -e "sample1\tcontrol" >> sample_metadata.tsv
echo -e "sample2\tcontrol" >> sample_metadata.tsv
echo -e "sample3\ttreated" >> sample_metadata.tsv
echo -e "sample4\ttreated" >> sample_metadata.tsv

# Create the R script for DESeq2 analysis
cat << 'EOF' > deseq2_analysis.R
library(DESeq2)

# Load count data
# Assumes 'gene_counts.tsv' has genes as rows and samples as columns, with raw counts.
count_data <- read.table("gene_counts.tsv", header = TRUE, row.names = 1, sep = "\t")

# Load sample metadata
# Assumes 'sample_metadata.tsv' has sample names as row names and experimental conditions as columns.
sample_info <- read.table("sample_metadata.tsv", header = TRUE, row.names = 1, sep = "\t")

# Ensure sample order matches between count data and metadata
sample_info <- sample_info[colnames(count_data), , drop = FALSE]

# Create DESeqDataSet object
# The 'design' formula specifies the experimental design. Here, we compare 'condition'.
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = ~ condition)

# Filter out genes with very low counts (optional but recommended for better power)
# Keeps genes with at least 10 total counts across all samples.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results for 'treated' vs 'control' comparison
# Adjust levels as per your actual data (e.g., c("condition", "treatment_group", "control_group"))
res <- results(dds, contrast = c("condition", "treated", "control"))

# Order results by adjusted p-value (padj)
res_ordered <- res[order(res$padj), ]

# Write differential expression results to a CSV file
write.csv(as.data.frame(res_ordered), "deseq2_results.csv")

# Optional: Generate a normalized count matrix
normalized_counts <- counts(dds, normalized=TRUE)
write.csv(as.data.frame(normalized_counts), "normalized_counts.csv")
EOF

# Execute the R script to perform differential gene expression analysis
Rscript deseq2_analysis.R

# Clean up dummy files (optional)
# rm gene_counts.tsv sample_metadata.tsv deseq2_analysis.R

View on GitHub

We performed differential gene expression analysis between all pairs of T cell sub-populations from two non-overlapping sets of rows in the log-transformed expression matrix EXPR.

limma (Inferred with models/gemini-2.5-flash) v3.58.1 GitHub

$ Bash example

#!/bin/bash

# Define input files and output directory
EXPR_MATRIX_FILE="EXPR.tsv"
METADATA_FILE="metadata.tsv"
OUTPUT_DIR="dge_results"
GROUP_COLUMN="t_cell_subpopulation" # Column in metadata defining groups

# Create dummy input files for demonstration if they don't exist
# In a real pipeline, these files would be generated by upstream steps.
if [ ! -f "$EXPR_MATRIX_FILE" ]; then
  echo "Creating dummy EXPR.tsv..."
  echo -e "Gene\tSample1\tSample2\tSample3\tSample4\tSample5\tSample6" > "$EXPR_MATRIX_FILE"
  echo -e "GeneA\t10.5\t11.2\t10.8\t15.1\t14.8\t15.5" >> "$EXPR_MATRIX_FILE"
  echo -e "GeneB\t8.2\t8.5\t8.0\t9.1\t9.3\t9.0" >> "$EXPR_MATRIX_FILE"
  echo -e "GeneC\t12.1\t11.9\t12.5\t11.5\t11.8\t12.0" >> "$EXPR_MATRIX_FILE"
  echo -e "GeneD\t5.0\t5.2\t5.1\t4.8\t4.9\t5.0" >> "$EXPR_MATRIX_FILE"
fi

if [ ! -f "$METADATA_FILE" ]; then
  echo "Creating dummy metadata.tsv..."
  echo -e "Sample\tt_cell_subpopulation\tother_info" > "$METADATA_FILE"
  echo -e "Sample1\tCD4_T_cell\tinfo1" >> "$METADATA_FILE"
  echo -e "Sample2\tCD4_T_cell\tinfo2" >> "$METADATA_FILE"
  echo -e "Sample3\tCD8_T_cell\tinfo3" >> "$METADATA_FILE"
  echo -e "Sample4\tCD8_T_cell\tinfo4" >> "$METADATA_FILE"
  echo -e "Sample5\tRegulatory_T_cell\tinfo5" >> "$METADATA_FILE"
  echo -e "Sample6\tRegulatory_T_cell\tinfo6" >> "$METADATA_FILE"
fi

# Create output directory
mkdir -p "$OUTPUT_DIR"

# R script content
R_SCRIPT=$(cat <<EOF
# Load necessary libraries
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("limma")
library(limma)

# --- Configuration ---
EXPR_MATRIX_FILE <- "$EXPR_MATRIX_FILE"
METADATA_FILE <- "$METADATA_FILE"
OUTPUT_DIR <- "$OUTPUT_DIR"
GROUP_COLUMN <- "$GROUP_COLUMN"

# --- Load Data ---
expr_matrix <- as.matrix(read.delim(EXPR_MATRIX_FILE, row.names = 1, sep="\\t"))
metadata <- read.delim(METADATA_FILE, row.names = 1, sep="\\t")

# Ensure sample names match and are in the same order
# Assuming sample names are column names in expr_matrix and row names in metadata
if (!all(colnames(expr_matrix) %in% rownames(metadata))) {
  stop("Sample names in expression matrix and metadata do not match.")
}
metadata <- metadata[colnames(expr_matrix), ] # Reorder metadata to match expr_matrix columns

# Define the grouping factor
groups <- factor(metadata[[GROUP_COLUMN]])
design <- model.matrix(~0 + groups)
colnames(design) <- levels(groups)

# --- Differential Expression Analysis ---
# Fit linear model
fit <- lmFit(expr_matrix, design)

# Define all pairwise contrasts
group_levels <- levels(groups)
contrasts_list <- list()
k <- 1
for (i in 1:(length(group_levels) - 1)) {
  for (j in (i + 1):length(group_levels)) {
    contrast_name <- paste0(group_levels[j], "_vs_", group_levels[i])
    contrast_formula <- paste0(group_levels[j], " - ", group_levels[i])
    contrasts_list[[k]] <- makeContrasts(contrasts = contrast_formula, levels = design)
    names(contrasts_list)[k] <- contrast_name
    k <- k + 1
  }
}

# Apply contrasts
contrast_matrix <- do.call(cbind, contrasts_list)
fit2 <- contrasts.fit(fit, contrast_matrix)

# Empirical Bayes moderation
fit2 <- eBayes(fit2)

# Extract and save results for each contrast
for (contrast_name in names(contrasts_list)) {
  results <- topTable(fit2, coef = contrast_name, number = Inf, adjust.method = "BH")
  output_file <- file.path(OUTPUT_DIR, paste0("dge_results_", contrast_name, ".tsv"))
  write.table(results, file = output_file, sep = "\\t", quote = FALSE, row.names = TRUE)
  message(paste("Results for", contrast_name, "saved to", output_file))
}

message("Differential gene expression analysis complete.")
EOF
)

# Execute the R script
Rscript -e "$R_SCRIPT"

View on GitHub

Since single-cell gene expression does not conform to the usual negative binomial distribution and can even be bimodal due to dropout we used two non-parametric statistical tests for heterogeneity of expression: Mann Whitney Wilcoxon (MWW, also known as MWU) which is a rank-sum test but relies on a large sample to approximate normality, and Kolmogorov-Smirnov 2-sample (KS2) test which finds the largest difference between the empirical cumulative distributions, even between two small samples such as our 1st division sub-types Div1TE (n =36) and Div1MEM (n =24).

R (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# This script demonstrates how to perform Mann-Whitney-Wilcoxon (MWW) and Kolmogorov-Smirnov 2-sample (KS2) tests using R.
# The description indicates comparing gene expression between two groups: Div1TE (n=36) and Div1MEM (n=24).
# We simulate data for demonstration purposes. In a real pipeline, 'div1te_expr' and 'div1mem_expr' would be loaded from actual gene expression data files.

# Example of how to install R (if not already installed)
# sudo apt-get update
# sudo apt-get install r-base

Rscript -e '
  # Set seed for reproducibility of simulated data
  set.seed(123)

  # Simulate gene expression data for Div1TE (n=36)
  # Assuming expression values might follow a log-normal distribution, common for gene expression
  div1te_expr <- rlnorm(36, meanlog = 1.5, sdlog = 0.8)

  # Simulate gene expression data for Div1MEM (n=24)
  # Assuming a potentially different distribution to test for heterogeneity
  div1mem_expr <- rlnorm(24, meanlog = 1.2, sdlog = 0.7)

  cat("Simulated Div1TE expression (first 5 values):\n")
  print(head(div1te_expr))
  cat("Simulated Div1MEM expression (first 5 values):\n")
  print(head(div1mem_expr))

  # Perform Mann-Whitney-Wilcoxon (MWW) test
  # This is a non-parametric test for comparing two independent samples.
  # The description implies testing for "heterogeneity of expression", so a two-sided alternative is appropriate.
  cat("\n--- Mann-Whitney-Wilcoxon Test ---\n")
  mww_result <- wilcox.test(div1te_expr, div1mem_expr, alternative = "two.sided")
  print(mww_result)

  # Perform Kolmogorov-Smirnov 2-sample (KS2) test
  # This test compares the empirical cumulative distribution functions of two samples.
  cat("\n--- Kolmogorov-Smirnov 2-sample Test ---\n")
  ks2_result <- ks.test(div1te_expr, div1mem_expr)
  print(ks2_result)

  # In a real scenario, you might save these results to a file:
  # sink("statistical_test_results.txt")
  # cat("Mann-Whitney-Wilcoxon Test Result:\n")
  # print(mww_result)
  # cat("\nKolmogorov-Smirnov 2-sample Test Result:\n")
  # print(ks2_result)
  # sink()
'

View on GitHub

Cell type classifier.

SingleR (Inferred with models/gemini-2.5-flash) vUnknown GitHub

$ Bash example

# Define input and output files
INPUT_EXPRESSION_MATRIX="expression_matrix.tsv" # Path to the input single-cell expression matrix (e.g., counts or normalized data)
REFERENCE_DATA="human_pbmc_reference.rds" # Path to a pre-built SingleR reference object (e.g., from celldex or a custom one)
OUTPUT_ANNOTATIONS="cell_type_annotations.tsv" # Output file for predicted cell type annotations

# Commented out installation instructions
# # Install BiocManager if not already installed
# # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')"
# # Install SingleR and celldex (if using celldex references)
# # R -e "BiocManager::install(c('SingleR', 'celldex'))"
# # Install Matrix package if needed for sparse matrices
# # R -e "install.packages('Matrix')"

# Set environment variables for the R script
export INPUT_EXPRESSION_MATRIX
export REFERENCE_DATA
export OUTPUT_ANNOTATIONS

# Execute SingleR for cell type classification
# This R script loads an expression matrix and a SingleR reference, then performs classification.
# The reference data should be an RDS file containing a list with 'data' (expression matrix) and 'labels' (cell types)
# or a SummarizedExperiment object from celldex.
Rscript -e "
    library(SingleR)
    library(Matrix) # Often needed for sparse matrices

    # Retrieve file paths from environment variables
    input_matrix_file <- Sys.getenv('INPUT_EXPRESSION_MATRIX')
    reference_file <- Sys.getenv('REFERENCE_DATA')
    output_annotations_file <- Sys.getenv('OUTPUT_ANNOTATIONS')

    # Load expression matrix
    # Assuming a tab-separated file with genes as rows and cells as columns.
    # Adjust 'read.table' parameters or use other functions (e.g., 'readRDS', 'Seurat::Read10X')
    # based on the actual format of your input expression data.
    # Example for a dense matrix:
    expr_matrix <- as.matrix(read.table(input_matrix_file, sep='\t', header=TRUE, row.names=1))
    # Example for a sparse matrix (e.g., from 10x Genomics, after loading with Matrix::readMM):
    # expr_matrix <- readMM(input_matrix_file) # If input_matrix_file is a .mtx file
    # colnames(expr_matrix) <- readLines('barcodes.tsv') # Assuming barcodes file
    # rownames(expr_matrix) <- readLines('features.tsv') # Assuming features file

    # Load SingleR reference data
    # This should be an RDS file containing a SingleR-compatible reference object.
    # For example, a list with 'data' (expression matrix) and 'labels' (cell type vector),
    # or a SummarizedExperiment object from the 'celldex' package.
    ref_obj <- readRDS(reference_file)

    # Perform cell type annotation
    # If 'ref_obj' is a SummarizedExperiment from celldex (e.g., HumanPrimaryCellAtlasData()),
    # SingleR can directly use it: predictions <- SingleR(test = expr_matrix, ref = ref_obj, labels = ref_obj$label.main)
    # If 'ref_obj' is a custom list:
    predictions <- SingleR(test = expr_matrix, ref = ref_obj$data, labels = ref_obj$labels)

    # Extract and save results
    cell_annotations <- data.frame(
        cell_id = rownames(predictions),
        predicted_label = predictions$labels,
        score = predictions$scores[, predictions$labels] # Score for the predicted label
    )
    write.table(cell_annotations, output_annotations_file, sep='\t', quote=FALSE, row.names=FALSE)
"

View on GitHub

We trained two binary T cell classifiers to identify gene expression signatures that not only differentiate the examined T cell sub-populations (like the differential gene expression described above) but can also be used to predict the âmemory-â or âeffector-nessâ of previously unseen cells.

Machine Learning Framework (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# This script demonstrates a conceptual approach to training binary T cell classifiers using Python and scikit-learn.
# It involves loading gene expression data, preparing labels, training two separate classifiers,
# and saving the trained models. Specific algorithms (e.g., SVM) are used as examples.

# Install necessary libraries (example for Python/scikit-learn)
# conda create -n ml_env python=3.9
# conda activate ml_env
# pip install pandas scikit-learn numpy joblib

# Placeholder for a Python script that would perform the classification
# This script would:
# 1. Load gene expression data (e.g., from a CSV file)
# 2. Load labels for T cell sub-populations and 'memory-ness'/'effector-ness'
# 3. Preprocess data (e.g., normalization, scaling)
# 4. Split data into training and testing sets
# 5. Train two binary classifiers (e.g., for sub-population differentiation and memory/effector prediction)
# 6. Evaluate the classifiers
# 7. Save the trained models and/or prediction results

cat << 'EOF' > train_tcell_classifiers.py
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC # Example classifier: Support Vector Classifier
from sklearn.metrics import accuracy_score, classification_report
import joblib # For saving models
import numpy as np

# --- Configuration --- 
GENE_EXPRESSION_FILE = "gene_expression_data.csv"
CELL_LABELS_FILE = "cell_labels.csv" # Contains 'subpopulation' and 'memory_effector_status'

# --- Load Data ---
print("Loading data...")
try:
    gene_expr_df = pd.read_csv(GENE_EXPRESSION_FILE, index_col=0)
    labels_df = pd.read_csv(CELL_LABELS_FILE, index_col=0)
except FileNotFoundError as e:
    print(f"Error loading data: {e}. Please ensure '{GENE_EXPRESSION_FILE}' and '{CELL_LABELS_FILE}' exist.")
    exit(1)

# Ensure indices match and align data
common_cells = gene_expr_df.index.intersection(labels_df.index)
gene_expr_df = gene_expr_df.loc[common_cells]
labels_df = labels_df.loc[common_cells]

X = gene_expr_df.values
y_subpop = labels_df['subpopulation'].values
y_mem_eff = labels_df['memory_effector_status'].values

# --- Preprocessing ---
print("Preprocessing data (scaling features)...")
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Train Classifier 1: Differentiate T cell sub-populations ---
print("\nTraining classifier for T cell sub-populations...")
# Split data for sub-population classification
X_train_subpop, X_test_subpop, y_train_subpop, y_test_subpop = train_test_split(X_scaled, y_subpop, test_size=0.2, random_state=42, stratify=y_subpop)

# Initialize and train the classifier
classifier_subpop = SVC(kernel='linear', probability=True, random_state=42)
classifier_subpop.fit(X_train_subpop, y_train_subpop)

# Evaluate the classifier
y_pred_subpop = classifier_subpop.predict(X_test_subpop)
print("\nClassifier 1 (Sub-population) Evaluation:")
print(f"Accuracy: {accuracy_score(y_test_subpop, y_pred_subpop):.4f}")
print(classification_report(y_test_subpop, y_pred_subpop, zero_division=0))

# Save the trained model
joblib.dump(classifier_subpop, 'tcell_subpop_classifier.pkl')
print("Saved T cell sub-population classifier to 'tcell_subpop_classifier.pkl'")

# --- Train Classifier 2: Predict 'memory-' or 'effector-ness' ---
print("\nTraining classifier for 'memory-' or 'effector-ness'...")
# Split data for memory/effector classification
X_train_mem_eff, X_test_mem_eff, y_train_mem_eff, y_test_mem_eff = train_test_split(X_scaled, y_mem_eff, test_size=0.2, random_state=42, stratify=y_mem_eff)

# Initialize and train the classifier
classifier_mem_eff = SVC(kernel='rbf', probability=True, random_state=42)
classifier_mem_eff.fit(X_train_mem_eff, y_train_mem_eff)

# Evaluate the classifier
y_pred_mem_eff = classifier_mem_eff.predict(X_test_mem_eff)
print("\nClassifier 2 (Memory/Effector) Evaluation:")
print(f"Accuracy: {accuracy_score(y_test_mem_eff, y_pred_mem_eff):.4f}")
print(classification_report(y_test_mem_eff, y_pred_mem_eff, zero_division=0))

# Save the trained model
joblib.dump(classifier_mem_eff, 'tcell_mem_eff_classifier.pkl')
print("Saved T cell memory/effector classifier to 'tcell_mem_eff_classifier.pkl'")

print("\nTraining complete. Models saved.")
EOF

# Create dummy input files for demonstration purposes
echo "Creating dummy input files for demonstration..."
cat << 'EOF' > gene_expression_data.csv
cell_id,geneA,geneB,geneC,geneD,geneE,geneF,geneG,geneH,geneI,geneJ
cell1,10,20,5,15,25,30,12,18,22,8
cell2,12,22,6,16,26,32,14,20,24,10
cell3,8,18,4,14,24,28,10,16,20,6
cell4,11,21,7,17,27,31,13,19,23,9
cell5,9,19,3,13,23,29,11,17,21,7
cell6,13,23,8,18,28,33,15,21,25,11
cell7,7,17,2,12,22,27,9,15,19,5
cell8,14,24,9,19,29,34,16,22,26,12
cell9,6,16,1,11,21,26,8,14,18,4
cell10,15,25,10,20,30,35,17,23,27,13
cell11,10.5,20.5,5.5,15.5,25.5,30.5,12.5,18.5,22.5,8.5
cell12,12.5,22.5,6.5,16.5,26.5,32.5,14.5,20.5,24.5,10.5
cell13,8.5,18.5,4.5,14.5,24.5,28.5,10.5,16.5,20.5,6.5
cell14,11.5,21.5,7.5,17.5,27.5,31.5,13.5,19.5,23.5,9.5
cell15,9.5,19.5,3.5,13.5,23.5,29.5,11.5,17.5,21.5,7.5
cell16,13.5,23.5,8.5,18.5,28.5,33.5,15.5,21.5,25.5,11.5
cell17,7.5,17.5,2.5,12.5,22.5,27.5,9.5,15.5,19.5,5.5
cell18,14.5,24.5,9.5,19.5,29.5,34.5,16.5,22.5,26.5,12.5
cell19,6.5,16.5,1.5,11.5,21.5,26.5,8.5,14.5,18.5,4.5
cell20,15.5,25.5,10.5,20.5,30.5,35.5,17.5,23.5,27.5,13.5
EOF

cat << 'EOF' > cell_labels.csv
cell_id,subpopulation,memory_effector_status
cell1,CD4_Treg,Memory
cell2,CD8_Cytotoxic,Effector
cell3,CD4_Helper,Memory
cell4,CD8_Cytotoxic,Memory
cell5,CD4_Treg,Effector
cell6,CD8_Cytotoxic,Memory
cell7,CD4_Helper,Effector
cell8,CD8_Cytotoxic,Memory
cell9,CD4_Treg,Memory
cell10,CD8_Cytotoxic,Effector
cell11,CD4_Treg,Memory
cell12,CD8_Cytotoxic,Effector
cell13,CD4_Helper,Memory
cell14,CD8_Cytotoxic,Memory
cell15,CD4_Treg,Effector
cell16,CD8_Cytotoxic,Memory
cell17,CD4_Helper,Effector
cell18,CD8_Cytotoxic,Memory
cell19,CD4_Treg,Memory
cell20,CD8_Cytotoxic,Effector
EOF

# Execute the Python script to train the classifiers
python train_tcell_classifiers.py

View on GitHub

Each classifier constructed an independent ensemble of Extremely Randomized Trees.

scikit-learn (Inferred with models/gemini-2.5-flash) v1.4.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install scikit-learn and joblib if not already installed
# pip install scikit-learn joblib

# Create a Python script to train the Extremely Randomized Trees classifier
cat << 'EOF' > train_extra_trees.py
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
import joblib
import os

# Placeholder for input data (features X and labels y)
# In a real pipeline, X and y would be loaded from files (e.g., CSV, HDF5).
# X: Feature matrix (n_samples, n_features)
# y: Target labels (n_samples,)

# Example: Generate synthetic data for demonstration purposes.
# Replace this with actual data loading in a real application.
print("Generating synthetic data for demonstration...")
n_samples = 1000
n_features = 20
n_classes = 2
X = np.random.rand(n_samples, n_features)
y = np.random.randint(0, n_classes, n_samples)

# Split data into training and testing sets (optional, but good practice)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Extremely Randomized Trees classifier (ExtraTreesClassifier)
# Parameters like n_estimators, max_features, min_samples_leaf, etc.,
# would typically be optimized or set based on prior knowledge or cross-validation.
# n_jobs=-1 uses all available CPU cores for parallel processing.
print("Initializing ExtraTreesClassifier...")
classifier = ExtraTreesClassifier(
    n_estimators=100,      # Number of trees in the forest
    random_state=42,       # Seed for reproducibility
    n_jobs=-1              # Use all available cores for fitting and prediction
)

# Train the classifier
print("Training the ExtraTreesClassifier...")
classifier.fit(X_train, y_train)

# Evaluate the classifier (optional)
accuracy = classifier.score(X_test, y_test)
print(f"Model accuracy on test set: {accuracy:.4f}")

# Save the trained model to a file using joblib
output_model_path = "extra_trees_classifier_model.joblib"
print(f"Saving the trained model to {output_model_path}...")
joblib.dump(classifier, output_model_path)
print("Model training and saving complete.")

# No specific reference genomic datasets are used for this general machine learning model training step.
# The model is trained on the provided input features and labels.
EOF

# Execute the Python script to train the classifier
python train_extra_trees.py

View on GitHub

Using the terminally differentiated effector and memory (TCM, TEM) populations, we built a training set for a fate classifier for CD8+ T cells.

scikit-learn (Inferred with models/gemini-2.5-flash) v1.3.0 GitHub

$ Bash example

# Install necessary libraries if not already present
# conda create -n classifier_env python=3.9 scikit-learn pandas joblib numpy
# conda activate classifier_env
# pip install scikit-learn pandas joblib numpy

# Create a dummy input data file for demonstration if it doesn't exist.
# In a real bioinformatics pipeline, 'cd8_tcell_expression_data.csv' would be
# generated by a preceding data processing step, containing features (e.g., gene expression)
# and a 'cell_fate' column (e.g., 0 for TCM, 1 for TEM).
if [ ! -f "cd8_tcell_expression_data.csv" ]; then
    echo "Creating dummy input data for demonstration..."
    python -c "
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

# Generate synthetic data representing CD8+ T cells with features and a fate label
X, y = make_classification(n_samples=200, n_features=50, n_informative=20, n_redundant=10, n_classes=2, random_state=42, weights=[0.5, 0.5])
data = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
data['cell_fate'] = y # 0 for TCM, 1 for TEM (example labels)
data.to_csv('cd8_tcell_expression_data.csv', index=False)
print('Dummy data created: cd8_tcell_expression_data.csv')
"
fi

# Create a Python script to build the fate classifier
cat << 'EOF' > classifier_training.py
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib # For saving the model

# Load the input data containing features and cell fate labels
try:
    data = pd.read_csv("cd8_tcell_expression_data.csv")
    X = data.drop("cell_fate", axis=1) # Features
    y = data["cell_fate"] # Target variable (cell fate: TCM/TEM)
except FileNotFoundError:
    print("Error: Input data file 'cd8_tcell_expression_data.csv' not found.")
    exit(1)

# Split the data into training and testing sets
# The training set will be used to build the classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")

# Initialize and train a classifier (e.g., RandomForestClassifier)
# The choice of classifier can be adjusted based on data characteristics and performance requirements
classifier = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
print("Training the classifier...")
classifier.fit(X_train, y_train)
print("Classifier training complete.")

# Evaluate the classifier on the test set (optional, but good practice)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Classifier accuracy on test set: {accuracy:.4f}")

# Save the trained classifier model for future use (e.g., prediction on new data)
model_filename = "cd8_tcell_fate_classifier.joblib"
joblib.dump(classifier, model_filename)
print(f"Trained classifier saved to {model_filename}")

# Save the training data (features and labels) that were used to build the classifier
# This is important for reproducibility and understanding the training set composition.
training_features_output_filename = "cd8_tcell_training_set_features.csv"
training_labels_output_filename = "cd8_tcell_training_set_labels.csv"
X_train.to_csv(training_features_output_filename, index=False)
y_train.to_csv(training_labels_output_filename, index=False)
print(f"Training set features saved to {training_features_output_filename}")
print(f"Training set labels saved to {training_labels_output_filename}")
EOF

# Execute the Python script to build the fate classifier
python classifier_training.py

echo "Classifier training process finished. A trained model and training set components are saved."

View on GitHub

Using the newly observed segregation of daughter T cells into Div1TE and Div1MEM subpopulations after the first division, we built a second training set for another early state classifier.

Machine Learning Classifier Training (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This is a conceptual script to illustrate building a training set for a classifier.
# The actual implementation would depend on the specific data format, features, and chosen machine learning library.

# Assuming input data (e.g., CSV, TSV) contains features and a 'subpopulation' column (Div1TE/Div1MEM)
# Example: data.csv
# cell_id,feature1,feature2,...,subpopulation
# C1,0.5,1.2,...,Div1TE
# C2,0.8,0.9,...,Div1MEM

# Install scikit-learn if not already installed
# pip install scikit-learn pandas numpy

# Create a Python script (e.g., build_training_set.py)
cat << 'EOF' > build_training_set.py
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Placeholder for input data file
# In a real scenario, this would be the output from a previous step that identified Div1TE/Div1MEM cells.
input_data_file = "processed_t_cell_data.csv" # This file should contain features and the 'subpopulation' label

# Placeholder for output training and testing set files
output_train_file = "classifier_training_set.csv"
output_test_file = "classifier_testing_set.csv"

# Placeholder for features to be used in the classifier
# Replace with actual feature column names from your data
feature_columns = ["feature1", "feature2", "feature3"] # Example features

try:
    # Load the data
    df = pd.read_csv(input_data_file)

    # Ensure the 'subpopulation' column exists
    if 'subpopulation' not in df.columns:
        raise ValueError(f"Column 'subpopulation' not found in {input_data_file}")

    # Filter for the relevant subpopulations if necessary (Div1TE, Div1MEM)
    # Assuming the input_data_file already contains only these or we need to filter
    df_filtered = df[df['subpopulation'].isin(['Div1TE', 'Div1MEM'])].copy()

    if df_filtered.empty:
        print(f"No data found for Div1TE or Div1MEM subpopulations in {input_data_file}. Exiting.")
        exit(1)

    # Prepare features (X) and labels (y)
    X = df_filtered[feature_columns]
    y = df_filtered['subpopulation']

    # Encode labels if they are strings (e.g., Div1TE -> 0, Div1MEM -> 1)
    label_encoder = LabelEncoder()
    y_encoded = label_encoder.fit_transform(y)

    # Split data into training and testing sets
    # Using a common split ratio, e.g., 80% training, 20% testing
    # random_state ensures reproducibility of the split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
    )

    # Combine features and encoded labels for saving
    train_df = pd.DataFrame(X_train, columns=feature_columns)
    train_df['label'] = y_train
    train_df['original_label'] = label_encoder.inverse_transform(y_train)

    test_df = pd.DataFrame(X_test, columns=feature_columns)
    test_df['label'] = y_test
    test_df['original_label'] = label_encoder.inverse_transform(y_test)

    # Save the training and testing sets
    train_df.to_csv(output_train_file, index=False)
    test_df.to_csv(output_test_file, index=False)

    print(f"Training set saved to {output_train_file}")
    print(f"Testing set saved to {output_test_file}")
    print(f"Label mapping: {list(label_encoder.classes_)} -> {list(range(len(label_encoder.classes_))))}")

except FileNotFoundError:
    print(f"Error: Input file '{input_data_file}' not found. Please ensure the file exists.")
    exit(1)
except ValueError as e:
    print(f"Error processing data: {e}")
    exit(1)
except Exception as e:
    print(f"An unexpected error occurred: {e}")
    exit(1)

EOF

# Execute the Python script
python build_training_set.py

Both classifiers were fed their respective training sets using 10-fold cross-validation

scikit-learn (Inferred with models/gemini-2.5-flash) v1.x GitHub

$ Bash example

# Install scikit-learn (a common Python library for machine learning tasks)
# pip install scikit-learn

# This command represents the execution of a Python script that would implement
# 10-fold cross-validation for the specified classifiers using their respective training sets.
# The exact script name and parameters would vary based on the specific implementation.
# 'training_data.csv' is a placeholder for the input training data.
python run_classifier_cross_validation.py \
    --input_training_data training_data.csv \
    --num_folds 10 \
    --output_results cross_validation_metrics.json

View on GitHub

After both the fate and early state classifiers were trained on their respective subpopulations, they were both applied on previously unseen intermediate Day 4 CD8+ T cells.

Machine Learning Classifier Application (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

bash
# This script conceptually represents the application of pre-trained machine learning classifiers
# to new, unseen data. The specific implementation (e.g., Python with scikit-learn, R with caret)
# would depend on how the classifiers were originally trained.

# Example installation for a Python-based scikit-learn environment:
# conda create -n classifier_env python=3.9 scikit-learn pandas numpy -y
# conda activate classifier_env

# Define paths for the trained classifier models and the input data
# These models would have been generated in a previous 'training' step.
TRAINED_FATE_CLASSIFIER_MODEL="path/to/trained_fate_classifier_model.pkl" # e.g., a pickled scikit-learn model
TRAINED_EARLY_STATE_CLASSIFIER_MODEL="path/to/trained_early_state_classifier_model.pkl" # e.g., a pickled scikit-learn model

# Input data: Previously unseen intermediate Day 4 CD8+ T cells
# This could be a CSV, TSV, H5AD (for single-cell data), or other format.
INPUT_DAY4_CD8_T_CELLS_DATA="path/to/day4_cd8_t_cells_features.h5ad" # Example: AnnData object for single-cell features

# Define output paths for the predictions
OUTPUT_FATE_PREDICTIONS="path/to/day4_cd8_t_cells_fate_predictions.csv"
OUTPUT_EARLY_STATE_PREDICTIONS="path/to/day4_cd8_t_cells_early_state_predictions.csv"

# Execute a conceptual Python script to load the models and apply them to the new data.
# This script would handle loading the input data, preprocessing if necessary,
# making predictions with both classifiers, and saving the results.
python apply_trained_classifiers.py \
    --fate_model "${TRAINED_FATE_CLASSIFIER_MODEL}" \
    --early_state_model "${TRAINED_EARLY_STATE_CLASSIFIER_MODEL}" \
    --input_data "${INPUT_DAY4_CD8_T_CELLS_DATA}" \
    --output_fate_predictions "${OUTPUT_FATE_PREDICTIONS}" \
    --output_early_state_predictions "${OUTPUT_EARLY_STATE_PREDICTIONS}"

View on GitHub

Their predicted âmemory-nessâ scores were scatter-plotted and shown to correlate in Fig.

R (Inferred with models/gemini-2.5-flash) v4.x.x (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R if not already installed (example for Ubuntu/Debian)
# sudo apt-get update
# sudo apt-get install r-base

# Create a dummy input file for demonstration purposes.
# In a real scenario, 'memory_scores.csv' would be generated by a previous step
# and contain the predicted 'memory-ness' scores for two sets of data to be correlated.
# Assuming a CSV format with two columns, e.g., 'score_A' and 'score_B'.
echo "score_A,score_B" > memory_scores.csv
echo "0.1,0.15" >> memory_scores.csv
echo "0.2,0.22" >> memory_scores.csv
echo "0.3,0.35" >> memory_scores.csv
echo "0.4,0.41" >> memory_scores.csv
echo "0.5,0.53" >> memory_scores.csv
echo "0.6,0.60" >> memory_scores.csv
echo "0.7,0.75" >> memory_scores.csv
echo "0.8,0.82" >> memory_scores.csv
echo "0.9,0.90" >> memory_scores.csv

# Create an R script to perform scatter plotting and correlation calculation
cat << 'EOF' > plot_memory_scores.R
# Read the input data
data <- read.csv("memory_scores.csv")

# Create a scatter plot and save it as a PNG file
png("memory_scores_scatterplot.png", width = 800, height = 600)
plot(data$score_A, data$score_B,
     xlab = "Predicted Memory-ness Score (Set A)",
     ylab = "Predicted Memory-ness Score (Set B)",
     main = "Correlation of Predicted Memory-ness Scores",
     pch = 16, col = "blue", cex = 1.2)

# Calculate Pearson correlation coefficient
correlation <- cor(data$score_A, data$score_B, method = "pearson")

# Add correlation value to the plot
legend("topleft", legend = paste("Pearson Correlation:", round(correlation, 3)), bty = "n", cex = 1.2)

# Add a linear regression line to visualize the trend
abline(lm(data$score_B ~ data$score_A), col = "red", lwd = 2)
dev.off()

# Print the correlation coefficient to standard output
cat(paste("Pearson Correlation Coefficient:", correlation, "\n"))
EOF

# Execute the R script
Rscript plot_memory_scores.R

View on GitHub

3f.

CLIPper (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# This code block assumes step 3f refers to eCLIP peak calling, as per the provided context and common eCLIP pipeline steps.
# Parameters are set to common defaults for CLIPper in the absence of specific instructions.

# Define input and output files
IP_BAM="aligned_ip.bam" # Placeholder for aligned IP reads BAM file
INPUT_BAM="aligned_input.bam" # Placeholder for aligned input control reads BAM file
CHROM_SIZES="hg38.chrom.sizes" # Placeholder for genome chromosome sizes file (e.g., from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes)
OUTPUT_BED="clipper_peaks.bed" # Output BED file for identified peaks

# Example command for CLIPper peak calling
# Assuming clipper.py is in the system's PATH or specified with its full path
# Common parameters: -p for p-value threshold (default 0.01), -w for minimum peak width (default 20)
clipper.py \
    -s $CHROM_SIZES \
    -i $IP_BAM \
    -c $INPUT_BAM \
    -p 0.01 \
    -w 20 \
    -o $OUTPUT_BED

View on GitHub

For each T cell, its âeffector-nessâ scores is 1 minus the âmemory-nessâ score and is redundant for this analysis.

Custom Script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This step calculates 'effector-ness' scores based on 'memory-ness' scores.
# The description states this calculation is redundant for the current analysis, but the code block demonstrates how it would be performed.

# Assuming an input file 'input_memory_ness_scores.tsv' with T cell IDs and their memory-ness scores.
# The output will be 'output_effector_ness_scores.tsv'.
# The input file is expected to be tab-separated and contain a column named 'MEMORY_NESS_SCORE'.

# Python script to calculate effector-ness
# effector_ness_calculator.py
cat << 'EOF' > effector_ness_calculator.py
import pandas as pd
import sys

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python effector_ness_calculator.py <input_memory_ness_file> <output_effector_ness_file>")
        sys.exit(1)

    input_file = sys.argv[1]
    output_file = sys.argv[2]

    try:
        df = pd.read_csv(input_file, sep='\t')
        if 'MEMORY_NESS_SCORE' not in df.columns:
            raise ValueError("Input file must contain a 'MEMORY_NESS_SCORE' column.")

        # Calculate 'effector-ness' as 1 minus 'memory-ness'
        df['EFFECTOR_NESS_SCORE'] = 1 - df['MEMORY_NESS_SCORE']

        # Output the results
        df.to_csv(output_file, sep='\t', index=False)
        print(f"Successfully calculated effector-ness scores and saved to {output_file}")
    except FileNotFoundError:
        print(f"Error: Input file '{input_file}' not found.")
        sys.exit(1)
    except ValueError as e:
        print(f"Error processing data: {e}")
        sys.exit(1)
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        sys.exit(1)
EOF

# Execute the Python script
# Replace 'input_memory_ness_scores.tsv' with your actual input file path
# Replace 'output_effector_ness_scores.tsv' with your desired output file path
python effector_ness_calculator.py input_memory_ness_scores.tsv output_effector_ness_scores.tsv

The signature genes for each classifier were selected from all G=22,425 genes by their GINI score.

Inferred with models/gemini-2.5-flash vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install necessary Python libraries if not already installed
# pip install scikit-learn pandas numpy

# Create a Python script to select signature genes based on GINI score.
# This script simulates the process of training a classifier (e.g., RandomForest)
# and extracting feature importances (GINI-based) from a total of G=22,425 genes.
# It outputs a list of top N signature genes.
cat << 'EOF' > select_signature_genes.py
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import argparse

def main():
    parser = argparse.ArgumentParser(description="Select signature genes based on GINI score from a classifier.")
    parser.add_argument("--total_genes", type=int, default=22425, help="Total number of genes (G) available for selection.")
    parser.add_argument("--num_samples", type=int, default=100, help="Number of samples (N) for the simulated dataset.")
    parser.add_argument("--top_n_genes", type=int, default=100, help="Number of top genes to select as signature genes.")
    parser.add_argument("--output_file", type=str, default="signature_genes_gini.tsv", help="Output file for the selected signature genes.")
    args = parser.parse_args()

    G = args.total_genes
    N_samples = args.num_samples
    num_signature_genes = args.top_n_genes

    np.random.seed(42)

    # Generate dummy gene names
    gene_names = [f'gene_{i+1}' for i in range(G)]

    # Simulate gene expression data (N_samples rows, G columns)
    # Introduce some signal for a few genes to make them 'important'
    X = np.random.rand(N_samples, G) * 10
    X[:, 0] += np.random.rand(N_samples) * 5  # Make gene_1 important
    X[:, 100] -= np.random.rand(N_samples) * 5 # Make gene_101 important

    # Simulate binary labels based on the 'important' genes
    y = (X[:, 0] + X[:, 100] > np.median(X[:, 0] + X[:, 100])).astype(int)

    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.0, random_state=42) # Use all for training for simplicity

    # Train a RandomForestClassifier
    # RandomForest uses GINI impurity by default for splitting nodes and calculating feature importances.
    classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    classifier.fit(X_train, y_train)

    # Get feature importances (GINI-based scores)
    gini_importances = classifier.feature_importances_

    # Create a DataFrame to store gene names and their GINI scores
    feature_importance_df = pd.DataFrame({
        'gene': gene_names,
        'gini_score': gini_importances
    })

    # Sort genes by GINI score in descending order
    feature_importance_df = feature_importance_df.sort_values(by='gini_score', ascending=False)

    # Select the top N genes as signature genes
    signature_genes_df = feature_importance_df.head(num_signature_genes)

    print(f"Selected {num_signature_genes} signature genes based on GINI score from {G} total genes.")
    signature_genes_df.to_csv(args.output_file, sep="\t", index=False)

if __name__ == "__main__":
    main()
EOF

# Execute the Python script to select signature genes.
# The --total_genes parameter reflects the G=22,425 genes mentioned in the description.
# The --top_n_genes parameter is an example of how many signature genes might be selected.
python select_signature_genes.py --total_genes 22425 --top_n_genes 100 --output_file signature_genes_gini.tsv

View on GitHub

Temporal expression trajectories through inferred lineage paths.

Monocle (Inferred with models/gemini-2.5-flash) vMonocle 3 GitHub

$ Bash example

# This script demonstrates how to perform temporal expression trajectory analysis using Monocle 3.
# It assumes you have an R environment set up with Monocle 3 installed.

# --- Installation (commented out) ---
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install(version = "3.18") # Use your Bioconductor version
# BiocManager::install(c('monocle3'))
# install.packages("devtools")
# devtools::install_github('cole-trapnell-lab/monocle3')

# --- Execution ---
# Assuming 'cds_object.rds' is a pre-processed CellDataSet object saved from a previous step.
# This object typically contains normalized expression data, cell metadata, and gene metadata.

Rscript -e '
library(monocle3)

# Load pre-processed CellDataSet object
# Replace "path/to/your/cds_object.rds" with the actual path to your input data.
# This object should contain expression data, cell metadata, and gene metadata.
# For demonstration, we will assume `cds` is already loaded or created from previous steps.
# In a real pipeline, you would load your actual CDS object:
# cds <- readRDS("path/to/your/cds_object.rds")

# Placeholder for a loaded CDS object. In a real scenario, `cds` would be the output
# of upstream steps like `preprocess_cds`, `reduce_dimension`, and `cluster_cells`.
# For example, if starting from a count matrix:
# expression_matrix <- as.matrix(read.csv("counts.csv", row.names = 1))
# cell_metadata <- read.csv("cell_metadata.csv", row.names = 1)
# gene_metadata <- read.csv("gene_metadata.csv", row.names = 1)
# cds <- new_cell_data_set(expression_matrix, cell_metadata = cell_metadata, gene_metadata = gene_metadata)
# cds <- preprocess_cds(cds, num_dim = 100)
# cds <- reduce_dimension(cds, reduction_method = "UMAP")
# cds <- cluster_cells(cds)

# For this example, we assume `cds` is a pre-processed and dimension-reduced CellDataSet object.
# If you are running this as a standalone step, ensure `cds` is properly initialized.

# Learn the trajectory graph
cds <- learn_graph(cds)

# Order cells in pseudotime
# You might need to specify a root node for the trajectory based on biological knowledge.
# This often involves selecting a cell or a cluster that represents the start of the lineage.
# Example: root_cells <- cells_in_partition(cds)[which(colData(cds)$cell_type == "Stem_Cell")]
# If no specific root cells are provided, Monocle will attempt to infer one.
cds <- order_cells(cds)

# Plot the trajectory
# You can color cells by pseudotime, cell type, or other metadata
plot_cells(cds,
           color_cells_by = "pseudotime",
           label_groups_by_cluster = FALSE,
           label_leaves = TRUE,
           label_branch_points = TRUE,
           graph_label_size = 1.5)

# Save the updated CellDataSet object with pseudotime information
saveRDS(cds, "monocle3_trajectory_cds.rds")

# You can also extract pseudotime values:
pseudotime_values <- pseudotime(cds)
write.csv(as.data.frame(pseudotime_values), "cell_pseudotime.csv")

# Further analysis: differential expression along pseudotime (optional)
# gene_fits <- fit_models(cds, model_formula_str = "~pseudotime")
# pseudotime_de_results <- differential_test(gene_fits, full_model_formula_str = "~pseudotime",
#                                            reduced_model_formula_str = "~1")
# write.csv(pseudotime_de_results, "pseudotime_de_genes.csv")
'

View on GitHub

To understand the temporal dynamics of expression for key genes along the effector and memory lineages, we constructed hypothetical differentiation time-courses for each lineage.

Custom R script (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# Install R if not already present (example for Debian/Ubuntu)
# sudo apt-get update
# sudo apt-get install r-base

# Create a placeholder R script for demonstration purposes.
# In a real scenario, this script would contain the actual logic
# for modeling gene expression over time and constructing hypothetical time-courses.
cat << 'EOF' > construct_time_courses.R
#!/usr/bin/env Rscript

# This is a placeholder R script to demonstrate the execution of a custom analysis.
# In a real scenario, this script would contain the logic to:
# 1. Load gene expression data (e.g., normalized counts, TPMs).
# 2. Load sample metadata including lineage and time point information.
# 3. For each gene and lineage, model the expression changes over time.
#    This could involve spline fitting, LOESS regression, or other time-series analysis methods.
# 4. Generate "hypothetical differentiation time-courses" by predicting expression
#    at finely spaced time points along the observed differentiation trajectory.
# 5. Save the results (e.g., predicted expression values, fitted models, plots).

# Example usage: Rscript construct_time_courses.R <expression_file> <metadata_file> <output_prefix>

args <- commandArgs(trailingOnly = TRUE)

if (length(args) < 3) {
  stop("Usage: Rscript construct_time_courses.R <expression_file> <metadata_file> <output_prefix>", call. = FALSE)
}

expression_file <- args[1]
metadata_file <- args[2]
output_prefix <- args[3]

message(paste0("Simulating construction of time-courses using:\n",
               "  Expression file: ", expression_file, "\n",
               "  Metadata file: ", metadata_file, "\n",
               "  Output prefix: ", output_prefix))

# Simulate creating an output file
output_file <- paste0(output_prefix, "_hypothetical_time_courses_summary.txt")
writeLines(c(
  "# Hypothetical Differentiation Time-Courses Summary",
  paste0("Analysis performed on expression data from: ", expression_file),
  paste0("Using metadata from: ", metadata_file),
  "This file represents a summary of the constructed hypothetical differentiation time-courses.",
  "Actual output would include fitted curves, predicted expression values, or visualizations."
), output_file)

message(paste0("Simulated output saved to: ", output_file))

EOF

# Make the R script executable
chmod +x construct_time_courses.R

# Create dummy input files for demonstration
# gene_expression_data.tsv: Contains normalized gene expression values (e.g., TPMs, counts)
#                           Rows are genes, columns are samples.
# sample_metadata.tsv: Contains metadata for each sample, including lineage and time point.

echo "gene_id\tsample1\tsample2\tsample3\tsample4" > gene_expression_data.tsv
echo "geneA\t10.5\t20.1\t30.8\t40.2" >> gene_expression_data.tsv
echo "geneB\t50.3\t45.7\t40.1\t35.9" >> gene_expression_data.tsv
echo "geneC\t5.2\t15.8\t25.1\t35.5" >> gene_expression_data.tsv

echo "sample_id\tlineage\ttime_point" > sample_metadata.tsv
echo "sample1\teffector\t0" >> sample_metadata.tsv
echo "sample2\teffector\t1" >> sample_metadata.tsv
echo "sample3\tmemory\t0" >> sample_metadata.tsv
echo "sample4\tmemory\t1" >> sample_metadata.tsv

# Execute the R script to construct hypothetical differentiation time-courses
# Input: gene_expression_data.tsv (gene expression matrix)
#        sample_metadata.tsv (sample annotations)
# Output prefix: lineage_time_course_analysis (output files will be prefixed with this)
./construct_time_courses.R gene_expression_data.tsv sample_metadata.tsv lineage_time_course_analysis

View on GitHub

Briefly, we sampled with replacement 50 cells from each population and constructed all trajectories through the cross-product of populations ordered in a particular lineage.

Single-cell trajectory construction and robustness analysis (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# This step describes a conceptual methodology for single-cell trajectory inference, which involves sampling cells and constructing trajectories based on population cross-products.
# Without a specific software tool mentioned (e.g., Monocle, Slingshot, PAGA, etc.), it is not possible to provide a concrete bash command.
#
# The description implies the following conceptual steps, which would typically be implemented in a scripting language like R or Python:
#
# 1.  **Sampling**: For each biological population, randomly sample 50 cells with replacement.
#     Example (conceptual pseudocode, not executable bash):
#     # sampled_cells_per_population = {}
#     # for population_id in all_population_ids:
#     #     cells_in_population = get_cells_for_population(population_id)
#     #     sampled_cells_per_population[population_id] = random_sample_with_replacement(cells_in_population, num_cells=50)
#
# 2.  **Trajectory Construction (Cross-product)**: Construct all possible ordered trajectories (sequences of populations) that adhere to a predefined lineage structure.
#     This involves generating combinations or permutations of populations based on the lineage graph.
#     Example (conceptual pseudocode, not executable bash):
#     # all_lineage_paths = generate_all_ordered_lineage_paths(population_graph, lineage_constraints)
#
# 3.  **Trajectory Inference**: For each constructed lineage path, combine the corresponding sampled cells and perform a trajectory inference analysis.
#     This is where a specific trajectory inference tool would be applied.
#     Example (conceptual pseudocode, not executable bash):
#     # for lineage_path in all_lineage_paths:
#     #     combined_data_for_path = combine_data_from_sampled_cells(sampled_cells_per_population, lineage_path)
#     #     # Execute a specific trajectory inference tool (e.g., Monocle3, Slingshot, PAGA) on combined_data_for_path
#     #     # e.g., run_monocle3(combined_data_for_path, output_prefix=f"trajectory_{lineage_path}")
#
# This is a high-level description of a method, not a specific tool's execution command.

View on GitHub

35

These orders were determined a priori based on our earlier work with similar timecourses of RT-qPCR study (Arsenio et. al.

Statistical Analysis Software (Inferred with models/gemini-2.5-flash) vN/A

Nature Immunology 2014).

N/A (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# No specific bioinformatics step or tool described in "Nature Immunology 2014)".
# Therefore, no execution command or parameters can be inferred.

View on GitHub

37
Specifically, the âeffector lineageâ starts from the naÃ¯ve population, and progresses through the Div1TE subpopulation, then onto Day 4, and finally Day 7.

(Inferred with models/gemini-2.5-flash) v(Inferred with models/gemini-2.5-flash)
$ Bash example
```
No specific bioinformatics tool or computational step is described. This description outlines a biological progression or experimental design, not a command-line execution.
```
38

In contrast, the âeffector memoryâ and âcentral memory lineagesâ start from naÃ¯ve, through the Div1MEM subpopulation, ending with TEM and TCM respectively.

(Inferred with models/gemini-2.5-flash) vN/A

These bootstrapped trajectories were visually summarized by a seaborn timeseries plot which links the average expression for each population sample with a solid line segment and represents the 95% confidence interval by a shaded area around it.

seaborn vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install seaborn and its dependencies if not already installed
# conda install -c conda-forge seaborn pandas numpy matplotlib

# Create a Python script for plotting
cat << 'EOF' > plot_timeseries.py
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# --- Simulate input data ---
# In a real pipeline, this DataFrame would be loaded from a file (e.g., CSV, Parquet)
# generated by the preceding bootstrapping step. It should contain columns for
# time points, population samples, and individual bootstrapped expression values.
# For demonstration, we simulate it here.
np.random.seed(42)
num_time_points = 10
num_populations = 3
num_bootstrap_samples_per_point = 100 # Number of bootstrap samples per time point and population

data = []
for t in range(num_time_points):
    for pop_idx in range(num_populations):
        population_name = f'Population_{chr(65 + pop_idx)}'
        # Simulate expression values for bootstrap samples
        # Mean expression changes over time and differs by population
        mean_expr = 5 + t * 0.5 + pop_idx * 1.5
        bootstrap_expressions = np.random.normal(loc=mean_expr, scale=1.0, size=num_bootstrap_samples_per_point)
        for expr in bootstrap_expressions:
            data.append({'time': t, 'population_sample': population_name, 'expression': expr})

df = pd.DataFrame(data)

# --- Generate the seaborn timeseries plot ---
plt.figure(figsize=(10, 6))
sns.lineplot(
    data=df,
    x='time',
    y='expression',
    hue='population_sample',
    errorbar=('ci', 95), # Calculate and display 95% confidence interval
    marker='o' # Optional: add markers for clarity
)

plt.title('Average Expression Over Time with 95% Confidence Interval')
plt.xlabel('Time Point')
plt.ylabel('Expression Level')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend(title='Population Sample')
plt.tight_layout()

# Save the plot to a file
output_plot_file = "timeseries_expression_plot.png"
plt.savefig(output_plot_file)
print(f"Plot saved to {output_plot_file}")
EOF

# Execute the Python script to generate the plot
python plot_timeseries.py

View on GitHub

Tools Used

STAR HOMER RNA-seq

Raw Source Text

Basecalls performed using CASAVA version 1.4
ChIP-seq reads were aligned to the mm10 genome assembly using STAR version 2.4.1b with the following options --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --limitOutSJoneRead --outFilterScoreMin 3 --outFilterScoreMinOverLread 0.66 --outFilterMatchNminOverLread 0.66 --outFilterMismatchNoverLmax 0.3 and the other parameter kept as default
BAM files of sample1-11 are downsampled, if needed, using samtools to ensure sample 1-4, sample 5-8, sample 9-11 have same read numbers.
for sample 1-4: peaks were called using MACS version 1.4.2 using sample 5-8 as backgournd, with the following setting -g mm --nomodel --nolambda and others as default.
for sample 9-11: tag directories were first generated by HOMER makeTagDirectory with -keepOne -tbp 1 and other options as default.
for sample 9 and 11: peakd were call by HOMER findpeak using sample 12 as background and -style factor -size 100 -fragLength 50 with other options as default
for sample 10: peakd were call by HOMER findpeak using sample 12 as background and -style histone -fragLength 50 -size 200 -minDist 1000 -L 0 with other options as default
for sample 13-16: transcript count were generated by kallisto version 0.42.4 using GENCODE GRCm38.p4 transcriptome as reference, with the following parameter: -l 200 -s 20 --single and other setting as default
For all Single Cell Samples the following processing steps apply:
Single-cell RNA-seq data pre-processing. Single-cell mRNA sequencing data from 256 murine CD8+ T cells were processed with a bioinformatics pipeline focusing on quality control (QC) and robust expression quantification. For each cell, raw RNA-seq reads were: checked for quality metrics with fastqc (v0.10.1); poly-A and adaptor-trimmed with cutadapt (v1.8.1); quantified by kallisto (v0.42.1) to a reference transcriptome (Gencode vM3) without bias correction; and aligned by STAR (v2.4.1b) to the reference mouse genome (mm10) with default parameters for quality control and downstream analysis. Next, the transcript per million (TPM) outputs of kallisto for all cells were combined into a cell-by-gene expression matrix (C=288 cells=rows, G=22425 genes=columns) by summing the expression values for all quantified transcripts of a given gene. Finally, the TPM value for each cell c and gene g was natural log-transformed to yield a normalized expression value: EXPRc,g = ln(1+TPMc,g).
Dimensionality reduction and cell heterogeneity visualization. To reduce the dimensionality of the cell-by-gene expression matrix EXPR and visualize the diversity of gene expression among CD8+ T cells of different subtypes in a 2-dimensional scatter plot, we applied the t-distributed Stochastic Neighborhood Embedding (tSNE) algorithm via its Barnes-Hut approximation (bhSNE). We first applied standard Principal Components Analysis (PCA) to reduce the dimensionality down to D=10, and only then applied bhSNE to visualize in D=2 (with perplexity=30 and theta=0.75 parameters).
Differential gene expression analysis. We performed differential gene expression analysis between all pairs of T cell sub-populations from two non-overlapping sets of rows in the log-transformed expression matrix EXPR. Since single-cell gene expression does not conform to the usual negative binomial distribution and can even be bimodal due to dropout we used two non-parametric statistical tests for heterogeneity of expression: Mann Whitney Wilcoxon (MWW, also known as MWU) which is a rank-sum test but relies on a large sample to approximate normality, and Kolmogorov-Smirnov 2-sample (KS2) test which finds the largest difference between the empirical cumulative distributions, even between two small samples such as our 1st division sub-types Div1TE (n =36) and Div1MEM (n =24).
Cell type classifier. We trained two binary T cell classifiers to identify gene expression signatures that not only differentiate the examined T cell sub-populations (like the differential gene expression described above) but can also be used to predict the âmemory-â or âeffector-nessâ of previously unseen cells. Each classifier constructed an independent ensemble of Extremely Randomized Trees. Using the terminally differentiated effector and memory (TCM, TEM) populations, we built a training set for a fate classifier for CD8+ T cells. Using the newly observed segregation of daughter T cells into Div1TE and Div1MEM subpopulations after the first division, we built a second training set for another early state classifier. Both classifiers were fed their respective training sets using 10-fold cross-validation
After both the fate and early state classifiers were trained on their respective subpopulations, they were both applied on previously unseen intermediate Day 4 CD8+ T cells. Their predicted âmemory-nessâ scores were scatter-plotted and shown to correlate in Fig. 3f. For each T cell, its âeffector-nessâ scores is 1 minus the âmemory-nessâ score and is redundant for this analysis. The signature genes for each classifier were selected from all G=22,425 genes by their GINI score.
Temporal expression trajectories through inferred lineage paths. To understand the temporal dynamics of expression for key genes along the effector and memory lineages, we constructed hypothetical differentiation time-courses for each lineage. Briefly, we sampled with replacement 50 cells from each population and constructed all trajectories through the cross-product of populations ordered in a particular lineage. These orders were determined a priori based on our earlier work with similar timecourses of RT-qPCR study (Arsenio et. al. Nature Immunology 2014). Specifically, the âeffector lineageâ starts from the naÃ¯ve population, and progresses through the Div1TE subpopulation, then onto Day 4, and finally Day 7. In contrast, the âeffector memoryâ and âcentral memory lineagesâ start from naÃ¯ve, through the Div1MEM subpopulation, ending with TEM and TCM respectively. These bootstrapped trajectories were visually summarized by a seaborn timeseries plot which links the average expression for each population sample with a solid line segment and represents the 95% confidence interval by a shaded area around it.
Genome_build: mm10
Supplementary_files_format_and_content: peak BED files generated from MACS and HOMER. Transcript Expression CSV tables generated from kallisto.

← Back to Analysis