GSE89405 Processing Pipeline

RNA-Seq code_examples 39 steps

Publication

Early transcriptional and epigenetic regulation of CD8<sup>+</sup> T cell differentiation revealed by single-cell RNA sequencing.

Nature immunology (2017) — PMID 28218746

Dataset

GSE89405

Early transcriptional and epigenetic regulation of CD8+ T cell differentiation revealed by single-cell RNA-seq

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Basecalls performed using CASAVA version 1.4

    CASAVA v1.4
    $ Bash example
    # CASAVA (Consensus Assessment of Sequence And Variation) is an older, proprietary Illumina software suite for base calling and initial data processing.
    # It is not typically installed via public package managers like conda.
    # Installation would involve obtaining the software package directly from Illumina.
    
    # Example command for base calling and demultiplexing using CASAVA 1.4's configureBclToFastq.pl script.
    # This script processes BCL files from an Illumina run directory into FASTQ files,
    # using a sample sheet for demultiplexing.
    # Replace <path_to_illumina_run_directory>, <path_to_output_fastq_directory>, and <path_to_sample_sheet.csv> with actual paths.
    
    # The exact path to configureBclToFastq.pl might vary based on CASAVA installation.
    # This command assumes the script is in the PATH or specified with its full path.
    configureBclToFastq.pl \
      --input-dir <path_to_illumina_run_directory>/Data/Intensities/BaseCalls \
      --output-dir <path_to_output_fastq_directory> \
      --sample-sheet <path_to_sample_sheet.csv>
  2. 2

    ChIP-seq reads were aligned to the mm10 genome assembly using STAR version 2.4.1b with the following options --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --limitOutSJoneRead --outFilterScoreMin 3 --outFilterScoreMinOverLread 0.66 --outFilterMatchNminOverLread 0.66 --outFilterMismatchNoverLmax 0.3 and the other parameter kept as default

    $ Bash example
    # Install STAR if not already available
    # conda install -c bioconda star=2.4.1b
    
    # Define variables for clarity
    STAR_GENOME_DIR="/path/to/genome/mm10/STAR_index" # Placeholder for STAR genome index directory. This directory should contain genome.fa, SA, SAindex, etc.
    # Input reads: Assuming paired-end. For single-end, remove READ2 and adjust --readFilesIn to only "${READ1}"
    READ1="input_reads_R1.fastq.gz" # Placeholder for input ChIP-seq reads R1
    READ2="input_reads_R2.fastq.gz" # Placeholder for input ChIP-seq reads R2
    OUTPUT_PREFIX="star_alignment_output/" # Prefix for output files (e.g., star_alignment_output/Aligned.sortedByCoordinate.out.bam)
    
    # Run STAR alignment
    # Note: --runThreadN is often specified for performance but is not explicitly mentioned in the description.
    #       It's omitted here to strictly adhere to "other parameter kept as default".
    STAR \
      --genomeDir "${STAR_GENOME_DIR}" \
      --readFilesIn "${READ1}" "${READ2}" \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outFilterMultimapNmax 10 \
      --outFilterMultimapScoreRange 1 \
      --limitOutSJoneRead \
      --outFilterScoreMin 3 \
      --outFilterScoreMinOverLread 0.66 \
      --outFilterMatchNminOverLread 0.66 \
      --outFilterMismatchNoverLmax 0.3
  3. 3

    BAM files of sample1-11 are downsampled, if needed, using samtools to ensure sample 1-4, sample 5-8, sample 9-11 have same read numbers.

    samtools v1.15.1 GitHub
    $ Bash example
    # Install samtools if not available
    # conda install -c bioconda samtools
    
    # Define input BAM files (replace with actual file paths if different)
    BAM_FILES=(
        "sample1.bam" "sample2.bam" "sample3.bam" "sample4.bam"
        "sample5.bam" "sample6.bam" "sample7.bam" "sample8.bam"
        "sample9.bam" "sample10.bam" "sample11.bam"
    )
    
    # Create an associative array to store read counts
    declare -A READ_COUNTS
    
    # Calculate read counts for each BAM file and find the global minimum
    MIN_READS=-1
    for bam_file in "${BAM_FILES[@]}"; do
        if [ ! -f "$bam_file" ]; then
            echo "Error: Input BAM file '$bam_file' not found. Please ensure all input files exist." >&2
            exit 1
        fi
        current_reads=$(samtools view -c "$bam_file")
        READ_COUNTS["$bam_file"]=$current_reads
        echo "Reads in $bam_file: $current_reads"
    
        if (( MIN_READS == -1 || current_reads < MIN_READS )); then
            MIN_READS=$current_reads
        fi
    done
    
    if (( MIN_READS == -1 )); then
        echo "Error: No valid BAM files processed to determine minimum read count." >&2
        exit 1
    fi
    
    echo "Target read count for downsampling (global minimum): $MIN_READS"
    
    # Downsample BAM files if their read count is above the target minimum
    # A fixed random seed (e.g., 42) is used for reproducibility.
    for bam_file in "${BAM_FILES[@]}"; do
        current_reads=${READ_COUNTS["$bam_file"]}
        output_bam="${bam_file%.bam}_downsampled.bam"
    
        if (( current_reads > MIN_READS )); then
            # Calculate the fraction for samtools -s. samtools -s takes a seed and a fraction (e.g., 42.0.5 for 50% with seed 42).
            fraction=$(awk "BEGIN {printf \"%.6f\", $MIN_READS / $current_reads}")
            seed_fraction="42.$fraction" # Using a fixed seed 42
    
            echo "Downsampling $bam_file from $current_reads to $MIN_READS reads (fraction: $fraction) -> $output_bam"
            samtools view -b -s "$seed_fraction" "$bam_file" > "$output_bam"
            samtools index "$output_bam"
        else
            echo "$bam_file already has $current_reads reads, which is <= target $MIN_READS. Copying to $output_bam."
            cp "$bam_file" "$output_bam"
            samtools index "$output_bam" # Ensure index exists for consistency
        fi
    done
  4. 4

    for sample 1-4: peaks were called using MACS version 1.4.2 using sample 5-8 as backgournd, with the following setting -g mm --nomodel --nolambda and others as default.

    MACS2 v1.4.2 GitHub
    $ Bash example
    # Install MACS2 (if not already installed)
    # conda install -c bioconda macs2
    
    # Define input and control BAM files
    TREATMENT_BAMS="sample1.bam sample2.bam sample3.bam sample4.bam"
    CONTROL_BAMS="sample5.bam sample6.bam sample7.bam sample8.bam"
    
    # Define output prefix
    OUTPUT_PREFIX="peaks_sample1-4_vs_sample5-8"
    
    # Define genome size (e.g., mm10 for mouse)
    GENOME_SIZE="mm"
    
    # Run MACS2 callpeak
    macs2 callpeak \
        -t ${TREATMENT_BAMS} \
        -c ${CONTROL_BAMS} \
        -g ${GENOME_SIZE} \
        --nomodel \
        --nolambda \
        -n ${OUTPUT_PREFIX} \
        --outdir .
  5. 5

    for sample 9-11: tag directories were first generated by HOMER makeTagDirectory with -keepOne -tbp 1 and other options as default.

    HOMER vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install HOMER (if not already installed)
    # For example, using conda:
    # conda install -c bioconda homer
    
    # Generate tag directories for sample 9
    makeTagDirectory sample_9_tag_dir sample_9.bam -keepOne -tbp 1
    
    # Generate tag directories for sample 10
    makeTagDirectory sample_10_tag_dir sample_10.bam -keepOne -tbp 1
    
    # Generate tag directories for sample 11
    makeTagDirectory sample_11_tag_dir sample_11.bam -keepOne -tbp 1
    
    # Note: Replace 'sample_X.bam' with the actual input BAM files for each sample
    # and 'sample_X_tag_dir' with the desired output tag directory names.
  6. 6

    for sample 9 and 11: peakd were call by HOMER findpeak using sample 12 as background and -style factor -size 100 -fragLength 50 with other options as default

    $ Bash example
    # Install HOMER (if not already installed)
    # conda install -c bioconda homer
    
    # Assuming input BAM files are available in the current directory
    # and are named sample9.bam, sample11.bam, and sample12.bam
    
    # Peak calling for sample 9
    findPeaks sample9.bam \
        -o sample9_peaks.txt \
        -i sample12.bam \
        -style factor \
        -size 100 \
        -fragLength 50
    
    # Peak calling for sample 11
    findPeaks sample11.bam \
        -o sample11_peaks.txt \
        -i sample12.bam \
        -style factor \
        -size 100 \
        -fragLength 50
  7. 7

    for sample 10: peakd were call by HOMER findpeak using sample 12 as background and -style histone -fragLength 50 -size 200 -minDist 1000 -L 0 with other options as default

    HOMER vInfer from description GitHub
    $ Bash example
    # Install HOMER (if not already installed)
    # For example, using conda:
    # conda install -c bioconda homer
    # Or from source:
    # wget http://homer.ucsd.edu/homer/configureHomer.pl
    # perl configureHomer.pl -install
    
    # Assuming sample10.bam is the treatment BAM file and sample12.bam is the background BAM file
    # And these files are located in the current directory or specified with full paths
    findPeaks sample10.bam -o sample10_peaks.txt -i sample12.bam -style histone -fragLength 50 -size 200 -minDist 1000 -L 0
  8. 8

    for sample 13-16: transcript count were generated by kallisto version 0.42.4 using GENCODE GRCm38.p4 transcriptome as reference, with the following parameter: -l 200 -s 20 --single and other setting as default

    kallisto v0.42.4 GitHub
    $ Bash example
    # Install kallisto if not already installed
    # conda install -c bioconda kallisto
    
    # Placeholder for kallisto index generation (if not already done)
    # The GENCODE GRCm38.p4 transcriptome FASTA file would be needed here.
    # For example, if the FASTA file is gencode.vM4.transcripts.fa.gz:
    # kallisto index -i gencode.vM4.transcripts.idx gencode.vM4.transcripts.fa.gz
    
    # Run kallisto quantification for a single sample (e.g., sample 13)
    # Assuming input fastq is sample_13.fastq.gz and output directory is kallisto_output_sample_13
    kallisto quant -i gencode.vM4.transcripts.idx -o kallisto_output_sample_13 --single -l 200 -s 20 sample_13.fastq.gz
  9. 9

    For all Single Cell Samples the following processing steps apply:

    (Inferred with models/gemini-2.5-flash) v(Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # This is a placeholder script for single-cell sample processing.
    # The actual tools and commands depend heavily on the specific single-cell assay (e.g., scRNA-seq, scATAC-seq)
    # and the technology used (e.g., 10x Genomics, Smart-seq2).
    
    # Define reference genome (placeholder for human hg38)
    # Replace with the actual path to your reference genome and annotation files.
    REFERENCE_GENOME_FASTA="/path/to/human/hg38/GRCh38.primary_assembly.genome.fa"
    REFERENCE_GENOME_GTF="/path/to/human/hg38/gencode.v38.annotation.gtf"
    
    # For 10x Genomics data, a Cell Ranger reference transcriptome is typically required:
    # CELLRANGER_TRANSCRIPTOME_REF="/path/to/cellranger/refdata-gex-GRCh38-2020-A"
    
    # Example conceptual steps for single-cell data processing:
    
    # 1. Quality Control (e.g., using FastQC)
    # fastqc -o qc_output/ sample_R1.fastq.gz sample_R2.fastq.gz
    
    # 2. Demultiplexing and Barcode Processing (if applicable, e.g., for 10x Genomics)
    # For 10x Genomics, this is often done with 'cellranger mkfastq' or handled by the sequencing provider.
    # cellranger mkfastq --id=sample_id --run=/path/to/sequencing_run_folder --csv=/path/to/sample_sheet.csv
    
    # 3. Alignment and Quantification (e.g., for scRNA-seq)
    # For 10x Genomics scRNA-seq, 'cellranger count' is the standard tool:
    # cellranger count \
    #   --id=sample_id \
    #   --transcriptome=${CELLRANGER_TRANSCRIPTOME_REF} \
    #   --fastqs=/path/to/fastqs_directory \
    #   --sample=sample_name \
    #   --expect-cells=3000
    
    # For generic scRNA-seq (e.g., Smart-seq2, using a splice-aware aligner like STAR and a quantifier like Salmon):
    # # Build STAR index (run once per reference genome)
    # # STAR --runMode genomeGenerate --genomeDir /path/to/STAR_index --genomeFastaFiles ${REFERENCE_GENOME_FASTA} --sjdbGTFfile ${REFERENCE_GENOME_GTF} --runThreadN 8
    
    # # Align reads with STAR
    # # STAR \
    # #   --genomeDir /path/to/STAR_index \
    # #   --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
    # #   --readFilesCommand zcat \
    # #   --outFileNamePrefix star_output/sample_ \
    # #   --outSAMtype BAM SortedByCoordinate \
    # #   --outSAMunmapped Within \
    # #   --outFilterType BySJout \
    # #   --outFilterMultimapNmax 20 \
    # #   --alignSJDBoverhangMin 1 \
    # #   --alignSJoverhangMin 8 \
    # #   --alignIntronMin 20 \
    # #   --alignIntronMax 1000000 \
    # #   --alignMatesGapMax 1000000 \
    # #   --runThreadN 8
    
    # # Quantify transcripts with Salmon (requires a Salmon index built from a transcriptome FASTA)
    # # salmon index -t /path/to/transcripts.fa -i salmon_index
    # # salmon quant \
    # #   -i salmon_index \
    # #   -l A \
    # #   -1 sample_R1.fastq.gz \
    # #   -2 sample_R2.fastq.gz \
    # #   -o salmon_output/sample_quant \
    # #   --validateMappings \
    # #   --gcBias
    
    # Further downstream analysis typically involves dimension reduction, clustering, and differential expression analysis
    # using tools like Seurat, Scanpy, or Bioconductor packages in R.
  10. 10

    Single-cell RNA-seq data pre-processing.

    RNA-seq v7.1.0 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install Cell Ranger (example using conda, though 10x Genomics provides binaries and recommends direct download)
    # conda create -n cellranger_env -c bioconda cellranger=7.1.0
    # conda activate cellranger_env
    
    # Download a pre-built human transcriptome reference (e.g., GRCh38 from 10x Genomics)
    # mkdir -p /path/to/references
    # cd /path/to/references
    # wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
    # tar -xzf refdata-gex-GRCh38-2020-A.tar.gz
    
    # Run cellranger count for single-cell RNA-seq pre-processing (alignment and quantification)
    # Replace /path/to/fastq_files with the actual directory containing your FASTQ files.
    # Ensure FASTQ files follow 10x Genomics naming conventions (e.g., SAMPLE_S1_L001_R1_001.fastq.gz).
    # Replace my_sample_analysis with a unique identifier for your analysis output directory.
    # Replace my_sample_prefix with the prefix of your FASTQ files (e.g., 'SAMPLE' if files are SAMPLE_S1_L001_R1_001.fastq.gz).
    cellranger count \
        --id=my_sample_analysis \
        --transcriptome=/path/to/references/refdata-gex-GRCh38-2020-A \
        --fastqs=/path/to/fastq_files \
        --sample=my_sample_prefix \
        --localcores=8 \
        --localmem=64
  11. 11

    Single-cell mRNA sequencing data from 256 murine CD8+ T cells were processed with a bioinformatics pipeline focusing on quality control (QC) and robust expression quantification.

    $ Bash example
    # Install Salmon and FastQC (example using conda)
    # conda create -n sc_rna_seq_env salmon fastqc -y
    # conda activate sc_rna_seq_env
    
    # --- Reference Data Setup ---
    # Download murine (GRCm39) cDNA transcriptome from Ensembl (example release 110)
    # mkdir -p reference
    # wget -P reference ftp://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/cdna/Mus_musculus.GRCm39.cdna.all.fa.gz
    
    # Build Salmon index for GRCm39 transcriptome
    # salmon index -t reference/Mus_musculus.GRCm39.cdna.all.fa.gz -i salmon_index_GRCm39 --gencode
    
    # --- Quality Control (QC) ---
    # Assuming raw paired-end reads are named 'sample_R1.fastq.gz' and 'sample_R2.fastq.gz'
    # Replace with actual input file names
    mkdir -p fastqc_results
    fastqc sample_R1.fastq.gz sample_R2.fastq.gz -o fastqc_results
    
    # --- Robust Expression Quantification with Salmon ---
    # Replace 'salmon_index_GRCm39' with the path to your built Salmon index
    # Replace 'sample_R1.fastq.gz' and 'sample_R2.fastq.gz' with actual input file names
    # -l A: Auto-detect library type (common for scRNA-seq)
    # -p 8: Use 8 threads for quantification
    # --validateMappings: Perform additional validation of mappings for improved accuracy
    mkdir -p salmon_quant_output
    salmon quant -i salmon_index_GRCm39 -l A \
                 -1 sample_R1.fastq.gz \
                 -2 sample_R2.fastq.gz \
                 -p 8 --validateMappings \
                 -o salmon_quant_output/sample_quant
  12. 12

    For each cell, raw RNA-seq reads were: checked for quality metrics with fastqc (v0.10.1); poly-A and adaptor-trimmed with cutadapt (v1.8.1); quantified by kallisto (v0.42.1) to a reference transcriptome (Gencode vM3) without bias correction; and aligned by STAR (v2.4.1b) to the reference mouse genome (mm10) with default parameters for quality control and downstream analysis.

    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Placeholder for STAR genome index for the reference mouse genome (mm10/GRCm38).
    # You would typically build this once using the mm10 genome FASTA and GTF files.
    # Example for building index (uncomment and modify paths as needed):
    # STAR --runMode genomeGenerate \
    #      --genomeDir /path/to/mm10_star_index \
    #      --genomeFastaFiles /path/to/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz \
    #      --sjdbGTFfile /path/to/Mus_musculus.GRCm38.98.gtf.gz \
    #      --runThreadN 8 # Adjust threads as needed
    
    # Align RNA-seq reads to the mm10 reference genome using STAR with default parameters.
    # Input: reads.fastq.gz (replace with your actual input file, e.g., from fastqc/cutadapt output)
    # Output: Aligned BAM file (e.g., aligned_reads_Aligned.sortedByCoord.out.bam)
    STAR --genomeDir /path/to/mm10_star_index \
         --readFilesIn reads.fastq.gz \
         --runThreadN 8 \
         --outFileNamePrefix aligned_reads_ \
         --outSAMtype BAM SortedByCoordinate
  13. 13

    Next, the transcript per million (TPM) outputs of kallisto for all cells were combined into a cell-by-gene expression matrix (C=288 cells=rows, G=22425 genes=columns) by summing the expression values for all quantified transcripts of a given gene.

    kallisto v0.46.2 GitHub
    $ Bash example
    # Install kallisto (if not already installed)
    # conda install -c bioconda kallisto
    # pip install pandas
    
    # --- Reference Data Placeholders ---
    # For kallisto quantification, a transcriptome index is needed.
    # Example: Homo_sapiens.GRCh38.cdna.all.fa.gz from Ensembl
    # kallisto index -i Homo_sapiens.GRCh38.idx Homo_sapiens.GRCh38.cdna.all.fa.gz
    
    # A gene-transcript mapping file is crucial for summing transcript TPMs to gene TPMs.
    # This can be generated by parsing a GTF/GFF file (e.g., Homo_sapiens.GRCh38.109.gtf from Ensembl).
    # Example command to create gene_transcript_map.tsv (gene_id\ttranscript_id):
    # Replace 'Homo_sapiens.GRCh38.109.gtf' with your actual GTF file path.
    # awk '$3 == "transcript" { 
    #     gene_id=""; transcript_id=""; 
    #     split($9, a, ";"); 
    #     for (i in a) { 
    #         if (a[i] ~ /gene_id/) gene_id=substr(a[i], index(a[i], "\"")+1, length(a[i])-index(a[i], "\"")-1); 
    #         if (a[i] ~ /transcript_id/) transcript_id=substr(a[i], index(a[i], "\"")+1, length(a[i])-index(a[i], "\"")-1) 
    #     } 
    #     if (gene_id && transcript_id) print gene_id "\t" transcript_id 
    # }' Homo_sapiens.GRCh38.109.gtf > gene_transcript_map.tsv
    
    # --- Define Input and Output Paths ---
    # Directory containing kallisto output subdirectories for each cell (e.g., cell1/abundance.tsv, cell2/abundance.tsv)
    KALLISTO_OUTPUT_DIR="kallisto_quant_outputs"
    # Path to the gene-transcript mapping file
    GENE_TRANSCRIPT_MAP="gene_transcript_map.tsv"
    # Output file for the combined cell-by-gene expression matrix
    OUTPUT_MATRIX="cell_by_gene_expression_matrix.tsv"
    
    # --- Create a Python script to combine kallisto outputs ---
    # This script will read abundance.tsv files, map transcripts to genes, 
    # sum TPMs per gene, and combine into a single cell-by-gene matrix.
    cat << 'EOF' > merge_kallisto_gene_tpm.py
    import pandas as pd
    import os
    import sys
    
    kallisto_output_dir = sys.argv[1]
    gene_transcript_map_file = sys.argv[2]
    output_matrix_file = sys.argv[3]
    
    # Load gene-transcript map
    # Assuming gene_transcript_map.tsv has two columns: gene_id\ttranscript_id
    gene_map = pd.read_csv(gene_transcript_map_file, sep='\t', header=None, names=['gene_id', 'transcript_id'])
    transcript_to_gene = gene_map.set_index('transcript_id')['gene_id'].to_dict()
    
    all_gene_tpm_dfs = []
    
    # Iterate through each cell's kallisto output directory
    for cell_dir_name in sorted(os.listdir(kallisto_output_dir)):
        full_cell_dir = os.path.join(kallisto_output_dir, cell_dir_name)
        if os.path.isdir(full_cell_dir):
            abundance_file = os.path.join(full_cell_dir, 'abundance.tsv')
            if os.path.exists(abundance_file):
                cell_name = cell_dir_name
                
                # Load abundance data (target_id, length, eff_length, est_counts, tpm)
                df_abundance = pd.read_csv(abundance_file, sep='\t')
                
                # Map transcripts (target_id) to genes and sum TPMs
                df_abundance['gene_id'] = df_abundance['target_id'].map(transcript_to_gene)
                
                # Drop transcripts that couldn't be mapped to a gene (if any, e.g., non-coding RNAs not in map)
                df_abundance = df_abundance.dropna(subset=['gene_id'])
                
                # Sum TPMs by gene
                gene_tpm_series = df_abundance.groupby('gene_id')['tpm'].sum()
                all_gene_tpm_dfs.append(gene_tpm_series.rename(cell_name))
            else:
                print(f"Warning: {abundance_file} not found for {cell_dir_name}", file=sys.stderr)
    
    if not all_gene_tpm_dfs:
        print("Error: No kallisto abundance files found to process.", file=sys.stderr)
        sys.exit(1)
    
    # Combine all gene TPM series into a single DataFrame
    # Use outer join to include all genes across all cells, filling missing values with 0
    final_matrix = pd.concat(all_gene_tpm_dfs, axis=1).fillna(0)
    
    # Transpose the matrix to have cells as rows and genes as columns, as per description
    # (C=288 cells=rows, G=22425 genes=columns)
    final_matrix = final_matrix.T
    
    # Write the final matrix to a TSV file
    final_matrix.to_csv(output_matrix_file, sep='\t', index=True, index_label='CellID')
    print(f"Successfully created {output_matrix_file}")
    EOF
    
    # --- Execute the Python script ---
    # Ensure Python and pandas are installed in your environment.
    python merge_kallisto_gene_tpm.py "${KALLISTO_OUTPUT_DIR}" "${GENE_TRANSCRIPT_MAP}" "${OUTPUT_MATRIX}"
    
    # --- Clean up (optional) ---
    # rm merge_kallisto_gene_tpm.py
    
  14. 14

    Finally, the TPM value for each cell c and gene g was natural log-transformed to yield a normalized expression value: EXPRc,g = ln(1+TPMc,g).

    awk (Inferred with models/gemini-2.5-flash) vGNU Awk (gawk) (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Assuming input_tpm_file.tsv is a tab-separated file with a header.
    # Replace '3' with the actual 1-indexed column number containing TPM values.
    # If your file is comma-separated, change 'FS=OFS="\t"' to 'FS=OFS=","'.
    awk 'BEGIN {FS=OFS="\t"} NR==1 {print $0, "EXPR_value"; next} {print $0, log(1+$3)}' input_tpm_file.tsv > output_expr_file.tsv
  15. 15

    Dimensionality reduction and cell heterogeneity visualization.

    Seurat (Inferred with models/gemini-2.5-flash) v5.0.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R and Seurat if not already present
    # conda install -c conda-forge r-seurat -y
    
    # This command assumes an R script named 'run_seurat_dr.R' exists
    # that performs dimensionality reduction (e.g., PCA, UMAP) and visualization
    # on a Seurat object. The script would typically take an input Seurat object,
    # process it, and save an updated object and a visualization plot.
    #
    # Example content for 'run_seurat_dr.R':
    # library(Seurat)
    # args <- commandArgs(trailingOnly = TRUE)
    # input_file <- args[1]
    # output_file <- args[2]
    # output_plot <- args[3]
    # seurat_obj <- readRDS(input_file)
    # seurat_obj <- RunPCA(seurat_obj, features = VariableFeatures(object = seurat_obj))
    # seurat_obj <- RunUMAP(seurat_obj, dims = 1:30) # Adjust dims as needed
    # png(output_plot, width = 800, height = 600)
    # DimPlot(seurat_obj, reduction = "umap", group.by = "seurat_clusters") # Adjust group.by
    # dev.off()
    # saveRDS(seurat_obj, output_file)
    
    # Execute the R script for dimensionality reduction and visualization
    Rscript run_seurat_dr.R input_seurat_object.rds processed_seurat_object.rds umap_visualization.png
  16. 16

    To reduce the dimensionality of the cell-by-gene expression matrix EXPR and visualize the diversity of gene expression among CD8+ T cells of different subtypes in a 2-dimensional scatter plot, we applied the t-distributed Stochastic Neighborhood Embedding (tSNE) algorithm via its Barnes-Hut approximation (bhSNE).

    t-SNE v(Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Create a dummy EXPR.csv for demonstration if it doesn't exist
    # In a real pipeline, 'EXPR.csv' would be the actual cell-by-gene expression matrix.
    if [ ! -f EXPR.csv ]; then
        echo "Creating dummy EXPR.csv for demonstration..."
        python -c "
    import pandas as pd
    import numpy as np
    num_cells = 100
    num_genes = 500
    EXPR = pd.DataFrame(np.random.rand(num_cells, num_genes),
                        index=[f'cell_{i}' for i in range(num_cells)],
                        columns=[f'gene_{j}' for j in range(num_genes)])
    EXPR.to_csv('EXPR.csv', index=False)
    print('Dummy EXPR.csv created.')
    "
    fi
    
    # Create the Python script for t-SNE dimensionality reduction
    cat << 'EOF' > run_tsne.py
    import pandas as pd
    import numpy as np
    from sklearn.manifold import TSNE
    import sys
    
    # Load the cell-by-gene expression matrix
    # Replace 'EXPR.csv' with the actual path to your expression matrix
    # Ensure the matrix is in a format suitable for TSNE (e.g., cells as rows, genes as columns)
    try:
        EXPR = pd.read_csv("EXPR.csv")
    except FileNotFoundError:
        print("Error: EXPR.csv not found. Please provide the input expression matrix.", file=sys.stderr)
        sys.exit(1)
    
    # Initialize t-SNE with Barnes-Hut approximation (default for sklearn's TSNE)
    # n_components=2 for 2-dimensional visualization as specified in the description.
    # perplexity and learning_rate are crucial parameters for t-SNE, often tuned.
    # Using common default values here as they are not specified in the description.
    # n_jobs=-1 utilizes all available CPU cores for faster computation.
    tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42, method='barnes_hut', n_jobs=-1)
    
    # Apply t-SNE to the expression matrix
    tsne_results = tsne.fit_transform(EXPR)
    
    # Create a DataFrame for the 2-dimensional t-SNE coordinates
    tsne_df = pd.DataFrame(data=tsne_results, columns=['tSNE_1', 'tSNE_2'])
    # If EXPR had cell IDs as an index, you might want to add them back:
    # tsne_df.index = EXPR.index
    
    # Save the 2-dimensional t-SNE coordinates to a CSV file
    tsne_df.to_csv("tsne_2d_coordinates.csv", index=False)
    
    print("t-SNE dimensionality reduction complete. Results saved to tsne_2d_coordinates.csv")
    EOF
    
    # Install required Python packages if not already installed
    # conda install scikit-learn pandas numpy
    # pip install scikit-learn pandas numpy
    
    # Execute the t-SNE Python script
    python run_tsne.py
  17. 17

    We first applied standard Principal Components Analysis (PCA) to reduce the dimensionality down to D=10, and only then applied bhSNE to visualize in D=2 (with perplexity=30 and theta=0.75 parameters).

    PCA vNot specified GitHub
    $ Bash example
    # Install necessary libraries (if not already installed)
    # pip install scikit-learn pandas
    
    # Create a Python script for PCA
    cat << 'EOF' > run_pca.py
    import pandas as pd
    from sklearn.decomposition import PCA
    import sys
    
    # Expects input_file, output_file, and n_components as command-line arguments
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    n_components = int(sys.argv[3])
    
    # Load data (assuming tab-separated, first column as index for sample IDs)
    # Data should be samples x features, where samples are rows and features are columns.
    # If your data is features x samples, you might need to transpose it before PCA.
    data = pd.read_csv(input_file, sep="\t", index_col=0)
    
    # Initialize PCA with the specified number of components
    pca = PCA(n_components=n_components)
    
    # Fit PCA to the data and transform it to get the principal components
    principal_components = pca.fit_transform(data)
    
    # Create a DataFrame for the principal components with meaningful column names
    pc_df = pd.DataFrame(data=principal_components,
                         columns=[f'PC{i+1}' for i in range(n_components)],
                         index=data.index)
    
    # Save the principal components to a tab-separated file
    pc_df.to_csv(output_file, sep="\t")
    
    # Optionally, save the explained variance ratio for each component
    explained_variance_ratio_file = output_file.replace('.tsv', '_explained_variance_ratio.tsv')
    explained_variance_ratio = pd.DataFrame(pca.explained_variance_ratio_,
                                            index=[f'PC{i+1}' for i in range(n_components)],
                                            columns=['Explained Variance Ratio'])
    explained_variance_ratio.to_csv(explained_variance_ratio_file, sep="\t")
    EOF
    
    # Example usage:
    # Replace 'input_matrix.tsv' with your actual preprocessed data matrix file.
    # This file should be tab-separated, with the first column as sample identifiers
    # and subsequent columns as features (e.g., gene expression, methylation levels).
    # The output 'pca_output_D10.tsv' will contain the top 10 principal components.
    python run_pca.py input_matrix.tsv pca_output_D10.tsv 10
  18. 18

    Differential gene expression analysis.

    DESeq2 (Inferred with models/gemini-2.5-flash) vBioconductor 3.18 GitHub
    $ Bash example
    # Install R and Bioconductor if not already present
    # conda create -n deseq2_env r-base bioconductor-deseq2 -c conda-forge -c bioconda
    # conda activate deseq2_env
    
    # Or, if R is already installed:
    # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')"
    
    # Create dummy count and metadata files for demonstration
    # In a real scenario, these would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto)
    # gene_counts.tsv: Tab-separated file with gene IDs as first column, sample names as header, and raw counts.
    # sample_metadata.tsv: Tab-separated file with sample IDs as first column, and experimental conditions/covariates.
    
    echo -e "gene\tsample1\tsample2\tsample3\tsample4" > gene_counts.tsv
    echo -e "geneA\t100\t120\t500\t550" >> gene_counts.tsv
    echo -e "geneB\t50\t60\t30\t35" >> gene_counts.tsv
    echo -e "geneC\t200\t210\t220\t230" >> gene_counts.tsv
    echo -e "geneD\t10\t12\t100\t110" >> gene_counts.tsv
    
    echo -e "sample_id\tcondition" > sample_metadata.tsv
    echo -e "sample1\tcontrol" >> sample_metadata.tsv
    echo -e "sample2\tcontrol" >> sample_metadata.tsv
    echo -e "sample3\ttreated" >> sample_metadata.tsv
    echo -e "sample4\ttreated" >> sample_metadata.tsv
    
    # Create the R script for DESeq2 analysis
    cat << 'EOF' > deseq2_analysis.R
    library(DESeq2)
    
    # Load count data
    # Assumes 'gene_counts.tsv' has genes as rows and samples as columns, with raw counts.
    count_data <- read.table("gene_counts.tsv", header = TRUE, row.names = 1, sep = "\t")
    
    # Load sample metadata
    # Assumes 'sample_metadata.tsv' has sample names as row names and experimental conditions as columns.
    sample_info <- read.table("sample_metadata.tsv", header = TRUE, row.names = 1, sep = "\t")
    
    # Ensure sample order matches between count data and metadata
    sample_info <- sample_info[colnames(count_data), , drop = FALSE]
    
    # Create DESeqDataSet object
    # The 'design' formula specifies the experimental design. Here, we compare 'condition'.
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_info,
                                  design = ~ condition)
    
    # Filter out genes with very low counts (optional but recommended for better power)
    # Keeps genes with at least 10 total counts across all samples.
    keep <- rowSums(counts(dds)) >= 10
    dds <- dds[keep,]
    
    # Run DESeq2 analysis
    dds <- DESeq(dds)
    
    # Get results for 'treated' vs 'control' comparison
    # Adjust levels as per your actual data (e.g., c("condition", "treatment_group", "control_group"))
    res <- results(dds, contrast = c("condition", "treated", "control"))
    
    # Order results by adjusted p-value (padj)
    res_ordered <- res[order(res$padj), ]
    
    # Write differential expression results to a CSV file
    write.csv(as.data.frame(res_ordered), "deseq2_results.csv")
    
    # Optional: Generate a normalized count matrix
    normalized_counts <- counts(dds, normalized=TRUE)
    write.csv(as.data.frame(normalized_counts), "normalized_counts.csv")
    EOF
    
    # Execute the R script to perform differential gene expression analysis
    Rscript deseq2_analysis.R
    
    # Clean up dummy files (optional)
    # rm gene_counts.tsv sample_metadata.tsv deseq2_analysis.R
  19. 19

    We performed differential gene expression analysis between all pairs of T cell sub-populations from two non-overlapping sets of rows in the log-transformed expression matrix EXPR.

    limma (Inferred with models/gemini-2.5-flash) v3.58.1 GitHub
    $ Bash example
    #!/bin/bash
    
    # Define input files and output directory
    EXPR_MATRIX_FILE="EXPR.tsv"
    METADATA_FILE="metadata.tsv"
    OUTPUT_DIR="dge_results"
    GROUP_COLUMN="t_cell_subpopulation" # Column in metadata defining groups
    
    # Create dummy input files for demonstration if they don't exist
    # In a real pipeline, these files would be generated by upstream steps.
    if [ ! -f "$EXPR_MATRIX_FILE" ]; then
      echo "Creating dummy EXPR.tsv..."
      echo -e "Gene\tSample1\tSample2\tSample3\tSample4\tSample5\tSample6" > "$EXPR_MATRIX_FILE"
      echo -e "GeneA\t10.5\t11.2\t10.8\t15.1\t14.8\t15.5" >> "$EXPR_MATRIX_FILE"
      echo -e "GeneB\t8.2\t8.5\t8.0\t9.1\t9.3\t9.0" >> "$EXPR_MATRIX_FILE"
      echo -e "GeneC\t12.1\t11.9\t12.5\t11.5\t11.8\t12.0" >> "$EXPR_MATRIX_FILE"
      echo -e "GeneD\t5.0\t5.2\t5.1\t4.8\t4.9\t5.0" >> "$EXPR_MATRIX_FILE"
    fi
    
    if [ ! -f "$METADATA_FILE" ]; then
      echo "Creating dummy metadata.tsv..."
      echo -e "Sample\tt_cell_subpopulation\tother_info" > "$METADATA_FILE"
      echo -e "Sample1\tCD4_T_cell\tinfo1" >> "$METADATA_FILE"
      echo -e "Sample2\tCD4_T_cell\tinfo2" >> "$METADATA_FILE"
      echo -e "Sample3\tCD8_T_cell\tinfo3" >> "$METADATA_FILE"
      echo -e "Sample4\tCD8_T_cell\tinfo4" >> "$METADATA_FILE"
      echo -e "Sample5\tRegulatory_T_cell\tinfo5" >> "$METADATA_FILE"
      echo -e "Sample6\tRegulatory_T_cell\tinfo6" >> "$METADATA_FILE"
    fi
    
    # Create output directory
    mkdir -p "$OUTPUT_DIR"
    
    # R script content
    R_SCRIPT=$(cat <<EOF
    # Load necessary libraries
    # if (!requireNamespace("BiocManager", quietly = TRUE))
    #     install.packages("BiocManager")
    # BiocManager::install("limma")
    library(limma)
    
    # --- Configuration ---
    EXPR_MATRIX_FILE <- "$EXPR_MATRIX_FILE"
    METADATA_FILE <- "$METADATA_FILE"
    OUTPUT_DIR <- "$OUTPUT_DIR"
    GROUP_COLUMN <- "$GROUP_COLUMN"
    
    # --- Load Data ---
    expr_matrix <- as.matrix(read.delim(EXPR_MATRIX_FILE, row.names = 1, sep="\\t"))
    metadata <- read.delim(METADATA_FILE, row.names = 1, sep="\\t")
    
    # Ensure sample names match and are in the same order
    # Assuming sample names are column names in expr_matrix and row names in metadata
    if (!all(colnames(expr_matrix) %in% rownames(metadata))) {
      stop("Sample names in expression matrix and metadata do not match.")
    }
    metadata <- metadata[colnames(expr_matrix), ] # Reorder metadata to match expr_matrix columns
    
    # Define the grouping factor
    groups <- factor(metadata[[GROUP_COLUMN]])
    design <- model.matrix(~0 + groups)
    colnames(design) <- levels(groups)
    
    # --- Differential Expression Analysis ---
    # Fit linear model
    fit <- lmFit(expr_matrix, design)
    
    # Define all pairwise contrasts
    group_levels <- levels(groups)
    contrasts_list <- list()
    k <- 1
    for (i in 1:(length(group_levels) - 1)) {
      for (j in (i + 1):length(group_levels)) {
        contrast_name <- paste0(group_levels[j], "_vs_", group_levels[i])
        contrast_formula <- paste0(group_levels[j], " - ", group_levels[i])
        contrasts_list[[k]] <- makeContrasts(contrasts = contrast_formula, levels = design)
        names(contrasts_list)[k] <- contrast_name
        k <- k + 1
      }
    }
    
    # Apply contrasts
    contrast_matrix <- do.call(cbind, contrasts_list)
    fit2 <- contrasts.fit(fit, contrast_matrix)
    
    # Empirical Bayes moderation
    fit2 <- eBayes(fit2)
    
    # Extract and save results for each contrast
    for (contrast_name in names(contrasts_list)) {
      results <- topTable(fit2, coef = contrast_name, number = Inf, adjust.method = "BH")
      output_file <- file.path(OUTPUT_DIR, paste0("dge_results_", contrast_name, ".tsv"))
      write.table(results, file = output_file, sep = "\\t", quote = FALSE, row.names = TRUE)
      message(paste("Results for", contrast_name, "saved to", output_file))
    }
    
    message("Differential gene expression analysis complete.")
    EOF
    )
    
    # Execute the R script
    Rscript -e "$R_SCRIPT"
  20. 20

    Since single-cell gene expression does not conform to the usual negative binomial distribution and can even be bimodal due to dropout we used two non-parametric statistical tests for heterogeneity of expression: Mann Whitney Wilcoxon (MWW, also known as MWU) which is a rank-sum test but relies on a large sample to approximate normality, and Kolmogorov-Smirnov 2-sample (KS2) test which finds the largest difference between the empirical cumulative distributions, even between two small samples such as our 1st division sub-types Div1TE (n =36) and Div1MEM (n =24).

    R (Inferred with models/gemini-2.5-flash) vNot specified GitHub
    $ Bash example
    # This script demonstrates how to perform Mann-Whitney-Wilcoxon (MWW) and Kolmogorov-Smirnov 2-sample (KS2) tests using R.
    # The description indicates comparing gene expression between two groups: Div1TE (n=36) and Div1MEM (n=24).
    # We simulate data for demonstration purposes. In a real pipeline, 'div1te_expr' and 'div1mem_expr' would be loaded from actual gene expression data files.
    
    # Example of how to install R (if not already installed)
    # sudo apt-get update
    # sudo apt-get install r-base
    
    Rscript -e '
      # Set seed for reproducibility of simulated data
      set.seed(123)
    
      # Simulate gene expression data for Div1TE (n=36)
      # Assuming expression values might follow a log-normal distribution, common for gene expression
      div1te_expr <- rlnorm(36, meanlog = 1.5, sdlog = 0.8)
    
      # Simulate gene expression data for Div1MEM (n=24)
      # Assuming a potentially different distribution to test for heterogeneity
      div1mem_expr <- rlnorm(24, meanlog = 1.2, sdlog = 0.7)
    
      cat("Simulated Div1TE expression (first 5 values):\n")
      print(head(div1te_expr))
      cat("Simulated Div1MEM expression (first 5 values):\n")
      print(head(div1mem_expr))
    
      # Perform Mann-Whitney-Wilcoxon (MWW) test
      # This is a non-parametric test for comparing two independent samples.
      # The description implies testing for "heterogeneity of expression", so a two-sided alternative is appropriate.
      cat("\n--- Mann-Whitney-Wilcoxon Test ---\n")
      mww_result <- wilcox.test(div1te_expr, div1mem_expr, alternative = "two.sided")
      print(mww_result)
    
      # Perform Kolmogorov-Smirnov 2-sample (KS2) test
      # This test compares the empirical cumulative distribution functions of two samples.
      cat("\n--- Kolmogorov-Smirnov 2-sample Test ---\n")
      ks2_result <- ks.test(div1te_expr, div1mem_expr)
      print(ks2_result)
    
      # In a real scenario, you might save these results to a file:
      # sink("statistical_test_results.txt")
      # cat("Mann-Whitney-Wilcoxon Test Result:\n")
      # print(mww_result)
      # cat("\nKolmogorov-Smirnov 2-sample Test Result:\n")
      # print(ks2_result)
      # sink()
    '
    
  21. 21

    Cell type classifier.

    SingleR (Inferred with models/gemini-2.5-flash) vUnknown GitHub
    $ Bash example
    # Define input and output files
    INPUT_EXPRESSION_MATRIX="expression_matrix.tsv" # Path to the input single-cell expression matrix (e.g., counts or normalized data)
    REFERENCE_DATA="human_pbmc_reference.rds" # Path to a pre-built SingleR reference object (e.g., from celldex or a custom one)
    OUTPUT_ANNOTATIONS="cell_type_annotations.tsv" # Output file for predicted cell type annotations
    
    # Commented out installation instructions
    # # Install BiocManager if not already installed
    # # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')"
    # # Install SingleR and celldex (if using celldex references)
    # # R -e "BiocManager::install(c('SingleR', 'celldex'))"
    # # Install Matrix package if needed for sparse matrices
    # # R -e "install.packages('Matrix')"
    
    # Set environment variables for the R script
    export INPUT_EXPRESSION_MATRIX
    export REFERENCE_DATA
    export OUTPUT_ANNOTATIONS
    
    # Execute SingleR for cell type classification
    # This R script loads an expression matrix and a SingleR reference, then performs classification.
    # The reference data should be an RDS file containing a list with 'data' (expression matrix) and 'labels' (cell types)
    # or a SummarizedExperiment object from celldex.
    Rscript -e "
        library(SingleR)
        library(Matrix) # Often needed for sparse matrices
    
        # Retrieve file paths from environment variables
        input_matrix_file <- Sys.getenv('INPUT_EXPRESSION_MATRIX')
        reference_file <- Sys.getenv('REFERENCE_DATA')
        output_annotations_file <- Sys.getenv('OUTPUT_ANNOTATIONS')
    
        # Load expression matrix
        # Assuming a tab-separated file with genes as rows and cells as columns.
        # Adjust 'read.table' parameters or use other functions (e.g., 'readRDS', 'Seurat::Read10X')
        # based on the actual format of your input expression data.
        # Example for a dense matrix:
        expr_matrix <- as.matrix(read.table(input_matrix_file, sep='\t', header=TRUE, row.names=1))
        # Example for a sparse matrix (e.g., from 10x Genomics, after loading with Matrix::readMM):
        # expr_matrix <- readMM(input_matrix_file) # If input_matrix_file is a .mtx file
        # colnames(expr_matrix) <- readLines('barcodes.tsv') # Assuming barcodes file
        # rownames(expr_matrix) <- readLines('features.tsv') # Assuming features file
    
        # Load SingleR reference data
        # This should be an RDS file containing a SingleR-compatible reference object.
        # For example, a list with 'data' (expression matrix) and 'labels' (cell type vector),
        # or a SummarizedExperiment object from the 'celldex' package.
        ref_obj <- readRDS(reference_file)
    
        # Perform cell type annotation
        # If 'ref_obj' is a SummarizedExperiment from celldex (e.g., HumanPrimaryCellAtlasData()),
        # SingleR can directly use it: predictions <- SingleR(test = expr_matrix, ref = ref_obj, labels = ref_obj$label.main)
        # If 'ref_obj' is a custom list:
        predictions <- SingleR(test = expr_matrix, ref = ref_obj$data, labels = ref_obj$labels)
    
        # Extract and save results
        cell_annotations <- data.frame(
            cell_id = rownames(predictions),
            predicted_label = predictions$labels,
            score = predictions$scores[, predictions$labels] # Score for the predicted label
        )
        write.table(cell_annotations, output_annotations_file, sep='\t', quote=FALSE, row.names=FALSE)
    "
  22. 22

    We trained two binary T cell classifiers to identify gene expression signatures that not only differentiate the examined T cell sub-populations (like the differential gene expression described above) but can also be used to predict the ‘memory-‘ or ‘effector-ness’ of previously unseen cells.

    Machine Learning Framework (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # This script demonstrates a conceptual approach to training binary T cell classifiers using Python and scikit-learn.
    # It involves loading gene expression data, preparing labels, training two separate classifiers,
    # and saving the trained models. Specific algorithms (e.g., SVM) are used as examples.
    
    # Install necessary libraries (example for Python/scikit-learn)
    # conda create -n ml_env python=3.9
    # conda activate ml_env
    # pip install pandas scikit-learn numpy joblib
    
    # Placeholder for a Python script that would perform the classification
    # This script would:
    # 1. Load gene expression data (e.g., from a CSV file)
    # 2. Load labels for T cell sub-populations and 'memory-ness'/'effector-ness'
    # 3. Preprocess data (e.g., normalization, scaling)
    # 4. Split data into training and testing sets
    # 5. Train two binary classifiers (e.g., for sub-population differentiation and memory/effector prediction)
    # 6. Evaluate the classifiers
    # 7. Save the trained models and/or prediction results
    
    cat << 'EOF' > train_tcell_classifiers.py
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.svm import SVC # Example classifier: Support Vector Classifier
    from sklearn.metrics import accuracy_score, classification_report
    import joblib # For saving models
    import numpy as np
    
    # --- Configuration --- 
    GENE_EXPRESSION_FILE = "gene_expression_data.csv"
    CELL_LABELS_FILE = "cell_labels.csv" # Contains 'subpopulation' and 'memory_effector_status'
    
    # --- Load Data ---
    print("Loading data...")
    try:
        gene_expr_df = pd.read_csv(GENE_EXPRESSION_FILE, index_col=0)
        labels_df = pd.read_csv(CELL_LABELS_FILE, index_col=0)
    except FileNotFoundError as e:
        print(f"Error loading data: {e}. Please ensure '{GENE_EXPRESSION_FILE}' and '{CELL_LABELS_FILE}' exist.")
        exit(1)
    
    # Ensure indices match and align data
    common_cells = gene_expr_df.index.intersection(labels_df.index)
    gene_expr_df = gene_expr_df.loc[common_cells]
    labels_df = labels_df.loc[common_cells]
    
    X = gene_expr_df.values
    y_subpop = labels_df['subpopulation'].values
    y_mem_eff = labels_df['memory_effector_status'].values
    
    # --- Preprocessing ---
    print("Preprocessing data (scaling features)...")
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # --- Train Classifier 1: Differentiate T cell sub-populations ---
    print("\nTraining classifier for T cell sub-populations...")
    # Split data for sub-population classification
    X_train_subpop, X_test_subpop, y_train_subpop, y_test_subpop = train_test_split(X_scaled, y_subpop, test_size=0.2, random_state=42, stratify=y_subpop)
    
    # Initialize and train the classifier
    classifier_subpop = SVC(kernel='linear', probability=True, random_state=42)
    classifier_subpop.fit(X_train_subpop, y_train_subpop)
    
    # Evaluate the classifier
    y_pred_subpop = classifier_subpop.predict(X_test_subpop)
    print("\nClassifier 1 (Sub-population) Evaluation:")
    print(f"Accuracy: {accuracy_score(y_test_subpop, y_pred_subpop):.4f}")
    print(classification_report(y_test_subpop, y_pred_subpop, zero_division=0))
    
    # Save the trained model
    joblib.dump(classifier_subpop, 'tcell_subpop_classifier.pkl')
    print("Saved T cell sub-population classifier to 'tcell_subpop_classifier.pkl'")
    
    # --- Train Classifier 2: Predict 'memory-' or 'effector-ness' ---
    print("\nTraining classifier for 'memory-' or 'effector-ness'...")
    # Split data for memory/effector classification
    X_train_mem_eff, X_test_mem_eff, y_train_mem_eff, y_test_mem_eff = train_test_split(X_scaled, y_mem_eff, test_size=0.2, random_state=42, stratify=y_mem_eff)
    
    # Initialize and train the classifier
    classifier_mem_eff = SVC(kernel='rbf', probability=True, random_state=42)
    classifier_mem_eff.fit(X_train_mem_eff, y_train_mem_eff)
    
    # Evaluate the classifier
    y_pred_mem_eff = classifier_mem_eff.predict(X_test_mem_eff)
    print("\nClassifier 2 (Memory/Effector) Evaluation:")
    print(f"Accuracy: {accuracy_score(y_test_mem_eff, y_pred_mem_eff):.4f}")
    print(classification_report(y_test_mem_eff, y_pred_mem_eff, zero_division=0))
    
    # Save the trained model
    joblib.dump(classifier_mem_eff, 'tcell_mem_eff_classifier.pkl')
    print("Saved T cell memory/effector classifier to 'tcell_mem_eff_classifier.pkl'")
    
    print("\nTraining complete. Models saved.")
    EOF
    
    # Create dummy input files for demonstration purposes
    echo "Creating dummy input files for demonstration..."
    cat << 'EOF' > gene_expression_data.csv
    cell_id,geneA,geneB,geneC,geneD,geneE,geneF,geneG,geneH,geneI,geneJ
    cell1,10,20,5,15,25,30,12,18,22,8
    cell2,12,22,6,16,26,32,14,20,24,10
    cell3,8,18,4,14,24,28,10,16,20,6
    cell4,11,21,7,17,27,31,13,19,23,9
    cell5,9,19,3,13,23,29,11,17,21,7
    cell6,13,23,8,18,28,33,15,21,25,11
    cell7,7,17,2,12,22,27,9,15,19,5
    cell8,14,24,9,19,29,34,16,22,26,12
    cell9,6,16,1,11,21,26,8,14,18,4
    cell10,15,25,10,20,30,35,17,23,27,13
    cell11,10.5,20.5,5.5,15.5,25.5,30.5,12.5,18.5,22.5,8.5
    cell12,12.5,22.5,6.5,16.5,26.5,32.5,14.5,20.5,24.5,10.5
    cell13,8.5,18.5,4.5,14.5,24.5,28.5,10.5,16.5,20.5,6.5
    cell14,11.5,21.5,7.5,17.5,27.5,31.5,13.5,19.5,23.5,9.5
    cell15,9.5,19.5,3.5,13.5,23.5,29.5,11.5,17.5,21.5,7.5
    cell16,13.5,23.5,8.5,18.5,28.5,33.5,15.5,21.5,25.5,11.5
    cell17,7.5,17.5,2.5,12.5,22.5,27.5,9.5,15.5,19.5,5.5
    cell18,14.5,24.5,9.5,19.5,29.5,34.5,16.5,22.5,26.5,12.5
    cell19,6.5,16.5,1.5,11.5,21.5,26.5,8.5,14.5,18.5,4.5
    cell20,15.5,25.5,10.5,20.5,30.5,35.5,17.5,23.5,27.5,13.5
    EOF
    
    cat << 'EOF' > cell_labels.csv
    cell_id,subpopulation,memory_effector_status
    cell1,CD4_Treg,Memory
    cell2,CD8_Cytotoxic,Effector
    cell3,CD4_Helper,Memory
    cell4,CD8_Cytotoxic,Memory
    cell5,CD4_Treg,Effector
    cell6,CD8_Cytotoxic,Memory
    cell7,CD4_Helper,Effector
    cell8,CD8_Cytotoxic,Memory
    cell9,CD4_Treg,Memory
    cell10,CD8_Cytotoxic,Effector
    cell11,CD4_Treg,Memory
    cell12,CD8_Cytotoxic,Effector
    cell13,CD4_Helper,Memory
    cell14,CD8_Cytotoxic,Memory
    cell15,CD4_Treg,Effector
    cell16,CD8_Cytotoxic,Memory
    cell17,CD4_Helper,Effector
    cell18,CD8_Cytotoxic,Memory
    cell19,CD4_Treg,Memory
    cell20,CD8_Cytotoxic,Effector
    EOF
    
    # Execute the Python script to train the classifiers
    python train_tcell_classifiers.py
  23. 23

    Each classifier constructed an independent ensemble of Extremely Randomized Trees.

    scikit-learn (Inferred with models/gemini-2.5-flash) v1.4.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install scikit-learn and joblib if not already installed
    # pip install scikit-learn joblib
    
    # Create a Python script to train the Extremely Randomized Trees classifier
    cat << 'EOF' > train_extra_trees.py
    import numpy as np
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.model_selection import train_test_split
    import joblib
    import os
    
    # Placeholder for input data (features X and labels y)
    # In a real pipeline, X and y would be loaded from files (e.g., CSV, HDF5).
    # X: Feature matrix (n_samples, n_features)
    # y: Target labels (n_samples,)
    
    # Example: Generate synthetic data for demonstration purposes.
    # Replace this with actual data loading in a real application.
    print("Generating synthetic data for demonstration...")
    n_samples = 1000
    n_features = 20
    n_classes = 2
    X = np.random.rand(n_samples, n_features)
    y = np.random.randint(0, n_classes, n_samples)
    
    # Split data into training and testing sets (optional, but good practice)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize the Extremely Randomized Trees classifier (ExtraTreesClassifier)
    # Parameters like n_estimators, max_features, min_samples_leaf, etc.,
    # would typically be optimized or set based on prior knowledge or cross-validation.
    # n_jobs=-1 uses all available CPU cores for parallel processing.
    print("Initializing ExtraTreesClassifier...")
    classifier = ExtraTreesClassifier(
        n_estimators=100,      # Number of trees in the forest
        random_state=42,       # Seed for reproducibility
        n_jobs=-1              # Use all available cores for fitting and prediction
    )
    
    # Train the classifier
    print("Training the ExtraTreesClassifier...")
    classifier.fit(X_train, y_train)
    
    # Evaluate the classifier (optional)
    accuracy = classifier.score(X_test, y_test)
    print(f"Model accuracy on test set: {accuracy:.4f}")
    
    # Save the trained model to a file using joblib
    output_model_path = "extra_trees_classifier_model.joblib"
    print(f"Saving the trained model to {output_model_path}...")
    joblib.dump(classifier, output_model_path)
    print("Model training and saving complete.")
    
    # No specific reference genomic datasets are used for this general machine learning model training step.
    # The model is trained on the provided input features and labels.
    EOF
    
    # Execute the Python script to train the classifier
    python train_extra_trees.py
  24. 24

    Using the terminally differentiated effector and memory (TCM, TEM) populations, we built a training set for a fate classifier for CD8+ T cells.

    scikit-learn (Inferred with models/gemini-2.5-flash) v1.3.0 GitHub
    $ Bash example
    # Install necessary libraries if not already present
    # conda create -n classifier_env python=3.9 scikit-learn pandas joblib numpy
    # conda activate classifier_env
    # pip install scikit-learn pandas joblib numpy
    
    # Create a dummy input data file for demonstration if it doesn't exist.
    # In a real bioinformatics pipeline, 'cd8_tcell_expression_data.csv' would be
    # generated by a preceding data processing step, containing features (e.g., gene expression)
    # and a 'cell_fate' column (e.g., 0 for TCM, 1 for TEM).
    if [ ! -f "cd8_tcell_expression_data.csv" ]; then
        echo "Creating dummy input data for demonstration..."
        python -c "
    import pandas as pd
    import numpy as np
    from sklearn.datasets import make_classification
    
    # Generate synthetic data representing CD8+ T cells with features and a fate label
    X, y = make_classification(n_samples=200, n_features=50, n_informative=20, n_redundant=10, n_classes=2, random_state=42, weights=[0.5, 0.5])
    data = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
    data['cell_fate'] = y # 0 for TCM, 1 for TEM (example labels)
    data.to_csv('cd8_tcell_expression_data.csv', index=False)
    print('Dummy data created: cd8_tcell_expression_data.csv')
    "
    fi
    
    # Create a Python script to build the fate classifier
    cat << 'EOF' > classifier_training.py
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score
    import joblib # For saving the model
    
    # Load the input data containing features and cell fate labels
    try:
        data = pd.read_csv("cd8_tcell_expression_data.csv")
        X = data.drop("cell_fate", axis=1) # Features
        y = data["cell_fate"] # Target variable (cell fate: TCM/TEM)
    except FileNotFoundError:
        print("Error: Input data file 'cd8_tcell_expression_data.csv' not found.")
        exit(1)
    
    # Split the data into training and testing sets
    # The training set will be used to build the classifier
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    print(f"Training set size: {len(X_train)} samples")
    print(f"Testing set size: {len(X_test)} samples")
    
    # Initialize and train a classifier (e.g., RandomForestClassifier)
    # The choice of classifier can be adjusted based on data characteristics and performance requirements
    classifier = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
    print("Training the classifier...")
    classifier.fit(X_train, y_train)
    print("Classifier training complete.")
    
    # Evaluate the classifier on the test set (optional, but good practice)
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Classifier accuracy on test set: {accuracy:.4f}")
    
    # Save the trained classifier model for future use (e.g., prediction on new data)
    model_filename = "cd8_tcell_fate_classifier.joblib"
    joblib.dump(classifier, model_filename)
    print(f"Trained classifier saved to {model_filename}")
    
    # Save the training data (features and labels) that were used to build the classifier
    # This is important for reproducibility and understanding the training set composition.
    training_features_output_filename = "cd8_tcell_training_set_features.csv"
    training_labels_output_filename = "cd8_tcell_training_set_labels.csv"
    X_train.to_csv(training_features_output_filename, index=False)
    y_train.to_csv(training_labels_output_filename, index=False)
    print(f"Training set features saved to {training_features_output_filename}")
    print(f"Training set labels saved to {training_labels_output_filename}")
    EOF
    
    # Execute the Python script to build the fate classifier
    python classifier_training.py
    
    echo "Classifier training process finished. A trained model and training set components are saved."
  25. 25

    Using the newly observed segregation of daughter T cells into Div1TE and Div1MEM subpopulations after the first division, we built a second training set for another early state classifier.

    Machine Learning Classifier Training (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # This is a conceptual script to illustrate building a training set for a classifier.
    # The actual implementation would depend on the specific data format, features, and chosen machine learning library.
    
    # Assuming input data (e.g., CSV, TSV) contains features and a 'subpopulation' column (Div1TE/Div1MEM)
    # Example: data.csv
    # cell_id,feature1,feature2,...,subpopulation
    # C1,0.5,1.2,...,Div1TE
    # C2,0.8,0.9,...,Div1MEM
    
    # Install scikit-learn if not already installed
    # pip install scikit-learn pandas numpy
    
    # Create a Python script (e.g., build_training_set.py)
    cat << 'EOF' > build_training_set.py
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder
    import numpy as np
    
    # Placeholder for input data file
    # In a real scenario, this would be the output from a previous step that identified Div1TE/Div1MEM cells.
    input_data_file = "processed_t_cell_data.csv" # This file should contain features and the 'subpopulation' label
    
    # Placeholder for output training and testing set files
    output_train_file = "classifier_training_set.csv"
    output_test_file = "classifier_testing_set.csv"
    
    # Placeholder for features to be used in the classifier
    # Replace with actual feature column names from your data
    feature_columns = ["feature1", "feature2", "feature3"] # Example features
    
    try:
        # Load the data
        df = pd.read_csv(input_data_file)
    
        # Ensure the 'subpopulation' column exists
        if 'subpopulation' not in df.columns:
            raise ValueError(f"Column 'subpopulation' not found in {input_data_file}")
    
        # Filter for the relevant subpopulations if necessary (Div1TE, Div1MEM)
        # Assuming the input_data_file already contains only these or we need to filter
        df_filtered = df[df['subpopulation'].isin(['Div1TE', 'Div1MEM'])].copy()
    
        if df_filtered.empty:
            print(f"No data found for Div1TE or Div1MEM subpopulations in {input_data_file}. Exiting.")
            exit(1)
    
        # Prepare features (X) and labels (y)
        X = df_filtered[feature_columns]
        y = df_filtered['subpopulation']
    
        # Encode labels if they are strings (e.g., Div1TE -> 0, Div1MEM -> 1)
        label_encoder = LabelEncoder()
        y_encoded = label_encoder.fit_transform(y)
    
        # Split data into training and testing sets
        # Using a common split ratio, e.g., 80% training, 20% testing
        # random_state ensures reproducibility of the split
        X_train, X_test, y_train, y_test = train_test_split(
            X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
        )
    
        # Combine features and encoded labels for saving
        train_df = pd.DataFrame(X_train, columns=feature_columns)
        train_df['label'] = y_train
        train_df['original_label'] = label_encoder.inverse_transform(y_train)
    
        test_df = pd.DataFrame(X_test, columns=feature_columns)
        test_df['label'] = y_test
        test_df['original_label'] = label_encoder.inverse_transform(y_test)
    
        # Save the training and testing sets
        train_df.to_csv(output_train_file, index=False)
        test_df.to_csv(output_test_file, index=False)
    
        print(f"Training set saved to {output_train_file}")
        print(f"Testing set saved to {output_test_file}")
        print(f"Label mapping: {list(label_encoder.classes_)} -> {list(range(len(label_encoder.classes_))))}")
    
    except FileNotFoundError:
        print(f"Error: Input file '{input_data_file}' not found. Please ensure the file exists.")
        exit(1)
    except ValueError as e:
        print(f"Error processing data: {e}")
        exit(1)
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        exit(1)
    
    EOF
    
    # Execute the Python script
    python build_training_set.py
  26. 26

    Both classifiers were fed their respective training sets using 10-fold cross-validation

    scikit-learn (Inferred with models/gemini-2.5-flash) v1.x GitHub
    $ Bash example
    # Install scikit-learn (a common Python library for machine learning tasks)
    # pip install scikit-learn
    
    # This command represents the execution of a Python script that would implement
    # 10-fold cross-validation for the specified classifiers using their respective training sets.
    # The exact script name and parameters would vary based on the specific implementation.
    # 'training_data.csv' is a placeholder for the input training data.
    python run_classifier_cross_validation.py \
        --input_training_data training_data.csv \
        --num_folds 10 \
        --output_results cross_validation_metrics.json
  27. 27

    After both the fate and early state classifiers were trained on their respective subpopulations, they were both applied on previously unseen intermediate Day 4 CD8+ T cells.

    Machine Learning Classifier Application (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    bash
    # This script conceptually represents the application of pre-trained machine learning classifiers
    # to new, unseen data. The specific implementation (e.g., Python with scikit-learn, R with caret)
    # would depend on how the classifiers were originally trained.
    
    # Example installation for a Python-based scikit-learn environment:
    # conda create -n classifier_env python=3.9 scikit-learn pandas numpy -y
    # conda activate classifier_env
    
    # Define paths for the trained classifier models and the input data
    # These models would have been generated in a previous 'training' step.
    TRAINED_FATE_CLASSIFIER_MODEL="path/to/trained_fate_classifier_model.pkl" # e.g., a pickled scikit-learn model
    TRAINED_EARLY_STATE_CLASSIFIER_MODEL="path/to/trained_early_state_classifier_model.pkl" # e.g., a pickled scikit-learn model
    
    # Input data: Previously unseen intermediate Day 4 CD8+ T cells
    # This could be a CSV, TSV, H5AD (for single-cell data), or other format.
    INPUT_DAY4_CD8_T_CELLS_DATA="path/to/day4_cd8_t_cells_features.h5ad" # Example: AnnData object for single-cell features
    
    # Define output paths for the predictions
    OUTPUT_FATE_PREDICTIONS="path/to/day4_cd8_t_cells_fate_predictions.csv"
    OUTPUT_EARLY_STATE_PREDICTIONS="path/to/day4_cd8_t_cells_early_state_predictions.csv"
    
    # Execute a conceptual Python script to load the models and apply them to the new data.
    # This script would handle loading the input data, preprocessing if necessary,
    # making predictions with both classifiers, and saving the results.
    python apply_trained_classifiers.py \
        --fate_model "${TRAINED_FATE_CLASSIFIER_MODEL}" \
        --early_state_model "${TRAINED_EARLY_STATE_CLASSIFIER_MODEL}" \
        --input_data "${INPUT_DAY4_CD8_T_CELLS_DATA}" \
        --output_fate_predictions "${OUTPUT_FATE_PREDICTIONS}" \
        --output_early_state_predictions "${OUTPUT_EARLY_STATE_PREDICTIONS}"
    
  28. 28

    Their predicted ‘memory-ness’ scores were scatter-plotted and shown to correlate in Fig.

    R (Inferred with models/gemini-2.5-flash) v4.x.x (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R if not already installed (example for Ubuntu/Debian)
    # sudo apt-get update
    # sudo apt-get install r-base
    
    # Create a dummy input file for demonstration purposes.
    # In a real scenario, 'memory_scores.csv' would be generated by a previous step
    # and contain the predicted 'memory-ness' scores for two sets of data to be correlated.
    # Assuming a CSV format with two columns, e.g., 'score_A' and 'score_B'.
    echo "score_A,score_B" > memory_scores.csv
    echo "0.1,0.15" >> memory_scores.csv
    echo "0.2,0.22" >> memory_scores.csv
    echo "0.3,0.35" >> memory_scores.csv
    echo "0.4,0.41" >> memory_scores.csv
    echo "0.5,0.53" >> memory_scores.csv
    echo "0.6,0.60" >> memory_scores.csv
    echo "0.7,0.75" >> memory_scores.csv
    echo "0.8,0.82" >> memory_scores.csv
    echo "0.9,0.90" >> memory_scores.csv
    
    # Create an R script to perform scatter plotting and correlation calculation
    cat << 'EOF' > plot_memory_scores.R
    # Read the input data
    data <- read.csv("memory_scores.csv")
    
    # Create a scatter plot and save it as a PNG file
    png("memory_scores_scatterplot.png", width = 800, height = 600)
    plot(data$score_A, data$score_B,
         xlab = "Predicted Memory-ness Score (Set A)",
         ylab = "Predicted Memory-ness Score (Set B)",
         main = "Correlation of Predicted Memory-ness Scores",
         pch = 16, col = "blue", cex = 1.2)
    
    # Calculate Pearson correlation coefficient
    correlation <- cor(data$score_A, data$score_B, method = "pearson")
    
    # Add correlation value to the plot
    legend("topleft", legend = paste("Pearson Correlation:", round(correlation, 3)), bty = "n", cex = 1.2)
    
    # Add a linear regression line to visualize the trend
    abline(lm(data$score_B ~ data$score_A), col = "red", lwd = 2)
    dev.off()
    
    # Print the correlation coefficient to standard output
    cat(paste("Pearson Correlation Coefficient:", correlation, "\n"))
    EOF
    
    # Execute the R script
    Rscript plot_memory_scores.R
  29. 29

    3f.

    CLIPper (Inferred with models/gemini-2.5-flash) vNot specified GitHub
    $ Bash example
    # This code block assumes step 3f refers to eCLIP peak calling, as per the provided context and common eCLIP pipeline steps.
    # Parameters are set to common defaults for CLIPper in the absence of specific instructions.
    
    # Define input and output files
    IP_BAM="aligned_ip.bam" # Placeholder for aligned IP reads BAM file
    INPUT_BAM="aligned_input.bam" # Placeholder for aligned input control reads BAM file
    CHROM_SIZES="hg38.chrom.sizes" # Placeholder for genome chromosome sizes file (e.g., from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes)
    OUTPUT_BED="clipper_peaks.bed" # Output BED file for identified peaks
    
    # Example command for CLIPper peak calling
    # Assuming clipper.py is in the system's PATH or specified with its full path
    # Common parameters: -p for p-value threshold (default 0.01), -w for minimum peak width (default 20)
    clipper.py \
        -s $CHROM_SIZES \
        -i $IP_BAM \
        -c $INPUT_BAM \
        -p 0.01 \
        -w 20 \
        -o $OUTPUT_BED
  30. 30

    For each T cell, its ‘effector-ness’ scores is 1 minus the ‘memory-ness’ score and is redundant for this analysis.

    Custom Script (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # This step calculates 'effector-ness' scores based on 'memory-ness' scores.
    # The description states this calculation is redundant for the current analysis, but the code block demonstrates how it would be performed.
    
    # Assuming an input file 'input_memory_ness_scores.tsv' with T cell IDs and their memory-ness scores.
    # The output will be 'output_effector_ness_scores.tsv'.
    # The input file is expected to be tab-separated and contain a column named 'MEMORY_NESS_SCORE'.
    
    # Python script to calculate effector-ness
    # effector_ness_calculator.py
    cat << 'EOF' > effector_ness_calculator.py
    import pandas as pd
    import sys
    
    if __name__ == "__main__":
        if len(sys.argv) != 3:
            print("Usage: python effector_ness_calculator.py <input_memory_ness_file> <output_effector_ness_file>")
            sys.exit(1)
    
        input_file = sys.argv[1]
        output_file = sys.argv[2]
    
        try:
            df = pd.read_csv(input_file, sep='\t')
            if 'MEMORY_NESS_SCORE' not in df.columns:
                raise ValueError("Input file must contain a 'MEMORY_NESS_SCORE' column.")
    
            # Calculate 'effector-ness' as 1 minus 'memory-ness'
            df['EFFECTOR_NESS_SCORE'] = 1 - df['MEMORY_NESS_SCORE']
    
            # Output the results
            df.to_csv(output_file, sep='\t', index=False)
            print(f"Successfully calculated effector-ness scores and saved to {output_file}")
        except FileNotFoundError:
            print(f"Error: Input file '{input_file}' not found.")
            sys.exit(1)
        except ValueError as e:
            print(f"Error processing data: {e}")
            sys.exit(1)
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            sys.exit(1)
    EOF
    
    # Execute the Python script
    # Replace 'input_memory_ness_scores.tsv' with your actual input file path
    # Replace 'output_effector_ness_scores.tsv' with your desired output file path
    python effector_ness_calculator.py input_memory_ness_scores.tsv output_effector_ness_scores.tsv
  31. 31

    The signature genes for each classifier were selected from all G=22,425 genes by their GINI score.

    Inferred with models/gemini-2.5-flash vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install necessary Python libraries if not already installed
    # pip install scikit-learn pandas numpy
    
    # Create a Python script to select signature genes based on GINI score.
    # This script simulates the process of training a classifier (e.g., RandomForest)
    # and extracting feature importances (GINI-based) from a total of G=22,425 genes.
    # It outputs a list of top N signature genes.
    cat << 'EOF' > select_signature_genes.py
    import pandas as pd
    import numpy as np
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    import argparse
    
    def main():
        parser = argparse.ArgumentParser(description="Select signature genes based on GINI score from a classifier.")
        parser.add_argument("--total_genes", type=int, default=22425, help="Total number of genes (G) available for selection.")
        parser.add_argument("--num_samples", type=int, default=100, help="Number of samples (N) for the simulated dataset.")
        parser.add_argument("--top_n_genes", type=int, default=100, help="Number of top genes to select as signature genes.")
        parser.add_argument("--output_file", type=str, default="signature_genes_gini.tsv", help="Output file for the selected signature genes.")
        args = parser.parse_args()
    
        G = args.total_genes
        N_samples = args.num_samples
        num_signature_genes = args.top_n_genes
    
        np.random.seed(42)
    
        # Generate dummy gene names
        gene_names = [f'gene_{i+1}' for i in range(G)]
    
        # Simulate gene expression data (N_samples rows, G columns)
        # Introduce some signal for a few genes to make them 'important'
        X = np.random.rand(N_samples, G) * 10
        X[:, 0] += np.random.rand(N_samples) * 5  # Make gene_1 important
        X[:, 100] -= np.random.rand(N_samples) * 5 # Make gene_101 important
    
        # Simulate binary labels based on the 'important' genes
        y = (X[:, 0] + X[:, 100] > np.median(X[:, 0] + X[:, 100])).astype(int)
    
        # Split data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.0, random_state=42) # Use all for training for simplicity
    
        # Train a RandomForestClassifier
        # RandomForest uses GINI impurity by default for splitting nodes and calculating feature importances.
        classifier = RandomForestClassifier(n_estimators=100, random_state=42)
        classifier.fit(X_train, y_train)
    
        # Get feature importances (GINI-based scores)
        gini_importances = classifier.feature_importances_
    
        # Create a DataFrame to store gene names and their GINI scores
        feature_importance_df = pd.DataFrame({
            'gene': gene_names,
            'gini_score': gini_importances
        })
    
        # Sort genes by GINI score in descending order
        feature_importance_df = feature_importance_df.sort_values(by='gini_score', ascending=False)
    
        # Select the top N genes as signature genes
        signature_genes_df = feature_importance_df.head(num_signature_genes)
    
        print(f"Selected {num_signature_genes} signature genes based on GINI score from {G} total genes.")
        signature_genes_df.to_csv(args.output_file, sep="\t", index=False)
    
    if __name__ == "__main__":
        main()
    EOF
    
    # Execute the Python script to select signature genes.
    # The --total_genes parameter reflects the G=22,425 genes mentioned in the description.
    # The --top_n_genes parameter is an example of how many signature genes might be selected.
    python select_signature_genes.py --total_genes 22425 --top_n_genes 100 --output_file signature_genes_gini.tsv
  32. 32

    Temporal expression trajectories through inferred lineage paths.

    Monocle (Inferred with models/gemini-2.5-flash) vMonocle 3 GitHub
    $ Bash example
    # This script demonstrates how to perform temporal expression trajectory analysis using Monocle 3.
    # It assumes you have an R environment set up with Monocle 3 installed.
    
    # --- Installation (commented out) ---
    # if (!requireNamespace("BiocManager", quietly = TRUE))
    #     install.packages("BiocManager")
    # BiocManager::install(version = "3.18") # Use your Bioconductor version
    # BiocManager::install(c('monocle3'))
    # install.packages("devtools")
    # devtools::install_github('cole-trapnell-lab/monocle3')
    
    # --- Execution ---
    # Assuming 'cds_object.rds' is a pre-processed CellDataSet object saved from a previous step.
    # This object typically contains normalized expression data, cell metadata, and gene metadata.
    
    Rscript -e '
    library(monocle3)
    
    # Load pre-processed CellDataSet object
    # Replace "path/to/your/cds_object.rds" with the actual path to your input data.
    # This object should contain expression data, cell metadata, and gene metadata.
    # For demonstration, we will assume `cds` is already loaded or created from previous steps.
    # In a real pipeline, you would load your actual CDS object:
    # cds <- readRDS("path/to/your/cds_object.rds")
    
    # Placeholder for a loaded CDS object. In a real scenario, `cds` would be the output
    # of upstream steps like `preprocess_cds`, `reduce_dimension`, and `cluster_cells`.
    # For example, if starting from a count matrix:
    # expression_matrix <- as.matrix(read.csv("counts.csv", row.names = 1))
    # cell_metadata <- read.csv("cell_metadata.csv", row.names = 1)
    # gene_metadata <- read.csv("gene_metadata.csv", row.names = 1)
    # cds <- new_cell_data_set(expression_matrix, cell_metadata = cell_metadata, gene_metadata = gene_metadata)
    # cds <- preprocess_cds(cds, num_dim = 100)
    # cds <- reduce_dimension(cds, reduction_method = "UMAP")
    # cds <- cluster_cells(cds)
    
    # For this example, we assume `cds` is a pre-processed and dimension-reduced CellDataSet object.
    # If you are running this as a standalone step, ensure `cds` is properly initialized.
    
    # Learn the trajectory graph
    cds <- learn_graph(cds)
    
    # Order cells in pseudotime
    # You might need to specify a root node for the trajectory based on biological knowledge.
    # This often involves selecting a cell or a cluster that represents the start of the lineage.
    # Example: root_cells <- cells_in_partition(cds)[which(colData(cds)$cell_type == "Stem_Cell")]
    # If no specific root cells are provided, Monocle will attempt to infer one.
    cds <- order_cells(cds)
    
    # Plot the trajectory
    # You can color cells by pseudotime, cell type, or other metadata
    plot_cells(cds,
               color_cells_by = "pseudotime",
               label_groups_by_cluster = FALSE,
               label_leaves = TRUE,
               label_branch_points = TRUE,
               graph_label_size = 1.5)
    
    # Save the updated CellDataSet object with pseudotime information
    saveRDS(cds, "monocle3_trajectory_cds.rds")
    
    # You can also extract pseudotime values:
    pseudotime_values <- pseudotime(cds)
    write.csv(as.data.frame(pseudotime_values), "cell_pseudotime.csv")
    
    # Further analysis: differential expression along pseudotime (optional)
    # gene_fits <- fit_models(cds, model_formula_str = "~pseudotime")
    # pseudotime_de_results <- differential_test(gene_fits, full_model_formula_str = "~pseudotime",
    #                                            reduced_model_formula_str = "~1")
    # write.csv(pseudotime_de_results, "pseudotime_de_genes.csv")
    '
  33. 33

    To understand the temporal dynamics of expression for key genes along the effector and memory lineages, we constructed hypothetical differentiation time-courses for each lineage.

    Custom R script (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # Install R if not already present (example for Debian/Ubuntu)
    # sudo apt-get update
    # sudo apt-get install r-base
    
    # Create a placeholder R script for demonstration purposes.
    # In a real scenario, this script would contain the actual logic
    # for modeling gene expression over time and constructing hypothetical time-courses.
    cat << 'EOF' > construct_time_courses.R
    #!/usr/bin/env Rscript
    
    # This is a placeholder R script to demonstrate the execution of a custom analysis.
    # In a real scenario, this script would contain the logic to:
    # 1. Load gene expression data (e.g., normalized counts, TPMs).
    # 2. Load sample metadata including lineage and time point information.
    # 3. For each gene and lineage, model the expression changes over time.
    #    This could involve spline fitting, LOESS regression, or other time-series analysis methods.
    # 4. Generate "hypothetical differentiation time-courses" by predicting expression
    #    at finely spaced time points along the observed differentiation trajectory.
    # 5. Save the results (e.g., predicted expression values, fitted models, plots).
    
    # Example usage: Rscript construct_time_courses.R <expression_file> <metadata_file> <output_prefix>
    
    args <- commandArgs(trailingOnly = TRUE)
    
    if (length(args) < 3) {
      stop("Usage: Rscript construct_time_courses.R <expression_file> <metadata_file> <output_prefix>", call. = FALSE)
    }
    
    expression_file <- args[1]
    metadata_file <- args[2]
    output_prefix <- args[3]
    
    message(paste0("Simulating construction of time-courses using:\n",
                   "  Expression file: ", expression_file, "\n",
                   "  Metadata file: ", metadata_file, "\n",
                   "  Output prefix: ", output_prefix))
    
    # Simulate creating an output file
    output_file <- paste0(output_prefix, "_hypothetical_time_courses_summary.txt")
    writeLines(c(
      "# Hypothetical Differentiation Time-Courses Summary",
      paste0("Analysis performed on expression data from: ", expression_file),
      paste0("Using metadata from: ", metadata_file),
      "This file represents a summary of the constructed hypothetical differentiation time-courses.",
      "Actual output would include fitted curves, predicted expression values, or visualizations."
    ), output_file)
    
    message(paste0("Simulated output saved to: ", output_file))
    
    EOF
    
    # Make the R script executable
    chmod +x construct_time_courses.R
    
    # Create dummy input files for demonstration
    # gene_expression_data.tsv: Contains normalized gene expression values (e.g., TPMs, counts)
    #                           Rows are genes, columns are samples.
    # sample_metadata.tsv: Contains metadata for each sample, including lineage and time point.
    
    echo "gene_id\tsample1\tsample2\tsample3\tsample4" > gene_expression_data.tsv
    echo "geneA\t10.5\t20.1\t30.8\t40.2" >> gene_expression_data.tsv
    echo "geneB\t50.3\t45.7\t40.1\t35.9" >> gene_expression_data.tsv
    echo "geneC\t5.2\t15.8\t25.1\t35.5" >> gene_expression_data.tsv
    
    echo "sample_id\tlineage\ttime_point" > sample_metadata.tsv
    echo "sample1\teffector\t0" >> sample_metadata.tsv
    echo "sample2\teffector\t1" >> sample_metadata.tsv
    echo "sample3\tmemory\t0" >> sample_metadata.tsv
    echo "sample4\tmemory\t1" >> sample_metadata.tsv
    
    # Execute the R script to construct hypothetical differentiation time-courses
    # Input: gene_expression_data.tsv (gene expression matrix)
    #        sample_metadata.tsv (sample annotations)
    # Output prefix: lineage_time_course_analysis (output files will be prefixed with this)
    ./construct_time_courses.R gene_expression_data.tsv sample_metadata.tsv lineage_time_course_analysis
  34. 34

    Briefly, we sampled with replacement 50 cells from each population and constructed all trajectories through the cross-product of populations ordered in a particular lineage.

    Single-cell trajectory construction and robustness analysis (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # This step describes a conceptual methodology for single-cell trajectory inference, which involves sampling cells and constructing trajectories based on population cross-products.
    # Without a specific software tool mentioned (e.g., Monocle, Slingshot, PAGA, etc.), it is not possible to provide a concrete bash command.
    #
    # The description implies the following conceptual steps, which would typically be implemented in a scripting language like R or Python:
    #
    # 1.  **Sampling**: For each biological population, randomly sample 50 cells with replacement.
    #     Example (conceptual pseudocode, not executable bash):
    #     # sampled_cells_per_population = {}
    #     # for population_id in all_population_ids:
    #     #     cells_in_population = get_cells_for_population(population_id)
    #     #     sampled_cells_per_population[population_id] = random_sample_with_replacement(cells_in_population, num_cells=50)
    #
    # 2.  **Trajectory Construction (Cross-product)**: Construct all possible ordered trajectories (sequences of populations) that adhere to a predefined lineage structure.
    #     This involves generating combinations or permutations of populations based on the lineage graph.
    #     Example (conceptual pseudocode, not executable bash):
    #     # all_lineage_paths = generate_all_ordered_lineage_paths(population_graph, lineage_constraints)
    #
    # 3.  **Trajectory Inference**: For each constructed lineage path, combine the corresponding sampled cells and perform a trajectory inference analysis.
    #     This is where a specific trajectory inference tool would be applied.
    #     Example (conceptual pseudocode, not executable bash):
    #     # for lineage_path in all_lineage_paths:
    #     #     combined_data_for_path = combine_data_from_sampled_cells(sampled_cells_per_population, lineage_path)
    #     #     # Execute a specific trajectory inference tool (e.g., Monocle3, Slingshot, PAGA) on combined_data_for_path
    #     #     # e.g., run_monocle3(combined_data_for_path, output_prefix=f"trajectory_{lineage_path}")
    #
    # This is a high-level description of a method, not a specific tool's execution command.
  35. 35

    These orders were determined a priori based on our earlier work with similar timecourses of RT-qPCR study (Arsenio et. al.

    Statistical Analysis Software (Inferred with models/gemini-2.5-flash) vN/A
  36. 36

    Nature Immunology 2014).

    N/A (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # No specific bioinformatics step or tool described in "Nature Immunology 2014)".
    # Therefore, no execution command or parameters can be inferred.
  37. 37

    Specifically, the ‘effector lineage’ starts from the naïve population, and progresses through the Div1TE subpopulation, then onto Day 4, and finally Day 7.

    (Inferred with models/gemini-2.5-flash) v(Inferred with models/gemini-2.5-flash)
    $ Bash example
    No specific bioinformatics tool or computational step is described. This description outlines a biological progression or experimental design, not a command-line execution.
  38. 38

    In contrast, the ‘effector memory’ and ‘central memory lineages’ start from naïve, through the Div1MEM subpopulation, ending with TEM and TCM respectively.

    (Inferred with models/gemini-2.5-flash) vN/A
  39. 39

    These bootstrapped trajectories were visually summarized by a seaborn timeseries plot which links the average expression for each population sample with a solid line segment and represents the 95% confidence interval by a shaded area around it.

    seaborn vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install seaborn and its dependencies if not already installed
    # conda install -c conda-forge seaborn pandas numpy matplotlib
    
    # Create a Python script for plotting
    cat << 'EOF' > plot_timeseries.py
    import pandas as pd
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # --- Simulate input data ---
    # In a real pipeline, this DataFrame would be loaded from a file (e.g., CSV, Parquet)
    # generated by the preceding bootstrapping step. It should contain columns for
    # time points, population samples, and individual bootstrapped expression values.
    # For demonstration, we simulate it here.
    np.random.seed(42)
    num_time_points = 10
    num_populations = 3
    num_bootstrap_samples_per_point = 100 # Number of bootstrap samples per time point and population
    
    data = []
    for t in range(num_time_points):
        for pop_idx in range(num_populations):
            population_name = f'Population_{chr(65 + pop_idx)}'
            # Simulate expression values for bootstrap samples
            # Mean expression changes over time and differs by population
            mean_expr = 5 + t * 0.5 + pop_idx * 1.5
            bootstrap_expressions = np.random.normal(loc=mean_expr, scale=1.0, size=num_bootstrap_samples_per_point)
            for expr in bootstrap_expressions:
                data.append({'time': t, 'population_sample': population_name, 'expression': expr})
    
    df = pd.DataFrame(data)
    
    # --- Generate the seaborn timeseries plot ---
    plt.figure(figsize=(10, 6))
    sns.lineplot(
        data=df,
        x='time',
        y='expression',
        hue='population_sample',
        errorbar=('ci', 95), # Calculate and display 95% confidence interval
        marker='o' # Optional: add markers for clarity
    )
    
    plt.title('Average Expression Over Time with 95% Confidence Interval')
    plt.xlabel('Time Point')
    plt.ylabel('Expression Level')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.legend(title='Population Sample')
    plt.tight_layout()
    
    # Save the plot to a file
    output_plot_file = "timeseries_expression_plot.png"
    plt.savefig(output_plot_file)
    print(f"Plot saved to {output_plot_file}")
    EOF
    
    # Execute the Python script to generate the plot
    python plot_timeseries.py

Tools Used

Raw Source Text
Basecalls performed using CASAVA version 1.4
ChIP-seq reads were aligned to the mm10 genome assembly using STAR version 2.4.1b with the following options --outFilterMultimapNmax 10  --outFilterMultimapScoreRange 1 --limitOutSJoneRead --outFilterScoreMin 3  --outFilterScoreMinOverLread 0.66 --outFilterMatchNminOverLread 0.66  --outFilterMismatchNoverLmax 0.3 and the other parameter kept as default
BAM files of sample1-11 are downsampled, if needed, using samtools to ensure sample 1-4, sample 5-8, sample 9-11 have same read numbers.
for sample 1-4: peaks were called using MACS version 1.4.2 using sample 5-8 as backgournd, with the following setting -g mm --nomodel --nolambda and others as default.
for sample 9-11: tag directories were first generated by HOMER makeTagDirectory with -keepOne -tbp 1 and other options as default.
for sample 9 and 11: peakd were call by HOMER findpeak using sample 12 as background and -style factor -size 100 -fragLength 50 with other options as default
for sample 10: peakd were call by HOMER findpeak using sample 12 as background and -style histone -fragLength 50 -size 200 -minDist 1000 -L 0 with other options as default
for sample 13-16: transcript count were generated by kallisto version 0.42.4 using GENCODE GRCm38.p4 transcriptome as reference, with the following parameter: -l 200 -s 20 --single and other setting as default
For all Single Cell Samples the following processing steps apply:
Single-cell RNA-seq data pre-processing. Single-cell mRNA sequencing data from 256 murine CD8+ T cells were processed with a bioinformatics pipeline focusing on quality control (QC) and robust expression quantification.  For each cell, raw RNA-seq reads were: checked for quality metrics with fastqc (v0.10.1); poly-A and adaptor-trimmed with cutadapt (v1.8.1); quantified by kallisto (v0.42.1) to a reference transcriptome (Gencode vM3) without bias correction; and aligned by STAR (v2.4.1b) to the reference mouse genome (mm10) with default parameters for quality control and downstream analysis. Next, the transcript per million (TPM) outputs of kallisto for all cells were combined into a cell-by-gene expression matrix (C=288 cells=rows, G=22425 genes=columns) by summing the expression values for all quantified transcripts of a given gene. Finally, the TPM value for each cell c and gene g was natural log-transformed to yield a normalized expression value: EXPRc,g = ln(1+TPMc,g).
Dimensionality reduction and cell heterogeneity visualization. To reduce the dimensionality of the cell-by-gene expression matrix EXPR and visualize the diversity of gene expression among CD8+ T cells of different subtypes in a 2-dimensional scatter plot, we applied the t-distributed Stochastic Neighborhood Embedding (tSNE) algorithm via its Barnes-Hut approximation (bhSNE). We first applied standard Principal Components Analysis (PCA) to reduce the dimensionality down to D=10, and only then applied bhSNE to visualize in D=2 (with perplexity=30 and theta=0.75 parameters).
Differential gene expression analysis. We performed differential gene expression analysis between all pairs of T cell sub-populations from two non-overlapping sets of rows in the log-transformed expression matrix EXPR. Since single-cell gene expression does not conform to the usual negative binomial distribution and can even be bimodal due to dropout we used two non-parametric statistical tests for heterogeneity of expression: Mann Whitney Wilcoxon (MWW, also known as MWU) which is a rank-sum test but relies on a large sample to approximate normality, and Kolmogorov-Smirnov 2-sample (KS2) test which finds the largest difference between the empirical cumulative distributions, even between two small samples such as our 1st division sub-types Div1TE (n =36) and Div1MEM (n =24).
Cell type classifier. We trained two binary T cell classifiers to identify gene expression signatures that not only differentiate the examined T cell sub-populations (like the differential gene expression described above) but can also be used to predict the ‘memory-‘ or ‘effector-ness’ of previously unseen cells. Each classifier constructed an independent ensemble of Extremely Randomized Trees. Using the terminally differentiated effector and memory (TCM, TEM) populations, we built a training set for a fate classifier for CD8+ T cells. Using the newly observed segregation of daughter T cells into Div1TE and Div1MEM subpopulations after the first division, we built a second training set for another early state classifier. Both classifiers were fed their respective training sets using 10-fold cross-validation
After both the fate and early state classifiers were trained on their respective subpopulations, they were both applied on previously unseen intermediate Day 4 CD8+ T cells. Their predicted ‘memory-ness’ scores were scatter-plotted and shown to correlate in Fig. 3f.  For each T cell, its ‘effector-ness’ scores is 1 minus the ‘memory-ness’ score and is redundant for this analysis. The signature genes for each classifier were selected from all G=22,425 genes by their GINI score.
Temporal expression trajectories through inferred lineage paths. To understand the temporal dynamics of expression for key genes along the effector and memory lineages, we constructed hypothetical differentiation time-courses for each lineage. Briefly, we sampled with replacement 50 cells from each population and constructed all trajectories through the cross-product of populations ordered in a particular lineage. These orders were determined a priori based on our earlier work with similar timecourses of RT-qPCR study (Arsenio et. al. Nature Immunology 2014). Specifically, the ‘effector lineage’ starts from the naïve population, and progresses through the Div1TE subpopulation, then onto Day 4, and finally Day 7. In contrast, the ‘effector memory’ and ‘central memory lineages’ start from naïve, through the Div1MEM subpopulation, ending with TEM and TCM respectively. These bootstrapped trajectories were visually summarized by a seaborn timeseries plot which links the average expression for each population sample with a solid line segment and represents the 95% confidence interval by a shaded area around it.
Genome_build: mm10
Supplementary_files_format_and_content: peak BED files generated from MACS and HOMER. Transcript Expression CSV tables generated from kallisto.
← Back to Analysis