GSE89405 Processing Pipeline
Publication
Early transcriptional and epigenetic regulation of CD8<sup>+</sup> T cell differentiation revealed by single-cell RNA sequencing.Nature immunology (2017) — PMID 28218746
Dataset
GSE89405Early transcriptional and epigenetic regulation of CD8+ T cell differentiation revealed by single-cell RNA-seq
Processing Steps
Generate Jupyter Notebook-
1
Basecalls performed using CASAVA version 1.4
CASAVA v1.4$ Bash example
# CASAVA (Consensus Assessment of Sequence And Variation) is an older, proprietary Illumina software suite for base calling and initial data processing. # It is not typically installed via public package managers like conda. # Installation would involve obtaining the software package directly from Illumina. # Example command for base calling and demultiplexing using CASAVA 1.4's configureBclToFastq.pl script. # This script processes BCL files from an Illumina run directory into FASTQ files, # using a sample sheet for demultiplexing. # Replace <path_to_illumina_run_directory>, <path_to_output_fastq_directory>, and <path_to_sample_sheet.csv> with actual paths. # The exact path to configureBclToFastq.pl might vary based on CASAVA installation. # This command assumes the script is in the PATH or specified with its full path. configureBclToFastq.pl \ --input-dir <path_to_illumina_run_directory>/Data/Intensities/BaseCalls \ --output-dir <path_to_output_fastq_directory> \ --sample-sheet <path_to_sample_sheet.csv>
-
2
ChIP-seq reads were aligned to the mm10 genome assembly using STAR version 2.4.1b with the following options --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --limitOutSJoneRead --outFilterScoreMin 3 --outFilterScoreMinOverLread 0.66 --outFilterMatchNminOverLread 0.66 --outFilterMismatchNoverLmax 0.3 and the other parameter kept as default
$ Bash example
# Install STAR if not already available # conda install -c bioconda star=2.4.1b # Define variables for clarity STAR_GENOME_DIR="/path/to/genome/mm10/STAR_index" # Placeholder for STAR genome index directory. This directory should contain genome.fa, SA, SAindex, etc. # Input reads: Assuming paired-end. For single-end, remove READ2 and adjust --readFilesIn to only "${READ1}" READ1="input_reads_R1.fastq.gz" # Placeholder for input ChIP-seq reads R1 READ2="input_reads_R2.fastq.gz" # Placeholder for input ChIP-seq reads R2 OUTPUT_PREFIX="star_alignment_output/" # Prefix for output files (e.g., star_alignment_output/Aligned.sortedByCoordinate.out.bam) # Run STAR alignment # Note: --runThreadN is often specified for performance but is not explicitly mentioned in the description. # It's omitted here to strictly adhere to "other parameter kept as default". STAR \ --genomeDir "${STAR_GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outFilterMultimapNmax 10 \ --outFilterMultimapScoreRange 1 \ --limitOutSJoneRead \ --outFilterScoreMin 3 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 \ --outFilterMismatchNoverLmax 0.3 -
3
BAM files of sample1-11 are downsampled, if needed, using samtools to ensure sample 1-4, sample 5-8, sample 9-11 have same read numbers.
$ Bash example
# Install samtools if not available # conda install -c bioconda samtools # Define input BAM files (replace with actual file paths if different) BAM_FILES=( "sample1.bam" "sample2.bam" "sample3.bam" "sample4.bam" "sample5.bam" "sample6.bam" "sample7.bam" "sample8.bam" "sample9.bam" "sample10.bam" "sample11.bam" ) # Create an associative array to store read counts declare -A READ_COUNTS # Calculate read counts for each BAM file and find the global minimum MIN_READS=-1 for bam_file in "${BAM_FILES[@]}"; do if [ ! -f "$bam_file" ]; then echo "Error: Input BAM file '$bam_file' not found. Please ensure all input files exist." >&2 exit 1 fi current_reads=$(samtools view -c "$bam_file") READ_COUNTS["$bam_file"]=$current_reads echo "Reads in $bam_file: $current_reads" if (( MIN_READS == -1 || current_reads < MIN_READS )); then MIN_READS=$current_reads fi done if (( MIN_READS == -1 )); then echo "Error: No valid BAM files processed to determine minimum read count." >&2 exit 1 fi echo "Target read count for downsampling (global minimum): $MIN_READS" # Downsample BAM files if their read count is above the target minimum # A fixed random seed (e.g., 42) is used for reproducibility. for bam_file in "${BAM_FILES[@]}"; do current_reads=${READ_COUNTS["$bam_file"]} output_bam="${bam_file%.bam}_downsampled.bam" if (( current_reads > MIN_READS )); then # Calculate the fraction for samtools -s. samtools -s takes a seed and a fraction (e.g., 42.0.5 for 50% with seed 42). fraction=$(awk "BEGIN {printf \"%.6f\", $MIN_READS / $current_reads}") seed_fraction="42.$fraction" # Using a fixed seed 42 echo "Downsampling $bam_file from $current_reads to $MIN_READS reads (fraction: $fraction) -> $output_bam" samtools view -b -s "$seed_fraction" "$bam_file" > "$output_bam" samtools index "$output_bam" else echo "$bam_file already has $current_reads reads, which is <= target $MIN_READS. Copying to $output_bam." cp "$bam_file" "$output_bam" samtools index "$output_bam" # Ensure index exists for consistency fi done -
4
for sample 1-4: peaks were called using MACS version 1.4.2 using sample 5-8 as backgournd, with the following setting -g mm --nomodel --nolambda and others as default.
$ Bash example
# Install MACS2 (if not already installed) # conda install -c bioconda macs2 # Define input and control BAM files TREATMENT_BAMS="sample1.bam sample2.bam sample3.bam sample4.bam" CONTROL_BAMS="sample5.bam sample6.bam sample7.bam sample8.bam" # Define output prefix OUTPUT_PREFIX="peaks_sample1-4_vs_sample5-8" # Define genome size (e.g., mm10 for mouse) GENOME_SIZE="mm" # Run MACS2 callpeak macs2 callpeak \ -t ${TREATMENT_BAMS} \ -c ${CONTROL_BAMS} \ -g ${GENOME_SIZE} \ --nomodel \ --nolambda \ -n ${OUTPUT_PREFIX} \ --outdir . -
5
for sample 9-11: tag directories were first generated by HOMER makeTagDirectory with -keepOne -tbp 1 and other options as default.
$ Bash example
# Install HOMER (if not already installed) # For example, using conda: # conda install -c bioconda homer # Generate tag directories for sample 9 makeTagDirectory sample_9_tag_dir sample_9.bam -keepOne -tbp 1 # Generate tag directories for sample 10 makeTagDirectory sample_10_tag_dir sample_10.bam -keepOne -tbp 1 # Generate tag directories for sample 11 makeTagDirectory sample_11_tag_dir sample_11.bam -keepOne -tbp 1 # Note: Replace 'sample_X.bam' with the actual input BAM files for each sample # and 'sample_X_tag_dir' with the desired output tag directory names.
-
6
for sample 9 and 11: peakd were call by HOMER findpeak using sample 12 as background and -style factor -size 100 -fragLength 50 with other options as default
$ Bash example
# Install HOMER (if not already installed) # conda install -c bioconda homer # Assuming input BAM files are available in the current directory # and are named sample9.bam, sample11.bam, and sample12.bam # Peak calling for sample 9 findPeaks sample9.bam \ -o sample9_peaks.txt \ -i sample12.bam \ -style factor \ -size 100 \ -fragLength 50 # Peak calling for sample 11 findPeaks sample11.bam \ -o sample11_peaks.txt \ -i sample12.bam \ -style factor \ -size 100 \ -fragLength 50 -
7
for sample 10: peakd were call by HOMER findpeak using sample 12 as background and -style histone -fragLength 50 -size 200 -minDist 1000 -L 0 with other options as default
$ Bash example
# Install HOMER (if not already installed) # For example, using conda: # conda install -c bioconda homer # Or from source: # wget http://homer.ucsd.edu/homer/configureHomer.pl # perl configureHomer.pl -install # Assuming sample10.bam is the treatment BAM file and sample12.bam is the background BAM file # And these files are located in the current directory or specified with full paths findPeaks sample10.bam -o sample10_peaks.txt -i sample12.bam -style histone -fragLength 50 -size 200 -minDist 1000 -L 0
-
8
for sample 13-16: transcript count were generated by kallisto version 0.42.4 using GENCODE GRCm38.p4 transcriptome as reference, with the following parameter: -l 200 -s 20 --single and other setting as default
$ Bash example
# Install kallisto if not already installed # conda install -c bioconda kallisto # Placeholder for kallisto index generation (if not already done) # The GENCODE GRCm38.p4 transcriptome FASTA file would be needed here. # For example, if the FASTA file is gencode.vM4.transcripts.fa.gz: # kallisto index -i gencode.vM4.transcripts.idx gencode.vM4.transcripts.fa.gz # Run kallisto quantification for a single sample (e.g., sample 13) # Assuming input fastq is sample_13.fastq.gz and output directory is kallisto_output_sample_13 kallisto quant -i gencode.vM4.transcripts.idx -o kallisto_output_sample_13 --single -l 200 -s 20 sample_13.fastq.gz
-
9
For all Single Cell Samples the following processing steps apply:
$ Bash example
# This is a placeholder script for single-cell sample processing. # The actual tools and commands depend heavily on the specific single-cell assay (e.g., scRNA-seq, scATAC-seq) # and the technology used (e.g., 10x Genomics, Smart-seq2). # Define reference genome (placeholder for human hg38) # Replace with the actual path to your reference genome and annotation files. REFERENCE_GENOME_FASTA="/path/to/human/hg38/GRCh38.primary_assembly.genome.fa" REFERENCE_GENOME_GTF="/path/to/human/hg38/gencode.v38.annotation.gtf" # For 10x Genomics data, a Cell Ranger reference transcriptome is typically required: # CELLRANGER_TRANSCRIPTOME_REF="/path/to/cellranger/refdata-gex-GRCh38-2020-A" # Example conceptual steps for single-cell data processing: # 1. Quality Control (e.g., using FastQC) # fastqc -o qc_output/ sample_R1.fastq.gz sample_R2.fastq.gz # 2. Demultiplexing and Barcode Processing (if applicable, e.g., for 10x Genomics) # For 10x Genomics, this is often done with 'cellranger mkfastq' or handled by the sequencing provider. # cellranger mkfastq --id=sample_id --run=/path/to/sequencing_run_folder --csv=/path/to/sample_sheet.csv # 3. Alignment and Quantification (e.g., for scRNA-seq) # For 10x Genomics scRNA-seq, 'cellranger count' is the standard tool: # cellranger count \ # --id=sample_id \ # --transcriptome=${CELLRANGER_TRANSCRIPTOME_REF} \ # --fastqs=/path/to/fastqs_directory \ # --sample=sample_name \ # --expect-cells=3000 # For generic scRNA-seq (e.g., Smart-seq2, using a splice-aware aligner like STAR and a quantifier like Salmon): # # Build STAR index (run once per reference genome) # # STAR --runMode genomeGenerate --genomeDir /path/to/STAR_index --genomeFastaFiles ${REFERENCE_GENOME_FASTA} --sjdbGTFfile ${REFERENCE_GENOME_GTF} --runThreadN 8 # # Align reads with STAR # # STAR \ # # --genomeDir /path/to/STAR_index \ # # --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \ # # --readFilesCommand zcat \ # # --outFileNamePrefix star_output/sample_ \ # # --outSAMtype BAM SortedByCoordinate \ # # --outSAMunmapped Within \ # # --outFilterType BySJout \ # # --outFilterMultimapNmax 20 \ # # --alignSJDBoverhangMin 1 \ # # --alignSJoverhangMin 8 \ # # --alignIntronMin 20 \ # # --alignIntronMax 1000000 \ # # --alignMatesGapMax 1000000 \ # # --runThreadN 8 # # Quantify transcripts with Salmon (requires a Salmon index built from a transcriptome FASTA) # # salmon index -t /path/to/transcripts.fa -i salmon_index # # salmon quant \ # # -i salmon_index \ # # -l A \ # # -1 sample_R1.fastq.gz \ # # -2 sample_R2.fastq.gz \ # # -o salmon_output/sample_quant \ # # --validateMappings \ # # --gcBias # Further downstream analysis typically involves dimension reduction, clustering, and differential expression analysis # using tools like Seurat, Scanpy, or Bioconductor packages in R. -
10
Single-cell RNA-seq data pre-processing.
RNA-seq v7.1.0 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install Cell Ranger (example using conda, though 10x Genomics provides binaries and recommends direct download) # conda create -n cellranger_env -c bioconda cellranger=7.1.0 # conda activate cellranger_env # Download a pre-built human transcriptome reference (e.g., GRCh38 from 10x Genomics) # mkdir -p /path/to/references # cd /path/to/references # wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz # tar -xzf refdata-gex-GRCh38-2020-A.tar.gz # Run cellranger count for single-cell RNA-seq pre-processing (alignment and quantification) # Replace /path/to/fastq_files with the actual directory containing your FASTQ files. # Ensure FASTQ files follow 10x Genomics naming conventions (e.g., SAMPLE_S1_L001_R1_001.fastq.gz). # Replace my_sample_analysis with a unique identifier for your analysis output directory. # Replace my_sample_prefix with the prefix of your FASTQ files (e.g., 'SAMPLE' if files are SAMPLE_S1_L001_R1_001.fastq.gz). cellranger count \ --id=my_sample_analysis \ --transcriptome=/path/to/references/refdata-gex-GRCh38-2020-A \ --fastqs=/path/to/fastq_files \ --sample=my_sample_prefix \ --localcores=8 \ --localmem=64 -
11
Single-cell mRNA sequencing data from 256 murine CD8+ T cells were processed with a bioinformatics pipeline focusing on quality control (QC) and robust expression quantification.
$ Bash example
# Install Salmon and FastQC (example using conda) # conda create -n sc_rna_seq_env salmon fastqc -y # conda activate sc_rna_seq_env # --- Reference Data Setup --- # Download murine (GRCm39) cDNA transcriptome from Ensembl (example release 110) # mkdir -p reference # wget -P reference ftp://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/cdna/Mus_musculus.GRCm39.cdna.all.fa.gz # Build Salmon index for GRCm39 transcriptome # salmon index -t reference/Mus_musculus.GRCm39.cdna.all.fa.gz -i salmon_index_GRCm39 --gencode # --- Quality Control (QC) --- # Assuming raw paired-end reads are named 'sample_R1.fastq.gz' and 'sample_R2.fastq.gz' # Replace with actual input file names mkdir -p fastqc_results fastqc sample_R1.fastq.gz sample_R2.fastq.gz -o fastqc_results # --- Robust Expression Quantification with Salmon --- # Replace 'salmon_index_GRCm39' with the path to your built Salmon index # Replace 'sample_R1.fastq.gz' and 'sample_R2.fastq.gz' with actual input file names # -l A: Auto-detect library type (common for scRNA-seq) # -p 8: Use 8 threads for quantification # --validateMappings: Perform additional validation of mappings for improved accuracy mkdir -p salmon_quant_output salmon quant -i salmon_index_GRCm39 -l A \ -1 sample_R1.fastq.gz \ -2 sample_R2.fastq.gz \ -p 8 --validateMappings \ -o salmon_quant_output/sample_quant -
12
For each cell, raw RNA-seq reads were: checked for quality metrics with fastqc (v0.10.1); poly-A and adaptor-trimmed with cutadapt (v1.8.1); quantified by kallisto (v0.42.1) to a reference transcriptome (Gencode vM3) without bias correction; and aligned by STAR (v2.4.1b) to the reference mouse genome (mm10) with default parameters for quality control and downstream analysis.
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Placeholder for STAR genome index for the reference mouse genome (mm10/GRCm38). # You would typically build this once using the mm10 genome FASTA and GTF files. # Example for building index (uncomment and modify paths as needed): # STAR --runMode genomeGenerate \ # --genomeDir /path/to/mm10_star_index \ # --genomeFastaFiles /path/to/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz \ # --sjdbGTFfile /path/to/Mus_musculus.GRCm38.98.gtf.gz \ # --runThreadN 8 # Adjust threads as needed # Align RNA-seq reads to the mm10 reference genome using STAR with default parameters. # Input: reads.fastq.gz (replace with your actual input file, e.g., from fastqc/cutadapt output) # Output: Aligned BAM file (e.g., aligned_reads_Aligned.sortedByCoord.out.bam) STAR --genomeDir /path/to/mm10_star_index \ --readFilesIn reads.fastq.gz \ --runThreadN 8 \ --outFileNamePrefix aligned_reads_ \ --outSAMtype BAM SortedByCoordinate -
13
Next, the transcript per million (TPM) outputs of kallisto for all cells were combined into a cell-by-gene expression matrix (C=288 cells=rows, G=22425 genes=columns) by summing the expression values for all quantified transcripts of a given gene.
$ Bash example
# Install kallisto (if not already installed) # conda install -c bioconda kallisto # pip install pandas # --- Reference Data Placeholders --- # For kallisto quantification, a transcriptome index is needed. # Example: Homo_sapiens.GRCh38.cdna.all.fa.gz from Ensembl # kallisto index -i Homo_sapiens.GRCh38.idx Homo_sapiens.GRCh38.cdna.all.fa.gz # A gene-transcript mapping file is crucial for summing transcript TPMs to gene TPMs. # This can be generated by parsing a GTF/GFF file (e.g., Homo_sapiens.GRCh38.109.gtf from Ensembl). # Example command to create gene_transcript_map.tsv (gene_id\ttranscript_id): # Replace 'Homo_sapiens.GRCh38.109.gtf' with your actual GTF file path. # awk '$3 == "transcript" { # gene_id=""; transcript_id=""; # split($9, a, ";"); # for (i in a) { # if (a[i] ~ /gene_id/) gene_id=substr(a[i], index(a[i], "\"")+1, length(a[i])-index(a[i], "\"")-1); # if (a[i] ~ /transcript_id/) transcript_id=substr(a[i], index(a[i], "\"")+1, length(a[i])-index(a[i], "\"")-1) # } # if (gene_id && transcript_id) print gene_id "\t" transcript_id # }' Homo_sapiens.GRCh38.109.gtf > gene_transcript_map.tsv # --- Define Input and Output Paths --- # Directory containing kallisto output subdirectories for each cell (e.g., cell1/abundance.tsv, cell2/abundance.tsv) KALLISTO_OUTPUT_DIR="kallisto_quant_outputs" # Path to the gene-transcript mapping file GENE_TRANSCRIPT_MAP="gene_transcript_map.tsv" # Output file for the combined cell-by-gene expression matrix OUTPUT_MATRIX="cell_by_gene_expression_matrix.tsv" # --- Create a Python script to combine kallisto outputs --- # This script will read abundance.tsv files, map transcripts to genes, # sum TPMs per gene, and combine into a single cell-by-gene matrix. cat << 'EOF' > merge_kallisto_gene_tpm.py import pandas as pd import os import sys kallisto_output_dir = sys.argv[1] gene_transcript_map_file = sys.argv[2] output_matrix_file = sys.argv[3] # Load gene-transcript map # Assuming gene_transcript_map.tsv has two columns: gene_id\ttranscript_id gene_map = pd.read_csv(gene_transcript_map_file, sep='\t', header=None, names=['gene_id', 'transcript_id']) transcript_to_gene = gene_map.set_index('transcript_id')['gene_id'].to_dict() all_gene_tpm_dfs = [] # Iterate through each cell's kallisto output directory for cell_dir_name in sorted(os.listdir(kallisto_output_dir)): full_cell_dir = os.path.join(kallisto_output_dir, cell_dir_name) if os.path.isdir(full_cell_dir): abundance_file = os.path.join(full_cell_dir, 'abundance.tsv') if os.path.exists(abundance_file): cell_name = cell_dir_name # Load abundance data (target_id, length, eff_length, est_counts, tpm) df_abundance = pd.read_csv(abundance_file, sep='\t') # Map transcripts (target_id) to genes and sum TPMs df_abundance['gene_id'] = df_abundance['target_id'].map(transcript_to_gene) # Drop transcripts that couldn't be mapped to a gene (if any, e.g., non-coding RNAs not in map) df_abundance = df_abundance.dropna(subset=['gene_id']) # Sum TPMs by gene gene_tpm_series = df_abundance.groupby('gene_id')['tpm'].sum() all_gene_tpm_dfs.append(gene_tpm_series.rename(cell_name)) else: print(f"Warning: {abundance_file} not found for {cell_dir_name}", file=sys.stderr) if not all_gene_tpm_dfs: print("Error: No kallisto abundance files found to process.", file=sys.stderr) sys.exit(1) # Combine all gene TPM series into a single DataFrame # Use outer join to include all genes across all cells, filling missing values with 0 final_matrix = pd.concat(all_gene_tpm_dfs, axis=1).fillna(0) # Transpose the matrix to have cells as rows and genes as columns, as per description # (C=288 cells=rows, G=22425 genes=columns) final_matrix = final_matrix.T # Write the final matrix to a TSV file final_matrix.to_csv(output_matrix_file, sep='\t', index=True, index_label='CellID') print(f"Successfully created {output_matrix_file}") EOF # --- Execute the Python script --- # Ensure Python and pandas are installed in your environment. python merge_kallisto_gene_tpm.py "${KALLISTO_OUTPUT_DIR}" "${GENE_TRANSCRIPT_MAP}" "${OUTPUT_MATRIX}" # --- Clean up (optional) --- # rm merge_kallisto_gene_tpm.py -
14
Finally, the TPM value for each cell c and gene g was natural log-transformed to yield a normalized expression value: EXPRc,g = ln(1+TPMc,g).
awk (Inferred with models/gemini-2.5-flash) vGNU Awk (gawk) (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Assuming input_tpm_file.tsv is a tab-separated file with a header. # Replace '3' with the actual 1-indexed column number containing TPM values. # If your file is comma-separated, change 'FS=OFS="\t"' to 'FS=OFS=","'. awk 'BEGIN {FS=OFS="\t"} NR==1 {print $0, "EXPR_value"; next} {print $0, log(1+$3)}' input_tpm_file.tsv > output_expr_file.tsv -
15
Dimensionality reduction and cell heterogeneity visualization.
Seurat (Inferred with models/gemini-2.5-flash) v5.0.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install R and Seurat if not already present # conda install -c conda-forge r-seurat -y # This command assumes an R script named 'run_seurat_dr.R' exists # that performs dimensionality reduction (e.g., PCA, UMAP) and visualization # on a Seurat object. The script would typically take an input Seurat object, # process it, and save an updated object and a visualization plot. # # Example content for 'run_seurat_dr.R': # library(Seurat) # args <- commandArgs(trailingOnly = TRUE) # input_file <- args[1] # output_file <- args[2] # output_plot <- args[3] # seurat_obj <- readRDS(input_file) # seurat_obj <- RunPCA(seurat_obj, features = VariableFeatures(object = seurat_obj)) # seurat_obj <- RunUMAP(seurat_obj, dims = 1:30) # Adjust dims as needed # png(output_plot, width = 800, height = 600) # DimPlot(seurat_obj, reduction = "umap", group.by = "seurat_clusters") # Adjust group.by # dev.off() # saveRDS(seurat_obj, output_file) # Execute the R script for dimensionality reduction and visualization Rscript run_seurat_dr.R input_seurat_object.rds processed_seurat_object.rds umap_visualization.png
-
16
To reduce the dimensionality of the cell-by-gene expression matrix EXPR and visualize the diversity of gene expression among CD8+ T cells of different subtypes in a 2-dimensional scatter plot, we applied the t-distributed Stochastic Neighborhood Embedding (tSNE) algorithm via its Barnes-Hut approximation (bhSNE).
$ Bash example
# Create a dummy EXPR.csv for demonstration if it doesn't exist # In a real pipeline, 'EXPR.csv' would be the actual cell-by-gene expression matrix. if [ ! -f EXPR.csv ]; then echo "Creating dummy EXPR.csv for demonstration..." python -c " import pandas as pd import numpy as np num_cells = 100 num_genes = 500 EXPR = pd.DataFrame(np.random.rand(num_cells, num_genes), index=[f'cell_{i}' for i in range(num_cells)], columns=[f'gene_{j}' for j in range(num_genes)]) EXPR.to_csv('EXPR.csv', index=False) print('Dummy EXPR.csv created.') " fi # Create the Python script for t-SNE dimensionality reduction cat << 'EOF' > run_tsne.py import pandas as pd import numpy as np from sklearn.manifold import TSNE import sys # Load the cell-by-gene expression matrix # Replace 'EXPR.csv' with the actual path to your expression matrix # Ensure the matrix is in a format suitable for TSNE (e.g., cells as rows, genes as columns) try: EXPR = pd.read_csv("EXPR.csv") except FileNotFoundError: print("Error: EXPR.csv not found. Please provide the input expression matrix.", file=sys.stderr) sys.exit(1) # Initialize t-SNE with Barnes-Hut approximation (default for sklearn's TSNE) # n_components=2 for 2-dimensional visualization as specified in the description. # perplexity and learning_rate are crucial parameters for t-SNE, often tuned. # Using common default values here as they are not specified in the description. # n_jobs=-1 utilizes all available CPU cores for faster computation. tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42, method='barnes_hut', n_jobs=-1) # Apply t-SNE to the expression matrix tsne_results = tsne.fit_transform(EXPR) # Create a DataFrame for the 2-dimensional t-SNE coordinates tsne_df = pd.DataFrame(data=tsne_results, columns=['tSNE_1', 'tSNE_2']) # If EXPR had cell IDs as an index, you might want to add them back: # tsne_df.index = EXPR.index # Save the 2-dimensional t-SNE coordinates to a CSV file tsne_df.to_csv("tsne_2d_coordinates.csv", index=False) print("t-SNE dimensionality reduction complete. Results saved to tsne_2d_coordinates.csv") EOF # Install required Python packages if not already installed # conda install scikit-learn pandas numpy # pip install scikit-learn pandas numpy # Execute the t-SNE Python script python run_tsne.py -
17
We first applied standard Principal Components Analysis (PCA) to reduce the dimensionality down to D=10, and only then applied bhSNE to visualize in D=2 (with perplexity=30 and theta=0.75 parameters).
$ Bash example
# Install necessary libraries (if not already installed) # pip install scikit-learn pandas # Create a Python script for PCA cat << 'EOF' > run_pca.py import pandas as pd from sklearn.decomposition import PCA import sys # Expects input_file, output_file, and n_components as command-line arguments input_file = sys.argv[1] output_file = sys.argv[2] n_components = int(sys.argv[3]) # Load data (assuming tab-separated, first column as index for sample IDs) # Data should be samples x features, where samples are rows and features are columns. # If your data is features x samples, you might need to transpose it before PCA. data = pd.read_csv(input_file, sep="\t", index_col=0) # Initialize PCA with the specified number of components pca = PCA(n_components=n_components) # Fit PCA to the data and transform it to get the principal components principal_components = pca.fit_transform(data) # Create a DataFrame for the principal components with meaningful column names pc_df = pd.DataFrame(data=principal_components, columns=[f'PC{i+1}' for i in range(n_components)], index=data.index) # Save the principal components to a tab-separated file pc_df.to_csv(output_file, sep="\t") # Optionally, save the explained variance ratio for each component explained_variance_ratio_file = output_file.replace('.tsv', '_explained_variance_ratio.tsv') explained_variance_ratio = pd.DataFrame(pca.explained_variance_ratio_, index=[f'PC{i+1}' for i in range(n_components)], columns=['Explained Variance Ratio']) explained_variance_ratio.to_csv(explained_variance_ratio_file, sep="\t") EOF # Example usage: # Replace 'input_matrix.tsv' with your actual preprocessed data matrix file. # This file should be tab-separated, with the first column as sample identifiers # and subsequent columns as features (e.g., gene expression, methylation levels). # The output 'pca_output_D10.tsv' will contain the top 10 principal components. python run_pca.py input_matrix.tsv pca_output_D10.tsv 10 -
18
Differential gene expression analysis.
$ Bash example
# Install R and Bioconductor if not already present # conda create -n deseq2_env r-base bioconductor-deseq2 -c conda-forge -c bioconda # conda activate deseq2_env # Or, if R is already installed: # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')" # Create dummy count and metadata files for demonstration # In a real scenario, these would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto) # gene_counts.tsv: Tab-separated file with gene IDs as first column, sample names as header, and raw counts. # sample_metadata.tsv: Tab-separated file with sample IDs as first column, and experimental conditions/covariates. echo -e "gene\tsample1\tsample2\tsample3\tsample4" > gene_counts.tsv echo -e "geneA\t100\t120\t500\t550" >> gene_counts.tsv echo -e "geneB\t50\t60\t30\t35" >> gene_counts.tsv echo -e "geneC\t200\t210\t220\t230" >> gene_counts.tsv echo -e "geneD\t10\t12\t100\t110" >> gene_counts.tsv echo -e "sample_id\tcondition" > sample_metadata.tsv echo -e "sample1\tcontrol" >> sample_metadata.tsv echo -e "sample2\tcontrol" >> sample_metadata.tsv echo -e "sample3\ttreated" >> sample_metadata.tsv echo -e "sample4\ttreated" >> sample_metadata.tsv # Create the R script for DESeq2 analysis cat << 'EOF' > deseq2_analysis.R library(DESeq2) # Load count data # Assumes 'gene_counts.tsv' has genes as rows and samples as columns, with raw counts. count_data <- read.table("gene_counts.tsv", header = TRUE, row.names = 1, sep = "\t") # Load sample metadata # Assumes 'sample_metadata.tsv' has sample names as row names and experimental conditions as columns. sample_info <- read.table("sample_metadata.tsv", header = TRUE, row.names = 1, sep = "\t") # Ensure sample order matches between count data and metadata sample_info <- sample_info[colnames(count_data), , drop = FALSE] # Create DESeqDataSet object # The 'design' formula specifies the experimental design. Here, we compare 'condition'. dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = ~ condition) # Filter out genes with very low counts (optional but recommended for better power) # Keeps genes with at least 10 total counts across all samples. keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # Run DESeq2 analysis dds <- DESeq(dds) # Get results for 'treated' vs 'control' comparison # Adjust levels as per your actual data (e.g., c("condition", "treatment_group", "control_group")) res <- results(dds, contrast = c("condition", "treated", "control")) # Order results by adjusted p-value (padj) res_ordered <- res[order(res$padj), ] # Write differential expression results to a CSV file write.csv(as.data.frame(res_ordered), "deseq2_results.csv") # Optional: Generate a normalized count matrix normalized_counts <- counts(dds, normalized=TRUE) write.csv(as.data.frame(normalized_counts), "normalized_counts.csv") EOF # Execute the R script to perform differential gene expression analysis Rscript deseq2_analysis.R # Clean up dummy files (optional) # rm gene_counts.tsv sample_metadata.tsv deseq2_analysis.R -
19
We performed differential gene expression analysis between all pairs of T cell sub-populations from two non-overlapping sets of rows in the log-transformed expression matrix EXPR.
$ Bash example
#!/bin/bash # Define input files and output directory EXPR_MATRIX_FILE="EXPR.tsv" METADATA_FILE="metadata.tsv" OUTPUT_DIR="dge_results" GROUP_COLUMN="t_cell_subpopulation" # Column in metadata defining groups # Create dummy input files for demonstration if they don't exist # In a real pipeline, these files would be generated by upstream steps. if [ ! -f "$EXPR_MATRIX_FILE" ]; then echo "Creating dummy EXPR.tsv..." echo -e "Gene\tSample1\tSample2\tSample3\tSample4\tSample5\tSample6" > "$EXPR_MATRIX_FILE" echo -e "GeneA\t10.5\t11.2\t10.8\t15.1\t14.8\t15.5" >> "$EXPR_MATRIX_FILE" echo -e "GeneB\t8.2\t8.5\t8.0\t9.1\t9.3\t9.0" >> "$EXPR_MATRIX_FILE" echo -e "GeneC\t12.1\t11.9\t12.5\t11.5\t11.8\t12.0" >> "$EXPR_MATRIX_FILE" echo -e "GeneD\t5.0\t5.2\t5.1\t4.8\t4.9\t5.0" >> "$EXPR_MATRIX_FILE" fi if [ ! -f "$METADATA_FILE" ]; then echo "Creating dummy metadata.tsv..." echo -e "Sample\tt_cell_subpopulation\tother_info" > "$METADATA_FILE" echo -e "Sample1\tCD4_T_cell\tinfo1" >> "$METADATA_FILE" echo -e "Sample2\tCD4_T_cell\tinfo2" >> "$METADATA_FILE" echo -e "Sample3\tCD8_T_cell\tinfo3" >> "$METADATA_FILE" echo -e "Sample4\tCD8_T_cell\tinfo4" >> "$METADATA_FILE" echo -e "Sample5\tRegulatory_T_cell\tinfo5" >> "$METADATA_FILE" echo -e "Sample6\tRegulatory_T_cell\tinfo6" >> "$METADATA_FILE" fi # Create output directory mkdir -p "$OUTPUT_DIR" # R script content R_SCRIPT=$(cat <<EOF # Load necessary libraries # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("limma") library(limma) # --- Configuration --- EXPR_MATRIX_FILE <- "$EXPR_MATRIX_FILE" METADATA_FILE <- "$METADATA_FILE" OUTPUT_DIR <- "$OUTPUT_DIR" GROUP_COLUMN <- "$GROUP_COLUMN" # --- Load Data --- expr_matrix <- as.matrix(read.delim(EXPR_MATRIX_FILE, row.names = 1, sep="\\t")) metadata <- read.delim(METADATA_FILE, row.names = 1, sep="\\t") # Ensure sample names match and are in the same order # Assuming sample names are column names in expr_matrix and row names in metadata if (!all(colnames(expr_matrix) %in% rownames(metadata))) { stop("Sample names in expression matrix and metadata do not match.") } metadata <- metadata[colnames(expr_matrix), ] # Reorder metadata to match expr_matrix columns # Define the grouping factor groups <- factor(metadata[[GROUP_COLUMN]]) design <- model.matrix(~0 + groups) colnames(design) <- levels(groups) # --- Differential Expression Analysis --- # Fit linear model fit <- lmFit(expr_matrix, design) # Define all pairwise contrasts group_levels <- levels(groups) contrasts_list <- list() k <- 1 for (i in 1:(length(group_levels) - 1)) { for (j in (i + 1):length(group_levels)) { contrast_name <- paste0(group_levels[j], "_vs_", group_levels[i]) contrast_formula <- paste0(group_levels[j], " - ", group_levels[i]) contrasts_list[[k]] <- makeContrasts(contrasts = contrast_formula, levels = design) names(contrasts_list)[k] <- contrast_name k <- k + 1 } } # Apply contrasts contrast_matrix <- do.call(cbind, contrasts_list) fit2 <- contrasts.fit(fit, contrast_matrix) # Empirical Bayes moderation fit2 <- eBayes(fit2) # Extract and save results for each contrast for (contrast_name in names(contrasts_list)) { results <- topTable(fit2, coef = contrast_name, number = Inf, adjust.method = "BH") output_file <- file.path(OUTPUT_DIR, paste0("dge_results_", contrast_name, ".tsv")) write.table(results, file = output_file, sep = "\\t", quote = FALSE, row.names = TRUE) message(paste("Results for", contrast_name, "saved to", output_file)) } message("Differential gene expression analysis complete.") EOF ) # Execute the R script Rscript -e "$R_SCRIPT" -
20
Since single-cell gene expression does not conform to the usual negative binomial distribution and can even be bimodal due to dropout we used two non-parametric statistical tests for heterogeneity of expression: Mann Whitney Wilcoxon (MWW, also known as MWU) which is a rank-sum test but relies on a large sample to approximate normality, and Kolmogorov-Smirnov 2-sample (KS2) test which finds the largest difference between the empirical cumulative distributions, even between two small samples such as our 1st division sub-types Div1TE (n =36) and Div1MEM (n =24).
$ Bash example
# This script demonstrates how to perform Mann-Whitney-Wilcoxon (MWW) and Kolmogorov-Smirnov 2-sample (KS2) tests using R. # The description indicates comparing gene expression between two groups: Div1TE (n=36) and Div1MEM (n=24). # We simulate data for demonstration purposes. In a real pipeline, 'div1te_expr' and 'div1mem_expr' would be loaded from actual gene expression data files. # Example of how to install R (if not already installed) # sudo apt-get update # sudo apt-get install r-base Rscript -e ' # Set seed for reproducibility of simulated data set.seed(123) # Simulate gene expression data for Div1TE (n=36) # Assuming expression values might follow a log-normal distribution, common for gene expression div1te_expr <- rlnorm(36, meanlog = 1.5, sdlog = 0.8) # Simulate gene expression data for Div1MEM (n=24) # Assuming a potentially different distribution to test for heterogeneity div1mem_expr <- rlnorm(24, meanlog = 1.2, sdlog = 0.7) cat("Simulated Div1TE expression (first 5 values):\n") print(head(div1te_expr)) cat("Simulated Div1MEM expression (first 5 values):\n") print(head(div1mem_expr)) # Perform Mann-Whitney-Wilcoxon (MWW) test # This is a non-parametric test for comparing two independent samples. # The description implies testing for "heterogeneity of expression", so a two-sided alternative is appropriate. cat("\n--- Mann-Whitney-Wilcoxon Test ---\n") mww_result <- wilcox.test(div1te_expr, div1mem_expr, alternative = "two.sided") print(mww_result) # Perform Kolmogorov-Smirnov 2-sample (KS2) test # This test compares the empirical cumulative distribution functions of two samples. cat("\n--- Kolmogorov-Smirnov 2-sample Test ---\n") ks2_result <- ks.test(div1te_expr, div1mem_expr) print(ks2_result) # In a real scenario, you might save these results to a file: # sink("statistical_test_results.txt") # cat("Mann-Whitney-Wilcoxon Test Result:\n") # print(mww_result) # cat("\nKolmogorov-Smirnov 2-sample Test Result:\n") # print(ks2_result) # sink() ' -
21
Cell type classifier.
$ Bash example
# Define input and output files INPUT_EXPRESSION_MATRIX="expression_matrix.tsv" # Path to the input single-cell expression matrix (e.g., counts or normalized data) REFERENCE_DATA="human_pbmc_reference.rds" # Path to a pre-built SingleR reference object (e.g., from celldex or a custom one) OUTPUT_ANNOTATIONS="cell_type_annotations.tsv" # Output file for predicted cell type annotations # Commented out installation instructions # # Install BiocManager if not already installed # # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')" # # Install SingleR and celldex (if using celldex references) # # R -e "BiocManager::install(c('SingleR', 'celldex'))" # # Install Matrix package if needed for sparse matrices # # R -e "install.packages('Matrix')" # Set environment variables for the R script export INPUT_EXPRESSION_MATRIX export REFERENCE_DATA export OUTPUT_ANNOTATIONS # Execute SingleR for cell type classification # This R script loads an expression matrix and a SingleR reference, then performs classification. # The reference data should be an RDS file containing a list with 'data' (expression matrix) and 'labels' (cell types) # or a SummarizedExperiment object from celldex. Rscript -e " library(SingleR) library(Matrix) # Often needed for sparse matrices # Retrieve file paths from environment variables input_matrix_file <- Sys.getenv('INPUT_EXPRESSION_MATRIX') reference_file <- Sys.getenv('REFERENCE_DATA') output_annotations_file <- Sys.getenv('OUTPUT_ANNOTATIONS') # Load expression matrix # Assuming a tab-separated file with genes as rows and cells as columns. # Adjust 'read.table' parameters or use other functions (e.g., 'readRDS', 'Seurat::Read10X') # based on the actual format of your input expression data. # Example for a dense matrix: expr_matrix <- as.matrix(read.table(input_matrix_file, sep='\t', header=TRUE, row.names=1)) # Example for a sparse matrix (e.g., from 10x Genomics, after loading with Matrix::readMM): # expr_matrix <- readMM(input_matrix_file) # If input_matrix_file is a .mtx file # colnames(expr_matrix) <- readLines('barcodes.tsv') # Assuming barcodes file # rownames(expr_matrix) <- readLines('features.tsv') # Assuming features file # Load SingleR reference data # This should be an RDS file containing a SingleR-compatible reference object. # For example, a list with 'data' (expression matrix) and 'labels' (cell type vector), # or a SummarizedExperiment object from the 'celldex' package. ref_obj <- readRDS(reference_file) # Perform cell type annotation # If 'ref_obj' is a SummarizedExperiment from celldex (e.g., HumanPrimaryCellAtlasData()), # SingleR can directly use it: predictions <- SingleR(test = expr_matrix, ref = ref_obj, labels = ref_obj$label.main) # If 'ref_obj' is a custom list: predictions <- SingleR(test = expr_matrix, ref = ref_obj$data, labels = ref_obj$labels) # Extract and save results cell_annotations <- data.frame( cell_id = rownames(predictions), predicted_label = predictions$labels, score = predictions$scores[, predictions$labels] # Score for the predicted label ) write.table(cell_annotations, output_annotations_file, sep='\t', quote=FALSE, row.names=FALSE) " -
22
We trained two binary T cell classifiers to identify gene expression signatures that not only differentiate the examined T cell sub-populations (like the differential gene expression described above) but can also be used to predict the âmemory-â or âeffector-nessâ of previously unseen cells.
Machine Learning Framework (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# This script demonstrates a conceptual approach to training binary T cell classifiers using Python and scikit-learn. # It involves loading gene expression data, preparing labels, training two separate classifiers, # and saving the trained models. Specific algorithms (e.g., SVM) are used as examples. # Install necessary libraries (example for Python/scikit-learn) # conda create -n ml_env python=3.9 # conda activate ml_env # pip install pandas scikit-learn numpy joblib # Placeholder for a Python script that would perform the classification # This script would: # 1. Load gene expression data (e.g., from a CSV file) # 2. Load labels for T cell sub-populations and 'memory-ness'/'effector-ness' # 3. Preprocess data (e.g., normalization, scaling) # 4. Split data into training and testing sets # 5. Train two binary classifiers (e.g., for sub-population differentiation and memory/effector prediction) # 6. Evaluate the classifiers # 7. Save the trained models and/or prediction results cat << 'EOF' > train_tcell_classifiers.py import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC # Example classifier: Support Vector Classifier from sklearn.metrics import accuracy_score, classification_report import joblib # For saving models import numpy as np # --- Configuration --- GENE_EXPRESSION_FILE = "gene_expression_data.csv" CELL_LABELS_FILE = "cell_labels.csv" # Contains 'subpopulation' and 'memory_effector_status' # --- Load Data --- print("Loading data...") try: gene_expr_df = pd.read_csv(GENE_EXPRESSION_FILE, index_col=0) labels_df = pd.read_csv(CELL_LABELS_FILE, index_col=0) except FileNotFoundError as e: print(f"Error loading data: {e}. Please ensure '{GENE_EXPRESSION_FILE}' and '{CELL_LABELS_FILE}' exist.") exit(1) # Ensure indices match and align data common_cells = gene_expr_df.index.intersection(labels_df.index) gene_expr_df = gene_expr_df.loc[common_cells] labels_df = labels_df.loc[common_cells] X = gene_expr_df.values y_subpop = labels_df['subpopulation'].values y_mem_eff = labels_df['memory_effector_status'].values # --- Preprocessing --- print("Preprocessing data (scaling features)...") scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # --- Train Classifier 1: Differentiate T cell sub-populations --- print("\nTraining classifier for T cell sub-populations...") # Split data for sub-population classification X_train_subpop, X_test_subpop, y_train_subpop, y_test_subpop = train_test_split(X_scaled, y_subpop, test_size=0.2, random_state=42, stratify=y_subpop) # Initialize and train the classifier classifier_subpop = SVC(kernel='linear', probability=True, random_state=42) classifier_subpop.fit(X_train_subpop, y_train_subpop) # Evaluate the classifier y_pred_subpop = classifier_subpop.predict(X_test_subpop) print("\nClassifier 1 (Sub-population) Evaluation:") print(f"Accuracy: {accuracy_score(y_test_subpop, y_pred_subpop):.4f}") print(classification_report(y_test_subpop, y_pred_subpop, zero_division=0)) # Save the trained model joblib.dump(classifier_subpop, 'tcell_subpop_classifier.pkl') print("Saved T cell sub-population classifier to 'tcell_subpop_classifier.pkl'") # --- Train Classifier 2: Predict 'memory-' or 'effector-ness' --- print("\nTraining classifier for 'memory-' or 'effector-ness'...") # Split data for memory/effector classification X_train_mem_eff, X_test_mem_eff, y_train_mem_eff, y_test_mem_eff = train_test_split(X_scaled, y_mem_eff, test_size=0.2, random_state=42, stratify=y_mem_eff) # Initialize and train the classifier classifier_mem_eff = SVC(kernel='rbf', probability=True, random_state=42) classifier_mem_eff.fit(X_train_mem_eff, y_train_mem_eff) # Evaluate the classifier y_pred_mem_eff = classifier_mem_eff.predict(X_test_mem_eff) print("\nClassifier 2 (Memory/Effector) Evaluation:") print(f"Accuracy: {accuracy_score(y_test_mem_eff, y_pred_mem_eff):.4f}") print(classification_report(y_test_mem_eff, y_pred_mem_eff, zero_division=0)) # Save the trained model joblib.dump(classifier_mem_eff, 'tcell_mem_eff_classifier.pkl') print("Saved T cell memory/effector classifier to 'tcell_mem_eff_classifier.pkl'") print("\nTraining complete. Models saved.") EOF # Create dummy input files for demonstration purposes echo "Creating dummy input files for demonstration..." cat << 'EOF' > gene_expression_data.csv cell_id,geneA,geneB,geneC,geneD,geneE,geneF,geneG,geneH,geneI,geneJ cell1,10,20,5,15,25,30,12,18,22,8 cell2,12,22,6,16,26,32,14,20,24,10 cell3,8,18,4,14,24,28,10,16,20,6 cell4,11,21,7,17,27,31,13,19,23,9 cell5,9,19,3,13,23,29,11,17,21,7 cell6,13,23,8,18,28,33,15,21,25,11 cell7,7,17,2,12,22,27,9,15,19,5 cell8,14,24,9,19,29,34,16,22,26,12 cell9,6,16,1,11,21,26,8,14,18,4 cell10,15,25,10,20,30,35,17,23,27,13 cell11,10.5,20.5,5.5,15.5,25.5,30.5,12.5,18.5,22.5,8.5 cell12,12.5,22.5,6.5,16.5,26.5,32.5,14.5,20.5,24.5,10.5 cell13,8.5,18.5,4.5,14.5,24.5,28.5,10.5,16.5,20.5,6.5 cell14,11.5,21.5,7.5,17.5,27.5,31.5,13.5,19.5,23.5,9.5 cell15,9.5,19.5,3.5,13.5,23.5,29.5,11.5,17.5,21.5,7.5 cell16,13.5,23.5,8.5,18.5,28.5,33.5,15.5,21.5,25.5,11.5 cell17,7.5,17.5,2.5,12.5,22.5,27.5,9.5,15.5,19.5,5.5 cell18,14.5,24.5,9.5,19.5,29.5,34.5,16.5,22.5,26.5,12.5 cell19,6.5,16.5,1.5,11.5,21.5,26.5,8.5,14.5,18.5,4.5 cell20,15.5,25.5,10.5,20.5,30.5,35.5,17.5,23.5,27.5,13.5 EOF cat << 'EOF' > cell_labels.csv cell_id,subpopulation,memory_effector_status cell1,CD4_Treg,Memory cell2,CD8_Cytotoxic,Effector cell3,CD4_Helper,Memory cell4,CD8_Cytotoxic,Memory cell5,CD4_Treg,Effector cell6,CD8_Cytotoxic,Memory cell7,CD4_Helper,Effector cell8,CD8_Cytotoxic,Memory cell9,CD4_Treg,Memory cell10,CD8_Cytotoxic,Effector cell11,CD4_Treg,Memory cell12,CD8_Cytotoxic,Effector cell13,CD4_Helper,Memory cell14,CD8_Cytotoxic,Memory cell15,CD4_Treg,Effector cell16,CD8_Cytotoxic,Memory cell17,CD4_Helper,Effector cell18,CD8_Cytotoxic,Memory cell19,CD4_Treg,Memory cell20,CD8_Cytotoxic,Effector EOF # Execute the Python script to train the classifiers python train_tcell_classifiers.py -
23
Each classifier constructed an independent ensemble of Extremely Randomized Trees.
scikit-learn (Inferred with models/gemini-2.5-flash) v1.4.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install scikit-learn and joblib if not already installed # pip install scikit-learn joblib # Create a Python script to train the Extremely Randomized Trees classifier cat << 'EOF' > train_extra_trees.py import numpy as np from sklearn.ensemble import ExtraTreesClassifier from sklearn.model_selection import train_test_split import joblib import os # Placeholder for input data (features X and labels y) # In a real pipeline, X and y would be loaded from files (e.g., CSV, HDF5). # X: Feature matrix (n_samples, n_features) # y: Target labels (n_samples,) # Example: Generate synthetic data for demonstration purposes. # Replace this with actual data loading in a real application. print("Generating synthetic data for demonstration...") n_samples = 1000 n_features = 20 n_classes = 2 X = np.random.rand(n_samples, n_features) y = np.random.randint(0, n_classes, n_samples) # Split data into training and testing sets (optional, but good practice) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Initialize the Extremely Randomized Trees classifier (ExtraTreesClassifier) # Parameters like n_estimators, max_features, min_samples_leaf, etc., # would typically be optimized or set based on prior knowledge or cross-validation. # n_jobs=-1 uses all available CPU cores for parallel processing. print("Initializing ExtraTreesClassifier...") classifier = ExtraTreesClassifier( n_estimators=100, # Number of trees in the forest random_state=42, # Seed for reproducibility n_jobs=-1 # Use all available cores for fitting and prediction ) # Train the classifier print("Training the ExtraTreesClassifier...") classifier.fit(X_train, y_train) # Evaluate the classifier (optional) accuracy = classifier.score(X_test, y_test) print(f"Model accuracy on test set: {accuracy:.4f}") # Save the trained model to a file using joblib output_model_path = "extra_trees_classifier_model.joblib" print(f"Saving the trained model to {output_model_path}...") joblib.dump(classifier, output_model_path) print("Model training and saving complete.") # No specific reference genomic datasets are used for this general machine learning model training step. # The model is trained on the provided input features and labels. EOF # Execute the Python script to train the classifier python train_extra_trees.py -
24
Using the terminally differentiated effector and memory (TCM, TEM) populations, we built a training set for a fate classifier for CD8+ T cells.
$ Bash example
# Install necessary libraries if not already present # conda create -n classifier_env python=3.9 scikit-learn pandas joblib numpy # conda activate classifier_env # pip install scikit-learn pandas joblib numpy # Create a dummy input data file for demonstration if it doesn't exist. # In a real bioinformatics pipeline, 'cd8_tcell_expression_data.csv' would be # generated by a preceding data processing step, containing features (e.g., gene expression) # and a 'cell_fate' column (e.g., 0 for TCM, 1 for TEM). if [ ! -f "cd8_tcell_expression_data.csv" ]; then echo "Creating dummy input data for demonstration..." python -c " import pandas as pd import numpy as np from sklearn.datasets import make_classification # Generate synthetic data representing CD8+ T cells with features and a fate label X, y = make_classification(n_samples=200, n_features=50, n_informative=20, n_redundant=10, n_classes=2, random_state=42, weights=[0.5, 0.5]) data = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])]) data['cell_fate'] = y # 0 for TCM, 1 for TEM (example labels) data.to_csv('cd8_tcell_expression_data.csv', index=False) print('Dummy data created: cd8_tcell_expression_data.csv') " fi # Create a Python script to build the fate classifier cat << 'EOF' > classifier_training.py import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score import joblib # For saving the model # Load the input data containing features and cell fate labels try: data = pd.read_csv("cd8_tcell_expression_data.csv") X = data.drop("cell_fate", axis=1) # Features y = data["cell_fate"] # Target variable (cell fate: TCM/TEM) except FileNotFoundError: print("Error: Input data file 'cd8_tcell_expression_data.csv' not found.") exit(1) # Split the data into training and testing sets # The training set will be used to build the classifier X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) print(f"Training set size: {len(X_train)} samples") print(f"Testing set size: {len(X_test)} samples") # Initialize and train a classifier (e.g., RandomForestClassifier) # The choice of classifier can be adjusted based on data characteristics and performance requirements classifier = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1) print("Training the classifier...") classifier.fit(X_train, y_train) print("Classifier training complete.") # Evaluate the classifier on the test set (optional, but good practice) y_pred = classifier.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"Classifier accuracy on test set: {accuracy:.4f}") # Save the trained classifier model for future use (e.g., prediction on new data) model_filename = "cd8_tcell_fate_classifier.joblib" joblib.dump(classifier, model_filename) print(f"Trained classifier saved to {model_filename}") # Save the training data (features and labels) that were used to build the classifier # This is important for reproducibility and understanding the training set composition. training_features_output_filename = "cd8_tcell_training_set_features.csv" training_labels_output_filename = "cd8_tcell_training_set_labels.csv" X_train.to_csv(training_features_output_filename, index=False) y_train.to_csv(training_labels_output_filename, index=False) print(f"Training set features saved to {training_features_output_filename}") print(f"Training set labels saved to {training_labels_output_filename}") EOF # Execute the Python script to build the fate classifier python classifier_training.py echo "Classifier training process finished. A trained model and training set components are saved." -
25
Using the newly observed segregation of daughter T cells into Div1TE and Div1MEM subpopulations after the first division, we built a second training set for another early state classifier.
Machine Learning Classifier Training (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# This is a conceptual script to illustrate building a training set for a classifier. # The actual implementation would depend on the specific data format, features, and chosen machine learning library. # Assuming input data (e.g., CSV, TSV) contains features and a 'subpopulation' column (Div1TE/Div1MEM) # Example: data.csv # cell_id,feature1,feature2,...,subpopulation # C1,0.5,1.2,...,Div1TE # C2,0.8,0.9,...,Div1MEM # Install scikit-learn if not already installed # pip install scikit-learn pandas numpy # Create a Python script (e.g., build_training_set.py) cat << 'EOF' > build_training_set.py import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder import numpy as np # Placeholder for input data file # In a real scenario, this would be the output from a previous step that identified Div1TE/Div1MEM cells. input_data_file = "processed_t_cell_data.csv" # This file should contain features and the 'subpopulation' label # Placeholder for output training and testing set files output_train_file = "classifier_training_set.csv" output_test_file = "classifier_testing_set.csv" # Placeholder for features to be used in the classifier # Replace with actual feature column names from your data feature_columns = ["feature1", "feature2", "feature3"] # Example features try: # Load the data df = pd.read_csv(input_data_file) # Ensure the 'subpopulation' column exists if 'subpopulation' not in df.columns: raise ValueError(f"Column 'subpopulation' not found in {input_data_file}") # Filter for the relevant subpopulations if necessary (Div1TE, Div1MEM) # Assuming the input_data_file already contains only these or we need to filter df_filtered = df[df['subpopulation'].isin(['Div1TE', 'Div1MEM'])].copy() if df_filtered.empty: print(f"No data found for Div1TE or Div1MEM subpopulations in {input_data_file}. Exiting.") exit(1) # Prepare features (X) and labels (y) X = df_filtered[feature_columns] y = df_filtered['subpopulation'] # Encode labels if they are strings (e.g., Div1TE -> 0, Div1MEM -> 1) label_encoder = LabelEncoder() y_encoded = label_encoder.fit_transform(y) # Split data into training and testing sets # Using a common split ratio, e.g., 80% training, 20% testing # random_state ensures reproducibility of the split X_train, X_test, y_train, y_test = train_test_split( X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded ) # Combine features and encoded labels for saving train_df = pd.DataFrame(X_train, columns=feature_columns) train_df['label'] = y_train train_df['original_label'] = label_encoder.inverse_transform(y_train) test_df = pd.DataFrame(X_test, columns=feature_columns) test_df['label'] = y_test test_df['original_label'] = label_encoder.inverse_transform(y_test) # Save the training and testing sets train_df.to_csv(output_train_file, index=False) test_df.to_csv(output_test_file, index=False) print(f"Training set saved to {output_train_file}") print(f"Testing set saved to {output_test_file}") print(f"Label mapping: {list(label_encoder.classes_)} -> {list(range(len(label_encoder.classes_))))}") except FileNotFoundError: print(f"Error: Input file '{input_data_file}' not found. Please ensure the file exists.") exit(1) except ValueError as e: print(f"Error processing data: {e}") exit(1) except Exception as e: print(f"An unexpected error occurred: {e}") exit(1) EOF # Execute the Python script python build_training_set.py -
26
Both classifiers were fed their respective training sets using 10-fold cross-validation
$ Bash example
# Install scikit-learn (a common Python library for machine learning tasks) # pip install scikit-learn # This command represents the execution of a Python script that would implement # 10-fold cross-validation for the specified classifiers using their respective training sets. # The exact script name and parameters would vary based on the specific implementation. # 'training_data.csv' is a placeholder for the input training data. python run_classifier_cross_validation.py \ --input_training_data training_data.csv \ --num_folds 10 \ --output_results cross_validation_metrics.json -
27
After both the fate and early state classifiers were trained on their respective subpopulations, they were both applied on previously unseen intermediate Day 4 CD8+ T cells.
$ Bash example
bash # This script conceptually represents the application of pre-trained machine learning classifiers # to new, unseen data. The specific implementation (e.g., Python with scikit-learn, R with caret) # would depend on how the classifiers were originally trained. # Example installation for a Python-based scikit-learn environment: # conda create -n classifier_env python=3.9 scikit-learn pandas numpy -y # conda activate classifier_env # Define paths for the trained classifier models and the input data # These models would have been generated in a previous 'training' step. TRAINED_FATE_CLASSIFIER_MODEL="path/to/trained_fate_classifier_model.pkl" # e.g., a pickled scikit-learn model TRAINED_EARLY_STATE_CLASSIFIER_MODEL="path/to/trained_early_state_classifier_model.pkl" # e.g., a pickled scikit-learn model # Input data: Previously unseen intermediate Day 4 CD8+ T cells # This could be a CSV, TSV, H5AD (for single-cell data), or other format. INPUT_DAY4_CD8_T_CELLS_DATA="path/to/day4_cd8_t_cells_features.h5ad" # Example: AnnData object for single-cell features # Define output paths for the predictions OUTPUT_FATE_PREDICTIONS="path/to/day4_cd8_t_cells_fate_predictions.csv" OUTPUT_EARLY_STATE_PREDICTIONS="path/to/day4_cd8_t_cells_early_state_predictions.csv" # Execute a conceptual Python script to load the models and apply them to the new data. # This script would handle loading the input data, preprocessing if necessary, # making predictions with both classifiers, and saving the results. python apply_trained_classifiers.py \ --fate_model "${TRAINED_FATE_CLASSIFIER_MODEL}" \ --early_state_model "${TRAINED_EARLY_STATE_CLASSIFIER_MODEL}" \ --input_data "${INPUT_DAY4_CD8_T_CELLS_DATA}" \ --output_fate_predictions "${OUTPUT_FATE_PREDICTIONS}" \ --output_early_state_predictions "${OUTPUT_EARLY_STATE_PREDICTIONS}" -
28
Their predicted âmemory-nessâ scores were scatter-plotted and shown to correlate in Fig.
$ Bash example
# Install R if not already installed (example for Ubuntu/Debian) # sudo apt-get update # sudo apt-get install r-base # Create a dummy input file for demonstration purposes. # In a real scenario, 'memory_scores.csv' would be generated by a previous step # and contain the predicted 'memory-ness' scores for two sets of data to be correlated. # Assuming a CSV format with two columns, e.g., 'score_A' and 'score_B'. echo "score_A,score_B" > memory_scores.csv echo "0.1,0.15" >> memory_scores.csv echo "0.2,0.22" >> memory_scores.csv echo "0.3,0.35" >> memory_scores.csv echo "0.4,0.41" >> memory_scores.csv echo "0.5,0.53" >> memory_scores.csv echo "0.6,0.60" >> memory_scores.csv echo "0.7,0.75" >> memory_scores.csv echo "0.8,0.82" >> memory_scores.csv echo "0.9,0.90" >> memory_scores.csv # Create an R script to perform scatter plotting and correlation calculation cat << 'EOF' > plot_memory_scores.R # Read the input data data <- read.csv("memory_scores.csv") # Create a scatter plot and save it as a PNG file png("memory_scores_scatterplot.png", width = 800, height = 600) plot(data$score_A, data$score_B, xlab = "Predicted Memory-ness Score (Set A)", ylab = "Predicted Memory-ness Score (Set B)", main = "Correlation of Predicted Memory-ness Scores", pch = 16, col = "blue", cex = 1.2) # Calculate Pearson correlation coefficient correlation <- cor(data$score_A, data$score_B, method = "pearson") # Add correlation value to the plot legend("topleft", legend = paste("Pearson Correlation:", round(correlation, 3)), bty = "n", cex = 1.2) # Add a linear regression line to visualize the trend abline(lm(data$score_B ~ data$score_A), col = "red", lwd = 2) dev.off() # Print the correlation coefficient to standard output cat(paste("Pearson Correlation Coefficient:", correlation, "\n")) EOF # Execute the R script Rscript plot_memory_scores.R -
29
3f.
$ Bash example
# This code block assumes step 3f refers to eCLIP peak calling, as per the provided context and common eCLIP pipeline steps. # Parameters are set to common defaults for CLIPper in the absence of specific instructions. # Define input and output files IP_BAM="aligned_ip.bam" # Placeholder for aligned IP reads BAM file INPUT_BAM="aligned_input.bam" # Placeholder for aligned input control reads BAM file CHROM_SIZES="hg38.chrom.sizes" # Placeholder for genome chromosome sizes file (e.g., from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes) OUTPUT_BED="clipper_peaks.bed" # Output BED file for identified peaks # Example command for CLIPper peak calling # Assuming clipper.py is in the system's PATH or specified with its full path # Common parameters: -p for p-value threshold (default 0.01), -w for minimum peak width (default 20) clipper.py \ -s $CHROM_SIZES \ -i $IP_BAM \ -c $INPUT_BAM \ -p 0.01 \ -w 20 \ -o $OUTPUT_BED -
30
For each T cell, its âeffector-nessâ scores is 1 minus the âmemory-nessâ score and is redundant for this analysis.
Custom Script (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# This step calculates 'effector-ness' scores based on 'memory-ness' scores. # The description states this calculation is redundant for the current analysis, but the code block demonstrates how it would be performed. # Assuming an input file 'input_memory_ness_scores.tsv' with T cell IDs and their memory-ness scores. # The output will be 'output_effector_ness_scores.tsv'. # The input file is expected to be tab-separated and contain a column named 'MEMORY_NESS_SCORE'. # Python script to calculate effector-ness # effector_ness_calculator.py cat << 'EOF' > effector_ness_calculator.py import pandas as pd import sys if __name__ == "__main__": if len(sys.argv) != 3: print("Usage: python effector_ness_calculator.py <input_memory_ness_file> <output_effector_ness_file>") sys.exit(1) input_file = sys.argv[1] output_file = sys.argv[2] try: df = pd.read_csv(input_file, sep='\t') if 'MEMORY_NESS_SCORE' not in df.columns: raise ValueError("Input file must contain a 'MEMORY_NESS_SCORE' column.") # Calculate 'effector-ness' as 1 minus 'memory-ness' df['EFFECTOR_NESS_SCORE'] = 1 - df['MEMORY_NESS_SCORE'] # Output the results df.to_csv(output_file, sep='\t', index=False) print(f"Successfully calculated effector-ness scores and saved to {output_file}") except FileNotFoundError: print(f"Error: Input file '{input_file}' not found.") sys.exit(1) except ValueError as e: print(f"Error processing data: {e}") sys.exit(1) except Exception as e: print(f"An unexpected error occurred: {e}") sys.exit(1) EOF # Execute the Python script # Replace 'input_memory_ness_scores.tsv' with your actual input file path # Replace 'output_effector_ness_scores.tsv' with your desired output file path python effector_ness_calculator.py input_memory_ness_scores.tsv output_effector_ness_scores.tsv -
31
The signature genes for each classifier were selected from all G=22,425 genes by their GINI score.
$ Bash example
# Install necessary Python libraries if not already installed # pip install scikit-learn pandas numpy # Create a Python script to select signature genes based on GINI score. # This script simulates the process of training a classifier (e.g., RandomForest) # and extracting feature importances (GINI-based) from a total of G=22,425 genes. # It outputs a list of top N signature genes. cat << 'EOF' > select_signature_genes.py import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split import argparse def main(): parser = argparse.ArgumentParser(description="Select signature genes based on GINI score from a classifier.") parser.add_argument("--total_genes", type=int, default=22425, help="Total number of genes (G) available for selection.") parser.add_argument("--num_samples", type=int, default=100, help="Number of samples (N) for the simulated dataset.") parser.add_argument("--top_n_genes", type=int, default=100, help="Number of top genes to select as signature genes.") parser.add_argument("--output_file", type=str, default="signature_genes_gini.tsv", help="Output file for the selected signature genes.") args = parser.parse_args() G = args.total_genes N_samples = args.num_samples num_signature_genes = args.top_n_genes np.random.seed(42) # Generate dummy gene names gene_names = [f'gene_{i+1}' for i in range(G)] # Simulate gene expression data (N_samples rows, G columns) # Introduce some signal for a few genes to make them 'important' X = np.random.rand(N_samples, G) * 10 X[:, 0] += np.random.rand(N_samples) * 5 # Make gene_1 important X[:, 100] -= np.random.rand(N_samples) * 5 # Make gene_101 important # Simulate binary labels based on the 'important' genes y = (X[:, 0] + X[:, 100] > np.median(X[:, 0] + X[:, 100])).astype(int) # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.0, random_state=42) # Use all for training for simplicity # Train a RandomForestClassifier # RandomForest uses GINI impurity by default for splitting nodes and calculating feature importances. classifier = RandomForestClassifier(n_estimators=100, random_state=42) classifier.fit(X_train, y_train) # Get feature importances (GINI-based scores) gini_importances = classifier.feature_importances_ # Create a DataFrame to store gene names and their GINI scores feature_importance_df = pd.DataFrame({ 'gene': gene_names, 'gini_score': gini_importances }) # Sort genes by GINI score in descending order feature_importance_df = feature_importance_df.sort_values(by='gini_score', ascending=False) # Select the top N genes as signature genes signature_genes_df = feature_importance_df.head(num_signature_genes) print(f"Selected {num_signature_genes} signature genes based on GINI score from {G} total genes.") signature_genes_df.to_csv(args.output_file, sep="\t", index=False) if __name__ == "__main__": main() EOF # Execute the Python script to select signature genes. # The --total_genes parameter reflects the G=22,425 genes mentioned in the description. # The --top_n_genes parameter is an example of how many signature genes might be selected. python select_signature_genes.py --total_genes 22425 --top_n_genes 100 --output_file signature_genes_gini.tsv -
32
Temporal expression trajectories through inferred lineage paths.
$ Bash example
# This script demonstrates how to perform temporal expression trajectory analysis using Monocle 3. # It assumes you have an R environment set up with Monocle 3 installed. # --- Installation (commented out) --- # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install(version = "3.18") # Use your Bioconductor version # BiocManager::install(c('monocle3')) # install.packages("devtools") # devtools::install_github('cole-trapnell-lab/monocle3') # --- Execution --- # Assuming 'cds_object.rds' is a pre-processed CellDataSet object saved from a previous step. # This object typically contains normalized expression data, cell metadata, and gene metadata. Rscript -e ' library(monocle3) # Load pre-processed CellDataSet object # Replace "path/to/your/cds_object.rds" with the actual path to your input data. # This object should contain expression data, cell metadata, and gene metadata. # For demonstration, we will assume `cds` is already loaded or created from previous steps. # In a real pipeline, you would load your actual CDS object: # cds <- readRDS("path/to/your/cds_object.rds") # Placeholder for a loaded CDS object. In a real scenario, `cds` would be the output # of upstream steps like `preprocess_cds`, `reduce_dimension`, and `cluster_cells`. # For example, if starting from a count matrix: # expression_matrix <- as.matrix(read.csv("counts.csv", row.names = 1)) # cell_metadata <- read.csv("cell_metadata.csv", row.names = 1) # gene_metadata <- read.csv("gene_metadata.csv", row.names = 1) # cds <- new_cell_data_set(expression_matrix, cell_metadata = cell_metadata, gene_metadata = gene_metadata) # cds <- preprocess_cds(cds, num_dim = 100) # cds <- reduce_dimension(cds, reduction_method = "UMAP") # cds <- cluster_cells(cds) # For this example, we assume `cds` is a pre-processed and dimension-reduced CellDataSet object. # If you are running this as a standalone step, ensure `cds` is properly initialized. # Learn the trajectory graph cds <- learn_graph(cds) # Order cells in pseudotime # You might need to specify a root node for the trajectory based on biological knowledge. # This often involves selecting a cell or a cluster that represents the start of the lineage. # Example: root_cells <- cells_in_partition(cds)[which(colData(cds)$cell_type == "Stem_Cell")] # If no specific root cells are provided, Monocle will attempt to infer one. cds <- order_cells(cds) # Plot the trajectory # You can color cells by pseudotime, cell type, or other metadata plot_cells(cds, color_cells_by = "pseudotime", label_groups_by_cluster = FALSE, label_leaves = TRUE, label_branch_points = TRUE, graph_label_size = 1.5) # Save the updated CellDataSet object with pseudotime information saveRDS(cds, "monocle3_trajectory_cds.rds") # You can also extract pseudotime values: pseudotime_values <- pseudotime(cds) write.csv(as.data.frame(pseudotime_values), "cell_pseudotime.csv") # Further analysis: differential expression along pseudotime (optional) # gene_fits <- fit_models(cds, model_formula_str = "~pseudotime") # pseudotime_de_results <- differential_test(gene_fits, full_model_formula_str = "~pseudotime", # reduced_model_formula_str = "~1") # write.csv(pseudotime_de_results, "pseudotime_de_genes.csv") ' -
33
To understand the temporal dynamics of expression for key genes along the effector and memory lineages, we constructed hypothetical differentiation time-courses for each lineage.
$ Bash example
# Install R if not already present (example for Debian/Ubuntu) # sudo apt-get update # sudo apt-get install r-base # Create a placeholder R script for demonstration purposes. # In a real scenario, this script would contain the actual logic # for modeling gene expression over time and constructing hypothetical time-courses. cat << 'EOF' > construct_time_courses.R #!/usr/bin/env Rscript # This is a placeholder R script to demonstrate the execution of a custom analysis. # In a real scenario, this script would contain the logic to: # 1. Load gene expression data (e.g., normalized counts, TPMs). # 2. Load sample metadata including lineage and time point information. # 3. For each gene and lineage, model the expression changes over time. # This could involve spline fitting, LOESS regression, or other time-series analysis methods. # 4. Generate "hypothetical differentiation time-courses" by predicting expression # at finely spaced time points along the observed differentiation trajectory. # 5. Save the results (e.g., predicted expression values, fitted models, plots). # Example usage: Rscript construct_time_courses.R <expression_file> <metadata_file> <output_prefix> args <- commandArgs(trailingOnly = TRUE) if (length(args) < 3) { stop("Usage: Rscript construct_time_courses.R <expression_file> <metadata_file> <output_prefix>", call. = FALSE) } expression_file <- args[1] metadata_file <- args[2] output_prefix <- args[3] message(paste0("Simulating construction of time-courses using:\n", " Expression file: ", expression_file, "\n", " Metadata file: ", metadata_file, "\n", " Output prefix: ", output_prefix)) # Simulate creating an output file output_file <- paste0(output_prefix, "_hypothetical_time_courses_summary.txt") writeLines(c( "# Hypothetical Differentiation Time-Courses Summary", paste0("Analysis performed on expression data from: ", expression_file), paste0("Using metadata from: ", metadata_file), "This file represents a summary of the constructed hypothetical differentiation time-courses.", "Actual output would include fitted curves, predicted expression values, or visualizations." ), output_file) message(paste0("Simulated output saved to: ", output_file)) EOF # Make the R script executable chmod +x construct_time_courses.R # Create dummy input files for demonstration # gene_expression_data.tsv: Contains normalized gene expression values (e.g., TPMs, counts) # Rows are genes, columns are samples. # sample_metadata.tsv: Contains metadata for each sample, including lineage and time point. echo "gene_id\tsample1\tsample2\tsample3\tsample4" > gene_expression_data.tsv echo "geneA\t10.5\t20.1\t30.8\t40.2" >> gene_expression_data.tsv echo "geneB\t50.3\t45.7\t40.1\t35.9" >> gene_expression_data.tsv echo "geneC\t5.2\t15.8\t25.1\t35.5" >> gene_expression_data.tsv echo "sample_id\tlineage\ttime_point" > sample_metadata.tsv echo "sample1\teffector\t0" >> sample_metadata.tsv echo "sample2\teffector\t1" >> sample_metadata.tsv echo "sample3\tmemory\t0" >> sample_metadata.tsv echo "sample4\tmemory\t1" >> sample_metadata.tsv # Execute the R script to construct hypothetical differentiation time-courses # Input: gene_expression_data.tsv (gene expression matrix) # sample_metadata.tsv (sample annotations) # Output prefix: lineage_time_course_analysis (output files will be prefixed with this) ./construct_time_courses.R gene_expression_data.tsv sample_metadata.tsv lineage_time_course_analysis -
34
Briefly, we sampled with replacement 50 cells from each population and constructed all trajectories through the cross-product of populations ordered in a particular lineage.
Single-cell trajectory construction and robustness analysis (Inferred with models/gemini-2.5-flash) vN/A GitHub$ Bash example
# This step describes a conceptual methodology for single-cell trajectory inference, which involves sampling cells and constructing trajectories based on population cross-products. # Without a specific software tool mentioned (e.g., Monocle, Slingshot, PAGA, etc.), it is not possible to provide a concrete bash command. # # The description implies the following conceptual steps, which would typically be implemented in a scripting language like R or Python: # # 1. **Sampling**: For each biological population, randomly sample 50 cells with replacement. # Example (conceptual pseudocode, not executable bash): # # sampled_cells_per_population = {} # # for population_id in all_population_ids: # # cells_in_population = get_cells_for_population(population_id) # # sampled_cells_per_population[population_id] = random_sample_with_replacement(cells_in_population, num_cells=50) # # 2. **Trajectory Construction (Cross-product)**: Construct all possible ordered trajectories (sequences of populations) that adhere to a predefined lineage structure. # This involves generating combinations or permutations of populations based on the lineage graph. # Example (conceptual pseudocode, not executable bash): # # all_lineage_paths = generate_all_ordered_lineage_paths(population_graph, lineage_constraints) # # 3. **Trajectory Inference**: For each constructed lineage path, combine the corresponding sampled cells and perform a trajectory inference analysis. # This is where a specific trajectory inference tool would be applied. # Example (conceptual pseudocode, not executable bash): # # for lineage_path in all_lineage_paths: # # combined_data_for_path = combine_data_from_sampled_cells(sampled_cells_per_population, lineage_path) # # # Execute a specific trajectory inference tool (e.g., Monocle3, Slingshot, PAGA) on combined_data_for_path # # # e.g., run_monocle3(combined_data_for_path, output_prefix=f"trajectory_{lineage_path}") # # This is a high-level description of a method, not a specific tool's execution command. -
35
These orders were determined a priori based on our earlier work with similar timecourses of RT-qPCR study (Arsenio et. al.
Statistical Analysis Software (Inferred with models/gemini-2.5-flash) vN/A -
36
Nature Immunology 2014).
$ Bash example
# No specific bioinformatics step or tool described in "Nature Immunology 2014)". # Therefore, no execution command or parameters can be inferred.
-
37
Specifically, the âeffector lineageâ starts from the naïve population, and progresses through the Div1TE subpopulation, then onto Day 4, and finally Day 7.
(Inferred with models/gemini-2.5-flash) v(Inferred with models/gemini-2.5-flash)$ Bash example
No specific bioinformatics tool or computational step is described. This description outlines a biological progression or experimental design, not a command-line execution.
-
38
In contrast, the âeffector memoryâ and âcentral memory lineagesâ start from naïve, through the Div1MEM subpopulation, ending with TEM and TCM respectively.
(Inferred with models/gemini-2.5-flash) vN/A -
39
These bootstrapped trajectories were visually summarized by a seaborn timeseries plot which links the average expression for each population sample with a solid line segment and represents the 95% confidence interval by a shaded area around it.
$ Bash example
# Install seaborn and its dependencies if not already installed # conda install -c conda-forge seaborn pandas numpy matplotlib # Create a Python script for plotting cat << 'EOF' > plot_timeseries.py import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt # --- Simulate input data --- # In a real pipeline, this DataFrame would be loaded from a file (e.g., CSV, Parquet) # generated by the preceding bootstrapping step. It should contain columns for # time points, population samples, and individual bootstrapped expression values. # For demonstration, we simulate it here. np.random.seed(42) num_time_points = 10 num_populations = 3 num_bootstrap_samples_per_point = 100 # Number of bootstrap samples per time point and population data = [] for t in range(num_time_points): for pop_idx in range(num_populations): population_name = f'Population_{chr(65 + pop_idx)}' # Simulate expression values for bootstrap samples # Mean expression changes over time and differs by population mean_expr = 5 + t * 0.5 + pop_idx * 1.5 bootstrap_expressions = np.random.normal(loc=mean_expr, scale=1.0, size=num_bootstrap_samples_per_point) for expr in bootstrap_expressions: data.append({'time': t, 'population_sample': population_name, 'expression': expr}) df = pd.DataFrame(data) # --- Generate the seaborn timeseries plot --- plt.figure(figsize=(10, 6)) sns.lineplot( data=df, x='time', y='expression', hue='population_sample', errorbar=('ci', 95), # Calculate and display 95% confidence interval marker='o' # Optional: add markers for clarity ) plt.title('Average Expression Over Time with 95% Confidence Interval') plt.xlabel('Time Point') plt.ylabel('Expression Level') plt.grid(True, linestyle='--', alpha=0.7) plt.legend(title='Population Sample') plt.tight_layout() # Save the plot to a file output_plot_file = "timeseries_expression_plot.png" plt.savefig(output_plot_file) print(f"Plot saved to {output_plot_file}") EOF # Execute the Python script to generate the plot python plot_timeseries.py
Raw Source Text
Basecalls performed using CASAVA version 1.4 ChIP-seq reads were aligned to the mm10 genome assembly using STAR version 2.4.1b with the following options --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --limitOutSJoneRead --outFilterScoreMin 3 --outFilterScoreMinOverLread 0.66 --outFilterMatchNminOverLread 0.66 --outFilterMismatchNoverLmax 0.3 and the other parameter kept as default BAM files of sample1-11 are downsampled, if needed, using samtools to ensure sample 1-4, sample 5-8, sample 9-11 have same read numbers. for sample 1-4: peaks were called using MACS version 1.4.2 using sample 5-8 as backgournd, with the following setting -g mm --nomodel --nolambda and others as default. for sample 9-11: tag directories were first generated by HOMER makeTagDirectory with -keepOne -tbp 1 and other options as default. for sample 9 and 11: peakd were call by HOMER findpeak using sample 12 as background and -style factor -size 100 -fragLength 50 with other options as default for sample 10: peakd were call by HOMER findpeak using sample 12 as background and -style histone -fragLength 50 -size 200 -minDist 1000 -L 0 with other options as default for sample 13-16: transcript count were generated by kallisto version 0.42.4 using GENCODE GRCm38.p4 transcriptome as reference, with the following parameter: -l 200 -s 20 --single and other setting as default For all Single Cell Samples the following processing steps apply: Single-cell RNA-seq data pre-processing. Single-cell mRNA sequencing data from 256 murine CD8+ T cells were processed with a bioinformatics pipeline focusing on quality control (QC) and robust expression quantification. For each cell, raw RNA-seq reads were: checked for quality metrics with fastqc (v0.10.1); poly-A and adaptor-trimmed with cutadapt (v1.8.1); quantified by kallisto (v0.42.1) to a reference transcriptome (Gencode vM3) without bias correction; and aligned by STAR (v2.4.1b) to the reference mouse genome (mm10) with default parameters for quality control and downstream analysis. Next, the transcript per million (TPM) outputs of kallisto for all cells were combined into a cell-by-gene expression matrix (C=288 cells=rows, G=22425 genes=columns) by summing the expression values for all quantified transcripts of a given gene. Finally, the TPM value for each cell c and gene g was natural log-transformed to yield a normalized expression value: EXPRc,g = ln(1+TPMc,g). Dimensionality reduction and cell heterogeneity visualization. To reduce the dimensionality of the cell-by-gene expression matrix EXPR and visualize the diversity of gene expression among CD8+ T cells of different subtypes in a 2-dimensional scatter plot, we applied the t-distributed Stochastic Neighborhood Embedding (tSNE) algorithm via its Barnes-Hut approximation (bhSNE). We first applied standard Principal Components Analysis (PCA) to reduce the dimensionality down to D=10, and only then applied bhSNE to visualize in D=2 (with perplexity=30 and theta=0.75 parameters). Differential gene expression analysis. We performed differential gene expression analysis between all pairs of T cell sub-populations from two non-overlapping sets of rows in the log-transformed expression matrix EXPR. Since single-cell gene expression does not conform to the usual negative binomial distribution and can even be bimodal due to dropout we used two non-parametric statistical tests for heterogeneity of expression: Mann Whitney Wilcoxon (MWW, also known as MWU) which is a rank-sum test but relies on a large sample to approximate normality, and Kolmogorov-Smirnov 2-sample (KS2) test which finds the largest difference between the empirical cumulative distributions, even between two small samples such as our 1st division sub-types Div1TE (n =36) and Div1MEM (n =24). Cell type classifier. We trained two binary T cell classifiers to identify gene expression signatures that not only differentiate the examined T cell sub-populations (like the differential gene expression described above) but can also be used to predict the âmemory-â or âeffector-nessâ of previously unseen cells. Each classifier constructed an independent ensemble of Extremely Randomized Trees. Using the terminally differentiated effector and memory (TCM, TEM) populations, we built a training set for a fate classifier for CD8+ T cells. Using the newly observed segregation of daughter T cells into Div1TE and Div1MEM subpopulations after the first division, we built a second training set for another early state classifier. Both classifiers were fed their respective training sets using 10-fold cross-validation After both the fate and early state classifiers were trained on their respective subpopulations, they were both applied on previously unseen intermediate Day 4 CD8+ T cells. Their predicted âmemory-nessâ scores were scatter-plotted and shown to correlate in Fig. 3f. For each T cell, its âeffector-nessâ scores is 1 minus the âmemory-nessâ score and is redundant for this analysis. The signature genes for each classifier were selected from all G=22,425 genes by their GINI score. Temporal expression trajectories through inferred lineage paths. To understand the temporal dynamics of expression for key genes along the effector and memory lineages, we constructed hypothetical differentiation time-courses for each lineage. Briefly, we sampled with replacement 50 cells from each population and constructed all trajectories through the cross-product of populations ordered in a particular lineage. These orders were determined a priori based on our earlier work with similar timecourses of RT-qPCR study (Arsenio et. al. Nature Immunology 2014). Specifically, the âeffector lineageâ starts from the naïve population, and progresses through the Div1TE subpopulation, then onto Day 4, and finally Day 7. In contrast, the âeffector memoryâ and âcentral memory lineagesâ start from naïve, through the Div1MEM subpopulation, ending with TEM and TCM respectively. These bootstrapped trajectories were visually summarized by a seaborn timeseries plot which links the average expression for each population sample with a solid line segment and represents the 95% confidence interval by a shaded area around it. Genome_build: mm10 Supplementary_files_format_and_content: peak BED files generated from MACS and HOMER. Transcript Expression CSV tables generated from kallisto.