GSE280895 Processing Pipeline

GSE code_examples 7 steps

Publication

Tissue-resident memory CD8 T cell diversity is spatiotemporally imprinted.

Nature (2025) — PMID 39843748

Dataset

Tissue-resident memory CD8 T Cell Diversity is Spatiotemporally Imprinted

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Cell nuclei were segmented using Cellpose 2.0 (Pachitariu et al.

Cellpose v2.0 GitHub

$ Bash example

# Install Cellpose (e.g., via pip)
# pip install cellpose

# Example command for segmenting cell nuclei using Cellpose 2.0
# Replace 'input_image.tif' with your actual image file and 'output_masks.npy' with your desired output path.
# The '--model nuclei' flag is used for segmenting nuclei.
cellpose --image_path input_image.tif --model nuclei --save_masks --save_flows --save_rois --dir output_directory

View on GitHub

2022).

Skipper (Inferred with models/gemini-2.5-flash) vmain branch (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Clone the Skipper repository
# git clone https://github.com/yeolab/skipper.git
# cd skipper

# Create a sample configuration file (e.g., samples.tsv)
# This file should list your input FASTQ files and sample names.
# Example samples.tsv:
# sample_name\tfastq_replicate1\tfastq_replicate2
# CLIP_sample1\tdata/CLIP_R1.fastq.gz\tdata/CLIP_R2.fastq.gz
# INPUT_sample1\tdata/INPUT_R1.fastq.gz\tdata/INPUT_R2.fastq.gz

# Create a configuration file (e.g., config.yaml)
# This file specifies genome build, paths, and other parameters.
# Example config.yaml (adjust paths and parameters as needed):
# GENOME_BUILD: hg38
# GENOME_FASTA: /path/to/genome/hg38.fa
# GENOME_GTF: /path/to/genome/hg38.gtf
# STAR_INDEX: /path/to/star_index/hg38
# SAMPLES: samples.tsv
# OUTPUT_DIR: results

# Ensure Snakemake and Conda are installed
# conda install -c bioconda -c conda-forge snakemake

# Execute the Skipper workflow
# Replace 'config.yaml' and 'samples.tsv' with your actual file paths.
# Adjust --cores based on available resources.
# The workflow will automatically manage software dependencies via Conda environments.
snakemake --use-conda --cores 8 --configfile config.yaml --config SAMPLES=samples.tsv GENOME_BUILD=hg38

View on GitHub

Cell boundaries were predicted using the nuclear staining and transcript positions using Baysor (Petukhov et al.

Baysor v(Inferred with models/gemini-2.5-flash)

$ Bash example

# Install Baysor (example using pip)
# pip install baysor

# Assuming input files:
# <transcript_positions.csv>: CSV file containing transcript coordinates and gene IDs (e.g., x, y, gene_id)
# <nuclear_staining.tif>: Image file of nuclear staining

# Predict cell boundaries using nuclear staining and transcript positions
baysor run \
    --transcript-file <transcript_positions.csv> \
    --nuclei-image <nuclear_staining.tif> \
    --output baysor_cell_boundaries_output

4

2021).

(Inferred with models/gemini-2.5-flash)

Experiments were integrated using scVI (Lopez et al 2018).

scVI v0.x

$ Bash example

# Install scvi-tools and its dependencies
# pip install scvi-tools anndata pandas numpy

# Create a Python script for scVI integration
cat << 'EOF' > run_scvi_integration.py
import scvi
import anndata as ad
import pandas as pd
import numpy as np

# --- Configuration ---
INPUT_H5AD="input_data.h5ad" # Path to your input AnnData object (e.g., containing raw counts and batch information)
OUTPUT_H5AD="integrated_output.h5ad" # Path for the output AnnData object with integrated embeddings
BATCH_KEY="batch_id" # Key in adata.obs for batch information (e.g., 'batch', 'sample_id', 'experiment')
LAYER_KEY="counts" # Layer in adata.layers containing raw counts (if applicable, otherwise set to None if .X contains raw counts)

# --- Load Data ---
print(f"Loading data from {INPUT_H5AD}...")
try:
    adata = ad.read_h5ad(INPUT_H5AD)
except FileNotFoundError:
    print(f"Error: Input file '{INPUT_H5AD}' not found. Please ensure the file exists and is correctly specified.")
    exit(1) # Exit if input file is not found

print("Data loaded successfully.")
print(f"AnnData object shape: {adata.shape}")

# --- Setup AnnData for scVI ---
# This step prepares the AnnData object for scVI, including identifying batch information.
# 'layer' specifies where raw counts are stored. If .X contains raw counts, set layer=None.
# 'batch_key' is crucial for integrating multiple experiments. Ensure this column exists in adata.obs.
print(f"Setting up AnnData for scVI with batch_key='{BATCH_KEY}' and layer='{LAYER_KEY}'...")
if BATCH_KEY not in adata.obs.columns:
    print(f"Warning: '{BATCH_KEY}' not found in adata.obs. Using default setup without explicit batch correction. Ensure your data is pre-batched or integration is not batch-dependent.")
    scvi.data.setup_anndata(adata, layer=LAYER_KEY)
else:
    scvi.data.setup_anndata(adata, layer=LAYER_KEY, batch_key=BATCH_KEY)

# --- Initialize and Train scVI Model ---
print("Initializing scVI model...")
# You can customize model parameters (e.g., n_latent, n_hidden, n_layers)
# based on your dataset size and complexity. Default values are often a good starting point.
model = scvi.model.SCVI(adata, n_latent=30, n_hidden=128, n_layers=2)

print("Training scVI model (this may take a while, depending on data size and hardware)...")
# Adjust max_epochs and other training parameters as needed. Early stopping helps prevent overfitting.
model.train(max_epochs=400, early_stopping=True, early_stopping_patience=30)

# --- Get Latent Representation ---
print("Extracting latent representation (integrated embeddings)...")
latent_representation = model.get_latent_representation()
adata.obsm["X_scVI"] = latent_representation # Store the integrated embeddings in .obsm

# --- Save Integrated Data ---
print(f"Saving integrated AnnData object to {OUTPUT_H5AD}...")
adata.write_h5ad(OUTPUT_H5AD)

print("scVI integration complete. The output file contains the original data along with 'X_scVI' in .obsm for integrated embeddings.")
EOF

# Execute the Python script
python run_scvi_integration.py

Celltypes were assigned manually, and the processed object is stored as an Anndata object in h5ad format containing both metadata and raw gene expression counts (Wolf et al 2018).

scanpy (Inferred with models/gemini-2.5-flash) v1.x.x (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install scanpy and dependencies if not already installed
# conda create -n scanpy_env python=3.9
# conda activate scanpy_env
# pip install scanpy pandas

# Define input and output filenames
INPUT_H5AD="input_raw_data.h5ad" # Placeholder for the Anndata object before manual assignment
OUTPUT_H5AD="output_processed_data_with_celltypes.h5ad" # The Anndata object after manual assignment

# Execute a Python script to load an Anndata object,
# simulate the addition of manual cell type assignments, and save it.
# The actual manual assignment process is typically interactive and not a command-line tool.
# This script represents the final step of saving the annotated object.
python -c "
import scanpy as sc
import pandas as pd
import numpy as np
import os

input_file = os.getenv('INPUT_H5AD', 'input_raw_data.h5ad')
output_file = os.getenv('OUTPUT_H5AD', 'output_processed_data_with_celltypes.h5ad')

# --- Placeholder for loading the Anndata object ---
# In a real scenario, 'input_file' would be the result of previous processing steps.
# For this example, we'll create a dummy Anndata object if the input doesn't exist.
if not os.path.exists(input_file):
    print(f'Warning: Input file {input_file} not found. Creating a dummy Anndata object for demonstration.')
    adata = sc.AnnData(np.random.rand(100, 50),
                       obs=pd.DataFrame(index=[f'cell_{i}' for i in range(100)]),
                       var=pd.DataFrame(index=[f'gene_{i}' for i in range(50)]))
    # Save the dummy object as the input for this step
    adata.write(input_file)
    print(f'Dummy input Anndata object saved to {input_file}')
else:
    print(f'Loading Anndata object from {input_file}')
    adata = sc.read_h5ad(input_file)

# --- Simulate manual cell type assignment ---
# The description states 'Celltypes were assigned manually'.
# This part represents adding the results of that manual assignment to adata.obs.
if 'manual_cell_type' not in adata.obs.columns:
    num_cells = adata.n_obs
    cell_types = ['Neuron', 'Astrocyte', 'Oligodendrocyte', 'Microglia']
    # Assign random cell types for demonstration purposes
    adata.obs['manual_cell_type'] = np.random.choice(cell_types, num_cells)
    print(f\"Added a 'manual_cell_type' column with {len(cell_types)} types (simulated manual assignment).\" )
else:
    print(\"'manual_cell_type' column already exists. Assuming it contains manual assignments.\" )

# --- Store the processed object as an Anndata object in h5ad format ---
print(f'Saving processed Anndata object to {output_file}')
adata.write(output_file)
print('Operation complete.')
"

Raw detected RNA counts per cell

Salmon (Inferred with models/gemini-2.5-flash) v1.10.2 (Inferred) GitHub

$ Bash example

# Install Salmon (if not already installed)
# conda install -c bioconda salmon

# --- Reference Data Setup ---
# Download human transcriptome (e.g., Ensembl GRCh38 cDNA)
# wget -O Homo_sapiens.GRCh38.cdna.all.fa.gz http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
# gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz

# Build Salmon index (run once for the reference transcriptome)
# salmon index -t Homo_sapiens.GRCh38.cdna.all.fa -i salmon_grch38_index

# --- Quantification for a single cell (example) ---
# This command quantifies transcript expression from raw RNA-seq reads for a single cell.
# The output directory 'cell1_salmon_quant' will contain 'quant.sf' with transcript-level counts.
# Assuming input FASTQ files for a cell are cell1_R1.fastq.gz and cell1_R2.fastq.gz
# And the Salmon index 'salmon_grch38_index' has been built from a reference transcriptome (e.g., GRCh38).

salmon quant \
    -i salmon_grch38_index \
    -l A \
    -1 cell1_R1.fastq.gz \
    -2 cell1_R2.fastq.gz \
    --validateMappings \
    -o cell1_salmon_quant

View on GitHub

Raw Source Text

Cell nuclei were segmented using Cellpose 2.0 (Pachitariu et al. 2022). Cell boundaries were predicted using the nuclear staining and transcript positions using Baysor (Petukhov et al. 2021). Experiments were integrated using scVI (Lopez et al 2018). Celltypes were assigned manually, and the processed object is stored as an Anndata object in h5ad format containing both metadata and raw gene expression counts (Wolf et al 2018).
Raw detected RNA counts per cell

← Back to Analysis