GSE213470 Processing Pipeline

RNA-Seq code_examples 2 steps

Publication

Early transcriptional and epigenetic divergence of CD8+ T cells responding to acute versus chronic infection.

PLoS biology (2023) — PMID 36716323

Dataset

Gene expression profiles at the single cell level of CD8+ T cells from the spleens of LCMV-Armstrong or LCMV-Clone 13 infected mice

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Alignment, cell barcode processing, umis processing, abundance measurements: cellranger count (version 2.1.1)

Cell Ranger v2.1.1

$ Bash example

# Download and install Cell Ranger from 10x Genomics website (version 2.1.1 or compatible)
# https://www.10xgenomics.com/support/software/cell-ranger/downloads

# Define variables for clarity
OUTPUT_ID="my_cellranger_run"
TRANSCRIPTOME_REF="/path/to/refdata-gex-GRCh38-2020-A" # Placeholder: Use a Cell Ranger-compatible transcriptome reference (e.g., human GRCh38)
FASTQ_DIR="/path/to/your/fastq_files" # Directory containing your FASTQ files
SAMPLE_NAME="my_sample" # Name of your sample (matches FASTQ file prefix, e.g., my_sample_S1_L001_R1_001.fastq.gz)
EXPECTED_CELLS=3000 # Optional: Adjust based on your experiment's expected cell count

# Execute cellranger count
cellranger count \
    --id="${OUTPUT_ID}" \
    --transcriptome="${TRANSCRIPTOME_REF}" \
    --fastqs="${FASTQ_DIR}" \
    --sample="${SAMPLE_NAME}" \
    --expect-cells="${EXPECTED_CELLS}" \
    --localcores=8 \
    --localmem=64

Cells and genes filtering, dimension reduction: Scanpy (version 1.7.2)

Scanpy v1.7.2 GitHub

$ Bash example

# Install Scanpy and its dependencies (if not already installed)
# conda create -n scanpy_env python=3.8
# conda activate scanpy_env
# conda install -c conda-forge scanpy python-igraph leidenalg

# Create a Python script for Scanpy operations
cat << 'EOF' > run_scanpy_processing.py
import scanpy as sc
import numpy as np
import pandas as pd

# Set verbosity for scanpy
sc.settings.verbosity = 3             # errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')

# --- Configuration --- #
INPUT_H5AD = "input_raw_counts.h5ad" # Replace with your actual input file
OUTPUT_H5AD = "output_processed.h5ad"

# --- Load Data ---
try:
    adata = sc.read_h5ad(INPUT_H5AD)
except FileNotFoundError:
    print(f"Error: Input file {INPUT_H5AD} not found. Creating a dummy AnnData object for demonstration.")
    # Create a dummy AnnData object for demonstration purposes
    n_cells = 1000
    n_genes = 2000
    counts = np.random.randint(0, 100, size=(n_cells, n_genes))
    gene_names = [f'gene_{i}' for i in range(n_genes)]
    cell_names = [f'cell_{i}' for i in range(n_cells)]
    adata = sc.AnnData(counts, obs=pd.DataFrame(index=cell_names), var=pd.DataFrame(index=gene_names))
    adata.var['mt'] = adata.var_names.str.startswith('MT-') # Dummy mitochondrial genes

print(f"Initial AnnData object: {adata}")

# --- Quality Control and Filtering ---
# Calculate QC metrics
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

# Filter cells based on number of genes and counts
# Example thresholds: min_genes=200, max_genes=2500, max_counts=10000, max_mt_percent=5
print("Filtering cells...")
initial_cells = adata.n_obs
sc.pp.filter_cells(adata, min_genes=200)
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.total_counts < 10000, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]
print(f"Filtered {initial_cells - adata.n_obs} cells. Remaining cells: {adata.n_obs}")

# Filter genes based on number of cells
print("Filtering genes...")
initial_genes = adata.n_vars
sc.pp.filter_genes(adata, min_cells=3)
print(f"Filtered {initial_genes - adata.n_vars} genes. Remaining genes: {adata.n_vars}")

# --- Normalization and Log-transformation ---
print("Normalizing and log-transforming data...")
sc.pp.normalize_total(adata, target_sum=1e4) # Normalize each cell's total counts to 10,000
sc.pp.log1p(adata) # Log-transform the data

# --- Feature Selection (Highly Variable Genes) ---
print("Identifying highly variable genes...")
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

# Subset to highly variable genes for dimension reduction
adata = adata[:, adata.var.highly_variable]
print(f"Number of highly variable genes: {adata.n_vars}")

# --- Scaling ---
print("Scaling data...")
sc.pp.scale(adata, max_value=10) # Scale data to unit variance, clip values above 10

# --- Dimension Reduction ---
print("Performing PCA...")
sc.tl.pca(adata, svd_solver='arpack')

print("Computing neighbors graph...")
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40) # Use 40 principal components for neighbor graph

print("Performing UMAP...")
sc.tl.umap(adata) # Uniform Manifold Approximation and Projection

# --- Save Processed Data ---
print(f"Saving processed AnnData object to {OUTPUT_H5AD}")
adata.write(OUTPUT_H5AD)
print("Processing complete.")
EOF

# Execute the Python script
python run_scanpy_processing.py

View on GitHub

Tools Used

Scanpy

Raw Source Text

Alignment, cell barcode processing, umis processing, abundance measurements: cellranger count (version 2.1.1)
Cells and genes filtering, dimension reduction: Scanpy (version 1.7.2)
Assembly: mm10
Supplementary files format and content: Tab-separated values files and matrix files

← Back to Analysis