GSE213470 Processing Pipeline

RNA-Seq code_examples 2 steps

Publication

Early transcriptional and epigenetic divergence of CD8+ T cells responding to acute versus chronic infection.

PLoS biology (2023) — PMID 36716323

Dataset

GSE213470

Gene expression profiles at the single cell level of CD8+ T cells from the spleens of LCMV-Armstrong or LCMV-Clone 13 infected mice

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Alignment, cell barcode processing, umis processing, abundance measurements: cellranger count (version 2.1.1)

    Cell Ranger v2.1.1
    $ Bash example
    # Download and install Cell Ranger from 10x Genomics website (version 2.1.1 or compatible)
    # https://www.10xgenomics.com/support/software/cell-ranger/downloads
    
    # Define variables for clarity
    OUTPUT_ID="my_cellranger_run"
    TRANSCRIPTOME_REF="/path/to/refdata-gex-GRCh38-2020-A" # Placeholder: Use a Cell Ranger-compatible transcriptome reference (e.g., human GRCh38)
    FASTQ_DIR="/path/to/your/fastq_files" # Directory containing your FASTQ files
    SAMPLE_NAME="my_sample" # Name of your sample (matches FASTQ file prefix, e.g., my_sample_S1_L001_R1_001.fastq.gz)
    EXPECTED_CELLS=3000 # Optional: Adjust based on your experiment's expected cell count
    
    # Execute cellranger count
    cellranger count \
        --id="${OUTPUT_ID}" \
        --transcriptome="${TRANSCRIPTOME_REF}" \
        --fastqs="${FASTQ_DIR}" \
        --sample="${SAMPLE_NAME}" \
        --expect-cells="${EXPECTED_CELLS}" \
        --localcores=8 \
        --localmem=64
  2. 2

    Cells and genes filtering, dimension reduction: Scanpy (version 1.7.2)

    $ Bash example
    # Install Scanpy and its dependencies (if not already installed)
    # conda create -n scanpy_env python=3.8
    # conda activate scanpy_env
    # conda install -c conda-forge scanpy python-igraph leidenalg
    
    # Create a Python script for Scanpy operations
    cat << 'EOF' > run_scanpy_processing.py
    import scanpy as sc
    import numpy as np
    import pandas as pd
    
    # Set verbosity for scanpy
    sc.settings.verbosity = 3             # errors (0), warnings (1), info (2), hints (3)
    sc.logging.print_header()
    sc.settings.set_figure_params(dpi=80, facecolor='white')
    
    # --- Configuration --- #
    INPUT_H5AD = "input_raw_counts.h5ad" # Replace with your actual input file
    OUTPUT_H5AD = "output_processed.h5ad"
    
    # --- Load Data ---
    try:
        adata = sc.read_h5ad(INPUT_H5AD)
    except FileNotFoundError:
        print(f"Error: Input file {INPUT_H5AD} not found. Creating a dummy AnnData object for demonstration.")
        # Create a dummy AnnData object for demonstration purposes
        n_cells = 1000
        n_genes = 2000
        counts = np.random.randint(0, 100, size=(n_cells, n_genes))
        gene_names = [f'gene_{i}' for i in range(n_genes)]
        cell_names = [f'cell_{i}' for i in range(n_cells)]
        adata = sc.AnnData(counts, obs=pd.DataFrame(index=cell_names), var=pd.DataFrame(index=gene_names))
        adata.var['mt'] = adata.var_names.str.startswith('MT-') # Dummy mitochondrial genes
    
    print(f"Initial AnnData object: {adata}")
    
    # --- Quality Control and Filtering ---
    # Calculate QC metrics
    sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
    
    # Filter cells based on number of genes and counts
    # Example thresholds: min_genes=200, max_genes=2500, max_counts=10000, max_mt_percent=5
    print("Filtering cells...")
    initial_cells = adata.n_obs
    sc.pp.filter_cells(adata, min_genes=200)
    adata = adata[adata.obs.n_genes_by_counts < 2500, :]
    adata = adata[adata.obs.total_counts < 10000, :]
    adata = adata[adata.obs.pct_counts_mt < 5, :]
    print(f"Filtered {initial_cells - adata.n_obs} cells. Remaining cells: {adata.n_obs}")
    
    # Filter genes based on number of cells
    print("Filtering genes...")
    initial_genes = adata.n_vars
    sc.pp.filter_genes(adata, min_cells=3)
    print(f"Filtered {initial_genes - adata.n_vars} genes. Remaining genes: {adata.n_vars}")
    
    # --- Normalization and Log-transformation ---
    print("Normalizing and log-transforming data...")
    sc.pp.normalize_total(adata, target_sum=1e4) # Normalize each cell's total counts to 10,000
    sc.pp.log1p(adata) # Log-transform the data
    
    # --- Feature Selection (Highly Variable Genes) ---
    print("Identifying highly variable genes...")
    sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
    
    # Subset to highly variable genes for dimension reduction
    adata = adata[:, adata.var.highly_variable]
    print(f"Number of highly variable genes: {adata.n_vars}")
    
    # --- Scaling ---
    print("Scaling data...")
    sc.pp.scale(adata, max_value=10) # Scale data to unit variance, clip values above 10
    
    # --- Dimension Reduction ---
    print("Performing PCA...")
    sc.tl.pca(adata, svd_solver='arpack')
    
    print("Computing neighbors graph...")
    sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40) # Use 40 principal components for neighbor graph
    
    print("Performing UMAP...")
    sc.tl.umap(adata) # Uniform Manifold Approximation and Projection
    
    # --- Save Processed Data ---
    print(f"Saving processed AnnData object to {OUTPUT_H5AD}")
    adata.write(OUTPUT_H5AD)
    print("Processing complete.")
    EOF
    
    # Execute the Python script
    python run_scanpy_processing.py

Tools Used

Raw Source Text
Alignment, cell barcode processing, umis processing, abundance measurements: cellranger count (version 2.1.1)
Cells and genes filtering, dimension reduction: Scanpy (version 1.7.2)
Assembly: mm10
Supplementary files format and content: Tab-separated values files and matrix files
← Back to Analysis