GSE147081 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Heterogenous Populations of Tissue-Resident CD8<sup>+</sup> T Cells Are Generated in Response to Infection and Malignancy.

Immunity (2020) — PMID 32433949

Dataset

GSE147081

Single Cell RNA-Seq of CD8+ T cell subsets responding to tumor

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

data were processed by cellranger 2.1.0 with default parameter, using mm10 as reference.

Cell Ranger v2.1.0

$ Bash example

# Install Cell Ranger (example, adjust for specific system and version)
# Note: Cell Ranger 2.1.0 is an older version. Ensure compatibility with your system and reference.
# wget https://cf.10xgenomics.com/releases/cell-exp/cellranger-2.1.0.tar.gz
# tar -xzf cellranger-2.1.0.tar.gz
# export PATH=/path/to/cellranger-2.1.0:$PATH

# Download or build mm10 reference compatible with Cell Ranger 2.1.0 (example, adjust path)
# 10x Genomics typically provides pre-built references. For 2.1.0, this might be refdata-cellranger-mm10-1.2.0 or similar.
# Example placeholder for reference path:
# REF_PATH="/path/to/cellranger_ref/mm10_compatible_with_2.1.0"
# If building, ensure the GTF and FASTA files are appropriate for mm10 and the Cell Ranger version.
# cellranger mkref --genome=mm10 --fasta=genome.fa --genes=genes.gtf

# Define input and output paths
SAMPLE_ID="my_cellranger_run"
FASTQ_DIR="/path/to/your/fastq_files" # Directory containing FASTQ files (e.g., from mkfastq output)
REFERENCE_PATH="/path/to/cellranger_ref/mm10_2.1.0_compatible" # Path to the Cell Ranger compatible mm10 reference transcriptome

# Run cellranger count with default parameters
# The description states "default parameter", so we use the standard 'count' command
# with essential inputs. Other parameters like --expect-cells, --chemistry, etc.,
# would use their default values for Cell Ranger 2.1.0.
cellranger count \
    --id="${SAMPLE_ID}" \
    --transcriptome="${REFERENCE_PATH}" \
    --fastqs="${FASTQ_DIR}" \
    --sample="your_sample_name" # Replace with actual sample name(s) as found in FASTQ file names

Raw cell-reads were then loaded to R using the cellrangerRkit package.

R vNot specified in description (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install cellrangerRkit package (if not already installed)
# R -e 'install.packages("cellrangerRkit")'

# Create a placeholder R script to load Cell Ranger data
cat <<EOF > load_cell_reads.R
library(cellrangerRkit)

# Define the path to the Cell Ranger output directory
# This directory should contain files like matrix.mtx, barcodes.tsv, features.tsv
# Replace 'path/to/cellranger_output_directory' with the actual path to your Cell Ranger output
cellranger_output_dir <- "path/to/cellranger_output_directory"

# Load the raw cell-reads matrix
# The function load_cellranger_matrix reads the sparse matrix, barcodes, and features
cell_reads_data <- load_cellranger_matrix(cellranger_output_dir)

# You can now work with 'cell_reads_data' object, e.g., inspect dimensions, save it
# For example, to save the loaded data for further R analysis:
saveRDS(cell_reads_data, "loaded_cell_reads.rds")

message(paste("Successfully loaded data from:", cellranger_output_dir))
message(paste("Dimensions of the loaded matrix:", paste(dim(cell_reads_data@exprs), collapse = " x ")))
EOF

# Execute the R script
Rscript load_cell_reads.R

The scRNA-seq dataset was then further filtered based on gene numbers and mitochondria gene counts tototal counts ratio.

scRNA-seq v4.x GitHub

$ Bash example

# Install Seurat if not already installed
# R -e 'install.packages("Seurat")'
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("Seurat")'

# Example R script for filtering scRNA-seq data using Seurat
Rscript -e '
library(Seurat)

# Load your Seurat object (replace with your actual input file)
# Assuming the input is a Seurat object saved as an .rds file
seurat_object <- readRDS("input_seurat_object.rds")

# Calculate mitochondrial percentage (if not already present)
# This assumes mitochondrial genes are prefixed with "MT-" for human or "mt-" for mouse.
# Adjust the pattern based on your organism.
seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object, pattern = "^MT-")

# Filter cells based on:
# 1. Number of genes detected (nFeature_RNA): typically > 200 and < 2500-5000
# 2. Mitochondrial percentage (percent.mt): typically < 5-10%
# These thresholds are common starting points and may need adjustment based on data quality.
# The description implies filtering based on "gene numbers" (nFeature_RNA) and "mitochondria gene counts to total counts ratio" (percent.mt).
filtered_seurat_object <- subset(seurat_object, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

# Save the filtered Seurat object
saveRDS(filtered_seurat_object, file = "filtered_seurat_object.rds")

print(paste("Original number of cells:", ncol(seurat_object)))
print(paste("Filtered number of cells:", ncol(filtered_seurat_object)))
'

View on GitHub

Only cells with > 400 genes,UMI > 0,and 0.5% ~ 20% of their UMIs mappingto mitochondria genes were kept for downstream analysis.

scanpy (Inferred with models/gemini-2.5-flash) v1.9.8 GitHub

$ Bash example

#!/bin/bash

# Define input and output files
INPUT_H5AD="raw_adata.h5ad" # Placeholder for input AnnData object (e.g., from cellranger aggr or similar)
OUTPUT_H5AD="filtered_adata.h5ad" # Placeholder for output AnnData object
MT_GENE_PREFIX="MT-" # Common prefix for mitochondrial genes (e.g., "MT-" for human GRCh38, "mt-" for mouse GRCm38)

# Check if input file exists
if [ ! -f "$INPUT_H5AD" ]; then
    echo "Error: Input file $INPUT_H5AD not found." >&2
    exit 1
fi

# Install scanpy if not already installed (commented out)
# conda create -n scanpy_env python=3.9 -y
# conda activate scanpy_env
# pip install scanpy anndata pandas numpy

# Execute Python script for filtering
python -c "
import scanpy as sc
import pandas as pd
import numpy as np
import sys

input_h5ad = \"$INPUT_H5AD\"
output_h5ad = \"$OUTPUT_H5AD\"
mt_gene_prefix = \"$MT_GENE_PREFIX\"

try:
    adata = sc.read_h5ad(input_h5ad)
except FileNotFoundError:
    print(f\"Error: Input file {input_h5ad} not found.\", file=sys.stderr)
    sys.exit(1)
except Exception as e:
    print(f\"Error loading AnnData object: {e}\", file=sys.stderr)
    sys.exit(1)

print(f\"Initial number of cells: {adata.n_obs}\")

# Identify mitochondrial genes based on prefix in adata.var_names
# This step is crucial and depends on the reference genome annotation used during quantification.
# For human GRCh38, 'MT-' is a common prefix. Adjust 'mt_gene_prefix' if needed for other species/annotations.
adata.var['mt'] = adata.var_names.str.startswith(mt_gene_prefix)

# Calculate QC metrics including mitochondrial percentage if not already present
# 'qc_vars=['mt']' ensures 'pct_counts_mt' is calculated based on the 'mt' column in adata.var
# 'n_genes_by_counts' and 'total_counts' (UMIs) are also calculated by default.
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

# Apply filtering criteria:
# 1. Cells with > 400 genes detected (n_genes_by_counts)
# 2. Cells with > 0 UMIs (total_counts)
# 3. Cells with 0.5% ~ 20% of UMIs mapping to mitochondrial genes (pct_counts_mt)

# Filter by number of genes
adata = adata[adata.obs['n_genes_by_counts'] > 400, :].copy()

# Filter by total UMIs (total_counts)
adata = adata[adata.obs['total_counts'] > 0, :].copy()

# Filter by mitochondrial percentage
adata = adata[adata.obs['pct_counts_mt'] > 0.5, :].copy()
adata = adata[adata.obs['pct_counts_mt'] < 20.0, :].copy()

print(f\"Number of cells after filtering: {adata.n_obs}\")

try:
    adata.write(output_h5ad)
except Exception as e:
    print(f\"Error saving filtered AnnData object: {e}\", file=sys.stderr)
    sys.exit(1)

print(f\"Filtered data saved to {output_h5ad}\")
"

View on GitHub

For the scRNA-seq, the first 16 bp of R1 is the cell barcode and the next 10bp (17-26bp) is the UMI.

scRNA-seq v1.1.2 GitHub

$ Bash example

# Install umi_tools if not already installed
# conda install -c bioconda umi_tools

# Example usage: Replace R1.fastq.gz and R2.fastq.gz with actual input files
# and R1_extracted.fastq.gz, R2_extracted.fastq.gz with desired output files.
# The cell barcode (C{16}) and UMI (N{10}) are extracted from R1.
umi_tools extract \
    --bc-pattern=C{16}N{10} \
    --extract-method=regex \
    --stdin=R1.fastq.gz \
    --stdout=R1_extracted.fastq.gz \
    --read2-in=R2.fastq.gz \
    --read2-out=R2_extracted.fastq.gz \
    --log=umi_extract.log

View on GitHub

Tools Used

R scRNA-seq

Raw Source Text

data were processed by cellranger 2.1.0 with default parameter, using mm10 as reference.
Raw cell-reads were then loaded to R using the cellrangerRkit package. The scRNA-seq dataset was then further filtered based on gene numbers and mitochondria gene counts tototal counts ratio. Only cells with > 400 genes,UMI > 0,and 0.5% ~ 20% of their UMIs mappingto mitochondria genes were kept for downstream analysis.
For the scRNA-seq, the first 16 bp of R1 is the cell barcode and the next 10bp (17-26bp) is the UMI.
Genome_build: mm10
Supplementary_files_format_and_content: CellRanger output of filtered gene matrix with barcodes & features files

← Back to Analysis