GSE125527 Processing Pipeline

OTHER code_examples 9 steps

Publication

Heterogeneity and clonal relationships of adaptive immune cells in ulcerative colitis revealed by single-cell analyses.

Science immunology (2020) — PMID 32826341

Dataset

Single-Cell Analyses Elucidate the Cellular and Molecular Landscape of Ulcerative Colitis

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

data were processed by cellranger 2.1.0 with default parameter, using mm10 as reference.

Cell Ranger v2.1.0

$ Bash example

# Install Cell Ranger (example using conda, adjust for your system)
# conda create -n cellranger-2.1.0 python=2.7
# conda activate cellranger-2.1.0
# Download and install Cell Ranger 2.1.0 from 10x Genomics website
# For example: wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-2.1.0.tar.gz
# tar -xzf cellranger-2.1.0.tar.gz
# export PATH=/path/to/cellranger-2.1.0:$PATH

# Download or build the mm10 reference transcriptome (if not already available)
# For example, from 10x Genomics support site for Cell Ranger 2.1.0:
# wget https://cf.10xgenomics.com/releases/cell-ranger/refdata-cellranger-mm10-1.2.0.tar.gz
# tar -xzf refdata-cellranger-mm10-1.2.0.tar.gz

# Define variables (replace with actual paths and sample names)
MM10_REFERENCE_PATH="/path/to/refdata-cellranger-mm10-1.2.0"
FASTQ_PATH="/path/to/your/fastq_directory"
SAMPLE_ID="your_sample_name"
OUTPUT_DIR="./cellranger_output"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run cellranger count with default parameters
cellranger count \
    --id="${SAMPLE_ID}" \
    --transcriptome="${MM10_REFERENCE_PATH}" \
    --fastqs="${FASTQ_PATH}" \
    --localcores=8 \
    --localmem=32

Raw cell-reads were then loaded to R using the cellrangerRkit package.

R vInfer from description

$ Bash example

# Install R if not already installed
# sudo apt-get update && sudo apt-get install -y r-base

# Install cellrangerRkit package in R
# R -e 'install.packages("cellrangerRkit", repos="http://cran.us.r-project.org")'

# Create a placeholder R script to load cell-reads
# This script assumes 'path/to/your/cellranger_output_dir' is the path to the Cell Ranger output directory
# and 'loaded_cellranger_data.rds' is where the loaded data will be saved.
cat << 'EOF' > load_cellranger_data.R
library(cellrangerRkit)

# Define the path to your Cell Ranger output directory
# This directory typically contains 'matrix.mtx', 'barcodes.tsv', 'features.tsv'
# or a 'filtered_feature_bc_matrix.h5' file.
# Replace 'path/to/your/cellranger_output_dir' with the actual path to your Cell Ranger output.
cellranger_output_dir <- "path/to/your/cellranger_output_dir" 

# Load the cell-reads data using cellrangerRkit's primary function.
# The 'load_cellranger_output' function is designed to read the output of 10x Cell Ranger.
# It typically returns a list containing the expression matrix, cell barcodes, and gene information.
# In some versions or configurations, it might directly return a Seurat object or similar.
if (dir.exists(cellranger_output_dir)) {
    message(paste("Loading data from:", cellranger_output_dir))
    
    # Attempt to load data. The exact structure of 'cell_data' will depend on the package version
    # and the specific Cell Ranger output format.
    # This function is designed to handle the standard Cell Ranger output directory structure.
    cell_data <- load_cellranger_output(cellranger_output_dir)
    
    # Save the loaded R object for subsequent analysis steps.
    # This creates an R Data Serialization (RDS) file.
    saveRDS(cell_data, file = "loaded_cellranger_data.rds")
    message("Cell Ranger data loaded and saved to loaded_cellranger_data.rds")
} else {
    stop(paste("Error: Cell Ranger output directory not found at", cellranger_output_dir))
}
EOF

# Execute the R script
Rscript load_cellranger_data.R

The pre-filtered VDJ information table (all_contig_annotations.csv, output by cellranger) were loaded at the same time.

Cell Ranger v7.0.0 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install Cell Ranger (example, specific version might vary)
# For Cell Ranger v7.0.0, download from 10x Genomics website:
# wget https://cf.10xgenomics.com/releases/cell-vdj/cellranger-vdj-7.0.0.tar.gz
# tar -xzf cellranger-vdj-7.0.0.tar.gz
# export PATH=/path/to/cellranger-7.0.0:$PATH

# Download a VDJ reference (example: human VDJ GRCh38 v7.0.0)
# wget https://cf.10xgenomics.com/releases/vdj-ref/ref-GRCh38-d_STAR_vdj-7.0.0.tar.gz
# tar -xzf ref-GRCh38-d_STAR_vdj-7.0.0.tar.gz
# VDJ_REFERENCE_PATH="./ref-GRCh38-d_STAR_vdj-7.0.0"

# Define input FASTQ directory and sample ID
# Replace with your actual FASTQ directory and sample name
FASTQ_DIR="/path/to/your/vdj_fastqs"
SAMPLE_ID="your_sample_name"

# Run cellranger vdj to process raw VDJ sequencing data
# This command generates various output files, including 'all_contig_annotations.csv'
# The output directory will be named after the --id parameter (e.g., your_sample_name)
cellranger vdj \
    --id="${SAMPLE_ID}" \
    --fastqs="${FASTQ_DIR}" \
    --reference="${VDJ_REFERENCE_PATH}" \
    --sample="${SAMPLE_ID}" # Optional, if fastqs are named with sample prefix

A cell was considered âvalidâ if it satisfied the following criteria: (1) Cell has enough RNA reads to pass the Cell Ranger threshold. (2) Cell has productive full-length BCR or TCR detected.

Cell Ranger v7.0.0

$ Bash example

    # Define input FASTQ directories and sample names
    # Replace with actual paths to your FASTQ files
    FASTQ_DIR_GEX="/path/to/your/gex_fastqs"
    FASTQ_DIR_VDJ="/path/to/your/vdj_fastqs"
    SAMPLE_ID="my_immune_sample"
    OUTPUT_DIR="${SAMPLE_ID}_cellranger_output"

    # Define reference paths
    # Download reference data from 10x Genomics website if not already available
    # Example: https://www.10xgenomics.com/support/software/cell-ranger/latest/references
    # For GEX: refdata-gex-GRCh38-2020-A (or a newer version)
    # For VDJ: refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0 (or a newer version)
    GEX_REF="/path/to/your/refdata-gex-GRCh38-2020-A"
    VDJ_REF="/path/to/your/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0"

    # Create a multi config CSV file
    # This file specifies the input libraries and references for cellranger multi
    cat <<EOF > multi_config.csv
[gene-expression]
reference,$GEX_REF
fastqs,$FASTQ_DIR_GEX
sample,$SAMPLE_ID

[vdj]
reference,$VDJ_REF
fastqs,$FASTQ_DIR_VDJ
sample,$SAMPLE_ID
EOF

    # Run Cell Ranger multi to process both gene expression and VDJ data.
    # This command generates the necessary output files (e.g., filtered_feature_bc_matrix,
    # filtered_contig_annotations.csv) from which cells satisfying the described criteria
    # can be identified in a subsequent analysis step.
    # Adjust --localcores and --localmem based on available system resources.
    cellranger multi --id=$OUTPUT_DIR --csv=multi_config.csv --localcores=16 --localmem=64

Only valid cells were kept for downstream analysis in both the scRNA-seq and scTCR-seq/scBCR-seq datasets.

scRNA-seq v4.x.x GitHub

$ Bash example

# Install Seurat (if not already installed)
# R -e "install.packages('Seurat')"
# R -e "install.packages('dplyr')"

# Define input and output paths
# INPUT_CELLRANGER_DIR: Path to the directory containing CellRanger output (e.g., 'raw_feature_bc_matrix')
# This directory should contain matrix.mtx, features.tsv, and barcodes.tsv files.
INPUT_CELLRANGER_DIR="path/to/cellranger/outs/raw_feature_bc_matrix"
OUTPUT_RDS="path/to/filtered_seurat_object.rds"

# Create an R script for filtering cells
cat << 'EOF' > filter_cells.R
library(Seurat)
library(dplyr)

# Get input and output paths from environment variables
input_dir <- Sys.getenv("INPUT_CELLRANGER_DIR")
output_rds_path <- Sys.getenv("OUTPUT_RDS")

# Check if input directory exists
if (!dir.exists(input_dir)) {
  stop(paste("Input CellRanger directory not found:", input_dir))
}

# Load raw data from CellRanger output
raw_data <- Read10X(data.dir = input_dir)

# Create a Seurat object
seurat_object <- CreateSeuratObject(counts = raw_data, project = "scRNAseq_filtered")

# Calculate mitochondrial percentage
# Assuming mitochondrial genes start with "MT-" or "mt-"
seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object, pattern = "^MT-")

# Filter cells based on common QC metrics
# Parameters are inferred as they are not specified in the description.
# Common thresholds used for filtering low-quality cells:
# nFeature_RNA: number of genes detected per cell (e.g., min 200, max 5000)
# nCount_RNA: total number of UMIs per cell (e.g., min 500)
# percent.mt: percentage of mitochondrial reads (e.g., max 10%)
min_features <- 200
max_features <- 5000
min_umis <- 500
max_mt_percent <- 10

message(paste("Original number of cells:", ncol(seurat_object)))

filtered_seurat_object <- subset(seurat_object,
                                 subset = nFeature_RNA > min_features &
                                          nFeature_RNA < max_features &
                                          nCount_RNA > min_umis &
                                          percent.mt < max_mt_percent)

message(paste("Filtered number of cells:", ncol(filtered_seurat_object)))

# Save the filtered Seurat object
saveRDS(filtered_seurat_object, file = output_rds_path)
EOF

# Set environment variables for the R script
export INPUT_CELLRANGER_DIR="path/to/cellranger/outs/raw_feature_bc_matrix" # Placeholder for raw counts
export OUTPUT_RDS="path/to/filtered_seurat_object.rds"

# Execute the R script
Rscript filter_cells.R

View on GitHub

The scRNA-seq dataset was then further filtered based on gene numbers and mitochondria gene counts tototal counts ratio.

scRNA-seq v4.x GitHub

$ Bash example

# Install R and Seurat if not already installed
# conda create -n r_seurat r-base r-essentials -y
# conda activate r_seurat
# R -q -e "install.packages('Seurat', repos='http://cran.us.r-project.org')"

# Define input and output files
INPUT_SEURAT_OBJECT="input_seurat_object.rds"
OUTPUT_SEURAT_OBJECT="filtered_seurat_object.rds"

# Define filtering parameters based on common scRNA-seq QC thresholds
# The description implies filtering based on 'gene numbers' (min and max)
# and 'mitochondria gene counts to total counts ratio' (max percentage)
MIN_GENES=200
MAX_GENES=2500
MAX_MITO_PERCENT=5

# Run Seurat filtering using Rscript
Rscript -e "
    library(Seurat)

    # Load the Seurat object
    seurat_obj <- readRDS('${INPUT_SEURAT_OBJECT}')

    # Calculate mitochondrial percentage (assuming human data with 'MT-' prefix)
    # Adjust pattern if organism is different (e.g., '^mt-' for mouse)
    seurat_obj[['percent.mt']] <- PercentageFeatureSet(seurat_obj, pattern = '^MT-')

    # Filter cells based on gene numbers and mitochondrial content
    filtered_seurat_obj <- subset(seurat_obj, 
                                  subset = nFeature_RNA > ${MIN_GENES} & 
                                           nFeature_RNA < ${MAX_GENES} & 
                                           percent.mt < ${MAX_MITO_PERCENT})

    # Save the filtered Seurat object
    saveRDS(filtered_seurat_obj, '${OUTPUT_SEURAT_OBJECT}')
"

View on GitHub

Only cells with > 400 genes,UMI > 0,and 0.5% ~ 30% of their UMIs mappingto mitochondria genes were kept for downstream analysis.

scanpy (Inferred with models/gemini-2.5-flash) v1.9.8 GitHub

$ Bash example

#!/bin/bash

# This script performs quality control filtering on single-cell RNA-seq data
# using scanpy, based on the specified criteria.

# Define input and output file paths
# Replace 'input_raw_cells.h5ad' with your actual input AnnData file (e.g., from cellranger output)
# Replace 'output_filtered_cells.h5ad' with your desired output file name
INPUT_H5AD="input_raw_cells.h5ad"
OUTPUT_H5AD="output_filtered_cells.h5ad"

# Check if the input file exists
if [ ! -f "$INPUT_H5AD" ]; then
    echo "Error: Input file $INPUT_H5AD not found. Please provide a valid .h5ad file." >&2
    exit 1
fi

# Install scanpy and its dependencies if not already installed (uncomment to run)
# Note: It's recommended to use a dedicated conda environment for bioinformatics tools.
# conda create -n sc_env python=3.9
# conda activate sc_env
# pip install scanpy pandas numpy scipy

# Execute the Python script for filtering using scanpy
python -c "
import scanpy as sc
import pandas as pd
import numpy as np

# Load the AnnData object
try:
    adata = sc.read_h5ad('$INPUT_H5AD')
except FileNotFoundError:
    print(f'Error: Input file $INPUT_H5AD not found.')
    exit(1)
except Exception as e:
    print(f'Error loading AnnData object from $INPUT_H5AD: {e}')
    exit(1)

print(f'\n--- Starting QC Filtering ---\n')
print(f'Original number of cells: {adata.n_obs}')
print(f'Original number of genes: {adata.n_vars}')

# Identify mitochondrial genes
# This assumes mitochondrial genes start with 'MT-' (common for human/mouse).
# Adjust the pattern (e.g., 'mt-') if your data uses a different convention.
adata.var['mt'] = adata.var_names.str.startswith('MT-')

# Calculate QC metrics for cells
# 'qc_vars' specifies variables for which to calculate percentage of counts (e.g., mitochondrial genes)
# 'percent_top' is set to None as we are not interested in top expressed genes for this filtering step
# 'log1p' is set to False as we want raw counts for filtering criteria
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

# --- Apply Filtering Criteria ---

# 1. Filter cells with more than 400 genes detected
initial_cells = adata.n_obs
adata = adata[adata.obs['n_genes_by_counts'] > 400, :].copy()
print(f'Cells after filtering for > 400 genes: {adata.n_obs} (removed {initial_cells - adata.n_obs} cells)')

# 2. Filter cells with more than 0 UMIs (total_counts)
# This criterion ensures that only cells with actual expression data are kept.
initial_cells = adata.n_obs
adata = adata[adata.obs['total_counts'] > 0, :].copy()
print(f'Cells after filtering for > 0 UMIs: {adata.n_obs} (removed {initial_cells - adata.n_obs} cells)')

# 3. Filter cells based on mitochondrial gene percentage (0.5% to 30%)
# 'pct_counts_mt' is the percentage of UMIs mapping to mitochondrial genes.
initial_cells = adata.n_obs
adata = adata[adata.obs['pct_counts_mt'] > 0.5, :].copy()
adata = adata[adata.obs['pct_counts_mt'] < 30, :].copy()
print(f'Cells after filtering for 0.5% < MT% < 30%: {adata.n_obs} (removed {initial_cells - adata.n_obs} cells)')

print(f'\nFinal number of cells after all filters: {adata.n_obs}')

# Save the filtered AnnData object to the specified output file
try:
    adata.write('$OUTPUT_H5AD')
    print(f'Filtered data saved to $OUTPUT_H5AD')
except Exception as e:
    print(f'Error saving filtered AnnData object to $OUTPUT_H5AD: {e}')
    exit(1)
" || {
    echo "Error: Python script execution failed." >&2
    exit 1
}

View on GitHub

All the VDJ data that don't have the following flag of cellranger were filtered: 1. the is_cell flag, 2. the full-length flag and 3. the productive flag.

Cell Ranger vNot specified (Inferred with models/gemini-2.5-flash)

$ Bash example

# Filter VDJ data from Cell Ranger's all_contig_annotations.csv
# to keep only contigs that are cells, full-length, and productive.
# This assumes the input file is 'all_contig_annotations.csv'
# and the output file will be 'filtered_vdj_data.csv'.

# Installation (awk is typically pre-installed on Linux/macOS)
# No specific installation command needed for awk.

# The filtering logic: keep rows where 'is_cell', 'full_length', and 'productive' are all "TRUE".
# This command first identifies the column indices from the header,
# then filters subsequent rows based on the values in those columns.
awk -F',' '
BEGIN { OFS = "," }
NR == 1 {
    for (i=1; i<=NF; i++) {
        if ($i == "is_cell") is_cell_col = i;
        if ($i == "full_length") full_length_col = i;
        if ($i == "productive") productive_col = i;
    }
    print; # Print header
    next;
}
$is_cell_col == "TRUE" && $full_length_col == "TRUE" && $productive_col == "TRUE" {
    print;
}
' all_contig_annotations.csv > filtered_vdj_data.csv

For the 5' scRNA-seq and 5' VDJ-seq, the first 16 bp of R1 is the cell barcode and the next 10bp (17-26bp) is the UMI.

scRNA-seq v1.1.2 GitHub

$ Bash example

# Install umi_tools if not already installed
# conda install -c bioconda umi_tools

# Define input and output file names (placeholders)
INPUT_R1="R1_input.fastq.gz"
OUTPUT_R1="R1_output_umi_extracted.fastq.gz"
LOG_FILE="umi_extract.log"

# Extract cell barcode (first 16bp) and UMI (next 10bp, i.e., 17-26bp) from R1
# C: Cell barcode, X: UMI
umi_tools extract \
    --bc-pattern=CCCCCCCCCCCCCCCCXXXXXXXXXX \
    --read1-in "${INPUT_R1}" \
    --read1-out "${OUTPUT_R1}" \
    --log "${LOG_FILE}"

View on GitHub

Tools Used

R scRNA-seq

Raw Source Text

data were processed by cellranger 2.1.0 with default parameter, using mm10 as reference.
Raw cell-reads were then loaded to R using the cellrangerRkit package. The pre-filtered VDJ information table (all_contig_annotations.csv, output by cellranger) were loaded at the same time. A cell was considered âvalidâ if it satisfied the following criteria: (1) Cell has enough RNA reads to pass the Cell Ranger threshold. (2) Cell has productive full-length BCR or TCR detected. Only valid cells were kept for downstream analysis in both the scRNA-seq and scTCR-seq/scBCR-seq datasets. The scRNA-seq dataset was then further filtered based on gene numbers and mitochondria gene counts tototal counts ratio. Only cells with > 400 genes,UMI > 0,and 0.5% ~ 30% of their UMIs mappingto mitochondria genes were kept for downstream analysis.
All the VDJ data that don't have the following flag of cellranger were filtered: 1. the is_cell flag, 2. the full-length flag and 3. the productive flag.
For the 5' scRNA-seq and 5' VDJ-seq, the first 16 bp of R1 is the cell barcode and the next 10bp (17-26bp) is the UMI.
Genome_build: hg38
Supplementary_files_format_and_content: cell-gene UMI table: UMI table with each row represent a cell and each column represent a gene.   BCR table: table containing BCR information, including V/D/J/C genes and CDR3  TCR table:  table containing TCR information, including V/D/J/C genes and CDR3

← Back to Analysis