GSE125527 Processing Pipeline
Publication
Heterogeneity and clonal relationships of adaptive immune cells in ulcerative colitis revealed by single-cell analyses.Science immunology (2020) — PMID 32826341
Dataset
GSE125527Single-Cell Analyses Elucidate the Cellular and Molecular Landscape of Ulcerative Colitis
Processing Steps
Generate Jupyter Notebook-
1
data were processed by cellranger 2.1.0 with default parameter, using mm10 as reference.
Cell Ranger v2.1.0$ Bash example
# Install Cell Ranger (example using conda, adjust for your system) # conda create -n cellranger-2.1.0 python=2.7 # conda activate cellranger-2.1.0 # Download and install Cell Ranger 2.1.0 from 10x Genomics website # For example: wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-2.1.0.tar.gz # tar -xzf cellranger-2.1.0.tar.gz # export PATH=/path/to/cellranger-2.1.0:$PATH # Download or build the mm10 reference transcriptome (if not already available) # For example, from 10x Genomics support site for Cell Ranger 2.1.0: # wget https://cf.10xgenomics.com/releases/cell-ranger/refdata-cellranger-mm10-1.2.0.tar.gz # tar -xzf refdata-cellranger-mm10-1.2.0.tar.gz # Define variables (replace with actual paths and sample names) MM10_REFERENCE_PATH="/path/to/refdata-cellranger-mm10-1.2.0" FASTQ_PATH="/path/to/your/fastq_directory" SAMPLE_ID="your_sample_name" OUTPUT_DIR="./cellranger_output" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run cellranger count with default parameters cellranger count \ --id="${SAMPLE_ID}" \ --transcriptome="${MM10_REFERENCE_PATH}" \ --fastqs="${FASTQ_PATH}" \ --localcores=8 \ --localmem=32 -
2
Raw cell-reads were then loaded to R using the cellrangerRkit package.
R vInfer from description$ Bash example
# Install R if not already installed # sudo apt-get update && sudo apt-get install -y r-base # Install cellrangerRkit package in R # R -e 'install.packages("cellrangerRkit", repos="http://cran.us.r-project.org")' # Create a placeholder R script to load cell-reads # This script assumes 'path/to/your/cellranger_output_dir' is the path to the Cell Ranger output directory # and 'loaded_cellranger_data.rds' is where the loaded data will be saved. cat << 'EOF' > load_cellranger_data.R library(cellrangerRkit) # Define the path to your Cell Ranger output directory # This directory typically contains 'matrix.mtx', 'barcodes.tsv', 'features.tsv' # or a 'filtered_feature_bc_matrix.h5' file. # Replace 'path/to/your/cellranger_output_dir' with the actual path to your Cell Ranger output. cellranger_output_dir <- "path/to/your/cellranger_output_dir" # Load the cell-reads data using cellrangerRkit's primary function. # The 'load_cellranger_output' function is designed to read the output of 10x Cell Ranger. # It typically returns a list containing the expression matrix, cell barcodes, and gene information. # In some versions or configurations, it might directly return a Seurat object or similar. if (dir.exists(cellranger_output_dir)) { message(paste("Loading data from:", cellranger_output_dir)) # Attempt to load data. The exact structure of 'cell_data' will depend on the package version # and the specific Cell Ranger output format. # This function is designed to handle the standard Cell Ranger output directory structure. cell_data <- load_cellranger_output(cellranger_output_dir) # Save the loaded R object for subsequent analysis steps. # This creates an R Data Serialization (RDS) file. saveRDS(cell_data, file = "loaded_cellranger_data.rds") message("Cell Ranger data loaded and saved to loaded_cellranger_data.rds") } else { stop(paste("Error: Cell Ranger output directory not found at", cellranger_output_dir)) } EOF # Execute the R script Rscript load_cellranger_data.R -
3
The pre-filtered VDJ information table (all_contig_annotations.csv, output by cellranger) were loaded at the same time.
Cell Ranger v7.0.0 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install Cell Ranger (example, specific version might vary) # For Cell Ranger v7.0.0, download from 10x Genomics website: # wget https://cf.10xgenomics.com/releases/cell-vdj/cellranger-vdj-7.0.0.tar.gz # tar -xzf cellranger-vdj-7.0.0.tar.gz # export PATH=/path/to/cellranger-7.0.0:$PATH # Download a VDJ reference (example: human VDJ GRCh38 v7.0.0) # wget https://cf.10xgenomics.com/releases/vdj-ref/ref-GRCh38-d_STAR_vdj-7.0.0.tar.gz # tar -xzf ref-GRCh38-d_STAR_vdj-7.0.0.tar.gz # VDJ_REFERENCE_PATH="./ref-GRCh38-d_STAR_vdj-7.0.0" # Define input FASTQ directory and sample ID # Replace with your actual FASTQ directory and sample name FASTQ_DIR="/path/to/your/vdj_fastqs" SAMPLE_ID="your_sample_name" # Run cellranger vdj to process raw VDJ sequencing data # This command generates various output files, including 'all_contig_annotations.csv' # The output directory will be named after the --id parameter (e.g., your_sample_name) cellranger vdj \ --id="${SAMPLE_ID}" \ --fastqs="${FASTQ_DIR}" \ --reference="${VDJ_REFERENCE_PATH}" \ --sample="${SAMPLE_ID}" # Optional, if fastqs are named with sample prefix -
4
A cell was considered âvalidâ if it satisfied the following criteria: (1) Cell has enough RNA reads to pass the Cell Ranger threshold. (2) Cell has productive full-length BCR or TCR detected.
Cell Ranger v7.0.0$ Bash example
# Define input FASTQ directories and sample names # Replace with actual paths to your FASTQ files FASTQ_DIR_GEX="/path/to/your/gex_fastqs" FASTQ_DIR_VDJ="/path/to/your/vdj_fastqs" SAMPLE_ID="my_immune_sample" OUTPUT_DIR="${SAMPLE_ID}_cellranger_output" # Define reference paths # Download reference data from 10x Genomics website if not already available # Example: https://www.10xgenomics.com/support/software/cell-ranger/latest/references # For GEX: refdata-gex-GRCh38-2020-A (or a newer version) # For VDJ: refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0 (or a newer version) GEX_REF="/path/to/your/refdata-gex-GRCh38-2020-A" VDJ_REF="/path/to/your/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0" # Create a multi config CSV file # This file specifies the input libraries and references for cellranger multi cat <<EOF > multi_config.csv [gene-expression] reference,$GEX_REF fastqs,$FASTQ_DIR_GEX sample,$SAMPLE_ID [vdj] reference,$VDJ_REF fastqs,$FASTQ_DIR_VDJ sample,$SAMPLE_ID EOF # Run Cell Ranger multi to process both gene expression and VDJ data. # This command generates the necessary output files (e.g., filtered_feature_bc_matrix, # filtered_contig_annotations.csv) from which cells satisfying the described criteria # can be identified in a subsequent analysis step. # Adjust --localcores and --localmem based on available system resources. cellranger multi --id=$OUTPUT_DIR --csv=multi_config.csv --localcores=16 --localmem=64 -
5
Only valid cells were kept for downstream analysis in both the scRNA-seq and scTCR-seq/scBCR-seq datasets.
$ Bash example
# Install Seurat (if not already installed) # R -e "install.packages('Seurat')" # R -e "install.packages('dplyr')" # Define input and output paths # INPUT_CELLRANGER_DIR: Path to the directory containing CellRanger output (e.g., 'raw_feature_bc_matrix') # This directory should contain matrix.mtx, features.tsv, and barcodes.tsv files. INPUT_CELLRANGER_DIR="path/to/cellranger/outs/raw_feature_bc_matrix" OUTPUT_RDS="path/to/filtered_seurat_object.rds" # Create an R script for filtering cells cat << 'EOF' > filter_cells.R library(Seurat) library(dplyr) # Get input and output paths from environment variables input_dir <- Sys.getenv("INPUT_CELLRANGER_DIR") output_rds_path <- Sys.getenv("OUTPUT_RDS") # Check if input directory exists if (!dir.exists(input_dir)) { stop(paste("Input CellRanger directory not found:", input_dir)) } # Load raw data from CellRanger output raw_data <- Read10X(data.dir = input_dir) # Create a Seurat object seurat_object <- CreateSeuratObject(counts = raw_data, project = "scRNAseq_filtered") # Calculate mitochondrial percentage # Assuming mitochondrial genes start with "MT-" or "mt-" seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object, pattern = "^MT-") # Filter cells based on common QC metrics # Parameters are inferred as they are not specified in the description. # Common thresholds used for filtering low-quality cells: # nFeature_RNA: number of genes detected per cell (e.g., min 200, max 5000) # nCount_RNA: total number of UMIs per cell (e.g., min 500) # percent.mt: percentage of mitochondrial reads (e.g., max 10%) min_features <- 200 max_features <- 5000 min_umis <- 500 max_mt_percent <- 10 message(paste("Original number of cells:", ncol(seurat_object))) filtered_seurat_object <- subset(seurat_object, subset = nFeature_RNA > min_features & nFeature_RNA < max_features & nCount_RNA > min_umis & percent.mt < max_mt_percent) message(paste("Filtered number of cells:", ncol(filtered_seurat_object))) # Save the filtered Seurat object saveRDS(filtered_seurat_object, file = output_rds_path) EOF # Set environment variables for the R script export INPUT_CELLRANGER_DIR="path/to/cellranger/outs/raw_feature_bc_matrix" # Placeholder for raw counts export OUTPUT_RDS="path/to/filtered_seurat_object.rds" # Execute the R script Rscript filter_cells.R -
6
The scRNA-seq dataset was then further filtered based on gene numbers and mitochondria gene counts tototal counts ratio.
$ Bash example
# Install R and Seurat if not already installed # conda create -n r_seurat r-base r-essentials -y # conda activate r_seurat # R -q -e "install.packages('Seurat', repos='http://cran.us.r-project.org')" # Define input and output files INPUT_SEURAT_OBJECT="input_seurat_object.rds" OUTPUT_SEURAT_OBJECT="filtered_seurat_object.rds" # Define filtering parameters based on common scRNA-seq QC thresholds # The description implies filtering based on 'gene numbers' (min and max) # and 'mitochondria gene counts to total counts ratio' (max percentage) MIN_GENES=200 MAX_GENES=2500 MAX_MITO_PERCENT=5 # Run Seurat filtering using Rscript Rscript -e " library(Seurat) # Load the Seurat object seurat_obj <- readRDS('${INPUT_SEURAT_OBJECT}') # Calculate mitochondrial percentage (assuming human data with 'MT-' prefix) # Adjust pattern if organism is different (e.g., '^mt-' for mouse) seurat_obj[['percent.mt']] <- PercentageFeatureSet(seurat_obj, pattern = '^MT-') # Filter cells based on gene numbers and mitochondrial content filtered_seurat_obj <- subset(seurat_obj, subset = nFeature_RNA > ${MIN_GENES} & nFeature_RNA < ${MAX_GENES} & percent.mt < ${MAX_MITO_PERCENT}) # Save the filtered Seurat object saveRDS(filtered_seurat_obj, '${OUTPUT_SEURAT_OBJECT}') " -
7
Only cells with > 400 genes,UMI > 0,and 0.5% ~ 30% of their UMIs mappingto mitochondria genes were kept for downstream analysis.
$ Bash example
#!/bin/bash # This script performs quality control filtering on single-cell RNA-seq data # using scanpy, based on the specified criteria. # Define input and output file paths # Replace 'input_raw_cells.h5ad' with your actual input AnnData file (e.g., from cellranger output) # Replace 'output_filtered_cells.h5ad' with your desired output file name INPUT_H5AD="input_raw_cells.h5ad" OUTPUT_H5AD="output_filtered_cells.h5ad" # Check if the input file exists if [ ! -f "$INPUT_H5AD" ]; then echo "Error: Input file $INPUT_H5AD not found. Please provide a valid .h5ad file." >&2 exit 1 fi # Install scanpy and its dependencies if not already installed (uncomment to run) # Note: It's recommended to use a dedicated conda environment for bioinformatics tools. # conda create -n sc_env python=3.9 # conda activate sc_env # pip install scanpy pandas numpy scipy # Execute the Python script for filtering using scanpy python -c " import scanpy as sc import pandas as pd import numpy as np # Load the AnnData object try: adata = sc.read_h5ad('$INPUT_H5AD') except FileNotFoundError: print(f'Error: Input file $INPUT_H5AD not found.') exit(1) except Exception as e: print(f'Error loading AnnData object from $INPUT_H5AD: {e}') exit(1) print(f'\n--- Starting QC Filtering ---\n') print(f'Original number of cells: {adata.n_obs}') print(f'Original number of genes: {adata.n_vars}') # Identify mitochondrial genes # This assumes mitochondrial genes start with 'MT-' (common for human/mouse). # Adjust the pattern (e.g., 'mt-') if your data uses a different convention. adata.var['mt'] = adata.var_names.str.startswith('MT-') # Calculate QC metrics for cells # 'qc_vars' specifies variables for which to calculate percentage of counts (e.g., mitochondrial genes) # 'percent_top' is set to None as we are not interested in top expressed genes for this filtering step # 'log1p' is set to False as we want raw counts for filtering criteria sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True) # --- Apply Filtering Criteria --- # 1. Filter cells with more than 400 genes detected initial_cells = adata.n_obs adata = adata[adata.obs['n_genes_by_counts'] > 400, :].copy() print(f'Cells after filtering for > 400 genes: {adata.n_obs} (removed {initial_cells - adata.n_obs} cells)') # 2. Filter cells with more than 0 UMIs (total_counts) # This criterion ensures that only cells with actual expression data are kept. initial_cells = adata.n_obs adata = adata[adata.obs['total_counts'] > 0, :].copy() print(f'Cells after filtering for > 0 UMIs: {adata.n_obs} (removed {initial_cells - adata.n_obs} cells)') # 3. Filter cells based on mitochondrial gene percentage (0.5% to 30%) # 'pct_counts_mt' is the percentage of UMIs mapping to mitochondrial genes. initial_cells = adata.n_obs adata = adata[adata.obs['pct_counts_mt'] > 0.5, :].copy() adata = adata[adata.obs['pct_counts_mt'] < 30, :].copy() print(f'Cells after filtering for 0.5% < MT% < 30%: {adata.n_obs} (removed {initial_cells - adata.n_obs} cells)') print(f'\nFinal number of cells after all filters: {adata.n_obs}') # Save the filtered AnnData object to the specified output file try: adata.write('$OUTPUT_H5AD') print(f'Filtered data saved to $OUTPUT_H5AD') except Exception as e: print(f'Error saving filtered AnnData object to $OUTPUT_H5AD: {e}') exit(1) " || { echo "Error: Python script execution failed." >&2 exit 1 } -
8
All the VDJ data that don't have the following flag of cellranger were filtered: 1. the is_cell flag, 2. the full-length flag and 3. the productive flag.
Cell Ranger vNot specified (Inferred with models/gemini-2.5-flash)$ Bash example
# Filter VDJ data from Cell Ranger's all_contig_annotations.csv # to keep only contigs that are cells, full-length, and productive. # This assumes the input file is 'all_contig_annotations.csv' # and the output file will be 'filtered_vdj_data.csv'. # Installation (awk is typically pre-installed on Linux/macOS) # No specific installation command needed for awk. # The filtering logic: keep rows where 'is_cell', 'full_length', and 'productive' are all "TRUE". # This command first identifies the column indices from the header, # then filters subsequent rows based on the values in those columns. awk -F',' ' BEGIN { OFS = "," } NR == 1 { for (i=1; i<=NF; i++) { if ($i == "is_cell") is_cell_col = i; if ($i == "full_length") full_length_col = i; if ($i == "productive") productive_col = i; } print; # Print header next; } $is_cell_col == "TRUE" && $full_length_col == "TRUE" && $productive_col == "TRUE" { print; } ' all_contig_annotations.csv > filtered_vdj_data.csv -
9
For the 5' scRNA-seq and 5' VDJ-seq, the first 16 bp of R1 is the cell barcode and the next 10bp (17-26bp) is the UMI.
$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi_tools # Define input and output file names (placeholders) INPUT_R1="R1_input.fastq.gz" OUTPUT_R1="R1_output_umi_extracted.fastq.gz" LOG_FILE="umi_extract.log" # Extract cell barcode (first 16bp) and UMI (next 10bp, i.e., 17-26bp) from R1 # C: Cell barcode, X: UMI umi_tools extract \ --bc-pattern=CCCCCCCCCCCCCCCCXXXXXXXXXX \ --read1-in "${INPUT_R1}" \ --read1-out "${OUTPUT_R1}" \ --log "${LOG_FILE}"
Raw Source Text
data were processed by cellranger 2.1.0 with default parameter, using mm10 as reference. Raw cell-reads were then loaded to R using the cellrangerRkit package. The pre-filtered VDJ information table (all_contig_annotations.csv, output by cellranger) were loaded at the same time. A cell was considered âvalidâ if it satisfied the following criteria: (1) Cell has enough RNA reads to pass the Cell Ranger threshold. (2) Cell has productive full-length BCR or TCR detected. Only valid cells were kept for downstream analysis in both the scRNA-seq and scTCR-seq/scBCR-seq datasets. The scRNA-seq dataset was then further filtered based on gene numbers and mitochondria gene counts tototal counts ratio. Only cells with > 400 genes,UMI > 0,and 0.5% ~ 30% of their UMIs mappingto mitochondria genes were kept for downstream analysis. All the VDJ data that don't have the following flag of cellranger were filtered: 1. the is_cell flag, 2. the full-length flag and 3. the productive flag. For the 5' scRNA-seq and 5' VDJ-seq, the first 16 bp of R1 is the cell barcode and the next 10bp (17-26bp) is the UMI. Genome_build: hg38 Supplementary_files_format_and_content: cell-gene UMI table: UMI table with each row represent a cell and each column represent a gene. BCR table: table containing BCR information, including V/D/J/C genes and CDR3 TCR table: table containing TCR information, including V/D/J/C genes and CDR3