GSE125527 Processing Pipeline

OTHER code_examples 9 steps

Publication

Heterogeneity and clonal relationships of adaptive immune cells in ulcerative colitis revealed by single-cell analyses.

Science immunology (2020) — PMID 32826341

Dataset

GSE125527

Single-Cell Analyses Elucidate the Cellular and Molecular Landscape of Ulcerative Colitis

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    data were processed by cellranger 2.1.0 with default parameter, using mm10 as reference.

    Cell Ranger v2.1.0
    $ Bash example
    # Install Cell Ranger (example using conda, adjust for your system)
    # conda create -n cellranger-2.1.0 python=2.7
    # conda activate cellranger-2.1.0
    # Download and install Cell Ranger 2.1.0 from 10x Genomics website
    # For example: wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-2.1.0.tar.gz
    # tar -xzf cellranger-2.1.0.tar.gz
    # export PATH=/path/to/cellranger-2.1.0:$PATH
    
    # Download or build the mm10 reference transcriptome (if not already available)
    # For example, from 10x Genomics support site for Cell Ranger 2.1.0:
    # wget https://cf.10xgenomics.com/releases/cell-ranger/refdata-cellranger-mm10-1.2.0.tar.gz
    # tar -xzf refdata-cellranger-mm10-1.2.0.tar.gz
    
    # Define variables (replace with actual paths and sample names)
    MM10_REFERENCE_PATH="/path/to/refdata-cellranger-mm10-1.2.0"
    FASTQ_PATH="/path/to/your/fastq_directory"
    SAMPLE_ID="your_sample_name"
    OUTPUT_DIR="./cellranger_output"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run cellranger count with default parameters
    cellranger count \
        --id="${SAMPLE_ID}" \
        --transcriptome="${MM10_REFERENCE_PATH}" \
        --fastqs="${FASTQ_PATH}" \
        --localcores=8 \
        --localmem=32
    
  2. 2

    Raw cell-reads were then loaded to R using the cellrangerRkit package.

    R vInfer from description
    $ Bash example
    # Install R if not already installed
    # sudo apt-get update && sudo apt-get install -y r-base
    
    # Install cellrangerRkit package in R
    # R -e 'install.packages("cellrangerRkit", repos="http://cran.us.r-project.org")'
    
    # Create a placeholder R script to load cell-reads
    # This script assumes 'path/to/your/cellranger_output_dir' is the path to the Cell Ranger output directory
    # and 'loaded_cellranger_data.rds' is where the loaded data will be saved.
    cat << 'EOF' > load_cellranger_data.R
    library(cellrangerRkit)
    
    # Define the path to your Cell Ranger output directory
    # This directory typically contains 'matrix.mtx', 'barcodes.tsv', 'features.tsv'
    # or a 'filtered_feature_bc_matrix.h5' file.
    # Replace 'path/to/your/cellranger_output_dir' with the actual path to your Cell Ranger output.
    cellranger_output_dir <- "path/to/your/cellranger_output_dir" 
    
    # Load the cell-reads data using cellrangerRkit's primary function.
    # The 'load_cellranger_output' function is designed to read the output of 10x Cell Ranger.
    # It typically returns a list containing the expression matrix, cell barcodes, and gene information.
    # In some versions or configurations, it might directly return a Seurat object or similar.
    if (dir.exists(cellranger_output_dir)) {
        message(paste("Loading data from:", cellranger_output_dir))
        
        # Attempt to load data. The exact structure of 'cell_data' will depend on the package version
        # and the specific Cell Ranger output format.
        # This function is designed to handle the standard Cell Ranger output directory structure.
        cell_data <- load_cellranger_output(cellranger_output_dir)
        
        # Save the loaded R object for subsequent analysis steps.
        # This creates an R Data Serialization (RDS) file.
        saveRDS(cell_data, file = "loaded_cellranger_data.rds")
        message("Cell Ranger data loaded and saved to loaded_cellranger_data.rds")
    } else {
        stop(paste("Error: Cell Ranger output directory not found at", cellranger_output_dir))
    }
    EOF
    
    # Execute the R script
    Rscript load_cellranger_data.R
  3. 3

    The pre-filtered VDJ information table (all_contig_annotations.csv, output by cellranger) were loaded at the same time.

    Cell Ranger v7.0.0 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install Cell Ranger (example, specific version might vary)
    # For Cell Ranger v7.0.0, download from 10x Genomics website:
    # wget https://cf.10xgenomics.com/releases/cell-vdj/cellranger-vdj-7.0.0.tar.gz
    # tar -xzf cellranger-vdj-7.0.0.tar.gz
    # export PATH=/path/to/cellranger-7.0.0:$PATH
    
    # Download a VDJ reference (example: human VDJ GRCh38 v7.0.0)
    # wget https://cf.10xgenomics.com/releases/vdj-ref/ref-GRCh38-d_STAR_vdj-7.0.0.tar.gz
    # tar -xzf ref-GRCh38-d_STAR_vdj-7.0.0.tar.gz
    # VDJ_REFERENCE_PATH="./ref-GRCh38-d_STAR_vdj-7.0.0"
    
    # Define input FASTQ directory and sample ID
    # Replace with your actual FASTQ directory and sample name
    FASTQ_DIR="/path/to/your/vdj_fastqs"
    SAMPLE_ID="your_sample_name"
    
    # Run cellranger vdj to process raw VDJ sequencing data
    # This command generates various output files, including 'all_contig_annotations.csv'
    # The output directory will be named after the --id parameter (e.g., your_sample_name)
    cellranger vdj \
        --id="${SAMPLE_ID}" \
        --fastqs="${FASTQ_DIR}" \
        --reference="${VDJ_REFERENCE_PATH}" \
        --sample="${SAMPLE_ID}" # Optional, if fastqs are named with sample prefix
    
  4. 4

    A cell was considered “valid” if it satisfied the following criteria: (1) Cell has enough RNA reads to pass the Cell Ranger threshold. (2) Cell has productive full-length BCR or TCR detected.

    Cell Ranger v7.0.0
    $ Bash example
        # Define input FASTQ directories and sample names
        # Replace with actual paths to your FASTQ files
        FASTQ_DIR_GEX="/path/to/your/gex_fastqs"
        FASTQ_DIR_VDJ="/path/to/your/vdj_fastqs"
        SAMPLE_ID="my_immune_sample"
        OUTPUT_DIR="${SAMPLE_ID}_cellranger_output"
    
        # Define reference paths
        # Download reference data from 10x Genomics website if not already available
        # Example: https://www.10xgenomics.com/support/software/cell-ranger/latest/references
        # For GEX: refdata-gex-GRCh38-2020-A (or a newer version)
        # For VDJ: refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0 (or a newer version)
        GEX_REF="/path/to/your/refdata-gex-GRCh38-2020-A"
        VDJ_REF="/path/to/your/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0"
    
        # Create a multi config CSV file
        # This file specifies the input libraries and references for cellranger multi
        cat <<EOF > multi_config.csv
    [gene-expression]
    reference,$GEX_REF
    fastqs,$FASTQ_DIR_GEX
    sample,$SAMPLE_ID
    
    [vdj]
    reference,$VDJ_REF
    fastqs,$FASTQ_DIR_VDJ
    sample,$SAMPLE_ID
    EOF
    
        # Run Cell Ranger multi to process both gene expression and VDJ data.
        # This command generates the necessary output files (e.g., filtered_feature_bc_matrix,
        # filtered_contig_annotations.csv) from which cells satisfying the described criteria
        # can be identified in a subsequent analysis step.
        # Adjust --localcores and --localmem based on available system resources.
        cellranger multi --id=$OUTPUT_DIR --csv=multi_config.csv --localcores=16 --localmem=64
  5. 5

    Only valid cells were kept for downstream analysis in both the scRNA-seq and scTCR-seq/scBCR-seq datasets.

    $ Bash example
    # Install Seurat (if not already installed)
    # R -e "install.packages('Seurat')"
    # R -e "install.packages('dplyr')"
    
    # Define input and output paths
    # INPUT_CELLRANGER_DIR: Path to the directory containing CellRanger output (e.g., 'raw_feature_bc_matrix')
    # This directory should contain matrix.mtx, features.tsv, and barcodes.tsv files.
    INPUT_CELLRANGER_DIR="path/to/cellranger/outs/raw_feature_bc_matrix"
    OUTPUT_RDS="path/to/filtered_seurat_object.rds"
    
    # Create an R script for filtering cells
    cat << 'EOF' > filter_cells.R
    library(Seurat)
    library(dplyr)
    
    # Get input and output paths from environment variables
    input_dir <- Sys.getenv("INPUT_CELLRANGER_DIR")
    output_rds_path <- Sys.getenv("OUTPUT_RDS")
    
    # Check if input directory exists
    if (!dir.exists(input_dir)) {
      stop(paste("Input CellRanger directory not found:", input_dir))
    }
    
    # Load raw data from CellRanger output
    raw_data <- Read10X(data.dir = input_dir)
    
    # Create a Seurat object
    seurat_object <- CreateSeuratObject(counts = raw_data, project = "scRNAseq_filtered")
    
    # Calculate mitochondrial percentage
    # Assuming mitochondrial genes start with "MT-" or "mt-"
    seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object, pattern = "^MT-")
    
    # Filter cells based on common QC metrics
    # Parameters are inferred as they are not specified in the description.
    # Common thresholds used for filtering low-quality cells:
    # nFeature_RNA: number of genes detected per cell (e.g., min 200, max 5000)
    # nCount_RNA: total number of UMIs per cell (e.g., min 500)
    # percent.mt: percentage of mitochondrial reads (e.g., max 10%)
    min_features <- 200
    max_features <- 5000
    min_umis <- 500
    max_mt_percent <- 10
    
    message(paste("Original number of cells:", ncol(seurat_object)))
    
    filtered_seurat_object <- subset(seurat_object,
                                     subset = nFeature_RNA > min_features &
                                              nFeature_RNA < max_features &
                                              nCount_RNA > min_umis &
                                              percent.mt < max_mt_percent)
    
    message(paste("Filtered number of cells:", ncol(filtered_seurat_object)))
    
    # Save the filtered Seurat object
    saveRDS(filtered_seurat_object, file = output_rds_path)
    EOF
    
    # Set environment variables for the R script
    export INPUT_CELLRANGER_DIR="path/to/cellranger/outs/raw_feature_bc_matrix" # Placeholder for raw counts
    export OUTPUT_RDS="path/to/filtered_seurat_object.rds"
    
    # Execute the R script
    Rscript filter_cells.R
  6. 6

    The scRNA-seq dataset was then further filtered based on gene numbers and mitochondria gene counts tototal counts ratio.

    $ Bash example
    # Install R and Seurat if not already installed
    # conda create -n r_seurat r-base r-essentials -y
    # conda activate r_seurat
    # R -q -e "install.packages('Seurat', repos='http://cran.us.r-project.org')"
    
    # Define input and output files
    INPUT_SEURAT_OBJECT="input_seurat_object.rds"
    OUTPUT_SEURAT_OBJECT="filtered_seurat_object.rds"
    
    # Define filtering parameters based on common scRNA-seq QC thresholds
    # The description implies filtering based on 'gene numbers' (min and max)
    # and 'mitochondria gene counts to total counts ratio' (max percentage)
    MIN_GENES=200
    MAX_GENES=2500
    MAX_MITO_PERCENT=5
    
    # Run Seurat filtering using Rscript
    Rscript -e "
        library(Seurat)
    
        # Load the Seurat object
        seurat_obj <- readRDS('${INPUT_SEURAT_OBJECT}')
    
        # Calculate mitochondrial percentage (assuming human data with 'MT-' prefix)
        # Adjust pattern if organism is different (e.g., '^mt-' for mouse)
        seurat_obj[['percent.mt']] <- PercentageFeatureSet(seurat_obj, pattern = '^MT-')
    
        # Filter cells based on gene numbers and mitochondrial content
        filtered_seurat_obj <- subset(seurat_obj, 
                                      subset = nFeature_RNA > ${MIN_GENES} & 
                                               nFeature_RNA < ${MAX_GENES} & 
                                               percent.mt < ${MAX_MITO_PERCENT})
    
        # Save the filtered Seurat object
        saveRDS(filtered_seurat_obj, '${OUTPUT_SEURAT_OBJECT}')
    "
  7. 7

    Only cells with > 400 genes,UMI > 0,and 0.5% ~ 30% of their UMIs mappingto mitochondria genes were kept for downstream analysis.

    scanpy (Inferred with models/gemini-2.5-flash) v1.9.8 GitHub
    $ Bash example
    #!/bin/bash
    
    # This script performs quality control filtering on single-cell RNA-seq data
    # using scanpy, based on the specified criteria.
    
    # Define input and output file paths
    # Replace 'input_raw_cells.h5ad' with your actual input AnnData file (e.g., from cellranger output)
    # Replace 'output_filtered_cells.h5ad' with your desired output file name
    INPUT_H5AD="input_raw_cells.h5ad"
    OUTPUT_H5AD="output_filtered_cells.h5ad"
    
    # Check if the input file exists
    if [ ! -f "$INPUT_H5AD" ]; then
        echo "Error: Input file $INPUT_H5AD not found. Please provide a valid .h5ad file." >&2
        exit 1
    fi
    
    # Install scanpy and its dependencies if not already installed (uncomment to run)
    # Note: It's recommended to use a dedicated conda environment for bioinformatics tools.
    # conda create -n sc_env python=3.9
    # conda activate sc_env
    # pip install scanpy pandas numpy scipy
    
    # Execute the Python script for filtering using scanpy
    python -c "
    import scanpy as sc
    import pandas as pd
    import numpy as np
    
    # Load the AnnData object
    try:
        adata = sc.read_h5ad('$INPUT_H5AD')
    except FileNotFoundError:
        print(f'Error: Input file $INPUT_H5AD not found.')
        exit(1)
    except Exception as e:
        print(f'Error loading AnnData object from $INPUT_H5AD: {e}')
        exit(1)
    
    print(f'\n--- Starting QC Filtering ---\n')
    print(f'Original number of cells: {adata.n_obs}')
    print(f'Original number of genes: {adata.n_vars}')
    
    # Identify mitochondrial genes
    # This assumes mitochondrial genes start with 'MT-' (common for human/mouse).
    # Adjust the pattern (e.g., 'mt-') if your data uses a different convention.
    adata.var['mt'] = adata.var_names.str.startswith('MT-')
    
    # Calculate QC metrics for cells
    # 'qc_vars' specifies variables for which to calculate percentage of counts (e.g., mitochondrial genes)
    # 'percent_top' is set to None as we are not interested in top expressed genes for this filtering step
    # 'log1p' is set to False as we want raw counts for filtering criteria
    sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
    
    # --- Apply Filtering Criteria ---
    
    # 1. Filter cells with more than 400 genes detected
    initial_cells = adata.n_obs
    adata = adata[adata.obs['n_genes_by_counts'] > 400, :].copy()
    print(f'Cells after filtering for > 400 genes: {adata.n_obs} (removed {initial_cells - adata.n_obs} cells)')
    
    # 2. Filter cells with more than 0 UMIs (total_counts)
    # This criterion ensures that only cells with actual expression data are kept.
    initial_cells = adata.n_obs
    adata = adata[adata.obs['total_counts'] > 0, :].copy()
    print(f'Cells after filtering for > 0 UMIs: {adata.n_obs} (removed {initial_cells - adata.n_obs} cells)')
    
    # 3. Filter cells based on mitochondrial gene percentage (0.5% to 30%)
    # 'pct_counts_mt' is the percentage of UMIs mapping to mitochondrial genes.
    initial_cells = adata.n_obs
    adata = adata[adata.obs['pct_counts_mt'] > 0.5, :].copy()
    adata = adata[adata.obs['pct_counts_mt'] < 30, :].copy()
    print(f'Cells after filtering for 0.5% < MT% < 30%: {adata.n_obs} (removed {initial_cells - adata.n_obs} cells)')
    
    print(f'\nFinal number of cells after all filters: {adata.n_obs}')
    
    # Save the filtered AnnData object to the specified output file
    try:
        adata.write('$OUTPUT_H5AD')
        print(f'Filtered data saved to $OUTPUT_H5AD')
    except Exception as e:
        print(f'Error saving filtered AnnData object to $OUTPUT_H5AD: {e}')
        exit(1)
    " || {
        echo "Error: Python script execution failed." >&2
        exit 1
    }
    
  8. 8

    All the VDJ data that don't have the following flag of cellranger were filtered: 1. the is_cell flag, 2. the full-length flag and 3. the productive flag.

    Cell Ranger vNot specified (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Filter VDJ data from Cell Ranger's all_contig_annotations.csv
    # to keep only contigs that are cells, full-length, and productive.
    # This assumes the input file is 'all_contig_annotations.csv'
    # and the output file will be 'filtered_vdj_data.csv'.
    
    # Installation (awk is typically pre-installed on Linux/macOS)
    # No specific installation command needed for awk.
    
    # The filtering logic: keep rows where 'is_cell', 'full_length', and 'productive' are all "TRUE".
    # This command first identifies the column indices from the header,
    # then filters subsequent rows based on the values in those columns.
    awk -F',' '
    BEGIN { OFS = "," }
    NR == 1 {
        for (i=1; i<=NF; i++) {
            if ($i == "is_cell") is_cell_col = i;
            if ($i == "full_length") full_length_col = i;
            if ($i == "productive") productive_col = i;
        }
        print; # Print header
        next;
    }
    $is_cell_col == "TRUE" && $full_length_col == "TRUE" && $productive_col == "TRUE" {
        print;
    }
    ' all_contig_annotations.csv > filtered_vdj_data.csv
  9. 9

    For the 5' scRNA-seq and 5' VDJ-seq, the first 16 bp of R1 is the cell barcode and the next 10bp (17-26bp) is the UMI.

    $ Bash example
    # Install umi_tools if not already installed
    # conda install -c bioconda umi_tools
    
    # Define input and output file names (placeholders)
    INPUT_R1="R1_input.fastq.gz"
    OUTPUT_R1="R1_output_umi_extracted.fastq.gz"
    LOG_FILE="umi_extract.log"
    
    # Extract cell barcode (first 16bp) and UMI (next 10bp, i.e., 17-26bp) from R1
    # C: Cell barcode, X: UMI
    umi_tools extract \
        --bc-pattern=CCCCCCCCCCCCCCCCXXXXXXXXXX \
        --read1-in "${INPUT_R1}" \
        --read1-out "${OUTPUT_R1}" \
        --log "${LOG_FILE}"

Tools Used

Raw Source Text
data were processed by cellranger 2.1.0 with default parameter, using mm10 as reference.
Raw cell-reads were then loaded to R using the cellrangerRkit package. The pre-filtered VDJ information table (all_contig_annotations.csv, output by cellranger) were loaded at the same time. A cell was considered “valid” if it satisfied the following criteria: (1) Cell has enough RNA reads to pass the Cell Ranger threshold. (2) Cell has productive full-length BCR or TCR detected. Only valid cells were kept for downstream analysis in both the scRNA-seq and scTCR-seq/scBCR-seq datasets. The scRNA-seq dataset was then further filtered based on gene numbers and mitochondria gene counts tototal counts ratio. Only cells with > 400 genes,UMI > 0,and 0.5% ~ 30% of their UMIs mappingto mitochondria genes were kept for downstream analysis.
All the VDJ data that don't have the following flag of cellranger were filtered: 1. the is_cell flag, 2. the full-length flag and 3. the productive flag.
For the 5' scRNA-seq and 5' VDJ-seq, the first 16 bp of R1 is the cell barcode and the next 10bp (17-26bp) is the UMI.
Genome_build: hg38
Supplementary_files_format_and_content: cell-gene UMI table: UMI table with each row represent a cell and each column represent a gene.   BCR table: table containing BCR information, including V/D/J/C genes and CDR3  TCR table:  table containing TCR information, including V/D/J/C genes and CDR3
← Back to Analysis