GSE147081 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Heterogenous Populations of Tissue-Resident CD8<sup>+</sup> T Cells Are Generated in Response to Infection and Malignancy.

Immunity (2020) — PMID 32433949

Dataset

GSE147081

Single Cell RNA-Seq of CD8+ T cell subsets responding to tumor

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    data were processed by cellranger 2.1.0 with default parameter, using mm10 as reference.

    Cell Ranger v2.1.0
    $ Bash example
    # Install Cell Ranger (example, adjust for specific system and version)
    # Note: Cell Ranger 2.1.0 is an older version. Ensure compatibility with your system and reference.
    # wget https://cf.10xgenomics.com/releases/cell-exp/cellranger-2.1.0.tar.gz
    # tar -xzf cellranger-2.1.0.tar.gz
    # export PATH=/path/to/cellranger-2.1.0:$PATH
    
    # Download or build mm10 reference compatible with Cell Ranger 2.1.0 (example, adjust path)
    # 10x Genomics typically provides pre-built references. For 2.1.0, this might be refdata-cellranger-mm10-1.2.0 or similar.
    # Example placeholder for reference path:
    # REF_PATH="/path/to/cellranger_ref/mm10_compatible_with_2.1.0"
    # If building, ensure the GTF and FASTA files are appropriate for mm10 and the Cell Ranger version.
    # cellranger mkref --genome=mm10 --fasta=genome.fa --genes=genes.gtf
    
    # Define input and output paths
    SAMPLE_ID="my_cellranger_run"
    FASTQ_DIR="/path/to/your/fastq_files" # Directory containing FASTQ files (e.g., from mkfastq output)
    REFERENCE_PATH="/path/to/cellranger_ref/mm10_2.1.0_compatible" # Path to the Cell Ranger compatible mm10 reference transcriptome
    
    # Run cellranger count with default parameters
    # The description states "default parameter", so we use the standard 'count' command
    # with essential inputs. Other parameters like --expect-cells, --chemistry, etc.,
    # would use their default values for Cell Ranger 2.1.0.
    cellranger count \
        --id="${SAMPLE_ID}" \
        --transcriptome="${REFERENCE_PATH}" \
        --fastqs="${FASTQ_DIR}" \
        --sample="your_sample_name" # Replace with actual sample name(s) as found in FASTQ file names
    
  2. 2

    Raw cell-reads were then loaded to R using the cellrangerRkit package.

    R vNot specified in description (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install cellrangerRkit package (if not already installed)
    # R -e 'install.packages("cellrangerRkit")'
    
    # Create a placeholder R script to load Cell Ranger data
    cat <<EOF > load_cell_reads.R
    library(cellrangerRkit)
    
    # Define the path to the Cell Ranger output directory
    # This directory should contain files like matrix.mtx, barcodes.tsv, features.tsv
    # Replace 'path/to/cellranger_output_directory' with the actual path to your Cell Ranger output
    cellranger_output_dir <- "path/to/cellranger_output_directory"
    
    # Load the raw cell-reads matrix
    # The function load_cellranger_matrix reads the sparse matrix, barcodes, and features
    cell_reads_data <- load_cellranger_matrix(cellranger_output_dir)
    
    # You can now work with 'cell_reads_data' object, e.g., inspect dimensions, save it
    # For example, to save the loaded data for further R analysis:
    saveRDS(cell_reads_data, "loaded_cell_reads.rds")
    
    message(paste("Successfully loaded data from:", cellranger_output_dir))
    message(paste("Dimensions of the loaded matrix:", paste(dim(cell_reads_data@exprs), collapse = " x ")))
    EOF
    
    # Execute the R script
    Rscript load_cell_reads.R
  3. 3

    The scRNA-seq dataset was then further filtered based on gene numbers and mitochondria gene counts tototal counts ratio.

    $ Bash example
    # Install Seurat if not already installed
    # R -e 'install.packages("Seurat")'
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("Seurat")'
    
    # Example R script for filtering scRNA-seq data using Seurat
    Rscript -e '
    library(Seurat)
    
    # Load your Seurat object (replace with your actual input file)
    # Assuming the input is a Seurat object saved as an .rds file
    seurat_object <- readRDS("input_seurat_object.rds")
    
    # Calculate mitochondrial percentage (if not already present)
    # This assumes mitochondrial genes are prefixed with "MT-" for human or "mt-" for mouse.
    # Adjust the pattern based on your organism.
    seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object, pattern = "^MT-")
    
    # Filter cells based on:
    # 1. Number of genes detected (nFeature_RNA): typically > 200 and < 2500-5000
    # 2. Mitochondrial percentage (percent.mt): typically < 5-10%
    # These thresholds are common starting points and may need adjustment based on data quality.
    # The description implies filtering based on "gene numbers" (nFeature_RNA) and "mitochondria gene counts to total counts ratio" (percent.mt).
    filtered_seurat_object <- subset(seurat_object, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
    
    # Save the filtered Seurat object
    saveRDS(filtered_seurat_object, file = "filtered_seurat_object.rds")
    
    print(paste("Original number of cells:", ncol(seurat_object)))
    print(paste("Filtered number of cells:", ncol(filtered_seurat_object)))
    '
  4. 4

    Only cells with > 400 genes,UMI > 0,and 0.5% ~ 20% of their UMIs mappingto mitochondria genes were kept for downstream analysis.

    scanpy (Inferred with models/gemini-2.5-flash) v1.9.8 GitHub
    $ Bash example
    #!/bin/bash
    
    # Define input and output files
    INPUT_H5AD="raw_adata.h5ad" # Placeholder for input AnnData object (e.g., from cellranger aggr or similar)
    OUTPUT_H5AD="filtered_adata.h5ad" # Placeholder for output AnnData object
    MT_GENE_PREFIX="MT-" # Common prefix for mitochondrial genes (e.g., "MT-" for human GRCh38, "mt-" for mouse GRCm38)
    
    # Check if input file exists
    if [ ! -f "$INPUT_H5AD" ]; then
        echo "Error: Input file $INPUT_H5AD not found." >&2
        exit 1
    fi
    
    # Install scanpy if not already installed (commented out)
    # conda create -n scanpy_env python=3.9 -y
    # conda activate scanpy_env
    # pip install scanpy anndata pandas numpy
    
    # Execute Python script for filtering
    python -c "
    import scanpy as sc
    import pandas as pd
    import numpy as np
    import sys
    
    input_h5ad = \"$INPUT_H5AD\"
    output_h5ad = \"$OUTPUT_H5AD\"
    mt_gene_prefix = \"$MT_GENE_PREFIX\"
    
    try:
        adata = sc.read_h5ad(input_h5ad)
    except FileNotFoundError:
        print(f\"Error: Input file {input_h5ad} not found.\", file=sys.stderr)
        sys.exit(1)
    except Exception as e:
        print(f\"Error loading AnnData object: {e}\", file=sys.stderr)
        sys.exit(1)
    
    print(f\"Initial number of cells: {adata.n_obs}\")
    
    # Identify mitochondrial genes based on prefix in adata.var_names
    # This step is crucial and depends on the reference genome annotation used during quantification.
    # For human GRCh38, 'MT-' is a common prefix. Adjust 'mt_gene_prefix' if needed for other species/annotations.
    adata.var['mt'] = adata.var_names.str.startswith(mt_gene_prefix)
    
    # Calculate QC metrics including mitochondrial percentage if not already present
    # 'qc_vars=['mt']' ensures 'pct_counts_mt' is calculated based on the 'mt' column in adata.var
    # 'n_genes_by_counts' and 'total_counts' (UMIs) are also calculated by default.
    sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
    
    # Apply filtering criteria:
    # 1. Cells with > 400 genes detected (n_genes_by_counts)
    # 2. Cells with > 0 UMIs (total_counts)
    # 3. Cells with 0.5% ~ 20% of UMIs mapping to mitochondrial genes (pct_counts_mt)
    
    # Filter by number of genes
    adata = adata[adata.obs['n_genes_by_counts'] > 400, :].copy()
    
    # Filter by total UMIs (total_counts)
    adata = adata[adata.obs['total_counts'] > 0, :].copy()
    
    # Filter by mitochondrial percentage
    adata = adata[adata.obs['pct_counts_mt'] > 0.5, :].copy()
    adata = adata[adata.obs['pct_counts_mt'] < 20.0, :].copy()
    
    print(f\"Number of cells after filtering: {adata.n_obs}\")
    
    try:
        adata.write(output_h5ad)
    except Exception as e:
        print(f\"Error saving filtered AnnData object: {e}\", file=sys.stderr)
        sys.exit(1)
    
    print(f\"Filtered data saved to {output_h5ad}\")
    "
    
  5. 5

    For the scRNA-seq, the first 16 bp of R1 is the cell barcode and the next 10bp (17-26bp) is the UMI.

    $ Bash example
    # Install umi_tools if not already installed
    # conda install -c bioconda umi_tools
    
    # Example usage: Replace R1.fastq.gz and R2.fastq.gz with actual input files
    # and R1_extracted.fastq.gz, R2_extracted.fastq.gz with desired output files.
    # The cell barcode (C{16}) and UMI (N{10}) are extracted from R1.
    umi_tools extract \
        --bc-pattern=C{16}N{10} \
        --extract-method=regex \
        --stdin=R1.fastq.gz \
        --stdout=R1_extracted.fastq.gz \
        --read2-in=R2.fastq.gz \
        --read2-out=R2_extracted.fastq.gz \
        --log=umi_extract.log

Tools Used

Raw Source Text
data were processed by cellranger 2.1.0 with default parameter, using mm10 as reference.
Raw cell-reads were then loaded to R using the cellrangerRkit package. The scRNA-seq dataset was then further filtered based on gene numbers and mitochondria gene counts tototal counts ratio. Only cells with > 400 genes,UMI > 0,and 0.5% ~ 20% of their UMIs mappingto mitochondria genes were kept for downstream analysis.
For the scRNA-seq, the first 16 bp of R1 is the cell barcode and the next 10bp (17-26bp) is the UMI.
Genome_build: mm10
Supplementary_files_format_and_content: CellRanger output of filtered gene matrix with barcodes & features files
← Back to Analysis