GSE131847 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Heterogenous Populations of Tissue-Resident CD8<sup>+</sup> T Cells Are Generated in Response to Infection and Malignancy.

Immunity (2020) — PMID 32433949

Dataset

GSE131847

Molecular determinants and heterogeneity of circulating and tissue-resident memory CD8+ T lymphocytes revealed by single-cell RNA sequencing

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    data were processed by cellranger 2.1.0 with default parameter, using mm10 as reference.

    Cell Ranger v2.1.0
    $ Bash example
    # Cell Ranger is a proprietary software from 10x Genomics.
    # Download and installation instructions are available on the 10x Genomics website:
    # https://www.10xgenomics.com/support/software/cell-ranger/downloads
    # Ensure cellranger 2.1.0 is in your PATH.
    
    # Download pre-built mm10 reference from 10x Genomics (or build your own using cellranger mkref):
    # https://www.10xgenomics.com/support/software/cell-ranger/downloads/latest
    REFERENCE_PATH="/path/to/cellranger_ref/mm10"
    
    # Define input and output paths
    SAMPLE_ID="my_sample"
    FASTQS_DIR="/path/to/fastq_files"
    OUTPUT_DIR="/path/to/output_directory"
    
    # Execute cellranger count with default parameters
    # The 'default parameter' implies using the standard settings for cellranger count.
    # You will need to replace SAMPLE_ID, FASTQS_DIR, OUTPUT_DIR, and REFERENCE_PATH with actual values.
    cellranger count \
        --id="${SAMPLE_ID}" \
        --transcriptome="${REFERENCE_PATH}" \
        --fastqs="${FASTQS_DIR}" \
        --sample="${SAMPLE_ID}" \
        --localcores=8 \
        --localmem=64
  2. 2

    Raw cell-reads were then loaded to R using the cellrangerRkit package.

    R vInfer from description (for R and cellrangerRkit package) (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R if not already installed (e.g., using conda)
    # conda create -n r_env r-base -y
    # conda activate r_env
    
    # Install the 'cellrangerRkit' package.
    # Note: 'cellrangerRkit' does not appear to be a standard CRAN or Bioconductor package.
    # If it's a custom package, you might need to install it from a specific source (e.g., GitHub).
    # Example for a hypothetical GitHub installation:
    # R -e 'install.packages("devtools")'
    # R -e 'devtools::install_github("your_organization/cellrangerRkit")' # Replace with actual repo if known
    
    # Create an R script to load the Cell Ranger output
    cat << 'EOF' > load_cellranger_data.R
    # Load the cellrangerRkit package
    library(cellrangerRkit)
    
    # Define the path to the Cell Ranger output directory from an environment variable
    # This directory typically contains 'matrix.mtx', 'barcodes.tsv', 'features.tsv'
    cellranger_output_dir <- Sys.getenv("CELLRANGER_OUTPUT_DIR")
    
    if (cellranger_output_dir == "") {
        stop("CELLRANGER_OUTPUT_DIR environment variable is not set. Please provide the path to the Cell Ranger output directory.")
    }
    
    message(paste("Loading raw cell-reads data from:", cellranger_output_dir))
    
    # Load the data using a function from cellrangerRkit.
    # The exact function name might vary (e.g., read_cellranger_matrix, load_cellranger_data).
    # Assuming 'read_cellranger_matrix' is a plausible function to load the feature-barcode matrix.
    cell_reads_data <- read_cellranger_matrix(cellranger_output_dir)
    
    # Further processing or saving the loaded data can be added here
    # For example, converting to a Seurat object, or saving to an RData file
    # saveRDS(cell_reads_data, file = "loaded_cell_reads.rds")
    
    message("Raw cell-reads data loaded successfully into R.")
    # You can inspect the loaded data, e.g.,
    # print(cell_reads_data)
    EOF
    
    # Set the environment variable for the Cell Ranger output directory
    # Replace 'path/to/your/cellranger/output/filtered_feature_bc_matrix' with the actual path
    export CELLRANGER_OUTPUT_DIR="path/to/your/cellranger/output/filtered_feature_bc_matrix"
    
    # Execute the R script
    Rscript load_cellranger_data.R
  3. 3

    The scRNA-seq dataset was then further filtered based on gene numbers and mitochondria gene counts total counts ratio.

    scRNA-seq v5.0.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    cat << 'EOF' > filter_scrnaseq.R
    library(Seurat)
    
    # Load your Seurat object (replace 'input_seurat_object.rds' with your actual input path)
    # This script assumes you have a Seurat object saved as an RDS file.
    # If starting from raw counts (e.g., 10x Genomics output), you would first create the Seurat object:
    # pbmc.data <- Read10X(data.dir = "path/to/10x/data/")
    # seurat_obj <- CreateSeuratObject(counts = pbmc.data, project = "scRNAseq_analysis")
    seurat_obj <- readRDS("input_seurat_object.rds")
    
    # Calculate mitochondrial percentage
    # The pattern for mitochondrial genes depends on the species and annotation.
    # For human, it's typically "^MT-". For mouse, it's often "^mt-". Adjust as needed.
    seurat_obj[["percent.mt"]] <- PercentageFeatureSet(seurat_obj, pattern = "^MT-")
    
    # Filter cells based on gene numbers (nFeature_RNA) and mitochondrial gene counts ratio (percent.mt)
    # The thresholds below are examples. Optimal thresholds should be determined by inspecting
    # violin plots or feature scatter plots of nFeature_RNA, nCount_RNA, and percent.mt
    # for your specific dataset to identify outliers and low-quality cells.
    # Example thresholds:
    # - nFeature_RNA: Number of unique genes detected per cell. Filter for cells with > 200 and < 2500 genes.
    # - percent.mt: Percentage of mitochondrial reads. Filter for cells with < 5% mitochondrial reads.
    seurat_obj_filtered <- subset(seurat_obj, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
    
    # Save the filtered Seurat object
    saveRDS(seurat_obj_filtered, "output_seurat_object_filtered.rds")
    
    # Optional: Print summary of filtering results
    cat("Original number of cells: ", ncol(seurat_obj), "\n")
    cat("Filtered number of cells: ", ncol(seurat_obj_filtered), "\n")
    EOF
    
    # Install Seurat (if not already installed)
    # R -q -e "install.packages('Seurat')"
    # R -q -e "install.packages('SeuratObject')"
    
    # Execute the R script
    Rscript filter_scrnaseq.R
  4. 4

    Only cells with > 400 genes,UMI > 0,and 0.5% ~ 20% of their UMIs mappingto mitochondria genes were kept for downstream analysis.

    Scanpy (Inferred with models/gemini-2.5-flash) v1.9.1 GitHub
    $ Bash example
    # Install Scanpy if not already installed
    # pip install scanpy
    
    # Create a Python script to perform cell filtering based on QC metrics
    cat << 'EOF' > filter_cells_qc.py
    import scanpy as sc
    import sys
    
    # Define input and output file paths
    input_h5ad = sys.argv[1]
    output_h5ad = sys.argv[2]
    
    # Load the AnnData object
    adata = sc.read_h5ad(input_h5ad)
    
    print(f"Initial number of cells: {adata.n_obs}")
    
    # Apply filtering criteria:
    # 1. Number of genes detected (n_genes_by_counts) > 400
    # 2. Total UMIs (total_counts) > 0
    # 3. Percentage of mitochondrial UMIs (pct_counts_mt) between 0.5% and 20% (exclusive of 20% in this implementation, adjust if inclusive is strictly needed)
    #    Note: It's assumed that 'n_genes_by_counts', 'total_counts', and 'pct_counts_mt' 
    #    have been pre-calculated and stored in adata.obs, typically via sc.pp.calculate_qc_metrics.
    filtered_adata = adata[
        (adata.obs['n_genes_by_counts'] > 400) &
        (adata.obs['total_counts'] > 0) &
        (adata.obs['pct_counts_mt'] > 0.5) &
        (adata.obs['pct_counts_mt'] < 20.0)
    ].copy() # Use .copy() to ensure a new AnnData object is created
    
    print(f"Number of cells after filtering: {filtered_adata.n_obs}")
    
    # Save the filtered AnnData object
    filtered_adata.write(output_h5ad)
    EOF
    
    # Execute the Python script with placeholder input and output files
    # Replace 'input_raw_cells.h5ad' with your actual input AnnData file
    # Replace 'output_filtered_cells.h5ad' with your desired output file name
    python filter_cells_qc.py input_raw_cells.h5ad output_filtered_cells.h5ad
  5. 5

    For the scRNA-seq, the first 16 bp of R1 is the cell barcode and the next 10bp (17-26bp) is the UMI.

    $ Bash example
    # Install UMI-tools if not already installed
    # conda install -c bioconda umi-tools
    
    # Define input and output file names
    R1_IN="R1.fastq.gz"
    R2_IN="R2.fastq.gz"
    R1_OUT="R1_extracted.fastq.gz"
    R2_OUT="R2_extracted.fastq.gz"
    
    # Define the barcode and UMI pattern for R1
    # C{16} for 16 bp Cell Barcode, U{10} for 10 bp UMI
    # The pattern CCCCCCCCCCCCCCCCUUUUUUUUUU means the CB is the first 16bp,
    # followed immediately by the UMI for the next 10bp in R1.
    UMI_PATTERN="CCCCCCCCCCCCCCCCUUUUUUUUUU"
    
    # Extract Cell Barcode and UMI from R1 and add to read headers of both R1 and R2
    umi_tools extract \
        --pattern "${UMI_PATTERN}" \
        --read1-in "${R1_IN}" \
        --read2-in "${R2_IN}" \
        --read1-out "${R1_OUT}" \
        --read2-out "${R2_OUT}"
    

Tools Used

Raw Source Text
data were processed by cellranger 2.1.0 with default parameter, using mm10 as reference.
Raw cell-reads were then loaded to R using the cellrangerRkit package. The scRNA-seq dataset was then further filtered based on gene numbers and mitochondria gene counts total counts ratio. Only cells with > 400 genes,UMI > 0,and 0.5% ~ 20% of their UMIs mappingto mitochondria genes were kept for downstream analysis.
For the scRNA-seq, the first 16 bp of R1 is the cell barcode and the next 10bp (17-26bp) is the UMI.
Genome_build: mm10
Supplementary_files_format_and_content: cell-gene UMI table: UMI table with each row represent a cell and each column represent a gene.
← Back to Analysis