GSE143437 Processing Pipeline

RNA-Seq code_examples 7 steps

Publication

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.

Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348

Dataset

GSE143437

Single-cell transcriptomic atlas of the mouse regenerating muscle tissue

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Libraries were sequenced on Illumina NextSeq 500 using paired-end 26 + 8 + 58 (according to 10X Chromium Single Cell 3' protocol)

    $ Bash example
    # Cell Ranger is typically downloaded as a tarball and extracted, then added to PATH.
    # For example:
    # wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-7.0.0.tar.gz
    # tar -xzf cellranger-7.0.0.tar.gz
    # export PATH=/path/to/cellranger-7.0.0:$PATH
    
    # Download 10x Genomics reference data (e.g., GRCh38 for human)
    # wget https://cf.10xgenomics.com/releases/cell-ranger/refdata-gex-GRCh38-2020-A.tar.gz
    # tar -xzf refdata-gex-GRCh38-2020-A.tar.gz
    
    cellranger count \
        --id=my_sample_id \
        --transcriptome=/path/to/refdata-gex-GRCh38-2020-A \
        --fastqs=/path/to/fastq_directory \
        --sample=my_sample_name \
        --chemistry=SC3Pv2
  2. 2

    Sequences were aligned to mm10 using Cell Ranger v2 pipeline (10X Genomics)

    Cell Ranger v2
    $ Bash example
    # Cell Ranger is proprietary software from 10x Genomics.
    # Download the specific version (e.g., v2) from the 10x Genomics website after registration.
    # Note: Direct download links for very old versions like v2 might be difficult to find publicly.
    # The general installation involves downloading a tarball and adding the executable to your PATH.
    # Example download and setup (replace with actual version and path, e.g., cellranger-2.x.x.tar.gz):
    # wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-7.1.0.tar.gz # Example for a recent version
    # tar -xzf cellranger-7.1.0.tar.gz
    # export PATH=/path/to/cellranger-7.1.0:$PATH
    
    # Example command for Cell Ranger v2 'count' pipeline (for single-cell gene expression alignment and quantification).
    # This assumes you have a Cell Ranger-formatted reference transcriptome for mm10.
    # If not, you would first run 'cellranger mkref' or download a pre-built one from 10x Genomics.
    # For mm10, you would typically download a pre-built reference from 10x Genomics for the specific Cell Ranger version.
    
    cellranger count \
        --id=my_sample_output_v2 \
        --transcriptome=/path/to/cellranger_refdata/mm10/ \
        --fastqs=/path/to/your/fastq_directory/ \
        --sample=my_sample_name \
        --localcores=8 \
        --localmem=64
  3. 3

    Gene expression data was processed with Seurat v2.3.4 (Satija lab) using R v3.6.2 (2019-12-12)

    Seurat v2.3.4
    $ Bash example
    # Create a conda environment with R 3.6.2 and Seurat 2.3.4
    # conda create -n seurat_v2_3_4 r-base=3.6.2 -c conda-forge -y
    # conda activate seurat_v2_3_4
    # conda install -c bioconda r-seurat=2.3.4 -y
    
    # Example of an R script (e.g., 'run_seurat_analysis.R') that uses Seurat:
    # # R script content (save this to 'run_seurat_analysis.R'):
    # # library(Seurat)
    # # # Load data, create Seurat object, perform analysis
    # # # For example:
    # # # pbmc.data <- Read10X(data.dir = "path/to/filtered_gene_bc_matrices/hg19/")
    # # # pbmc <- CreateSeuratObject(counts = pbmc.data)
    # # # pbmc <- NormalizeData(pbmc)
    # # # ... further Seurat analysis ...
    
    # To execute an R script containing Seurat commands:
    Rscript run_seurat_analysis.R
  4. 4

    Initial data filtering (minimum cells to express an included gene = 3, minium genes to be expressed in an included cell = 200)

    $ Bash example
    #!/bin/bash
    
    # This script performs initial data filtering using Scanpy.
    # It filters genes based on a minimum number of expressing cells
    # and filters cells based on a minimum number of expressed genes.
    
    # Usage: ./filter_data.sh <input_h5ad_file> <output_h5ad_file>
    
    # Check if input and output files are provided
    if [ -z "$1" ] || [ -z "$2" ]; then
        echo "Error: Missing input or output file arguments."
        echo "Usage: $0 <input_h5ad_file> <output_h5ad_file>"
        exit 1
    fi
    
    INPUT_H5AD="$1"
    OUTPUT_H5AD="$2"
    
    # Comment out installation - use your preferred method (conda, pip, etc.)
    # conda install -c conda-forge scanpy
    # pip install scanpy
    
    # Run Scanpy filtering using a Python one-liner
    # This script loads an .h5ad file, applies filters, and saves the result.
    python -c "
    import scanpy as sc
    import sys
    
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    
    print(f'Loading data from {input_file}...')
    adata = sc.read_h5ad(input_file)
    print(f'Initial data shape: {adata.shape} (cells, genes)')
    
    # Filter genes: keep genes expressed in at least 3 cells
    min_cells_for_gene = 3
    print(f'Filtering genes: keeping genes expressed in at least {min_cells_for_gene} cells...')
    sc.pp.filter_genes(adata, min_cells=min_cells_for_gene)
    print(f'Shape after gene filtering: {adata.shape}')
    
    # Filter cells: keep cells expressing at least 200 genes
    min_genes_for_cell = 200
    print(f'Filtering cells: keeping cells expressing at least {min_genes_for_cell} genes...')
    sc.pp.filter_cells(adata, min_genes=min_genes_for_cell)
    print(f'Shape after cell filtering: {adata.shape}')
    
    print(f'Saving filtered data to {output_file}...')
    adata.write(output_file)
    print('Filtering complete.')
    " "$INPUT_H5AD" "$OUTPUT_H5AD"
    
  5. 5

    Filtering by percent mitochondiral genes in included cells (maximum 0.2), number of UMIs in included cells (minimum 1000, maximum 60000), and number of mapped genes in included cells (maximum 8000)

    Scanpy vNot specified GitHub
    $ Bash example
    # conda install -c conda-forge scanpy
    python -c "
    import scanpy as sc
    import pandas as pd
    
    # Load the AnnData object. Replace 'input.h5ad' with your actual input file path.
    # This example assumes the AnnData object already has 'pct_counts_mt', 'n_counts', and 'n_genes' in its .obs dataframe.
    # If these metrics are not present, you would typically calculate them first using:
    # sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
    # where 'mt' is a boolean array indicating mitochondrial genes (e.g., adata.var_names.str.startswith('MT-')).
    adata = sc.read_h5ad('input.h5ad')
    
    # --- Filtering steps ---
    
    # Filter by percent mitochondrial genes: maximum 0.2
    # Assuming 'pct_counts_mt' is a percentage (0-100). Adjust '20' to '0.2' if it's a fraction (0-1).
    adata = adata[adata.obs['pct_counts_mt'] <= 20, :].copy()
    
    # Filter by number of UMIs (n_counts): minimum 1000, maximum 60000
    adata = adata[adata.obs['n_counts'] >= 1000, :].copy()
    adata = adata[adata.obs['n_counts'] <= 60000, :].copy()
    
    # Filter by number of mapped genes (n_genes): maximum 8000
    adata = adata[adata.obs['n_genes'] <= 8000, :].copy()
    
    # Save the filtered AnnData object. Replace 'output.h5ad' with your desired output file path.
    adata.write('output.h5ad')
    "
  6. 6

    Log-normalization of expression data

    $ Bash example
    # Install DESeq2 via Bioconda (recommended for pipeline environments)
    # conda install -c bioconda bioconductor-deseq2=1.40.2
    
    # Alternatively, install via R/BiocManager (ensure R and BiocManager are installed)
    # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')"
    # R -e "BiocManager::install('DESeq2', version = '3.17')" # Bioconductor 3.17 corresponds to DESeq2 1.40.x
    
    # Create a temporary R script for log-normalization using rlog transformation
    cat << 'EOF' > normalize_deseq2.R
    library(DESeq2)
    
    # --- Configuration ---
    # Input count matrix file (e.g., from featureCounts, HTSeq-count)
    # Expected format: CSV with gene IDs as the first column and sample names as headers.
    # Example: gene_id,sampleA,sampleB,sampleC
    #          gene1,100,150,200
    #          gene2,50,75,100
    input_counts_file <- "counts.csv"
    
    # Input sample metadata file
    # Expected format: CSV with sample IDs as the first column and metadata columns as headers.
    # A 'condition' column is recommended but not strictly required for blind rlog transformation.
    # Example: sample,condition,batch
    #          sampleA,control,batch1
    #          sampleB,treated,batch1
    input_coldata_file <- "coldata.csv"
    
    # Output log-normalized expression matrix file
    output_normalized_file <- "normalized_expression_rlog.csv"
    
    # --- Script Logic ---
    message(paste("Loading count data from:", input_counts_file))
    # Read count data, assuming gene IDs are in the first column and sample names are headers.
    # check.names = FALSE prevents R from modifying column names (e.g., replacing hyphens with dots).
    count_data <- read.csv(input_counts_file, row.names = 1, check.names = FALSE)
    count_data <- as.matrix(count_data)
    
    message(paste("Loading sample metadata from:", input_coldata_file))
    # Read colData, assuming sample IDs are in the first column.
    coldata <- read.csv(input_coldata_file, row.names = 1, check.names = FALSE)
    
    # Ensure sample names in count data columns match and are in the same order as colData rows.
    if (!all(colnames(count_data) %in% rownames(coldata))) {
      stop("Error: Sample names in count data columns do not match sample names in colData rows.")
    }
    coldata <- coldata[colnames(count_data), , drop = FALSE]
    
    # Create DESeqDataSet object.
    # DESeq2 expects integer counts, so round if necessary (though typically counts are integers).
    # A design formula is required. For blind rlog transformation (for general normalization/visualization),
    # a simple design like ~1 or ~condition (if a condition column exists) is sufficient.
    design_formula <- ~ 1 # Default to a simple design for blind transformation
    if ("condition" %in% colnames(coldata)) {
      message("Using 'condition' column from colData for DESeqDataSet design.")
      design_formula <- ~ condition
      # Ensure 'condition' is a factor for DESeq2
      coldata$condition <- factor(coldata$condition)
    } else {
      message("No 'condition' column found in colData. Using a dummy design (~1) for blind transformation.")
    }
    
    dds <- DESeqDataSetFromMatrix(countData = round(count_data), # DESeq2 expects integer counts
                                  colData = coldata,
                                  design = design_formula)
    
    # Filter out genes with zero counts across all samples to avoid issues with log transformation.
    dds <- dds[rowSums(counts(dds)) > 0, ]
    
    message("Performing rlog transformation (blind=TRUE for normalization for downstream analysis)...")
    # The 'rlog' transformation stabilizes variance across the range of mean expression values.
    # 'blind=TRUE' is appropriate when the goal is general normalization for visualization or clustering,
    # not for differential expression where the design should be explicitly considered.
    rld <- rlog(dds, blind = TRUE)
    
    # Extract rlog transformed counts
    rlog_counts <- assay(rld)
    
    message(paste("Saving log-normalized expression data to:", output_normalized_file))
    write.csv(rlog_counts, output_normalized_file)
    
    message("DESeq2 log-normalization complete.")
    EOF
    
    # --- Example Input Files (for demonstration purposes) ---
    # In a real pipeline, these files would be generated by upstream steps.
    echo "gene_id,sampleA,sampleB,sampleC" > counts.csv
    echo "gene1,100,150,200" >> counts.csv
    echo "gene2,50,75,100" >> counts.csv
    echo "gene3,10,15,20" >> counts.csv
    echo "gene4,0,5,10" >> counts.csv
    
    echo "sample,condition" > coldata.csv
    echo "sampleA,control" >> coldata.csv
    echo "sampleB,treated" >> coldata.csv
    echo "sampleC,control" >> coldata.csv
    
    # --- Run the R script ---
    Rscript normalize_deseq2.R
    
    # --- Clean up (optional) ---
    # rm normalize_deseq2.R
    # rm counts.csv coldata.csv
    
  7. 7

    Cell annotations in metadata derived from unsupervised clustering in CCA space (CCA coefficients obtained by MultiCCA from Seurat (Satija Lab).

    Seurat vv3 GitHub
    $ Bash example
    #!/bin/bash
    
    # This script demonstrates a Seurat workflow for unsupervised clustering
    # in a CCA-derived space, as described. It uses the 'FindIntegrationAnchors'
    # function with 'reduction = "cca"' which is the modern Seurat (v3+) approach
    # for CCA-based integration across multiple datasets, conceptually aligning
    # with the 'MultiCCA' mentioned in the description.
    
    # --- Installation (commented out) ---
    # Uncomment the following lines to install Seurat and its dependencies if not already installed.
    # R -e "install.packages('Seurat')"
    # R -e "install.packages('SeuratObject')"
    # R -e "install.packages('patchwork')" # For plotting
    # R -e "install.packages('ggplot2')" # For plotting
    
    # --- Create an R script for Seurat operations ---
    cat << 'EOF' > run_seurat_cca_clustering.R
    library(Seurat)
    library(patchwork)
    library(ggplot2)
    
    # --- 1. Load or create dummy Seurat objects ---
    # In a real scenario, you would load your single-cell RNA-seq data,
    # typically from 10x Genomics output, into multiple Seurat objects
    # (e.g., one object per sample/batch).
    # For this demonstration, we'll create two simple dummy Seurat objects.
    
    # Sample 1 data
    set.seed(123)
    n_cells_1 <- 1000
    n_genes <- 2000
    counts_1 <- matrix(sample(0:5, n_genes * n_cells_1, replace = TRUE, prob = c(0.6, 0.2, 0.1, 0.05, 0.03, 0.02)),
                       nrow = n_genes, ncol = n_cells_1,
                       dimnames = list(paste0("gene", 1:n_genes), paste0("cell1_", 1:n_cells_1)))
    seurat_obj_1 <- CreateSeuratObject(counts = counts_1, project = "sample1")
    seurat_obj_1[["percent.mt"]] <- PercentageFeatureSet(seurat_obj_1, pattern = "^MT-")
    
    # Sample 2 data
    set.seed(456)
    n_cells_2 <- 1200
    counts_2 <- matrix(sample(0:5, n_genes * n_cells_2, replace = TRUE, prob = c(0.6, 0.2, 0.1, 0.05, 0.03, 0.02)),
                       nrow = n_genes, ncol = n_cells_2,
                       dimnames = list(paste0("gene", 1:n_genes), paste0("cell2_", 1:n_cells_2)))
    seurat_obj_2 <- CreateSeuratObject(counts = counts_2, project = "sample2")
    seurat_obj_2[["percent.mt"]] <- PercentageFeatureSet(seurat_obj_2, pattern = "^MT-")
    
    # --- 2. Pre-process individual objects ---
    # Normalize and find highly variable features for each object.
    message("Normalizing and finding variable features for individual samples...")
    seurat_obj_1 <- NormalizeData(seurat_obj_1) %>%
                    FindVariableFeatures(selection.method = "vst", nfeatures = 2000)
    seurat_obj_2 <- NormalizeData(seurat_obj_2) %>%
                    FindVariableFeatures(selection.method = "vst", nfeatures = 2000)
    
    # --- 3. Prepare for and perform integration using CCA principles ---
    # The description mentions "MultiCCA". In Seurat v3+, integration across
    # multiple datasets using CCA principles is primarily achieved via
    # `FindIntegrationAnchors` with `reduction = "cca"` followed by `IntegrateData`.
    # This generates a common, integrated CCA-derived space.
    
    seurat_list <- list(seurat_obj_1, seurat_obj_2)
    
    # Select features that are repeatedly variable across datasets for integration
    message("Selecting integration features...")
    features <- SelectIntegrationFeatures(object.list = seurat_list)
    
    # Find integration anchors using CCA.
    # This step computes CCA coefficients and identifies 'anchors' (cells in shared
    # biological states) in the CCA space. The 'reduction = "cca"' argument
    # explicitly uses Canonical Correlation Analysis for anchor finding.
    message("Finding integration anchors using CCA...")
    integration_anchors <- FindIntegrationAnchors(object.list = seurat_list,
                                                  anchor.features = features,
                                                  reduction = "cca")
    
    # Integrate the datasets based on the identified anchors.
    # This creates an 'integrated' assay where cells from different samples
    # are projected into a common, batch-corrected, CCA-derived space.
    message("Integrating data...")
    seurat_integrated <- IntegrateData(anchorset = integration_anchors)
    
    # --- 4. Perform unsupervised clustering in the integrated CCA space ---
    # Set the default assay to 'integrated' for downstream analysis.
    DefaultAssay(seurat_integrated) <- "integrated"
    
    # Scale the integrated data.
    message("Scaling integrated data...")
    seurat_integrated <- ScaleData(seurat_integrated, verbose = FALSE)
    
    # Perform PCA on the integrated data.
    # The dimensions here represent the principal components of the integrated
    # CCA space, which are then used for clustering.
    message("Running PCA on integrated data...")
    seurat_integrated <- RunPCA(seurat_integrated, npcs = 30, verbose = FALSE)
    
    # Find neighbors and clusters based on the PCA dimensions from the integrated space.
    message("Finding neighbors and clusters...")
    seurat_integrated <- FindNeighbors(seurat_integrated, dims = 1:30)
    seurat_integrated <- FindClusters(seurat_integrated, resolution = 0.5)
    
    # --- 5. Add cell annotations to metadata ---
    # The cluster IDs are automatically added to the metadata slot 'seurat_clusters'.
    # We can rename this column for clarity or use it directly.
    seurat_integrated$unsupervised_clusters_cca <- seurat_integrated$seurat_clusters
    message("Unsupervised cluster IDs added to metadata as 'unsupervised_clusters_cca'.")
    
    # --- 6. Save results and visualize (optional) ---
    saveRDS(seurat_integrated, file = "seurat_integrated_cca_clustered.rds")
    message("Seurat integrated object with CCA-derived clusters saved to seurat_integrated_cca_clustered.rds")
    
    # Visualization (optional): Generate UMAP plots to visualize the clustering.
    message("Generating UMAP plots...")
    seurat_integrated <- RunUMAP(seurat_integrated, dims = 1:30)
    p1 <- DimPlot(seurat_integrated, reduction = "umap", group.by = "orig.ident", pt.size = 0.5) +
          ggtitle("UMAP by Original Sample")
    p2 <- DimPlot(seurat_integrated, reduction = "umap", group.by = "unsupervised_clusters_cca", label = TRUE, pt.size = 0.5) +
          ggtitle("UMAP by Unsupervised CCA Clusters") + NoLegend()
    
    # Save plots
    ggsave("umap_by_sample.png", plot = p1, width = 8, height = 6)
    ggsave("umap_by_clusters.png", plot = p2, width = 8, height = 6)
    message("UMAP plots saved to umap_by_sample.png and umap_by_clusters.png")
    
    message("Seurat integration and clustering workflow complete.")
    EOF
    
    # --- Run the R script ---
    Rscript run_seurat_cca_clustering.R
    

Tools Used

Raw Source Text
Libraries were sequenced on Illumina NextSeq 500 using paired-end 26 + 8 + 58 (according to 10X Chromium Single Cell 3' protocol)
Sequences were aligned to mm10 using Cell Ranger v2 pipeline (10X Genomics)
Gene expression data was processed with Seurat v2.3.4 (Satija lab) using R  v3.6.2 (2019-12-12)
Initial data filtering (minimum cells to express an included gene = 3, minium genes to be expressed in an included cell = 200)
Filtering by percent mitochondiral genes in included cells (maximum 0.2), number of UMIs in included cells (minimum 1000, maximum 60000), and number of mapped genes in included cells (maximum 8000)
Log-normalization of expression data
Cell annotations in metadata derived from unsupervised clustering in CCA space (CCA coefficients obtained by MultiCCA from Seurat (Satija Lab).
Genome_build: mm10
Supplementary_files_format_and_content: .txt tabulated files of metadata information, normalized gene expression matrix, and raw gene expression matrix
← Back to Analysis