GSE143437 Processing Pipeline
Publication
Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348
Processing Steps
Generate Jupyter Notebook-
1
Libraries were sequenced on Illumina NextSeq 500 using paired-end 26 + 8 + 58 (according to 10X Chromium Single Cell 3' protocol)
scRNA-seq v7.0.0$ Bash example
# Cell Ranger is typically downloaded as a tarball and extracted, then added to PATH. # For example: # wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-7.0.0.tar.gz # tar -xzf cellranger-7.0.0.tar.gz # export PATH=/path/to/cellranger-7.0.0:$PATH # Download 10x Genomics reference data (e.g., GRCh38 for human) # wget https://cf.10xgenomics.com/releases/cell-ranger/refdata-gex-GRCh38-2020-A.tar.gz # tar -xzf refdata-gex-GRCh38-2020-A.tar.gz cellranger count \ --id=my_sample_id \ --transcriptome=/path/to/refdata-gex-GRCh38-2020-A \ --fastqs=/path/to/fastq_directory \ --sample=my_sample_name \ --chemistry=SC3Pv2 -
2
Sequences were aligned to mm10 using Cell Ranger v2 pipeline (10X Genomics)
Cell Ranger v2$ Bash example
# Cell Ranger is proprietary software from 10x Genomics. # Download the specific version (e.g., v2) from the 10x Genomics website after registration. # Note: Direct download links for very old versions like v2 might be difficult to find publicly. # The general installation involves downloading a tarball and adding the executable to your PATH. # Example download and setup (replace with actual version and path, e.g., cellranger-2.x.x.tar.gz): # wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-7.1.0.tar.gz # Example for a recent version # tar -xzf cellranger-7.1.0.tar.gz # export PATH=/path/to/cellranger-7.1.0:$PATH # Example command for Cell Ranger v2 'count' pipeline (for single-cell gene expression alignment and quantification). # This assumes you have a Cell Ranger-formatted reference transcriptome for mm10. # If not, you would first run 'cellranger mkref' or download a pre-built one from 10x Genomics. # For mm10, you would typically download a pre-built reference from 10x Genomics for the specific Cell Ranger version. cellranger count \ --id=my_sample_output_v2 \ --transcriptome=/path/to/cellranger_refdata/mm10/ \ --fastqs=/path/to/your/fastq_directory/ \ --sample=my_sample_name \ --localcores=8 \ --localmem=64 -
3
Gene expression data was processed with Seurat v2.3.4 (Satija lab) using R v3.6.2 (2019-12-12)
Seurat v2.3.4$ Bash example
# Create a conda environment with R 3.6.2 and Seurat 2.3.4 # conda create -n seurat_v2_3_4 r-base=3.6.2 -c conda-forge -y # conda activate seurat_v2_3_4 # conda install -c bioconda r-seurat=2.3.4 -y # Example of an R script (e.g., 'run_seurat_analysis.R') that uses Seurat: # # R script content (save this to 'run_seurat_analysis.R'): # # library(Seurat) # # # Load data, create Seurat object, perform analysis # # # For example: # # # pbmc.data <- Read10X(data.dir = "path/to/filtered_gene_bc_matrices/hg19/") # # # pbmc <- CreateSeuratObject(counts = pbmc.data) # # # pbmc <- NormalizeData(pbmc) # # # ... further Seurat analysis ... # To execute an R script containing Seurat commands: Rscript run_seurat_analysis.R
-
4
Initial data filtering (minimum cells to express an included gene = 3, minium genes to be expressed in an included cell = 200)
$ Bash example
#!/bin/bash # This script performs initial data filtering using Scanpy. # It filters genes based on a minimum number of expressing cells # and filters cells based on a minimum number of expressed genes. # Usage: ./filter_data.sh <input_h5ad_file> <output_h5ad_file> # Check if input and output files are provided if [ -z "$1" ] || [ -z "$2" ]; then echo "Error: Missing input or output file arguments." echo "Usage: $0 <input_h5ad_file> <output_h5ad_file>" exit 1 fi INPUT_H5AD="$1" OUTPUT_H5AD="$2" # Comment out installation - use your preferred method (conda, pip, etc.) # conda install -c conda-forge scanpy # pip install scanpy # Run Scanpy filtering using a Python one-liner # This script loads an .h5ad file, applies filters, and saves the result. python -c " import scanpy as sc import sys input_file = sys.argv[1] output_file = sys.argv[2] print(f'Loading data from {input_file}...') adata = sc.read_h5ad(input_file) print(f'Initial data shape: {adata.shape} (cells, genes)') # Filter genes: keep genes expressed in at least 3 cells min_cells_for_gene = 3 print(f'Filtering genes: keeping genes expressed in at least {min_cells_for_gene} cells...') sc.pp.filter_genes(adata, min_cells=min_cells_for_gene) print(f'Shape after gene filtering: {adata.shape}') # Filter cells: keep cells expressing at least 200 genes min_genes_for_cell = 200 print(f'Filtering cells: keeping cells expressing at least {min_genes_for_cell} genes...') sc.pp.filter_cells(adata, min_genes=min_genes_for_cell) print(f'Shape after cell filtering: {adata.shape}') print(f'Saving filtered data to {output_file}...') adata.write(output_file) print('Filtering complete.') " "$INPUT_H5AD" "$OUTPUT_H5AD" -
5
Filtering by percent mitochondiral genes in included cells (maximum 0.2), number of UMIs in included cells (minimum 1000, maximum 60000), and number of mapped genes in included cells (maximum 8000)
$ Bash example
# conda install -c conda-forge scanpy python -c " import scanpy as sc import pandas as pd # Load the AnnData object. Replace 'input.h5ad' with your actual input file path. # This example assumes the AnnData object already has 'pct_counts_mt', 'n_counts', and 'n_genes' in its .obs dataframe. # If these metrics are not present, you would typically calculate them first using: # sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True) # where 'mt' is a boolean array indicating mitochondrial genes (e.g., adata.var_names.str.startswith('MT-')). adata = sc.read_h5ad('input.h5ad') # --- Filtering steps --- # Filter by percent mitochondrial genes: maximum 0.2 # Assuming 'pct_counts_mt' is a percentage (0-100). Adjust '20' to '0.2' if it's a fraction (0-1). adata = adata[adata.obs['pct_counts_mt'] <= 20, :].copy() # Filter by number of UMIs (n_counts): minimum 1000, maximum 60000 adata = adata[adata.obs['n_counts'] >= 1000, :].copy() adata = adata[adata.obs['n_counts'] <= 60000, :].copy() # Filter by number of mapped genes (n_genes): maximum 8000 adata = adata[adata.obs['n_genes'] <= 8000, :].copy() # Save the filtered AnnData object. Replace 'output.h5ad' with your desired output file path. adata.write('output.h5ad') " -
6
Log-normalization of expression data
$ Bash example
# Install DESeq2 via Bioconda (recommended for pipeline environments) # conda install -c bioconda bioconductor-deseq2=1.40.2 # Alternatively, install via R/BiocManager (ensure R and BiocManager are installed) # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')" # R -e "BiocManager::install('DESeq2', version = '3.17')" # Bioconductor 3.17 corresponds to DESeq2 1.40.x # Create a temporary R script for log-normalization using rlog transformation cat << 'EOF' > normalize_deseq2.R library(DESeq2) # --- Configuration --- # Input count matrix file (e.g., from featureCounts, HTSeq-count) # Expected format: CSV with gene IDs as the first column and sample names as headers. # Example: gene_id,sampleA,sampleB,sampleC # gene1,100,150,200 # gene2,50,75,100 input_counts_file <- "counts.csv" # Input sample metadata file # Expected format: CSV with sample IDs as the first column and metadata columns as headers. # A 'condition' column is recommended but not strictly required for blind rlog transformation. # Example: sample,condition,batch # sampleA,control,batch1 # sampleB,treated,batch1 input_coldata_file <- "coldata.csv" # Output log-normalized expression matrix file output_normalized_file <- "normalized_expression_rlog.csv" # --- Script Logic --- message(paste("Loading count data from:", input_counts_file)) # Read count data, assuming gene IDs are in the first column and sample names are headers. # check.names = FALSE prevents R from modifying column names (e.g., replacing hyphens with dots). count_data <- read.csv(input_counts_file, row.names = 1, check.names = FALSE) count_data <- as.matrix(count_data) message(paste("Loading sample metadata from:", input_coldata_file)) # Read colData, assuming sample IDs are in the first column. coldata <- read.csv(input_coldata_file, row.names = 1, check.names = FALSE) # Ensure sample names in count data columns match and are in the same order as colData rows. if (!all(colnames(count_data) %in% rownames(coldata))) { stop("Error: Sample names in count data columns do not match sample names in colData rows.") } coldata <- coldata[colnames(count_data), , drop = FALSE] # Create DESeqDataSet object. # DESeq2 expects integer counts, so round if necessary (though typically counts are integers). # A design formula is required. For blind rlog transformation (for general normalization/visualization), # a simple design like ~1 or ~condition (if a condition column exists) is sufficient. design_formula <- ~ 1 # Default to a simple design for blind transformation if ("condition" %in% colnames(coldata)) { message("Using 'condition' column from colData for DESeqDataSet design.") design_formula <- ~ condition # Ensure 'condition' is a factor for DESeq2 coldata$condition <- factor(coldata$condition) } else { message("No 'condition' column found in colData. Using a dummy design (~1) for blind transformation.") } dds <- DESeqDataSetFromMatrix(countData = round(count_data), # DESeq2 expects integer counts colData = coldata, design = design_formula) # Filter out genes with zero counts across all samples to avoid issues with log transformation. dds <- dds[rowSums(counts(dds)) > 0, ] message("Performing rlog transformation (blind=TRUE for normalization for downstream analysis)...") # The 'rlog' transformation stabilizes variance across the range of mean expression values. # 'blind=TRUE' is appropriate when the goal is general normalization for visualization or clustering, # not for differential expression where the design should be explicitly considered. rld <- rlog(dds, blind = TRUE) # Extract rlog transformed counts rlog_counts <- assay(rld) message(paste("Saving log-normalized expression data to:", output_normalized_file)) write.csv(rlog_counts, output_normalized_file) message("DESeq2 log-normalization complete.") EOF # --- Example Input Files (for demonstration purposes) --- # In a real pipeline, these files would be generated by upstream steps. echo "gene_id,sampleA,sampleB,sampleC" > counts.csv echo "gene1,100,150,200" >> counts.csv echo "gene2,50,75,100" >> counts.csv echo "gene3,10,15,20" >> counts.csv echo "gene4,0,5,10" >> counts.csv echo "sample,condition" > coldata.csv echo "sampleA,control" >> coldata.csv echo "sampleB,treated" >> coldata.csv echo "sampleC,control" >> coldata.csv # --- Run the R script --- Rscript normalize_deseq2.R # --- Clean up (optional) --- # rm normalize_deseq2.R # rm counts.csv coldata.csv -
7
Cell annotations in metadata derived from unsupervised clustering in CCA space (CCA coefficients obtained by MultiCCA from Seurat (Satija Lab).
$ Bash example
#!/bin/bash # This script demonstrates a Seurat workflow for unsupervised clustering # in a CCA-derived space, as described. It uses the 'FindIntegrationAnchors' # function with 'reduction = "cca"' which is the modern Seurat (v3+) approach # for CCA-based integration across multiple datasets, conceptually aligning # with the 'MultiCCA' mentioned in the description. # --- Installation (commented out) --- # Uncomment the following lines to install Seurat and its dependencies if not already installed. # R -e "install.packages('Seurat')" # R -e "install.packages('SeuratObject')" # R -e "install.packages('patchwork')" # For plotting # R -e "install.packages('ggplot2')" # For plotting # --- Create an R script for Seurat operations --- cat << 'EOF' > run_seurat_cca_clustering.R library(Seurat) library(patchwork) library(ggplot2) # --- 1. Load or create dummy Seurat objects --- # In a real scenario, you would load your single-cell RNA-seq data, # typically from 10x Genomics output, into multiple Seurat objects # (e.g., one object per sample/batch). # For this demonstration, we'll create two simple dummy Seurat objects. # Sample 1 data set.seed(123) n_cells_1 <- 1000 n_genes <- 2000 counts_1 <- matrix(sample(0:5, n_genes * n_cells_1, replace = TRUE, prob = c(0.6, 0.2, 0.1, 0.05, 0.03, 0.02)), nrow = n_genes, ncol = n_cells_1, dimnames = list(paste0("gene", 1:n_genes), paste0("cell1_", 1:n_cells_1))) seurat_obj_1 <- CreateSeuratObject(counts = counts_1, project = "sample1") seurat_obj_1[["percent.mt"]] <- PercentageFeatureSet(seurat_obj_1, pattern = "^MT-") # Sample 2 data set.seed(456) n_cells_2 <- 1200 counts_2 <- matrix(sample(0:5, n_genes * n_cells_2, replace = TRUE, prob = c(0.6, 0.2, 0.1, 0.05, 0.03, 0.02)), nrow = n_genes, ncol = n_cells_2, dimnames = list(paste0("gene", 1:n_genes), paste0("cell2_", 1:n_cells_2))) seurat_obj_2 <- CreateSeuratObject(counts = counts_2, project = "sample2") seurat_obj_2[["percent.mt"]] <- PercentageFeatureSet(seurat_obj_2, pattern = "^MT-") # --- 2. Pre-process individual objects --- # Normalize and find highly variable features for each object. message("Normalizing and finding variable features for individual samples...") seurat_obj_1 <- NormalizeData(seurat_obj_1) %>% FindVariableFeatures(selection.method = "vst", nfeatures = 2000) seurat_obj_2 <- NormalizeData(seurat_obj_2) %>% FindVariableFeatures(selection.method = "vst", nfeatures = 2000) # --- 3. Prepare for and perform integration using CCA principles --- # The description mentions "MultiCCA". In Seurat v3+, integration across # multiple datasets using CCA principles is primarily achieved via # `FindIntegrationAnchors` with `reduction = "cca"` followed by `IntegrateData`. # This generates a common, integrated CCA-derived space. seurat_list <- list(seurat_obj_1, seurat_obj_2) # Select features that are repeatedly variable across datasets for integration message("Selecting integration features...") features <- SelectIntegrationFeatures(object.list = seurat_list) # Find integration anchors using CCA. # This step computes CCA coefficients and identifies 'anchors' (cells in shared # biological states) in the CCA space. The 'reduction = "cca"' argument # explicitly uses Canonical Correlation Analysis for anchor finding. message("Finding integration anchors using CCA...") integration_anchors <- FindIntegrationAnchors(object.list = seurat_list, anchor.features = features, reduction = "cca") # Integrate the datasets based on the identified anchors. # This creates an 'integrated' assay where cells from different samples # are projected into a common, batch-corrected, CCA-derived space. message("Integrating data...") seurat_integrated <- IntegrateData(anchorset = integration_anchors) # --- 4. Perform unsupervised clustering in the integrated CCA space --- # Set the default assay to 'integrated' for downstream analysis. DefaultAssay(seurat_integrated) <- "integrated" # Scale the integrated data. message("Scaling integrated data...") seurat_integrated <- ScaleData(seurat_integrated, verbose = FALSE) # Perform PCA on the integrated data. # The dimensions here represent the principal components of the integrated # CCA space, which are then used for clustering. message("Running PCA on integrated data...") seurat_integrated <- RunPCA(seurat_integrated, npcs = 30, verbose = FALSE) # Find neighbors and clusters based on the PCA dimensions from the integrated space. message("Finding neighbors and clusters...") seurat_integrated <- FindNeighbors(seurat_integrated, dims = 1:30) seurat_integrated <- FindClusters(seurat_integrated, resolution = 0.5) # --- 5. Add cell annotations to metadata --- # The cluster IDs are automatically added to the metadata slot 'seurat_clusters'. # We can rename this column for clarity or use it directly. seurat_integrated$unsupervised_clusters_cca <- seurat_integrated$seurat_clusters message("Unsupervised cluster IDs added to metadata as 'unsupervised_clusters_cca'.") # --- 6. Save results and visualize (optional) --- saveRDS(seurat_integrated, file = "seurat_integrated_cca_clustered.rds") message("Seurat integrated object with CCA-derived clusters saved to seurat_integrated_cca_clustered.rds") # Visualization (optional): Generate UMAP plots to visualize the clustering. message("Generating UMAP plots...") seurat_integrated <- RunUMAP(seurat_integrated, dims = 1:30) p1 <- DimPlot(seurat_integrated, reduction = "umap", group.by = "orig.ident", pt.size = 0.5) + ggtitle("UMAP by Original Sample") p2 <- DimPlot(seurat_integrated, reduction = "umap", group.by = "unsupervised_clusters_cca", label = TRUE, pt.size = 0.5) + ggtitle("UMAP by Unsupervised CCA Clusters") + NoLegend() # Save plots ggsave("umap_by_sample.png", plot = p1, width = 8, height = 6) ggsave("umap_by_clusters.png", plot = p2, width = 8, height = 6) message("UMAP plots saved to umap_by_sample.png and umap_by_clusters.png") message("Seurat integration and clustering workflow complete.") EOF # --- Run the R script --- Rscript run_seurat_cca_clustering.R
Raw Source Text
Libraries were sequenced on Illumina NextSeq 500 using paired-end 26 + 8 + 58 (according to 10X Chromium Single Cell 3' protocol) Sequences were aligned to mm10 using Cell Ranger v2 pipeline (10X Genomics) Gene expression data was processed with Seurat v2.3.4 (Satija lab) using R v3.6.2 (2019-12-12) Initial data filtering (minimum cells to express an included gene = 3, minium genes to be expressed in an included cell = 200) Filtering by percent mitochondiral genes in included cells (maximum 0.2), number of UMIs in included cells (minimum 1000, maximum 60000), and number of mapped genes in included cells (maximum 8000) Log-normalization of expression data Cell annotations in metadata derived from unsupervised clustering in CCA space (CCA coefficients obtained by MultiCCA from Seurat (Satija Lab). Genome_build: mm10 Supplementary_files_format_and_content: .txt tabulated files of metadata information, normalized gene expression matrix, and raw gene expression matrix