GSE203086 Processing Pipeline

RNA-Seq code_examples 10 steps

Publication

The long noncoding RNA Malat1 regulates CD8+ T cell differentiation by mediating epigenetic repression.

The Journal of experimental medicine (2022) — PMID 35593887

Dataset

GSE203086

The long noncoding RNA Malat1 regulates CD8+ T cell differentiation by mediating epigenetic repression (scRNA-Seq)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Reads from scRNA-seq were aligned to mm10 using the 10x Genomics Cell Ranger software (v 2.1.0).

    Cell Ranger v2.1.0
    $ Bash example
    # Install Cell Ranger (example - adjust path as needed)
    # wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-2.1.0.tar.gz
    # tar -xzf cellranger-2.1.0.tar.gz
    # export PATH=/path/to/cellranger-2.1.0:$PATH
    
    # Download or build mm10 reference transcriptome (example)
    # For Cell Ranger, you typically download a pre-built reference from 10x Genomics or build your own.
    # Example: wget https://cf.10xgenomics.com/releases/cell-ranger/refdata-cellranger-mm10-1.2.0.tar.gz
    # tar -xzf refdata-cellranger-mm10-1.2.0.tar.gz
    # REF_GENOME_DIR="refdata-cellranger-mm10-1.2.0"
    
    # Define variables
    FASTQ_DIR="path/to/your/fastq_files" # Directory containing FASTQ files
    REF_GENOME_DIR="path/to/your/mm10_cellranger_reference" # Path to the Cell Ranger mm10 reference transcriptome
    OUTPUT_ID="my_scrnaseq_alignment" # A unique ID for this run
    SAMPLE_NAME="my_sample" # Sample name (optional, but good practice)
    EXPECTED_CELLS=3000 # Estimated number of cells (adjust as needed)
    
    # Run Cell Ranger count
    cellranger count \
        --id="${OUTPUT_ID}" \
        --transcriptome="${REF_GENOME_DIR}" \
        --fastqs="${FASTQ_DIR}" \
        --sample="${SAMPLE_NAME}" \
        --expect-cells="${EXPECTED_CELLS}"
  2. 2

    Reads were collapsed into unique molecular identifier counts.

    umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install umi_tools if not already available
    # conda install -c bioconda umi_tools
    
    # Define input and output file paths
    INPUT_BAM="aligned_reads_with_umis.bam" # Input BAM file, typically coordinate-sorted, with UMIs in a tag (e.g., RX)
    OUTPUT_DEDUP_BAM="deduplicated_reads.bam"
    UMI_TAG="RX" # The tag where UMIs are stored (e.g., after umi_tools extract)
    
    # Collapse reads into unique molecular identifier counts (deduplication)
    umi_tools dedup \
      --input "${INPUT_BAM}" \
      --output "${OUTPUT_DEDUP_BAM}" \
      --umi-tag "${UMI_TAG}" \
      --method "unique" \
      --log "umi_tools_dedup.log"
    
    # Note: This command assumes UMIs have already been extracted from read names
    # and placed into a BAM tag (e.g., 'RX') in a prior step (e.g., using 'umi_tools extract').
    # For example, a preceding step might look like:
    # umi_tools extract \
    #   --input "aligned_reads.bam" \
    #   --output "aligned_reads_with_umis.bam" \
    #   --extract-method "regex" \
    #   --regex "^(?P<umi_1>.{6})(?P<discard_1>.*)" \
    #   --umi-tag "RX"
  3. 3

    All samples had >2000 cells detected with >1000 genes per cells with >70% of the coding genome covered.

    CellRanger (Inferred with models/gemini-2.5-flash) v7.1.0
    $ Bash example
    # This script demonstrates how CellRanger might be used to process single-cell RNA-seq data,
    # generating metrics that are then checked against the described quality control thresholds.
    # The description states that "All samples had >2000 cells detected with >1000 genes per cells
    # with >70% of the coding genome covered." This implies a successful run of a tool like CellRanger
    # followed by a quality control filtering step.
    
    # --- Installation (commented out) ---
    # Cell Ranger software is proprietary and typically downloaded from 10x Genomics.
    # Ensure Cell Ranger is installed and in your PATH.
    # Example download (requires registration):
    # # wget https://cf.10xgenomics.com/releases/cell-exp/cellranger-7.1.0.tar.gz
    # # tar -xzf cellranger-7.1.0.tar.gz
    # # export PATH=/path/to/cellranger-7.1.0:$PATH
    
    # --- Reference Data (commented out) ---
    # Download a 10x Genomics reference transcriptome (e.g., GRCh38 for human)
    # # wget https://cf.10xgenomics.com/releases/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
    # # tar -xzf refdata-gex-GRCh38-2020-A.tar.gz
    # # REFERENCE_PATH="./refdata-gex-GRCh38-2020-A"
    
    # --- Configuration ---
    # Replace with actual paths and sample information
    FASTQ_BASE_DIR="/path/to/your/fastq_files" # Directory containing subdirectories for each sample's fastqs
    REFERENCE_PATH="/path/to/refdata-gex-GRCh38-2020-A" # Path to the Cell Ranger transcriptome reference
    OUTPUT_DIR="./cellranger_output"
    NUM_CORES=16
    MEMORY_GB=64
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # --- Main CellRanger Execution Loop ---
    # This loop processes each sample. The metrics generated by 'cellranger count'
    # are then implicitly checked against the criteria mentioned in the description.
    # The description states that *all samples met* these criteria, implying successful processing
    # and subsequent validation of the output metrics.
    
    # Example: Iterate through sample directories (assuming fastqs are organized by sample ID)
    for SAMPLE_FASTQ_DIR in "${FASTQ_BASE_DIR}"/*/; do
        SAMPLE_ID=$(basename "${SAMPLE_FASTQ_DIR}")
        echo "Processing sample: ${SAMPLE_ID}"
    
        cellranger count \
            --id="${SAMPLE_ID}" \
            --transcriptome="${REFERENCE_PATH}" \
            --fastqs="${SAMPLE_FASTQ_DIR}" \
            --sample="${SAMPLE_ID}" \
            --localcores="${NUM_CORES}" \
            --localmem="${MEMORY_GB}" \
            --output-directory="${OUTPUT_DIR}/${SAMPLE_ID}"
    
        # After this command runs, Cell Ranger generates a 'metrics_summary.csv' file
        # within the output directory (e.g., ${OUTPUT_DIR}/${SAMPLE_ID}/outs/metrics_summary.csv).
        # The description implies that the values in this summary file for each sample
        # satisfied the following conditions:
        # - "Number of Cells Detected" > 2000
        # - "Median Genes per Cell" > 1000
        # - "Fraction Reads Mapped Confidently to Transcriptome" (as a proxy for coding genome covered) > 0.70
        # A separate script (e.g., in R or Python) would typically parse this CSV
        # and perform these checks.
    done
  4. 4

    Genes that were not expressed in at least 5% of all cells were excluded.

    Seurat (Inferred with models/gemini-2.5-flash) v4.x (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Seurat if not already installed (uncomment to run)
    # R -e "install.packages('Seurat')"
    # R -e "install.packages('SeuratObject')" # SeuratObject is a dependency
    
    # Create an R script for filtering genes based on expression percentage across cells
    cat << 'EOF' > filter_genes_by_expression.R
    library(Seurat)
    
    # Load the Seurat object
    # Replace 'input_seurat_object.rds' with your actual input file path
    # This placeholder assumes a Seurat object saved as an RDS file.
    # If your data is in a different format (e.g., AnnData for Scanpy, or raw counts), 
    # you would need to adapt the loading and object creation steps.
    seurat_obj <- readRDS("input_seurat_object.rds")
    
    # Define the minimum percentage of cells a gene must be expressed in
    # Parameter inferred from description: "at least 5% of all cells"
    min_percent_cells <- 5
    
    # Calculate the percentage of cells expressing each gene
    # This assumes the primary assay is 'RNA' and the data is in the 'counts' slot.
    # A gene is considered 'expressed' if its count is greater than 0 in a cell.
    percent_cells_expressed <- rowSums(seurat_obj@assays$RNA@counts > 0) / ncol(seurat_obj) * 100
    
    # Identify genes to keep (those expressed in at least min_percent_cells of cells)
    genes_to_keep <- names(percent_cells_expressed[percent_cells_expressed >= min_percent_cells])
    
    # Subset the Seurat object to retain only the identified genes
    seurat_obj_filtered <- subset(seurat_obj, features = genes_to_keep)
    
    # Save the filtered Seurat object
    # Replace 'filtered_seurat_object.rds' with your desired output file path
    saveRDS(seurat_obj_filtered, "filtered_seurat_object.rds")
    
    message(paste("Original number of genes:", nrow(seurat_obj)))
    message(paste("Number of genes after filtering:", nrow(seurat_obj_filtered)))
    EOF
    
    # Execute the R script to perform the filtering
    Rscript filter_genes_by_expression.R
  5. 5

    As previously described (Boland et al., 2020), replicates of single cell libraries were normalized removing batch effects using RUVnormalize (v1.15.0).

    RUVnormalize v1.15.0 GitHub
    $ Bash example
    # Install R and RUVnormalize if not already installed
    # sudo apt-get update
    # sudo apt-get install -y r-base
    # R -e "install.packages('BiocManager', repos='http://cran.us.r-project.org')"
    # R -e "BiocManager::install('RUVnormalize')"
    
    # Create an R script to run RUVnormalize
    cat << 'EOF' > run_ruvnormalize.R
    # Load RUVnormalize library
    library(RUVnormalize)
    
    # Placeholder for input data (e.g., a count matrix)
    # Replace 'input_counts.tsv' with your actual input file path
    # Example: counts_data <- read.table("input_counts.tsv", header=TRUE, row.names=1)
    # For demonstration, let's create some dummy data
    set.seed(123)
    num_genes <- 100
    num_samples <- 10
    counts_data <- matrix(rnbinom(num_genes * num_samples, size = 10, mu = 100),
                          nrow = num_genes, ncol = num_samples)
    colnames(counts_data) <- paste0("sample", 1:num_samples)
    rownames(counts_data) <- paste0("gene", 1:num_genes)
    
    # Define control genes (ctl)
    # These are genes assumed to be unaffected by the biological variation of interest.
    # Replace with actual control gene indices or names from your data.
    # For demonstration, let's assume the first 10 genes are controls.
    ctl_genes_indices <- 1:10
    
    # Define 'k', the number of unwanted factors to estimate.
    # This parameter often needs to be determined by the user (e.g., using diagnostics).
    k_value <- 1 # Example value, adjust as needed based on your data
    
    # Perform RUV normalization using RUVg
    # RUVg is suitable when you have a set of negative control genes.
    # If you have replicate information, RUVs might be more appropriate.
    # The description "removing batch effects" is general.
    normalized_data_obj <- RUVg(x = counts_data, ctl = ctl_genes_indices, k = k_value)
    normalized_matrix <- normalized_data_obj$normalizedCounts
    
    # Save normalized data to an output file
    write.table(normalized_matrix, "normalized_counts.tsv", sep="\t", quote=FALSE, col.names=NA)
    
    EOF
    
    # Execute the R script
    Rscript run_ruvnormalize.R
  6. 6

    The raw UMI matrix was scaled and input to the naiveRandRUV function with parameters coeff=1e-3 and k=10.

    R (Inferred with models/gemini-2.5-flash) vLatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    #!/bin/bash
    
    # Define the R script content
    R_SCRIPT_CONTENT='
    #!/usr/bin/env Rscript
    
    # This script performs normalization using a function named naiveRandRUV.
    # The function naiveRandRUV is assumed to be either custom-defined or sourced
    # from a specific R package/script not explicitly stated but based on RUV principles.
    # For demonstration, a placeholder function is provided.
    
    # Load necessary libraries
    # library(RUVSeq) # Uncomment if RUVSeq functions are directly used within naiveRandRUV
    
    # --- Placeholder for naiveRandRUV function definition ---
    # If naiveRandRUV is a custom function, it would be defined here or sourced.
    # This example provides a conceptual implementation based on RUV principles.
    # A real implementation would involve more complex statistical modeling.
    naiveRandRUV <- function(data_matrix, coeff, k) {
      message(paste("Running custom naiveRandRUV with coeff =", coeff, "and k =", k))
    
      if (!is.matrix(data_matrix) && !is.data.frame(data_matrix)) {
        stop("Input data_matrix must be a matrix or data frame.")
      }
      if (!is.numeric(data_matrix)) {
        stop("Input data_matrix must contain numeric values.")
      }
    
      # Example: Simple scaling (not actual RUV, just for demonstration)
      # In a real RUV implementation, this would involve more sophisticated
      # statistical modeling to estimate and remove unwanted variation.
      scaled_data <- data_matrix * coeff
    
      # Example: Generating dummy unwanted factors (actual RUV would derive these)
      if (nrow(data_matrix) > 0 && k > 0) {
        unwanted_factors <- matrix(rnorm(nrow(data_matrix) * k), ncol = k)
        colnames(unwanted_factors) <- paste0("W", 1:k)
      } else {
        unwanted_factors <- matrix(0, nrow = nrow(data_matrix), ncol = k)
      }
    
      return(list(normalized_data = scaled_data, unwanted_factors = unwanted_factors))
    }
    # ----------------------------------------------------------
    
    # Define input and output file paths
    input_umi_matrix_file <- "raw_umi_matrix.tsv"
    output_normalized_matrix_file <- "normalized_umi_matrix.tsv"
    output_unwanted_factors_file <- "ruv_unwanted_factors.tsv"
    
    # Check if input file exists
    if (!file.exists(input_umi_matrix_file)) {
      stop(paste("Input UMI matrix file not found:", input_umi_matrix_file))
    }
    
    # Load the raw UMI matrix
    umi_matrix <- as.matrix(read.delim(input_umi_matrix_file, row.names = 1, sep = "\t", check.names = FALSE))
    
    # Parameters for naiveRandRUV
    param_coeff <- 1e-3
    param_k <- 10
    
    # Call the naiveRandRUV function
    message(paste("Calling naiveRandRUV with coeff =", param_coeff, "and k =", param_k))
    ruv_results <- naiveRandRUV(data_matrix = umi_matrix, coeff = param_coeff, k = param_k)
    
    # Save the output normalized matrix
    message(paste("Saving normalized UMI matrix to:", output_normalized_matrix_file))
    write.table(ruv_results$normalized_data, output_normalized_matrix_file, sep = "\t", quote = FALSE, col.names = NA)
    
    # Save the estimated unwanted factors
    message(paste("Saving RUV unwanted factors to:", output_unwanted_factors_file))
    write.table(ruv_results$unwanted_factors, output_unwanted_factors_file, sep = "\t", quote = FALSE, col.names = NA)
    
    message("Normalization complete.")
    '
    
    # Create a dummy raw UMI matrix for demonstration purposes
    # In a real pipeline, this file would be generated by a previous step.
    echo -e "Gene\tSample1\tSample2\tSample3" > raw_umi_matrix.tsv
    echo -e "GeneA\t100\t120\t90" >> raw_umi_matrix.tsv
    echo -e "GeneB\t50\t60\t55" >> raw_umi_matrix.tsv
    echo -e "GeneC\t200\t210\t190" >> raw_umi_matrix.tsv
    echo -e "GeneD\t10\t15\t12" >> raw_umi_matrix.tsv
    echo -e "GeneE\t70\t80\t75" >> raw_umi_matrix.tsv
    
    # Write the R script content to a file
    echo "$R_SCRIPT_CONTENT" > run_naiveRandRUV.R
    
    # Ensure R is installed and Rscript is in PATH
    # conda install -c conda-forge r-base r-matrix r-data.table # Example R installation
    # Bioconductor packages like RUVSeq might need specific installation:
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("RUVSeq")'
    
    # Execute the R script
    Rscript run_naiveRandRUV.R
    
    # Clean up the R script file (optional)
    # rm run_naiveRandRUV.R
  7. 7

    Fifty negative control genes were taken from a list of housekeeping genes (Eisenberg and Levanon, 2013) with least variability in all datasets.

    Python (Inferred with models/gemini-2.5-flash) vNot applicable GitHub
    $ Bash example
    # This script conceptually demonstrates the selection of negative control genes
    # based on variability from a list of housekeeping genes.
    # Actual implementation would depend on the specific data formats and variability metric.
    
    # Input files assumed:
    # 1. housekeeping_genes.txt: A file listing known housekeeping genes (e.g., one gene ID per line).
    #    Source: Eisenberg and Levanon, 2013 (Trends Genet. 2013 Oct;29(10):569-74)
    # 2. expression_data.tsv: A tab-separated file with gene expression data across multiple datasets/samples.
    #    Format: GeneID \t Dataset1_expr \t Dataset2_expr \t ...
    
    # Output: A file containing the 50 selected negative control genes.
    
    # Define input and output file names (placeholders)
    HOUSEKEEPING_GENES_FILE="housekeeping_genes.txt"
    EXPRESSION_DATA_FILE="expression_data.tsv"
    OUTPUT_FILE="negative_control_genes_50.txt"
    NUM_CONTROL_GENES=50
    
    # --- Conceptual Python script (select_control_genes.py) ---
    # #!/usr/bin/env python3
    # import pandas as pd
    # import numpy as np
    # import sys
    #
    # def calculate_variability(series):
    #     """Calculate a variability metric, e.g., coefficient of variation (CV)."""
    #     if series.mean() == 0:
    #         return np.inf # Avoid division by zero for genes with zero expression
    #     return series.std() / series.mean() if series.mean() != 0 else np.inf
    #
    # def select_negative_control_genes(housekeeping_genes_path, expression_data_path, num_genes_to_select):
    #     # Load housekeeping genes
    #     try:
    #         with open(housekeeping_genes_path, 'r') as f:
    #             housekeeping_genes = {line.strip() for line in f if line.strip()}
    #     except FileNotFoundError:
    #         print(f"Error: Housekeeping genes file not found at {housekeeping_genes_path}", file=sys.stderr)
    #         return []
    #
    #     # Load expression data
    #     try:
    #         expression_df = pd.read_csv(expression_data_path, sep='\t', index_col='GeneID')
    #     except FileNotFoundError:
    #         print(f"Error: Expression data file not found at {expression_data_path}", file=sys.stderr)
    #         return []
    #     except Exception as e:
    #         print(f"Error loading expression data: {e}", file=sys.stderr)
    #         return []
    #
    #     # Filter expression data to include only housekeeping genes
    #     hk_expression_df = expression_df[expression_df.index.isin(housekeeping_genes)]
    #
    #     if hk_expression_df.empty:
    #         print("No housekeeping genes found in the expression data or no overlap.", file=sys.stderr)
    #         return []
    #
    #     # Calculate variability for each housekeeping gene across all datasets
    #     variability_scores = hk_expression_df.apply(calculate_variability, axis=1)
    #
    #     # Sort genes by variability (ascending) and select the top N
    #     selected_genes = variability_scores.sort_values().head(num_genes_to_select).index.tolist()
    #
    #     return selected_genes
    #
    # if __name__ == "__main__":
    #     if len(sys.argv) != 5: # Expecting 4 arguments: script_name, hk_file, expr_file, out_file, num_genes
    #         print("Usage: python3 select_control_genes.py <housekeeping_genes_file> <expression_data_file> <output_file> <num_genes_to_select>", file=sys.stderr)
    #         sys.exit(1)
    #
    #     hk_file = sys.argv[1]
    #     expr_file = sys.argv[2]
    #     out_file = sys.argv[3]
    #     num_genes = int(sys.argv[4]) # Convert to integer
    #
    #     selected_genes = select_negative_control_genes(hk_file, expr_file, num_genes)
    #
    #     if selected_genes:
    #         with open(out_file, 'w') as f:
    #             for gene in selected_genes:
    #                 f.write(f"{gene}\n")
    #         print(f"Successfully selected {len(selected_genes)} negative control genes and saved to {out_file}")
    #     else:
    #         print("No genes were selected.", file=sys.stderr)
    #
    # --- End of conceptual Python script ---
    
    # To run this, you would first create the 'select_control_genes.py' file
    # with the content above, and ensure pandas and numpy are installed.
    # conda install pandas numpy
    # pip install pandas numpy
    
    # Execute the script
    python3 select_control_genes.py "${HOUSEKEEPING_GENES_FILE}" "${EXPRESSION_DATA_FILE}" "${OUTPUT_FILE}" "${NUM_CONTROL_GENES}"
  8. 8

    Seurat (v3.0.1) functions were used to calculate top variable genes, principal components analysis (PCA), and tSNE with FindVariableGenes, RunPCA, and RunTSNE.

    Seurat v3.0.1 GitHub
    $ Bash example
    # Install Seurat if not already installed
    # R -q -e "if (!requireNamespace('Seurat', quietly = TRUE)) install.packages('Seurat')"
    # R -q -e "if (!requireNamespace('devtools', quietly = TRUE)) install.packages('devtools')"
    # R -q -e "devtools::install_github('satijalab/seurat', ref = 'release/3.0.1')" # For specific version from GitHub
    
    # Create an R script to perform Seurat analysis
    cat << 'EOF' > run_seurat_analysis.R
    library(Seurat)
    
    # Define input and output file paths
    input_seurat_object_path <- "seurat_object.rds" # Assumed input from a previous step
    output_seurat_object_path <- "processed_seurat_object.rds"
    
    # Load the Seurat object
    if (!file.exists(input_seurat_object_path)) {
        stop(paste("Input Seurat object not found:", input_seurat_object_path))
    }
    seurat_object <- readRDS(input_seurat_object_path)
    
    # 1. Calculate top variable genes (using FindVariableFeatures for Seurat v3+)
    # selection.method: Method to use for feature selection (default "vst")
    # nfeatures: Number of features to return (default 2000)
    message("Finding variable features...")
    seurat_object <- FindVariableFeatures(seurat_object, selection.method = "vst", nfeatures = 2000)
    
    # 2. Scale data (often done before PCA, using all genes or variable features)
    # This step is crucial for PCA. Default features is VariableFeatures(object).
    message("Scaling data...")
    all.genes <- rownames(seurat_object)
    seurat_object <- ScaleData(seurat_object, features = all.genes)
    
    # 3. Perform Principal Components Analysis (PCA)
    # features: Features to use as input for PCA (default VariableFeatures(object))
    # npcs: Number of PCs to compute (default 50)
    message("Running PCA...")
    seurat_object <- RunPCA(seurat_object, features = VariableFeatures(object = seurat_object), npcs = 30)
    
    # 4. Perform tSNE
    # dims: Dimensions to use as input for tSNE (default 1:30 for RunTSNE)
    message("Running tSNE...")
    seurat_object <- RunTSNE(seurat_object, dims = 1:30)
    
    # Save the processed Seurat object
    message(paste("Saving processed Seurat object to:", output_seurat_object_path))
    saveRDS(seurat_object, file = output_seurat_object_path)
    
    message("Seurat analysis complete.")
    EOF
    
    # Execute the R script
    Rscript run_seurat_analysis.R
  9. 9

    The top 5000 genes were considered as input for the PCA calculation, and only the top 25 principal components (PCs) were used in tSNE.

    PCA v1.3.0 GitHub
    $ Bash example
    # Install scikit-learn and pandas if not already installed
    # pip install scikit-learn pandas numpy
    
    # Python script for PCA calculation
    python -c "
    import pandas as pd
    from sklearn.decomposition import PCA
    import numpy as np
    
    # --- Configuration ---
    INPUT_FILE = 'gene_expression_matrix_top5000.tsv' # Assumed input file containing top 5000 genes
    OUTPUT_FILE = 'pca_top25_components.tsv'
    n_components_pca = 25 # As specified: 'only the top 25 principal components (PCs) were used'
    
    # --- Load Data ---
    try:
        # Assuming the input file has genes as rows and samples as columns, with gene names as index.
        # The description 'top 5000 genes were considered as input' implies this file is already filtered.
        data = pd.read_csv(INPUT_FILE, sep='\t', index_col=0)
    except FileNotFoundError:
        print(f'Error: Input file {INPUT_FILE} not found. Please ensure the file exists.')
        exit(1)
    except Exception as e:
        print(f'Error loading input file: {e}')
        exit(1)
    
    # PCA expects features (genes) as columns and samples as rows.
    # If data is genes x samples, transpose it.
    # We check if the number of rows matches 5000 (number of genes).
    if data.shape[0] == 5000:
        data = data.T # Transpose to samples x genes
    
    # Ensure all data is numeric and handle potential missing values (e.g., drop columns with NaNs)
    original_columns = data.columns
    data = data.apply(pd.to_numeric, errors='coerce')
    data = data.dropna(axis=1) # Drop genes (features) that became all NaN
    
    if data.empty:
        print('Error: Dataframe is empty after numeric conversion and NaN handling. Check input data.')
        exit(1)
    
    if data.shape[1] < n_components_pca:
        print(f'Warning: Number of features ({data.shape[1]}) is less than n_components_pca ({n_components_pca}). Adjusting n_components_pca.')
        n_components_pca = data.shape[1]
        if n_components_pca == 0:
            print('Error: No valid features remaining for PCA.')
            exit(1)
    
    # --- Perform PCA ---
    pca = PCA(n_components=n_components_pca)
    principal_components = pca.fit_transform(data)
    
    # --- Save Results ---
    pc_df = pd.DataFrame(
        data=principal_components,
        columns=[f'PC{i+1}' for i in range(n_components_pca)],
        index=data.index # Keep sample names as index
    )
    
    pc_df.to_csv(OUTPUT_FILE, sep='\t')
    
    print(f'PCA completed successfully. {n_components_pca} principal components saved to {OUTPUT_FILE}')
    print(f'Explained variance ratio per component: {pca.explained_variance_ratio_}')
    print(f'Total explained variance by {n_components_pca} components: {np.sum(pca.explained_variance_ratio_):.2f}')
    "
    
  10. 10

    none provided by the submitter

    Not specified (No description provided) vNot specified (No description provided)
    $ Bash example
    # No step description provided by the submitter. Cannot infer tool, version, or generate a specific code block.
Raw Source Text
Reads from scRNA-seq were aligned to mm10 using the 10x Genomics Cell Ranger software (v 2.1.0). Reads were collapsed into unique molecular identifier counts. All samples had >2000 cells detected with >1000 genes per cells with >70% of the coding genome covered.  Genes that were not expressed in at least 5% of all cells were excluded. As previously described (Boland et al., 2020), replicates of single cell libraries were normalized removing batch effects using RUVnormalize (v1.15.0). The raw UMI matrix was scaled and input to the naiveRandRUV function with parameters coeff=1e-3 and k=10. Fifty negative control genes were taken from a list of housekeeping genes (Eisenberg and Levanon, 2013) with least variability in all datasets. Seurat (v3.0.1) functions were used to calculate top variable genes, principal components analysis (PCA), and tSNE with FindVariableGenes, RunPCA, and RunTSNE. The top 5000 genes were considered as input for the PCA calculation, and only the top 25 principal components (PCs) were used in tSNE.
none provided by the submitter
Assembly: mm10
Supplementary files format and content: Tab-separated values files and matrix files
← Back to Analysis