GSE203092 Processing Pipeline
Publication
The long noncoding RNA Malat1 regulates CD8+ T cell differentiation by mediating epigenetic repression.The Journal of experimental medicine (2022) — PMID 35593887
Dataset
GSE203092The long noncoding RNA Malat1 regulates CD8+ T cell differentiation by mediating epigenetic repression
Processing Steps
Generate Jupyter Notebook-
1
Reads from scRNA-seq were aligned to mm10 using the 10x Genomics Cell Ranger software (v 2.1.0).
Cell Ranger v2.1.0$ Bash example
# Cell Ranger is typically installed by downloading and extracting the tarball. # For example, for version 2.1.0: # wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-2.1.0.tar.gz # tar -xzf cellranger-2.1.0.tar.gz # export PATH=/path/to/cellranger-2.1.0:$PATH # Define paths for reference genome and input/output # The mm10 transcriptome reference must be pre-built or downloaded from 10x Genomics. # Example command to build a reference (if not using a pre-built one): # cellranger mkref --genome=mm10 --fasta=/path/to/mm10.fa --genes=/path/to/mm10_genes.gtf --transcript=/path/to/mm10_transcripts.gtf MM10_TRANSCRIPTOME_PATH="/path/to/10x_genomics_references/mm10/cellranger_refdata_mm10_2020-A" # Placeholder for mm10 transcriptome reference FASTQ_DIR="/path/to/your/input_fastqs" # Placeholder for directory containing FASTQ files OUTPUT_ID="scrnaseq_alignment_mm10" # Placeholder for the output directory name SAMPLE_NAME="my_scrnaseq_sample" # Placeholder for the sample name (matches FASTQ file prefix) # Run Cell Ranger alignment cellranger count \ --id="${OUTPUT_ID}" \ --transcriptome="${MM10_TRANSCRIPTOME_PATH}" \ --fastqs="${FASTQ_DIR}" \ --sample="${SAMPLE_NAME}" \ --expect-cells=3000 # Adjust based on expected number of cells -
2
Reads were collapsed into unique molecular identifier counts.
$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi_tools # Assuming input BAM file is 'aligned_reads.bam' where UMIs are in the read names # (e.g., read_name_UMI_sequence) and separated by an underscore. # This command collapses reads with the same mapping position and UMI into a single count. umi_tools count \ --per-contig \ --umi-separator='_' \ -I aligned_reads.bam \ -S umi_counted_reads.bam \ --log umi_counts.log -
3
All samples had >2000 cells detected with >1000 genes per cells with >70% of the coding genome covered.
$ Bash example
# This script conceptually represents the quality control filtering step # described: "All samples had >2000 cells detected with >1000 genes per cells with >70% of the coding genome covered." # It assumes a tab-separated file 'sample_qc_metrics.tsv' exists with relevant columns. # Define thresholds MIN_CELLS=2000 MIN_GENES_PER_CELL=1000 MIN_CODING_GENOME_COVERAGE=70 # Input QC metrics file (example structure: sample_id\tcells_detected\tgenes_per_cell\tcoding_genome_coverage_percent) QC_FILE="sample_qc_metrics.tsv" # Output file for samples passing QC PASSING_SAMPLES_FILE="samples_passing_qc.tsv" # Reference genome annotation (e.g., for calculating coding genome coverage if not pre-calculated) # This would typically be a GTF/GFF file used by an upstream tool like QualiMap or RSeQC. # For demonstration, we'll just note its conceptual presence. # GENOME_ANNOTATION="Homo_sapiens.GRCh38.109.gtf" # Example for human GRCh38 echo "Checking samples against QC criteria:" echo " - Cells detected > ${MIN_CELLS}" echo " - Genes per cell > ${MIN_GENES_PER_CELL}" echo " - Coding genome coverage > ${MIN_CODING_GENOME_COVERAGE}%" echo "" # Process the QC file, skipping header, and filter samples # Using awk for numerical comparison awk -v min_cells="$MIN_CELLS" \ -v min_genes="$MIN_GENES_PER_CELL" \ -v min_coverage="$MIN_CODING_GENOME_COVERAGE" \ 'BEGIN {FS="\t"; OFS="\t"} NR==1 {print; next} # Print header $2 > min_cells && $3 > min_genes && $4 > min_coverage {print}' \ "$QC_FILE" > "$PASSING_SAMPLES_FILE" echo "QC complete. Samples passing criteria are listed in: $PASSING_SAMPLES_FILE" echo "Original QC file: $QC_FILE" # echo "Note: Calculation of 'coding genome coverage' typically requires a genome annotation file (e.g., GTF) and tools like QualiMap or RSeQC, which would be run upstream." -
4
Genes that were not expressed in at least 5% of all cells were excluded.
$ Bash example
# Install Seurat if not already installed # R -e "if (!requireNamespace('Seurat', quietly = TRUE)) install.packages('Seurat')" # Create an R script for filtering genes based on expression percentage across cells cat << 'EOF' > filter_genes_by_expression.R library(Seurat) # --- Configuration --- INPUT_SEURAT_OBJECT_PATH <- "input_seurat_object.rds" # Path to your input Seurat object OUTPUT_SEURAT_OBJECT_PATH <- "filtered_seurat_object.rds" # Path for the filtered output Seurat object MIN_PERCENT_CELLS_EXPRESSING <- 5 # Minimum percentage of cells a gene must be expressed in to be kept # --------------------- # Check if input file exists if (!file.exists(INPUT_SEURAT_OBJECT_PATH)) { stop(paste("Input Seurat object '", INPUT_SEURAT_OBJECT_PATH, "' not found. Please provide a valid path.", sep="")) } message(paste("Loading Seurat object from:", INPUT_SEURAT_OBJECT_PATH)) seurat_object <- readRDS(INPUT_SEURAT_OBJECT_PATH) # Get the raw counts matrix from the 'RNA' assay # Assuming 'RNA' is the default assay and 'counts' slot contains raw data counts <- GetAssayData(object = seurat_object, assay = "RNA", slot = "counts") # Calculate the number of cells where each gene is expressed (count > 0) num_cells_expressing_gene <- rowSums(counts > 0) # Calculate the percentage of cells expressing each gene percent_cells_expressing_gene <- (num_cells_expressing_gene / ncol(counts)) * 100 # Identify genes to keep (expressed in at least MIN_PERCENT_CELLS_EXPRESSING% of cells) # Genes that were not expressed in at least 5% of all cells were excluded. # This means we keep genes expressed in >= 5% of cells. genes_to_keep <- names(percent_cells_expressing_gene[percent_cells_expressing_gene >= MIN_PERCENT_CELLS_EXPRESSING]) message(paste("Original number of genes:", nrow(seurat_object@assays$RNA@counts))) message(paste("Number of genes to keep (expressed in >= ", MIN_PERCENT_CELLS_EXPRESSING, "% of cells): ", length(genes_to_keep))) # Subset the Seurat object to keep only these genes seurat_object_filtered <- subset(seurat_object, features = genes_to_keep) message(paste("Filtered number of genes:", nrow(seurat_object_filtered@assays$RNA@counts))) # Save the filtered Seurat object message(paste("Saving filtered Seurat object to:", OUTPUT_SEURAT_OBJECT_PATH)) saveRDS(seurat_object_filtered, OUTPUT_SEURAT_OBJECT_PATH) message("Gene filtering complete.") EOF # Execute the R script Rscript filter_genes_by_expression.R -
5
As previously described (Boland et al., 2020), replicates of single cell libraries were normalized removing batch effects using RUVnormalize (v1.15.0).
RUVnormalize v1.15.0$ Bash example
# Install R (if not already installed) # For example, on Ubuntu: # sudo apt update # sudo apt install r-base # Install BiocManager and RUVnormalize package (v1.15.0) # This version of RUVnormalize (1.15.0) is part of Bioconductor 3.12, which requires R version 4.0.x. # If your R version is different, you might need to adjust the BiocManager::install call or R version. # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("RUVnormalize", version = "3.12")' # Define input and output file paths # Replace 'your_input_counts.csv' with the actual path to your single-cell count matrix. # The count matrix should have genes as rows and samples as columns, with gene IDs in the first column. INPUT_COUNTS_FILE="your_input_counts.csv" OUTPUT_NORMALIZED_FILE="normalized_single_cell_counts.csv" # Define RUVnormalize specific parameters # These are CRITICAL and cannot be inferred from the description. # You must determine these based on your data and experimental design. # CONTROL_GENES_FILE: Path to a file listing control gene names (one per line). # These are genes assumed to have constant expression across samples, # or spike-ins, used to estimate unwanted variation. # NUM_UNWANTED_FACTORS: Integer 'k', the number of unwanted factors to estimate. # Often determined by exploring variance components (e.g., PCA). # For demonstration, we use placeholders. In a real pipeline, these would be actual files/values. # Example: # CONTROL_GENES_FILE="path/to/your/control_genes.txt" # NUM_UNWANTED_FACTORS="1" # e.g., 1, 2, or more # Create a temporary R script to execute RUVnormalize R_SCRIPT_NAME="run_ruvnormalize.R" cat << 'EOF_R_SCRIPT' > "${R_SCRIPT_NAME}" # R script to perform RUV normalization using RUVnormalize package (v1.15.0) # Load RUVnormalize package library(RUVnormalize) # --- Configuration --- input_counts_file <- Sys.getenv("INPUT_COUNTS_FILE") output_normalized_file <- Sys.getenv("OUTPUT_NORMALIZED_FILE") # --- Load Data --- if (is.null(input_counts_file) || !file.exists(input_counts_file)) { stop(paste("Error: Input counts file not found or not specified via INPUT_COUNTS_FILE environment variable. Expected:", input_counts_file)) } # Assuming a CSV file where the first column is gene IDs and subsequent columns are sample counts counts_data <- read.csv(input_counts_file, row.names = 1, check.names = FALSE) # --- RUV Normalization Parameters (CRITICAL: These need to be determined by the user) --- # The RUVnormalize function requires 'cIdx' (indices of control genes) and 'k' (number of unwanted factors). # These cannot be inferred from the general description "removing batch effects". # Users typically identify control genes (e.g., housekeeping genes, spike-ins, or empirically) # and determine 'k' through data exploration (e.g., PCA, variance analysis). # Placeholder for control gene indices (e.g., from a list of gene names) # In a real pipeline, you would load these from a file or define them. # Example: # control_genes_file <- Sys.getenv("CONTROL_GENES_FILE", "") # if (file.exists(control_genes_file)) { # cIdx_genes <- readLines(control_genes_file) # cIdx <- match(cIdx_genes, rownames(counts_data)) # cIdx <- cIdx[!is.na(cIdx)] # if (length(cIdx) == 0) { # warning("No specified control genes found in the input data. Using first 10 genes as dummy controls.") # cIdx <- 1:min(10, nrow(counts_data)) # } # } else { # warning("Control genes file not found or not specified. Using first 10 genes as dummy controls.") # cIdx <- 1:min(10, nrow(counts_data)) # } # For this example, we will use the first 10 genes as dummy controls if not enough genes. cIdx <- 1:min(10, nrow(counts_data)) if (length(cIdx) == 0) { stop("Error: Not enough genes in the input data to select dummy control genes.") } # Placeholder for 'k' (number of unwanted factors) # In a real pipeline, this would be a user-defined value. # Example: # k_value <- as.numeric(Sys.getenv("NUM_UNWANTED_FACTORS", "1")) # For this example, we will use k=1 as a dummy value. k_value <- 1 message(paste("Using dummy control gene indices (first", length(cIdx), "genes):", paste(cIdx, collapse=", "))) message(paste("Using dummy number of unwanted factors (k):", k_value)) message("WARNING: 'cIdx' and 'k' are critical parameters that should be carefully chosen based on your data and experimental design.") # --- Perform RUV Normalization --- normalized_result <- RUVnormalize(counts = as.matrix(counts_data), cIdx = cIdx, k = k_value) # Extract normalized counts normalized_matrix <- normalized_result$normalizedCounts # --- Save Results --- if (is.null(output_normalized_file)) { stop("Error: Output normalized file not specified via OUTPUT_NORMALIZED_FILE environment variable.") } write.csv(normalized_matrix, file = output_normalized_file, row.names = TRUE) message(paste("Normalization complete. Normalized counts saved to", output_normalized_file)) EOF_R_SCRIPT # Set environment variables for the R script export INPUT_COUNTS_FILE="${INPUT_COUNTS_FILE}" export OUTPUT_NORMALIZED_FILE="${OUTPUT_NORMALIZED_FILE}" # export CONTROL_GENES_FILE="${CONTROL_GENES_FILE}" # Uncomment and set if you have a control genes file # export NUM_UNWANTED_FACTORS="${NUM_UNWANTED_FACTORS}" # Uncomment and set if you have a specific k value # Execute the R script Rscript "${R_SCRIPT_NAME}" # Clean up the temporary R script file rm "${R_SCRIPT_NAME}" -
6
The raw UMI matrix was scaled and input to the naiveRandRUV function with parameters coeff=1e-3 and k=10.
$ Bash example
# Install R and Bioconductor if not already present # sudo apt-get update && sudo apt-get install -y r-base # R -e 'install.packages("BiocManager", repos="https://cloud.r-project.org")' # R -e 'BiocManager::install("RUVSeq")' # Create an R script cat << 'EOF' > run_ruvseq.R # Load necessary library library(RUVSeq) # --- Placeholder for loading the raw UMI matrix --- # Replace 'raw_umi_matrix.csv' with your actual input file path. # The input should be a matrix of raw UMI counts (integers). # Example: # umi_matrix <- read.csv("raw_umi_matrix.csv", row.names=1) # umi_matrix <- as.matrix(umi_matrix) # Dummy data for demonstration purposes if no actual file is provided set.seed(123) num_genes <- 1000 num_samples <- 10 umi_matrix <- matrix(rnbinom(num_genes * num_samples, mu=100, size=10), ncol=num_samples, dimnames=list(paste0("gene", 1:num_genes), paste0("sample", 1:num_samples))) # Convert the UMI matrix to a SeqExpressionSet object, which is commonly used by RUVSeq. # If you have sample metadata (e.g., batch information), you would include it here. # For naiveRandRUV, a basic SeqExpressionSet from the count matrix is sufficient. set <- newSeqExpressionSet(umi_matrix) # Apply the naiveRandRUV function. # The parameter 'k=10' is directly used by naiveRandRUV. # The parameter 'coeff=1e-3' is NOT a direct argument for the naiveRandRUV function # as defined in the RUVSeq package (version 1.38.0 and earlier). # It is possible that 'coeff' refers to: # 1. A parameter for a prior custom scaling step applied to the UMI matrix before input to naiveRandRUV. # 2. A parameter for a custom wrapper function that internally calls naiveRandRUV. # 3. A parameter for a different RUV-like function (e.g., from another package like RUVIII). # Since the description explicitly states "naiveRandRUV function with parameters coeff=1e-3 and k=10", # but 'coeff' is not a valid argument, we proceed with 'k=10' and acknowledge the discrepancy for 'coeff'. ruv_set <- naiveRandRUV(set, k=10) # Extract normalized counts from the RUVSeq output object. normalized_counts <- normCounts(ruv_set) # --- Placeholder for saving the output --- # Replace 'normalized_umi_matrix_ruvseq.csv' with your desired output file path. write.csv(normalized_counts, "normalized_umi_matrix_ruvseq.csv", quote=FALSE) message("RUVSeq normalization complete. Normalized counts saved to normalized_umi_matrix_ruvseq.csv") EOF # Execute the R script Rscript run_ruvseq.R -
7
Fifty negative control genes were taken from a list of housekeeping genes (Eisenberg and Levanon, 2013) with least variability in all datasets.
Custom Script (Inferred with models/gemini-2.5-flash)$ Bash example
# This script conceptually represents the selection of 50 negative control genes. # It assumes that a list of housekeeping genes (e.g., from Eisenberg and Levanon, 2013) # and pre-computed variability scores for these genes across relevant datasets are available. # Placeholder for the file containing housekeeping genes and their variability scores. # Format: gene_id\tvariability_score # Example content (replace with actual data): # geneA\t0.05 # geneB\t0.01 # geneC\t0.12 # ... # For demonstration, create a dummy file if not available: # echo -e "geneA\t0.05\ngeneB\t0.01\ngeneC\t0.12\ngeneD\t0.03\ngeneE\t0.08\ngeneF\t0.02\ngeneG\t0.07\ngeneH\t0.04\ngeneI\t0.06\ngeneJ\t0.09" > gene_variability_scores.tsv # Assuming 'gene_variability_scores.tsv' contains gene IDs and their variability scores, # with variability in the second column. # Sort by variability (ascending) and take the top 50 gene IDs. # The 'cut -f1' extracts only the gene ID. sort -k2,2n gene_variability_scores.tsv | head -n 50 | cut -f1 > negative_control_genes_50.txt # The output file 'negative_control_genes_50.txt' will contain the 50 selected gene IDs.
-
8
Seurat (v3.0.1) functions were used to calculate top variable genes, principal components analysis (PCA), and tSNE with FindVariableGenes, RunPCA, and RunTSNE.
$ Bash example
cat << 'EOF' > run_seurat_analysis.R # Load the Seurat library library(Seurat) # --- Placeholder for loading your Seurat object --- # This script assumes you have a Seurat object already created and normalized # from previous steps in your pipeline. Replace 'path/to/your/normalized_seurat_object.rds' # with the actual path to your input Seurat object. # If starting from raw counts, you would need initial steps like: # counts <- Read10X(data.dir = "path/to/10x/data/") # seurat_object <- CreateSeuratObject(counts = counts) # seurat_object <- NormalizeData(seurat_object) # Example: Load an existing Seurat object # For a real pipeline, ensure 'path/to/your/normalized_seurat_object.rds' exists. if (file.exists("path/to/your/normalized_seurat_object.rds")) { seurat_object <- readRDS("path/to/your/normalized_seurat_object.rds") } else { # --- DEMONSTRATION ONLY: Create a dummy Seurat object if no file is found --- # In a real scenario, this would typically be an error or the output of a previous step. message("WARNING: 'path/to/your/normalized_seurat_object.rds' not found. Creating a dummy Seurat object for demonstration.") dummy_counts <- matrix(sample(0:10, 1000, replace = TRUE), nrow = 50, ncol = 20) rownames(dummy_counts) <- paste0("gene", 1:50) colnames(dummy_counts) <- paste0("cell", 1:20) seurat_object <- CreateSeuratObject(counts = dummy_counts) seurat_object <- NormalizeData(seurat_object) message("Please replace the placeholder path with your actual Seurat object path.") } # 1. Calculate top variable genes using FindVariableFeatures (formerly FindVariableGenes) # Using common default parameters as none were specified in the description. # 'vst' (variance stabilizing transformation) is a robust method. # 'nfeatures = 2000' identifies the top 2000 variable features. seurat_object <- FindVariableFeatures(seurat_object, selection.method = "vst", nfeatures = 2000) # 2. Perform Principal Components Analysis (PCA) # PCA is run on the previously identified variable features. seurat_object <- RunPCA(seurat_object, features = VariableFeatures(seurat_object)) # 3. Perform t-distributed Stochastic Neighbor Embedding (tSNE) # tSNE is typically run after PCA, using a subset of the principal components. # 'dims = 1:30' uses the first 30 principal components, a common choice. seurat_object <- RunTSNE(seurat_object, dims = 1:30) # --- Placeholder for saving the updated Seurat object --- # Replace 'path/to/your/processed_seurat_object.rds' with your desired output path. saveRDS(seurat_object, "path/to/your/processed_seurat_object.rds") # Optional: You can add plotting commands here to visualize the results # For example: # pdf("variable_features_plot.pdf") # VariableFeaturePlot(seurat_object) # dev.off() # # pdf("pca_elbow_plot.pdf") # ElbowPlot(seurat_object) # dev.off() # # pdf("tsne_plot.pdf") # DimPlot(seurat_object, reduction = "tsne") # dev.off() EOF # --- Installation instructions (commented out) --- # To install Seurat in R: # R -e 'install.packages("Seurat")' # Or, if you need BiocManager: # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("Seurat")' # Execute the R script Rscript run_seurat_analysis.R -
9
The top 5000 genes were considered as input for the PCA calculation, and only the top 25 principal components (PCs) were used in tSNE.
$ Bash example
# Install scikit-learn if not already installed # conda install -c conda-forge scikit-learn=1.4.0 pandas numpy # Create a dummy input file for demonstration purposes # In a real scenario, 'top_5000_genes_expression.csv' would be the pre-filtered gene expression matrix. # This example creates a matrix with 100 samples and 5000 genes. python -c " import pandas as pd import numpy as np np.random.seed(42) num_samples = 100 num_genes = 5000 data = np.random.rand(num_samples, num_genes) gene_names = [f'gene_{i+1}' for i in range(num_genes)] sample_names = [f'sample_{i+1}' for i in range(num_samples)] df = pd.DataFrame(data, index=sample_names, columns=gene_names) df.to_csv('top_5000_genes_expression.csv') print('Created dummy input file: top_5000_genes_expression.csv') " # Python script to perform PCA python -c " import pandas as pd from sklearn.decomposition import PCA import numpy as np # Load the pre-filtered gene expression data (top 5000 genes) # Assuming rows are samples and columns are genes input_file = 'top_5000_genes_expression.csv' df = pd.read_csv(input_file, index_col=0) # Initialize PCA with 25 principal components n_components = 25 pca = PCA(n_components=n_components) # Fit PCA to the data and transform it pc_data = pca.fit_transform(df) # Create a DataFrame for the principal components pc_df = pd.DataFrame(data=pc_data, columns=[f'PC_{i+1}' for i in range(n_components)], index=df.index) # Save the principal components to a CSV file output_file = 'principal_components_25.csv' pc_df.to_csv(output_file) print(f'PCA completed. Saved {n_components} principal components to {output_file}') print(f'Explained variance ratio: {pca.explained_variance_ratio_.sum():.4f}') " -
10
none provided by the submitter
(Inferred with models/gemini-2.5-flash)
Raw Source Text
Reads from scRNA-seq were aligned to mm10 using the 10x Genomics Cell Ranger software (v 2.1.0). Reads were collapsed into unique molecular identifier counts. All samples had >2000 cells detected with >1000 genes per cells with >70% of the coding genome covered. Genes that were not expressed in at least 5% of all cells were excluded. As previously described (Boland et al., 2020), replicates of single cell libraries were normalized removing batch effects using RUVnormalize (v1.15.0). The raw UMI matrix was scaled and input to the naiveRandRUV function with parameters coeff=1e-3 and k=10. Fifty negative control genes were taken from a list of housekeeping genes (Eisenberg and Levanon, 2013) with least variability in all datasets. Seurat (v3.0.1) functions were used to calculate top variable genes, principal components analysis (PCA), and tSNE with FindVariableGenes, RunPCA, and RunTSNE. The top 5000 genes were considered as input for the PCA calculation, and only the top 25 principal components (PCs) were used in tSNE. none provided by the submitter Assembly: mm10 Supplementary files format and content: Tab-separated values files and matrix files