GSE203086 Processing Pipeline
Publication
The long noncoding RNA Malat1 regulates CD8+ T cell differentiation by mediating epigenetic repression.The Journal of experimental medicine (2022) — PMID 35593887
Dataset
GSE203086The long noncoding RNA Malat1 regulates CD8+ T cell differentiation by mediating epigenetic repression (scRNA-Seq)
Processing Steps
Generate Jupyter Notebook-
1
Reads from scRNA-seq were aligned to mm10 using the 10x Genomics Cell Ranger software (v 2.1.0).
Cell Ranger v2.1.0$ Bash example
# Install Cell Ranger (example - adjust path as needed) # wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-2.1.0.tar.gz # tar -xzf cellranger-2.1.0.tar.gz # export PATH=/path/to/cellranger-2.1.0:$PATH # Download or build mm10 reference transcriptome (example) # For Cell Ranger, you typically download a pre-built reference from 10x Genomics or build your own. # Example: wget https://cf.10xgenomics.com/releases/cell-ranger/refdata-cellranger-mm10-1.2.0.tar.gz # tar -xzf refdata-cellranger-mm10-1.2.0.tar.gz # REF_GENOME_DIR="refdata-cellranger-mm10-1.2.0" # Define variables FASTQ_DIR="path/to/your/fastq_files" # Directory containing FASTQ files REF_GENOME_DIR="path/to/your/mm10_cellranger_reference" # Path to the Cell Ranger mm10 reference transcriptome OUTPUT_ID="my_scrnaseq_alignment" # A unique ID for this run SAMPLE_NAME="my_sample" # Sample name (optional, but good practice) EXPECTED_CELLS=3000 # Estimated number of cells (adjust as needed) # Run Cell Ranger count cellranger count \ --id="${OUTPUT_ID}" \ --transcriptome="${REF_GENOME_DIR}" \ --fastqs="${FASTQ_DIR}" \ --sample="${SAMPLE_NAME}" \ --expect-cells="${EXPECTED_CELLS}" -
2
Reads were collapsed into unique molecular identifier counts.
umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install umi_tools if not already available # conda install -c bioconda umi_tools # Define input and output file paths INPUT_BAM="aligned_reads_with_umis.bam" # Input BAM file, typically coordinate-sorted, with UMIs in a tag (e.g., RX) OUTPUT_DEDUP_BAM="deduplicated_reads.bam" UMI_TAG="RX" # The tag where UMIs are stored (e.g., after umi_tools extract) # Collapse reads into unique molecular identifier counts (deduplication) umi_tools dedup \ --input "${INPUT_BAM}" \ --output "${OUTPUT_DEDUP_BAM}" \ --umi-tag "${UMI_TAG}" \ --method "unique" \ --log "umi_tools_dedup.log" # Note: This command assumes UMIs have already been extracted from read names # and placed into a BAM tag (e.g., 'RX') in a prior step (e.g., using 'umi_tools extract'). # For example, a preceding step might look like: # umi_tools extract \ # --input "aligned_reads.bam" \ # --output "aligned_reads_with_umis.bam" \ # --extract-method "regex" \ # --regex "^(?P<umi_1>.{6})(?P<discard_1>.*)" \ # --umi-tag "RX" -
3
All samples had >2000 cells detected with >1000 genes per cells with >70% of the coding genome covered.
CellRanger (Inferred with models/gemini-2.5-flash) v7.1.0$ Bash example
# This script demonstrates how CellRanger might be used to process single-cell RNA-seq data, # generating metrics that are then checked against the described quality control thresholds. # The description states that "All samples had >2000 cells detected with >1000 genes per cells # with >70% of the coding genome covered." This implies a successful run of a tool like CellRanger # followed by a quality control filtering step. # --- Installation (commented out) --- # Cell Ranger software is proprietary and typically downloaded from 10x Genomics. # Ensure Cell Ranger is installed and in your PATH. # Example download (requires registration): # # wget https://cf.10xgenomics.com/releases/cell-exp/cellranger-7.1.0.tar.gz # # tar -xzf cellranger-7.1.0.tar.gz # # export PATH=/path/to/cellranger-7.1.0:$PATH # --- Reference Data (commented out) --- # Download a 10x Genomics reference transcriptome (e.g., GRCh38 for human) # # wget https://cf.10xgenomics.com/releases/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz # # tar -xzf refdata-gex-GRCh38-2020-A.tar.gz # # REFERENCE_PATH="./refdata-gex-GRCh38-2020-A" # --- Configuration --- # Replace with actual paths and sample information FASTQ_BASE_DIR="/path/to/your/fastq_files" # Directory containing subdirectories for each sample's fastqs REFERENCE_PATH="/path/to/refdata-gex-GRCh38-2020-A" # Path to the Cell Ranger transcriptome reference OUTPUT_DIR="./cellranger_output" NUM_CORES=16 MEMORY_GB=64 # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # --- Main CellRanger Execution Loop --- # This loop processes each sample. The metrics generated by 'cellranger count' # are then implicitly checked against the criteria mentioned in the description. # The description states that *all samples met* these criteria, implying successful processing # and subsequent validation of the output metrics. # Example: Iterate through sample directories (assuming fastqs are organized by sample ID) for SAMPLE_FASTQ_DIR in "${FASTQ_BASE_DIR}"/*/; do SAMPLE_ID=$(basename "${SAMPLE_FASTQ_DIR}") echo "Processing sample: ${SAMPLE_ID}" cellranger count \ --id="${SAMPLE_ID}" \ --transcriptome="${REFERENCE_PATH}" \ --fastqs="${SAMPLE_FASTQ_DIR}" \ --sample="${SAMPLE_ID}" \ --localcores="${NUM_CORES}" \ --localmem="${MEMORY_GB}" \ --output-directory="${OUTPUT_DIR}/${SAMPLE_ID}" # After this command runs, Cell Ranger generates a 'metrics_summary.csv' file # within the output directory (e.g., ${OUTPUT_DIR}/${SAMPLE_ID}/outs/metrics_summary.csv). # The description implies that the values in this summary file for each sample # satisfied the following conditions: # - "Number of Cells Detected" > 2000 # - "Median Genes per Cell" > 1000 # - "Fraction Reads Mapped Confidently to Transcriptome" (as a proxy for coding genome covered) > 0.70 # A separate script (e.g., in R or Python) would typically parse this CSV # and perform these checks. done -
4
Genes that were not expressed in at least 5% of all cells were excluded.
$ Bash example
# Install Seurat if not already installed (uncomment to run) # R -e "install.packages('Seurat')" # R -e "install.packages('SeuratObject')" # SeuratObject is a dependency # Create an R script for filtering genes based on expression percentage across cells cat << 'EOF' > filter_genes_by_expression.R library(Seurat) # Load the Seurat object # Replace 'input_seurat_object.rds' with your actual input file path # This placeholder assumes a Seurat object saved as an RDS file. # If your data is in a different format (e.g., AnnData for Scanpy, or raw counts), # you would need to adapt the loading and object creation steps. seurat_obj <- readRDS("input_seurat_object.rds") # Define the minimum percentage of cells a gene must be expressed in # Parameter inferred from description: "at least 5% of all cells" min_percent_cells <- 5 # Calculate the percentage of cells expressing each gene # This assumes the primary assay is 'RNA' and the data is in the 'counts' slot. # A gene is considered 'expressed' if its count is greater than 0 in a cell. percent_cells_expressed <- rowSums(seurat_obj@assays$RNA@counts > 0) / ncol(seurat_obj) * 100 # Identify genes to keep (those expressed in at least min_percent_cells of cells) genes_to_keep <- names(percent_cells_expressed[percent_cells_expressed >= min_percent_cells]) # Subset the Seurat object to retain only the identified genes seurat_obj_filtered <- subset(seurat_obj, features = genes_to_keep) # Save the filtered Seurat object # Replace 'filtered_seurat_object.rds' with your desired output file path saveRDS(seurat_obj_filtered, "filtered_seurat_object.rds") message(paste("Original number of genes:", nrow(seurat_obj))) message(paste("Number of genes after filtering:", nrow(seurat_obj_filtered))) EOF # Execute the R script to perform the filtering Rscript filter_genes_by_expression.R -
5
As previously described (Boland et al., 2020), replicates of single cell libraries were normalized removing batch effects using RUVnormalize (v1.15.0).
$ Bash example
# Install R and RUVnormalize if not already installed # sudo apt-get update # sudo apt-get install -y r-base # R -e "install.packages('BiocManager', repos='http://cran.us.r-project.org')" # R -e "BiocManager::install('RUVnormalize')" # Create an R script to run RUVnormalize cat << 'EOF' > run_ruvnormalize.R # Load RUVnormalize library library(RUVnormalize) # Placeholder for input data (e.g., a count matrix) # Replace 'input_counts.tsv' with your actual input file path # Example: counts_data <- read.table("input_counts.tsv", header=TRUE, row.names=1) # For demonstration, let's create some dummy data set.seed(123) num_genes <- 100 num_samples <- 10 counts_data <- matrix(rnbinom(num_genes * num_samples, size = 10, mu = 100), nrow = num_genes, ncol = num_samples) colnames(counts_data) <- paste0("sample", 1:num_samples) rownames(counts_data) <- paste0("gene", 1:num_genes) # Define control genes (ctl) # These are genes assumed to be unaffected by the biological variation of interest. # Replace with actual control gene indices or names from your data. # For demonstration, let's assume the first 10 genes are controls. ctl_genes_indices <- 1:10 # Define 'k', the number of unwanted factors to estimate. # This parameter often needs to be determined by the user (e.g., using diagnostics). k_value <- 1 # Example value, adjust as needed based on your data # Perform RUV normalization using RUVg # RUVg is suitable when you have a set of negative control genes. # If you have replicate information, RUVs might be more appropriate. # The description "removing batch effects" is general. normalized_data_obj <- RUVg(x = counts_data, ctl = ctl_genes_indices, k = k_value) normalized_matrix <- normalized_data_obj$normalizedCounts # Save normalized data to an output file write.table(normalized_matrix, "normalized_counts.tsv", sep="\t", quote=FALSE, col.names=NA) EOF # Execute the R script Rscript run_ruvnormalize.R -
6
The raw UMI matrix was scaled and input to the naiveRandRUV function with parameters coeff=1e-3 and k=10.
$ Bash example
#!/bin/bash # Define the R script content R_SCRIPT_CONTENT=' #!/usr/bin/env Rscript # This script performs normalization using a function named naiveRandRUV. # The function naiveRandRUV is assumed to be either custom-defined or sourced # from a specific R package/script not explicitly stated but based on RUV principles. # For demonstration, a placeholder function is provided. # Load necessary libraries # library(RUVSeq) # Uncomment if RUVSeq functions are directly used within naiveRandRUV # --- Placeholder for naiveRandRUV function definition --- # If naiveRandRUV is a custom function, it would be defined here or sourced. # This example provides a conceptual implementation based on RUV principles. # A real implementation would involve more complex statistical modeling. naiveRandRUV <- function(data_matrix, coeff, k) { message(paste("Running custom naiveRandRUV with coeff =", coeff, "and k =", k)) if (!is.matrix(data_matrix) && !is.data.frame(data_matrix)) { stop("Input data_matrix must be a matrix or data frame.") } if (!is.numeric(data_matrix)) { stop("Input data_matrix must contain numeric values.") } # Example: Simple scaling (not actual RUV, just for demonstration) # In a real RUV implementation, this would involve more sophisticated # statistical modeling to estimate and remove unwanted variation. scaled_data <- data_matrix * coeff # Example: Generating dummy unwanted factors (actual RUV would derive these) if (nrow(data_matrix) > 0 && k > 0) { unwanted_factors <- matrix(rnorm(nrow(data_matrix) * k), ncol = k) colnames(unwanted_factors) <- paste0("W", 1:k) } else { unwanted_factors <- matrix(0, nrow = nrow(data_matrix), ncol = k) } return(list(normalized_data = scaled_data, unwanted_factors = unwanted_factors)) } # ---------------------------------------------------------- # Define input and output file paths input_umi_matrix_file <- "raw_umi_matrix.tsv" output_normalized_matrix_file <- "normalized_umi_matrix.tsv" output_unwanted_factors_file <- "ruv_unwanted_factors.tsv" # Check if input file exists if (!file.exists(input_umi_matrix_file)) { stop(paste("Input UMI matrix file not found:", input_umi_matrix_file)) } # Load the raw UMI matrix umi_matrix <- as.matrix(read.delim(input_umi_matrix_file, row.names = 1, sep = "\t", check.names = FALSE)) # Parameters for naiveRandRUV param_coeff <- 1e-3 param_k <- 10 # Call the naiveRandRUV function message(paste("Calling naiveRandRUV with coeff =", param_coeff, "and k =", param_k)) ruv_results <- naiveRandRUV(data_matrix = umi_matrix, coeff = param_coeff, k = param_k) # Save the output normalized matrix message(paste("Saving normalized UMI matrix to:", output_normalized_matrix_file)) write.table(ruv_results$normalized_data, output_normalized_matrix_file, sep = "\t", quote = FALSE, col.names = NA) # Save the estimated unwanted factors message(paste("Saving RUV unwanted factors to:", output_unwanted_factors_file)) write.table(ruv_results$unwanted_factors, output_unwanted_factors_file, sep = "\t", quote = FALSE, col.names = NA) message("Normalization complete.") ' # Create a dummy raw UMI matrix for demonstration purposes # In a real pipeline, this file would be generated by a previous step. echo -e "Gene\tSample1\tSample2\tSample3" > raw_umi_matrix.tsv echo -e "GeneA\t100\t120\t90" >> raw_umi_matrix.tsv echo -e "GeneB\t50\t60\t55" >> raw_umi_matrix.tsv echo -e "GeneC\t200\t210\t190" >> raw_umi_matrix.tsv echo -e "GeneD\t10\t15\t12" >> raw_umi_matrix.tsv echo -e "GeneE\t70\t80\t75" >> raw_umi_matrix.tsv # Write the R script content to a file echo "$R_SCRIPT_CONTENT" > run_naiveRandRUV.R # Ensure R is installed and Rscript is in PATH # conda install -c conda-forge r-base r-matrix r-data.table # Example R installation # Bioconductor packages like RUVSeq might need specific installation: # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("RUVSeq")' # Execute the R script Rscript run_naiveRandRUV.R # Clean up the R script file (optional) # rm run_naiveRandRUV.R -
7
Fifty negative control genes were taken from a list of housekeeping genes (Eisenberg and Levanon, 2013) with least variability in all datasets.
$ Bash example
# This script conceptually demonstrates the selection of negative control genes # based on variability from a list of housekeeping genes. # Actual implementation would depend on the specific data formats and variability metric. # Input files assumed: # 1. housekeeping_genes.txt: A file listing known housekeeping genes (e.g., one gene ID per line). # Source: Eisenberg and Levanon, 2013 (Trends Genet. 2013 Oct;29(10):569-74) # 2. expression_data.tsv: A tab-separated file with gene expression data across multiple datasets/samples. # Format: GeneID \t Dataset1_expr \t Dataset2_expr \t ... # Output: A file containing the 50 selected negative control genes. # Define input and output file names (placeholders) HOUSEKEEPING_GENES_FILE="housekeeping_genes.txt" EXPRESSION_DATA_FILE="expression_data.tsv" OUTPUT_FILE="negative_control_genes_50.txt" NUM_CONTROL_GENES=50 # --- Conceptual Python script (select_control_genes.py) --- # #!/usr/bin/env python3 # import pandas as pd # import numpy as np # import sys # # def calculate_variability(series): # """Calculate a variability metric, e.g., coefficient of variation (CV).""" # if series.mean() == 0: # return np.inf # Avoid division by zero for genes with zero expression # return series.std() / series.mean() if series.mean() != 0 else np.inf # # def select_negative_control_genes(housekeeping_genes_path, expression_data_path, num_genes_to_select): # # Load housekeeping genes # try: # with open(housekeeping_genes_path, 'r') as f: # housekeeping_genes = {line.strip() for line in f if line.strip()} # except FileNotFoundError: # print(f"Error: Housekeeping genes file not found at {housekeeping_genes_path}", file=sys.stderr) # return [] # # # Load expression data # try: # expression_df = pd.read_csv(expression_data_path, sep='\t', index_col='GeneID') # except FileNotFoundError: # print(f"Error: Expression data file not found at {expression_data_path}", file=sys.stderr) # return [] # except Exception as e: # print(f"Error loading expression data: {e}", file=sys.stderr) # return [] # # # Filter expression data to include only housekeeping genes # hk_expression_df = expression_df[expression_df.index.isin(housekeeping_genes)] # # if hk_expression_df.empty: # print("No housekeeping genes found in the expression data or no overlap.", file=sys.stderr) # return [] # # # Calculate variability for each housekeeping gene across all datasets # variability_scores = hk_expression_df.apply(calculate_variability, axis=1) # # # Sort genes by variability (ascending) and select the top N # selected_genes = variability_scores.sort_values().head(num_genes_to_select).index.tolist() # # return selected_genes # # if __name__ == "__main__": # if len(sys.argv) != 5: # Expecting 4 arguments: script_name, hk_file, expr_file, out_file, num_genes # print("Usage: python3 select_control_genes.py <housekeeping_genes_file> <expression_data_file> <output_file> <num_genes_to_select>", file=sys.stderr) # sys.exit(1) # # hk_file = sys.argv[1] # expr_file = sys.argv[2] # out_file = sys.argv[3] # num_genes = int(sys.argv[4]) # Convert to integer # # selected_genes = select_negative_control_genes(hk_file, expr_file, num_genes) # # if selected_genes: # with open(out_file, 'w') as f: # for gene in selected_genes: # f.write(f"{gene}\n") # print(f"Successfully selected {len(selected_genes)} negative control genes and saved to {out_file}") # else: # print("No genes were selected.", file=sys.stderr) # # --- End of conceptual Python script --- # To run this, you would first create the 'select_control_genes.py' file # with the content above, and ensure pandas and numpy are installed. # conda install pandas numpy # pip install pandas numpy # Execute the script python3 select_control_genes.py "${HOUSEKEEPING_GENES_FILE}" "${EXPRESSION_DATA_FILE}" "${OUTPUT_FILE}" "${NUM_CONTROL_GENES}" -
8
Seurat (v3.0.1) functions were used to calculate top variable genes, principal components analysis (PCA), and tSNE with FindVariableGenes, RunPCA, and RunTSNE.
$ Bash example
# Install Seurat if not already installed # R -q -e "if (!requireNamespace('Seurat', quietly = TRUE)) install.packages('Seurat')" # R -q -e "if (!requireNamespace('devtools', quietly = TRUE)) install.packages('devtools')" # R -q -e "devtools::install_github('satijalab/seurat', ref = 'release/3.0.1')" # For specific version from GitHub # Create an R script to perform Seurat analysis cat << 'EOF' > run_seurat_analysis.R library(Seurat) # Define input and output file paths input_seurat_object_path <- "seurat_object.rds" # Assumed input from a previous step output_seurat_object_path <- "processed_seurat_object.rds" # Load the Seurat object if (!file.exists(input_seurat_object_path)) { stop(paste("Input Seurat object not found:", input_seurat_object_path)) } seurat_object <- readRDS(input_seurat_object_path) # 1. Calculate top variable genes (using FindVariableFeatures for Seurat v3+) # selection.method: Method to use for feature selection (default "vst") # nfeatures: Number of features to return (default 2000) message("Finding variable features...") seurat_object <- FindVariableFeatures(seurat_object, selection.method = "vst", nfeatures = 2000) # 2. Scale data (often done before PCA, using all genes or variable features) # This step is crucial for PCA. Default features is VariableFeatures(object). message("Scaling data...") all.genes <- rownames(seurat_object) seurat_object <- ScaleData(seurat_object, features = all.genes) # 3. Perform Principal Components Analysis (PCA) # features: Features to use as input for PCA (default VariableFeatures(object)) # npcs: Number of PCs to compute (default 50) message("Running PCA...") seurat_object <- RunPCA(seurat_object, features = VariableFeatures(object = seurat_object), npcs = 30) # 4. Perform tSNE # dims: Dimensions to use as input for tSNE (default 1:30 for RunTSNE) message("Running tSNE...") seurat_object <- RunTSNE(seurat_object, dims = 1:30) # Save the processed Seurat object message(paste("Saving processed Seurat object to:", output_seurat_object_path)) saveRDS(seurat_object, file = output_seurat_object_path) message("Seurat analysis complete.") EOF # Execute the R script Rscript run_seurat_analysis.R -
9
The top 5000 genes were considered as input for the PCA calculation, and only the top 25 principal components (PCs) were used in tSNE.
$ Bash example
# Install scikit-learn and pandas if not already installed # pip install scikit-learn pandas numpy # Python script for PCA calculation python -c " import pandas as pd from sklearn.decomposition import PCA import numpy as np # --- Configuration --- INPUT_FILE = 'gene_expression_matrix_top5000.tsv' # Assumed input file containing top 5000 genes OUTPUT_FILE = 'pca_top25_components.tsv' n_components_pca = 25 # As specified: 'only the top 25 principal components (PCs) were used' # --- Load Data --- try: # Assuming the input file has genes as rows and samples as columns, with gene names as index. # The description 'top 5000 genes were considered as input' implies this file is already filtered. data = pd.read_csv(INPUT_FILE, sep='\t', index_col=0) except FileNotFoundError: print(f'Error: Input file {INPUT_FILE} not found. Please ensure the file exists.') exit(1) except Exception as e: print(f'Error loading input file: {e}') exit(1) # PCA expects features (genes) as columns and samples as rows. # If data is genes x samples, transpose it. # We check if the number of rows matches 5000 (number of genes). if data.shape[0] == 5000: data = data.T # Transpose to samples x genes # Ensure all data is numeric and handle potential missing values (e.g., drop columns with NaNs) original_columns = data.columns data = data.apply(pd.to_numeric, errors='coerce') data = data.dropna(axis=1) # Drop genes (features) that became all NaN if data.empty: print('Error: Dataframe is empty after numeric conversion and NaN handling. Check input data.') exit(1) if data.shape[1] < n_components_pca: print(f'Warning: Number of features ({data.shape[1]}) is less than n_components_pca ({n_components_pca}). Adjusting n_components_pca.') n_components_pca = data.shape[1] if n_components_pca == 0: print('Error: No valid features remaining for PCA.') exit(1) # --- Perform PCA --- pca = PCA(n_components=n_components_pca) principal_components = pca.fit_transform(data) # --- Save Results --- pc_df = pd.DataFrame( data=principal_components, columns=[f'PC{i+1}' for i in range(n_components_pca)], index=data.index # Keep sample names as index ) pc_df.to_csv(OUTPUT_FILE, sep='\t') print(f'PCA completed successfully. {n_components_pca} principal components saved to {OUTPUT_FILE}') print(f'Explained variance ratio per component: {pca.explained_variance_ratio_}') print(f'Total explained variance by {n_components_pca} components: {np.sum(pca.explained_variance_ratio_):.2f}') " -
10
none provided by the submitter
Not specified (No description provided) vNot specified (No description provided)$ Bash example
# No step description provided by the submitter. Cannot infer tool, version, or generate a specific code block.
Raw Source Text
Reads from scRNA-seq were aligned to mm10 using the 10x Genomics Cell Ranger software (v 2.1.0). Reads were collapsed into unique molecular identifier counts. All samples had >2000 cells detected with >1000 genes per cells with >70% of the coding genome covered. Genes that were not expressed in at least 5% of all cells were excluded. As previously described (Boland et al., 2020), replicates of single cell libraries were normalized removing batch effects using RUVnormalize (v1.15.0). The raw UMI matrix was scaled and input to the naiveRandRUV function with parameters coeff=1e-3 and k=10. Fifty negative control genes were taken from a list of housekeeping genes (Eisenberg and Levanon, 2013) with least variability in all datasets. Seurat (v3.0.1) functions were used to calculate top variable genes, principal components analysis (PCA), and tSNE with FindVariableGenes, RunPCA, and RunTSNE. The top 5000 genes were considered as input for the PCA calculation, and only the top 25 principal components (PCs) were used in tSNE. none provided by the submitter Assembly: mm10 Supplementary files format and content: Tab-separated values files and matrix files