GSE202051 Processing Pipeline
RNA-Seq
code_examples
5 steps
Publication
A super-enhancer-regulated RNA-binding protein cascade drives pancreatic cancer.Nature communications (2023) — PMID 37673892
Dataset
GSE202051Refined molecular taxonomy and treatment remodeling of pancreatic cancer using single-cell resolution
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
BCL files were converted to FASTQ using bcl2fastq2-v2.20.
$ Bash example
# Install bcl2fastq2 (example using conda) # conda install -c bioconda bcl2fastq2 # Define input run directory and output directory RUN_DIR="/path/to/your/bcl_run_folder" OUTPUT_DIR="/path/to/your/fastq_output_folder" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Execute bcl2fastq2 bcl2fastq --runfolder-dir "${RUN_DIR}" --output-dir "${OUTPUT_DIR}" -
2
CellRanger v3.0.2 was used to demultiplex the FASTQ reads, align them to the hg38 human transcriptome (pre-mRNA) reference and extract the UMI and nuclei barcodes.
Cell Ranger v3.0.2$ Bash example
# Install Cell Ranger (example using conda, adjust as needed) # # Download and install Cell Ranger from 10x Genomics website # # e.g., wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-3.0.2.tar.gz # # tar -xzf cellranger-3.0.2.tar.gz # # export PATH=/path/to/cellranger-3.0.2:$PATH # Define variables for inputs and outputs FASTQ_DIR="/path/to/your/fastq_files" # Directory containing FASTQ files (e.g., SampleName_S1_L001_R1_001.fastq.gz) SAMPLE_ID="your_sample_name" # A unique ID for your sample, typically matching the prefix of your FASTQ files TRANSCRIPTOME_REF="/path/to/cellranger_references/refdata-gex-GRCh38-2020-A" # Path to the pre-built Cell Ranger transcriptome reference (hg38, including pre-mRNA) OUTPUT_DIR="./cellranger_output" # Output directory for Cell Ranger results # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run Cell Ranger count cellranger count \ --id="${SAMPLE_ID}" \ --transcriptome="${TRANSCRIPTOME_REF}" \ --fastqs="${FASTQ_DIR}" \ --sample="${SAMPLE_ID}" \ --localcores=8 \ --localmem=64 \ --output-directory="${OUTPUT_DIR}" -
3
We used CellBender remove-background to remove ambient RNA and other technical artifacts from the count matrices.
$ Bash example
# Install CellBender (if not already installed) # pip install cellbender # Example usage of CellBender remove-background to remove ambient RNA and other technical artifacts # Replace input_counts.h5 and cleaned_counts.h5 with actual file paths for your count matrices. # Adjust parameters like --expected-cells, --total-droplets-included, --fpr, and --epochs as needed for your dataset. cellbender remove-background \ --input input_counts.h5 \ --output cleaned_counts.h5 \ --cuda \ --expected-cells 5000 \ --total-droplets-included 20000 \ --fpr 0.01 \ --epochs 150 -
4
Nuclei with over 10,000 UMI counts were filtered.
$ Bash example
#!/bin/bash # Define input and output paths INPUT_SEURAT_OBJECT="input_seurat_object.rds" # Placeholder for input Seurat object OUTPUT_SEURAT_OBJECT="filtered_seurat_object.rds" # Placeholder for output Seurat object UMI_THRESHOLD=10000 # Check if input file exists if [ ! -f "$INPUT_SEURAT_OBJECT" ]; then echo "Error: Input Seurat object not found at $INPUT_SEURAT_OBJECT" exit 1 fi # R script to perform filtering R_SCRIPT=$(cat <<EOF # Load Seurat library # install.packages("Seurat") # Uncomment to install if not already installed library(Seurat) # Load the input Seurat object if (!file.exists("$INPUT_SEURAT_OBJECT")) { stop("Input Seurat object not found.") } pbmc <- readRDS("$INPUT_SEURAT_OBJECT") # Filter cells based on UMI counts # Description: "Nuclei with over 10,000 UMI counts were filtered." # This implies removing cells where nCount_RNA > 10000. # So, we retain cells where nCount_RNA <= 10000. # 'nCount_RNA' is the total UMI count for each cell, automatically calculated by Seurat. pbmc_filtered <- subset(pbmc, subset = nCount_RNA <= $UMI_THRESHOLD) # Save the filtered Seurat object saveRDS(pbmc_filtered, file = "$OUTPUT_SEURAT_OBJECT") message(paste("Original number of cells:", ncol(pbmc))) message(paste("Number of cells after filtering:", ncol(pbmc_filtered))) message(paste("Filtered out", ncol(pbmc) - ncol(pbmc_filtered), "cells with >", $UMI_THRESHOLD, "UMI counts.")) EOF ) # Execute the R script Rscript -e "$R_SCRIPT" echo "Filtering complete. Filtered Seurat object saved to $OUTPUT_SEURAT_OBJECT" -
5
UMI counts were normalized by the total number of UMIs per nucleus and converted to transcripts-per-10,000 (TP10K) as the final expression unit.
$ Bash example
# Install pandas (if not already installed) # pip install pandas # Create a dummy UMI count matrix for demonstration # In a real scenario, this would be loaded from a file (e.g., output from a UMI quantification tool) cat <<EOF > raw_umi_counts.tsv gene1 10 20 5 gene2 5 10 2 gene3 15 30 8 EOF # Python script to perform TP10K normalization # This script assumes an input file `raw_umi_counts.tsv` where rows are genes and columns are nuclei/cells. # The first column is assumed to be gene names, and subsequent columns are UMI counts. python -c ' import pandas as pd input_file = "raw_umi_counts.tsv" output_file = "normalized_tp10k_expression.tsv" # Read the UMI count matrix # Assuming the first column is gene names and subsequent columns are UMI counts for nuclei # header=None because our dummy file has no header row for counts # index_col=0 sets the first column (gene names) as the index umi_counts_df = pd.read_csv(input_file, sep="\t", header=None, index_col=0) # Assign generic column names (e.g., nucleus_1, nucleus_2, ...) if not present # This makes the output more readable if the input had no column headers umi_counts_df.columns = [f"nucleus_{i+1}" for i in range(umi_counts_df.shape[1])] # Calculate the total number of UMIs per nucleus (column sums) total_umis_per_nucleus = umi_counts_df.sum(axis=0) # Normalize UMI counts to Transcripts Per 10,000 (TP10K) # Formula: (UMI_count / total_UMIs_per_nucleus) * 10000 # This operation is performed column-wise (per nucleus) tp10k_df = (umi_counts_df / total_umis_per_nucleus) * 10000 # Print a preview of the normalized data print("Normalized TP10K expression matrix (first 5 rows, first 5 columns):") print(tp10k_df.head()) # Save the normalized TP10K expression matrix to a TSV file # index=True to write the gene names (index) as the first column # header=True to write the column names (nucleus IDs) as the first row tp10k_df.to_csv(output_file, sep="\t", index=True, header=True) print(f"Normalization complete. Output saved to {output_file}") '
Raw Source Text
BCL files were converted to FASTQ using bcl2fastq2-v2.20. CellRanger v3.0.2 was used to demultiplex the FASTQ reads, align them to the hg38 human transcriptome (pre-mRNA) reference and extract the UMI and nuclei barcodes. We used CellBender remove-background to remove ambient RNA and other technical artifacts from the count matrices. Nuclei with over 10,000 UMI counts were filtered. UMI counts were normalized by the total number of UMIs per nucleus and converted to transcripts-per-10,000 (TP10K) as the final expression unit. Assembly: Hg38 Supplementary files format and content: H5ad file of aggregated processed single-nuclei RNA sequencing data from all 45 human specimens; found in totaldata-final-toshare.h5ad Supplementary files format and content: H5ad file of processed single-nuclei RNA sequencing data from the patient-derived organoid (untreated); found in adata_010nuc_10x.h5ad Supplementary files format and content: H5ad file of processed single-nuclei RNA sequencing data from the patient-derived organoid (treated with 5-FU, SM38, and oxaliplatin); found in adata_010orgCRT_10x.h5ad