GSE202051 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

A super-enhancer-regulated RNA-binding protein cascade drives pancreatic cancer.

Nature communications (2023) — PMID 37673892

Dataset

GSE202051

Refined molecular taxonomy and treatment remodeling of pancreatic cancer using single-cell resolution

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

BCL files were converted to FASTQ using bcl2fastq2-v2.20.

bcl2fastq2 vv2.20 GitHub

$ Bash example

# Install bcl2fastq2 (example using conda)
# conda install -c bioconda bcl2fastq2

# Define input run directory and output directory
RUN_DIR="/path/to/your/bcl_run_folder"
OUTPUT_DIR="/path/to/your/fastq_output_folder"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Execute bcl2fastq2
bcl2fastq --runfolder-dir "${RUN_DIR}" --output-dir "${OUTPUT_DIR}"

View on GitHub

CellRanger v3.0.2 was used to demultiplex the FASTQ reads, align them to the hg38 human transcriptome (pre-mRNA) reference and extract the UMI and nuclei barcodes.

Cell Ranger v3.0.2

$ Bash example

# Install Cell Ranger (example using conda, adjust as needed)
# # Download and install Cell Ranger from 10x Genomics website
# # e.g., wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-3.0.2.tar.gz
# # tar -xzf cellranger-3.0.2.tar.gz
# # export PATH=/path/to/cellranger-3.0.2:$PATH

# Define variables for inputs and outputs
FASTQ_DIR="/path/to/your/fastq_files" # Directory containing FASTQ files (e.g., SampleName_S1_L001_R1_001.fastq.gz)
SAMPLE_ID="your_sample_name" # A unique ID for your sample, typically matching the prefix of your FASTQ files
TRANSCRIPTOME_REF="/path/to/cellranger_references/refdata-gex-GRCh38-2020-A" # Path to the pre-built Cell Ranger transcriptome reference (hg38, including pre-mRNA)
OUTPUT_DIR="./cellranger_output" # Output directory for Cell Ranger results

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run Cell Ranger count
cellranger count \
    --id="${SAMPLE_ID}" \
    --transcriptome="${TRANSCRIPTOME_REF}" \
    --fastqs="${FASTQ_DIR}" \
    --sample="${SAMPLE_ID}" \
    --localcores=8 \
    --localmem=64 \
    --output-directory="${OUTPUT_DIR}"

We used CellBender remove-background to remove ambient RNA and other technical artifacts from the count matrices.

CellBender v0.3.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install CellBender (if not already installed)
# pip install cellbender

# Example usage of CellBender remove-background to remove ambient RNA and other technical artifacts
# Replace input_counts.h5 and cleaned_counts.h5 with actual file paths for your count matrices.
# Adjust parameters like --expected-cells, --total-droplets-included, --fpr, and --epochs as needed for your dataset.
cellbender remove-background \
    --input input_counts.h5 \
    --output cleaned_counts.h5 \
    --cuda \
    --expected-cells 5000 \
    --total-droplets-included 20000 \
    --fpr 0.01 \
    --epochs 150

View on GitHub

Nuclei with over 10,000 UMI counts were filtered.

Seurat (Inferred with models/gemini-2.5-flash) v4.x GitHub

$ Bash example

#!/bin/bash

# Define input and output paths
INPUT_SEURAT_OBJECT="input_seurat_object.rds" # Placeholder for input Seurat object
OUTPUT_SEURAT_OBJECT="filtered_seurat_object.rds" # Placeholder for output Seurat object
UMI_THRESHOLD=10000

# Check if input file exists
if [ ! -f "$INPUT_SEURAT_OBJECT" ]; then
    echo "Error: Input Seurat object not found at $INPUT_SEURAT_OBJECT"
    exit 1
fi

# R script to perform filtering
R_SCRIPT=$(cat <<EOF
# Load Seurat library
# install.packages("Seurat") # Uncomment to install if not already installed
library(Seurat)

# Load the input Seurat object
if (!file.exists("$INPUT_SEURAT_OBJECT")) {
  stop("Input Seurat object not found.")
}
pbmc <- readRDS("$INPUT_SEURAT_OBJECT")

# Filter cells based on UMI counts
# Description: "Nuclei with over 10,000 UMI counts were filtered."
# This implies removing cells where nCount_RNA > 10000.
# So, we retain cells where nCount_RNA <= 10000.
# 'nCount_RNA' is the total UMI count for each cell, automatically calculated by Seurat.
pbmc_filtered <- subset(pbmc, subset = nCount_RNA <= $UMI_THRESHOLD)

# Save the filtered Seurat object
saveRDS(pbmc_filtered, file = "$OUTPUT_SEURAT_OBJECT")

message(paste("Original number of cells:", ncol(pbmc)))
message(paste("Number of cells after filtering:", ncol(pbmc_filtered)))
message(paste("Filtered out", ncol(pbmc) - ncol(pbmc_filtered), "cells with >", $UMI_THRESHOLD, "UMI counts."))
EOF
)

# Execute the R script
Rscript -e "$R_SCRIPT"

echo "Filtering complete. Filtered Seurat object saved to $OUTPUT_SEURAT_OBJECT"

View on GitHub

UMI counts were normalized by the total number of UMIs per nucleus and converted to transcripts-per-10,000 (TP10K) as the final expression unit.

pandas (Inferred with models/gemini-2.5-flash) v1.5.3 GitHub

$ Bash example

# Install pandas (if not already installed)
# pip install pandas

# Create a dummy UMI count matrix for demonstration
# In a real scenario, this would be loaded from a file (e.g., output from a UMI quantification tool)
cat <<EOF > raw_umi_counts.tsv
gene1	10	20	5
gene2	5	10	2
gene3	15	30	8
EOF

# Python script to perform TP10K normalization
# This script assumes an input file `raw_umi_counts.tsv` where rows are genes and columns are nuclei/cells.
# The first column is assumed to be gene names, and subsequent columns are UMI counts.
python -c '
import pandas as pd

input_file = "raw_umi_counts.tsv"
output_file = "normalized_tp10k_expression.tsv"

# Read the UMI count matrix
# Assuming the first column is gene names and subsequent columns are UMI counts for nuclei
# header=None because our dummy file has no header row for counts
# index_col=0 sets the first column (gene names) as the index
umi_counts_df = pd.read_csv(input_file, sep="\t", header=None, index_col=0)

# Assign generic column names (e.g., nucleus_1, nucleus_2, ...) if not present
# This makes the output more readable if the input had no column headers
umi_counts_df.columns = [f"nucleus_{i+1}" for i in range(umi_counts_df.shape[1])]

# Calculate the total number of UMIs per nucleus (column sums)
total_umis_per_nucleus = umi_counts_df.sum(axis=0)

# Normalize UMI counts to Transcripts Per 10,000 (TP10K)
# Formula: (UMI_count / total_UMIs_per_nucleus) * 10000
# This operation is performed column-wise (per nucleus)
tp10k_df = (umi_counts_df / total_umis_per_nucleus) * 10000

# Print a preview of the normalized data
print("Normalized TP10K expression matrix (first 5 rows, first 5 columns):")
print(tp10k_df.head())

# Save the normalized TP10K expression matrix to a TSV file
# index=True to write the gene names (index) as the first column
# header=True to write the column names (nucleus IDs) as the first row
tp10k_df.to_csv(output_file, sep="\t", index=True, header=True)
print(f"Normalization complete. Output saved to {output_file}")
'

View on GitHub

Raw Source Text

BCL files were converted to FASTQ using bcl2fastq2-v2.20.
CellRanger v3.0.2 was used to demultiplex the FASTQ reads, align them to the hg38 human transcriptome (pre-mRNA) reference and extract the UMI and nuclei barcodes.
We used CellBender remove-background to remove ambient RNA and other technical artifacts from the count matrices.
Nuclei with over 10,000 UMI counts were filtered.
UMI counts were normalized by the total number of UMIs per nucleus and converted to transcripts-per-10,000 (TP10K) as the final expression unit.
Assembly: Hg38
Supplementary files format and content: H5ad file of aggregated processed single-nuclei RNA sequencing data from all 45 human specimens; found in totaldata-final-toshare.h5ad
Supplementary files format and content: H5ad file of processed single-nuclei RNA sequencing data from the patient-derived organoid (untreated); found in adata_010nuc_10x.h5ad
Supplementary files format and content: H5ad file of processed single-nuclei RNA sequencing data from the patient-derived organoid (treated with 5-FU, SM38, and oxaliplatin); found in adata_010orgCRT_10x.h5ad

← Back to Analysis