GSE202051 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

A super-enhancer-regulated RNA-binding protein cascade drives pancreatic cancer.

Nature communications (2023) — PMID 37673892

Dataset

GSE202051

Refined molecular taxonomy and treatment remodeling of pancreatic cancer using single-cell resolution

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    BCL files were converted to FASTQ using bcl2fastq2-v2.20.

    bcl2fastq2 vv2.20 GitHub
    $ Bash example
    # Install bcl2fastq2 (example using conda)
    # conda install -c bioconda bcl2fastq2
    
    # Define input run directory and output directory
    RUN_DIR="/path/to/your/bcl_run_folder"
    OUTPUT_DIR="/path/to/your/fastq_output_folder"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Execute bcl2fastq2
    bcl2fastq --runfolder-dir "${RUN_DIR}" --output-dir "${OUTPUT_DIR}"
  2. 2

    CellRanger v3.0.2 was used to demultiplex the FASTQ reads, align them to the hg38 human transcriptome (pre-mRNA) reference and extract the UMI and nuclei barcodes.

    Cell Ranger v3.0.2
    $ Bash example
    # Install Cell Ranger (example using conda, adjust as needed)
    # # Download and install Cell Ranger from 10x Genomics website
    # # e.g., wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-3.0.2.tar.gz
    # # tar -xzf cellranger-3.0.2.tar.gz
    # # export PATH=/path/to/cellranger-3.0.2:$PATH
    
    # Define variables for inputs and outputs
    FASTQ_DIR="/path/to/your/fastq_files" # Directory containing FASTQ files (e.g., SampleName_S1_L001_R1_001.fastq.gz)
    SAMPLE_ID="your_sample_name" # A unique ID for your sample, typically matching the prefix of your FASTQ files
    TRANSCRIPTOME_REF="/path/to/cellranger_references/refdata-gex-GRCh38-2020-A" # Path to the pre-built Cell Ranger transcriptome reference (hg38, including pre-mRNA)
    OUTPUT_DIR="./cellranger_output" # Output directory for Cell Ranger results
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run Cell Ranger count
    cellranger count \
        --id="${SAMPLE_ID}" \
        --transcriptome="${TRANSCRIPTOME_REF}" \
        --fastqs="${FASTQ_DIR}" \
        --sample="${SAMPLE_ID}" \
        --localcores=8 \
        --localmem=64 \
        --output-directory="${OUTPUT_DIR}"
  3. 3

    We used CellBender remove-background to remove ambient RNA and other technical artifacts from the count matrices.

    CellBender v0.3.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install CellBender (if not already installed)
    # pip install cellbender
    
    # Example usage of CellBender remove-background to remove ambient RNA and other technical artifacts
    # Replace input_counts.h5 and cleaned_counts.h5 with actual file paths for your count matrices.
    # Adjust parameters like --expected-cells, --total-droplets-included, --fpr, and --epochs as needed for your dataset.
    cellbender remove-background \
        --input input_counts.h5 \
        --output cleaned_counts.h5 \
        --cuda \
        --expected-cells 5000 \
        --total-droplets-included 20000 \
        --fpr 0.01 \
        --epochs 150
  4. 4

    Nuclei with over 10,000 UMI counts were filtered.

    Seurat (Inferred with models/gemini-2.5-flash) v4.x GitHub
    $ Bash example
    #!/bin/bash
    
    # Define input and output paths
    INPUT_SEURAT_OBJECT="input_seurat_object.rds" # Placeholder for input Seurat object
    OUTPUT_SEURAT_OBJECT="filtered_seurat_object.rds" # Placeholder for output Seurat object
    UMI_THRESHOLD=10000
    
    # Check if input file exists
    if [ ! -f "$INPUT_SEURAT_OBJECT" ]; then
        echo "Error: Input Seurat object not found at $INPUT_SEURAT_OBJECT"
        exit 1
    fi
    
    # R script to perform filtering
    R_SCRIPT=$(cat <<EOF
    # Load Seurat library
    # install.packages("Seurat") # Uncomment to install if not already installed
    library(Seurat)
    
    # Load the input Seurat object
    if (!file.exists("$INPUT_SEURAT_OBJECT")) {
      stop("Input Seurat object not found.")
    }
    pbmc <- readRDS("$INPUT_SEURAT_OBJECT")
    
    # Filter cells based on UMI counts
    # Description: "Nuclei with over 10,000 UMI counts were filtered."
    # This implies removing cells where nCount_RNA > 10000.
    # So, we retain cells where nCount_RNA <= 10000.
    # 'nCount_RNA' is the total UMI count for each cell, automatically calculated by Seurat.
    pbmc_filtered <- subset(pbmc, subset = nCount_RNA <= $UMI_THRESHOLD)
    
    # Save the filtered Seurat object
    saveRDS(pbmc_filtered, file = "$OUTPUT_SEURAT_OBJECT")
    
    message(paste("Original number of cells:", ncol(pbmc)))
    message(paste("Number of cells after filtering:", ncol(pbmc_filtered)))
    message(paste("Filtered out", ncol(pbmc) - ncol(pbmc_filtered), "cells with >", $UMI_THRESHOLD, "UMI counts."))
    EOF
    )
    
    # Execute the R script
    Rscript -e "$R_SCRIPT"
    
    echo "Filtering complete. Filtered Seurat object saved to $OUTPUT_SEURAT_OBJECT"
  5. 5

    UMI counts were normalized by the total number of UMIs per nucleus and converted to transcripts-per-10,000 (TP10K) as the final expression unit.

    pandas (Inferred with models/gemini-2.5-flash) v1.5.3 GitHub
    $ Bash example
    # Install pandas (if not already installed)
    # pip install pandas
    
    # Create a dummy UMI count matrix for demonstration
    # In a real scenario, this would be loaded from a file (e.g., output from a UMI quantification tool)
    cat <<EOF > raw_umi_counts.tsv
    gene1	10	20	5
    gene2	5	10	2
    gene3	15	30	8
    EOF
    
    # Python script to perform TP10K normalization
    # This script assumes an input file `raw_umi_counts.tsv` where rows are genes and columns are nuclei/cells.
    # The first column is assumed to be gene names, and subsequent columns are UMI counts.
    python -c '
    import pandas as pd
    
    input_file = "raw_umi_counts.tsv"
    output_file = "normalized_tp10k_expression.tsv"
    
    # Read the UMI count matrix
    # Assuming the first column is gene names and subsequent columns are UMI counts for nuclei
    # header=None because our dummy file has no header row for counts
    # index_col=0 sets the first column (gene names) as the index
    umi_counts_df = pd.read_csv(input_file, sep="\t", header=None, index_col=0)
    
    # Assign generic column names (e.g., nucleus_1, nucleus_2, ...) if not present
    # This makes the output more readable if the input had no column headers
    umi_counts_df.columns = [f"nucleus_{i+1}" for i in range(umi_counts_df.shape[1])]
    
    # Calculate the total number of UMIs per nucleus (column sums)
    total_umis_per_nucleus = umi_counts_df.sum(axis=0)
    
    # Normalize UMI counts to Transcripts Per 10,000 (TP10K)
    # Formula: (UMI_count / total_UMIs_per_nucleus) * 10000
    # This operation is performed column-wise (per nucleus)
    tp10k_df = (umi_counts_df / total_umis_per_nucleus) * 10000
    
    # Print a preview of the normalized data
    print("Normalized TP10K expression matrix (first 5 rows, first 5 columns):")
    print(tp10k_df.head())
    
    # Save the normalized TP10K expression matrix to a TSV file
    # index=True to write the gene names (index) as the first column
    # header=True to write the column names (nucleus IDs) as the first row
    tp10k_df.to_csv(output_file, sep="\t", index=True, header=True)
    print(f"Normalization complete. Output saved to {output_file}")
    '
Raw Source Text
BCL files were converted to FASTQ using bcl2fastq2-v2.20.
CellRanger v3.0.2 was used to demultiplex the FASTQ reads, align them to the hg38 human transcriptome (pre-mRNA) reference and extract the UMI and nuclei barcodes.
We used CellBender remove-background to remove ambient RNA and other technical artifacts from the count matrices.
Nuclei with over 10,000 UMI counts were filtered.
UMI counts were normalized by the total number of UMIs per nucleus and converted to transcripts-per-10,000 (TP10K) as the final expression unit.
Assembly: Hg38
Supplementary files format and content: H5ad file of aggregated processed single-nuclei RNA sequencing data from all 45 human specimens; found in totaldata-final-toshare.h5ad
Supplementary files format and content: H5ad file of processed single-nuclei RNA sequencing data from the patient-derived organoid (untreated); found in adata_010nuc_10x.h5ad
Supplementary files format and content: H5ad file of processed single-nuclei RNA sequencing data from the patient-derived organoid (treated with 5-FU, SM38, and oxaliplatin); found in adata_010orgCRT_10x.h5ad
← Back to Analysis