GSE134809 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

RNA binding protein DDX5 restricts RORγt<sup>+</sup> T<sub>reg</sub> suppressor function to promote intestine inflammation.

Science advances (2023) — PMID 36724232

Dataset

GSE134809

Single-cell analysis of Crohnâs disease lesions identifies a pathogenic cellular module associated with resistance to anti-TNF therapy

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

FASTQ were demultiplexed using Cell Ranger v2.0 and aligned to the Grch38 human

Cell Ranger v2.0

$ Bash example

# Install Cell Ranger (example, adjust for specific environment)
# Download Cell Ranger 2.0.0 from 10x Genomics website (requires registration)
# For example:
# wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-2.0.0.tar.gz
# tar -xzf cellranger-2.0.0.tar.gz
# export PATH=/path/to/cellranger-2.0.0:$PATH

# Define variables
SAMPLE_ID="sample_name" # Replace with actual sample identifier
FASTQ_DIR="/path/to/fastq_files" # Directory containing FASTQ files (e.g., from mkfastq or direct download)
REFERENCE_PATH="/path/to/cellranger_refdata/GRCh38_2.0.0" # Path to the pre-built Cell Ranger GRCh38 reference genome for v2.0

# Execute Cell Ranger count for demultiplexing (cell barcode) and alignment
# This command takes FASTQ files, performs cell barcode demultiplexing, aligns reads to the transcriptome,
# and generates gene-barcode matrices and other output files.
cellranger count \
    --id="${SAMPLE_ID}_output" \
    --transcriptome="${REFERENCE_PATH}" \
    --fastqs="${FASTQ_DIR}" \
    --sample="${SAMPLE_ID}" \
    --expect-cells=3000 # Example parameter, adjust based on expected cell count

# Note: If the initial input was BCL files, an additional 'cellranger mkfastq' step would precede this.
# Example mkfastq command (commented out as the description implies FASTQ were already available or processed):
# BCL_DIR="/path/to/bcl_directory"
# SAMPLE_SHEET="/path/to/sample_sheet.csv" # Required for mkfastq
# cellranger mkfastq --id="fastq_output" --run="${BCL_DIR}" --csv="${SAMPLE_SHEET}"
# FASTQ_DIR="fastq_output/outs/fastq_path" # Output directory for FASTQ files from mkfastq

Cell barcodes and unique molecular identifiers (UMIs) were extracted and âRawâ UMI matrix generated for each sample

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi_tools
# conda install -c bioconda umi_tools

# --- Step 1: Extract UMIs and Cell Barcodes from raw FASTQ reads ---
# This step extracts UMIs and Cell Barcodes and appends them to the read names.
# The specific patterns (--bc-pattern, --umi-pattern) depend on the library preparation.
# For eCLIP, UMIs are typically in the adapter sequence or a specific read.
# Example for paired-end reads where UMI is in R1 and CB is in R2 (adjust as needed):
# umi_tools extract --bc-pattern=NNNNNNNNNNNNNNNN --umi-pattern=NNNNNNNNNN --extract-method=regex \
#                   --stdin=sample_R1.fastq.gz --stdout=sample_R1_umi_extracted.fastq.gz \
#                   --read2-in=sample_R2.fastq.gz --read2-out=sample_R2_umi_extracted.fastq.gz

# --- Step 2: Align the UMI-extracted reads to the genome ---
# This step is a prerequisite for UMI counting and typically uses a splice-aware aligner like STAR.
# STAR --genomeDir /path/to/STAR_genome --readFilesIn sample_R1_umi_extracted.fastq.gz sample_R2_umi_extracted.fastq.gz \
#      --outFileNamePrefix sample_ --outSAMtype BAM SortedByCoordinate --runThreadN 8

# --- Step 3: Generate the "Raw" UMI matrix from aligned reads ---
# This command counts unique UMIs per gene/feature for each sample, generating the raw UMI matrix.
# It assumes the input BAM file (sample_aligned_umi.bam) has UMIs in the read names (e.g., from umi_tools extract).
# Replace 'gencode.v38.annotation.gtf' with the appropriate gene annotation file for your reference genome (e.g., GRCh38).
umi_tools count --per-gene \
                --gene-coords=gencode.v38.annotation.gtf \
                --assigned-reads-file=sample_aligned_umi.bam \
                --output-counts=sample_umi_counts.tsv \
                --output-stats=sample_umi_stats.tsv

View on GitHub

extracted cell-barcodes associated with at least 800 UMIs from the âRawâ output UMI matrices of CellRanger

Cell Ranger v1.9.3 GitHub

$ Bash example

# Install scanpy and pandas if not already installed
# pip install scanpy pandas

# Define input and output paths
# Replace 'path/to/cellranger_output/raw_feature_bc_matrix' with the actual path
CELLRANGER_OUTPUT_DIR="path/to/cellranger_output/raw_feature_bc_matrix"
OUTPUT_BARCODES_FILE="filtered_barcodes_800_umis.tsv"
UMI_THRESHOLD=800

# Export variables for the Python script
export CELLRANGER_OUTPUT_DIR
export OUTPUT_BARCODES_FILE
export UMI_THRESHOLD

# Python script to load Cell Ranger raw matrix, calculate UMIs per cell, and filter
python -c "
import scanpy as sc
import pandas as pd
import os

input_dir = os.environ.get('CELLRANGER_OUTPUT_DIR')
output_file = os.environ.get('OUTPUT_BARCODES_FILE')
umi_threshold = int(os.environ.get('UMI_THRESHOLD'))

if not input_dir or not output_file or not umi_threshold:
    print('Error: Environment variables for input_dir, output_file, or umi_threshold are not set.')
    exit(1)

# Check if the input directory exists
if not os.path.isdir(input_dir):
    print(f'Error: Input directory not found: {input_dir}')
    exit(1)

try:
    # Load raw data from Cell Ranger output
    # The directory should contain matrix.mtx, barcodes.tsv, features.tsv (can be gzipped)
    adata = sc.read_10x_mtx(
        input_dir,
        var_names='gene_symbols', # Or 'gene_ids' depending on the Cell Ranger version and preference
        cache=True
    )
    
    # Calculate total UMIs per cell
    adata.obs['n_umis'] = adata.X.sum(axis=1)
    
    # Filter cells based on UMI count
    initial_cells = adata.n_obs
    filtered_adata = adata[adata.obs['n_umis'] >= umi_threshold, :]
    filtered_cells = filtered_adata.n_obs
    
    # Get filtered barcodes
    filtered_barcodes = filtered_adata.obs_names.tolist()
    
    # Save filtered barcodes to a file
    with open(output_file, 'w') as f:
        for barcode in filtered_barcodes:
            f.write(f'{barcode}\n')
    
    print(f'Successfully filtered {initial_cells} cells down to {filtered_cells} cells with >= {umi_threshold} UMIs.')
    print(f'Filtered barcodes saved to: {output_file}')

except Exception as e:
    print(f'An error occurred: {e}')
    exit(1)
"

View on GitHub

Epithelial cells, Red blood cells and cell-barcodess associated with high mitochondrial mRNA content were excluded by "in-silico" gating precedure which computes that fractoin of UMIs assocated with each gene signature.

Seurat (Inferred with models/gemini-2.5-flash) v4.0.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Seurat if not already installed
# install.packages("Seurat")
# install.packages("dplyr") # Often used with Seurat
# install.packages("Matrix") # Dependency
# install.packages("ggplot2") # Dependency for plotting

library(Seurat)
library(dplyr)

# --- Placeholder for loading your single-cell data into a Seurat object ---
# In a real pipeline, 'seurat_object' would be loaded from a previous step,
# e.g., from 10x Genomics CellRanger output using Read10X() and CreateSeuratObject(),
# or loaded from an RDS file:
# seurat_object <- readRDS("path/to/your/unfiltered_seurat_object.rds")

# For demonstration purposes, let's assume 'seurat_object' is already loaded.
# If you need to create a dummy object for testing:
# counts_matrix <- Matrix::Matrix(sample(0:10, 10000, replace = TRUE), nrow = 100, ncol = 100)
# rownames(counts_matrix) <- paste0("gene", 1:100)
# colnames(counts_matrix) <- paste0("cell", 1:100)
# seurat_object <- CreateSeuratObject(counts = counts_matrix, project = "dummy_project")
# seurat_object$nFeature_RNA <- colSums(counts_matrix > 0)
# seurat_object$nCount_RNA <- colSums(counts_matrix)

# --- Calculate mitochondrial percentage ---
# Assuming human data with "MT-" prefix for mitochondrial genes.
# For mouse data, use pattern = "^mt-".
# This step adds 'percent.mt' to the Seurat object's metadata.
seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object, pattern = "^MT-")

# --- Define filtering thresholds ---
# These thresholds are common starting points and should be adjusted
# based on the specific dataset, cell type, and expected data quality.
# - min_features: Minimum number of unique genes detected per cell.
# - max_features: Maximum number of unique genes detected per cell (to remove potential doublets).
# - max_percent_mt: Maximum percentage of mitochondrial reads per cell.
#   Cells with high mitochondrial content are often indicative of damaged or dying cells.

min_features <- 200
max_features <- 2500 # Example: adjust based on expected cell complexity
max_percent_mt <- 15 # Example: adjust based on data quality (e.g., 5-20%)

# --- Perform "in-silico" gating/filtering ---
# Exclude cells based on defined quality control metrics.
# This step filters out cells that are likely low quality, damaged, or potential doublets.
seurat_object_filtered <- subset(seurat_object, subset = nFeature_RNA > min_features &
                                                    nFeature_RNA < max_features &
                                                    percent.mt < max_percent_mt)

# --- Additional considerations for "Epithelial cells, Red blood cells" exclusion ---
# The description also mentions excluding specific cell types. This typically involves:
# 1. Identifying marker genes for these cell types (e.g., HBB/HBA1 for RBCs, KRTs for epithelial).
# 2. Scoring cells for these gene signatures (e.g., using Seurat's AddModuleScore).
# 3. Removing cells that highly express these signatures or fall into clusters identified as these cell types.
# This step is more complex and dataset-specific, requiring prior knowledge of cell type markers.
# Example (conceptual, requires actual marker gene expression in data):
# seurat_object_filtered <- subset(seurat_object_filtered, subset = HBB < 1 & HBA1 < 1) # To remove RBCs
# seurat_object_filtered <- subset(seurat_object_filtered, subset = Epithelial_Score < threshold) # If a score was calculated

# --- Save the filtered Seurat object (optional) ---
# saveRDS(seurat_object_filtered, file = "filtered_seurat_object.rds")

View on GitHub

Thresholds and genes lists are detailed in the manuscripts.

Not specified (Inferred with models/gemini-2.5-flash) vNot specified

$ Bash example

# The step description "Thresholds and genes lists are detailed in the manuscripts"
# indicates that specific parameters and outputs are documented externally.
# This step describes a reporting or documentation aspect rather than a direct computational execution.
# Therefore, no executable bash command can be inferred from this description.

Raw Source Text

FASTQ were demultiplexed using Cell Ranger v2.0 and aligned to the Grch38 human
Cell barcodes and unique molecular identifiers (UMIs) were extracted and âRawâ UMI matrix generated for each sample
extracted cell-barcodes associated with at least 800 UMIs from the âRawâ output UMI matrices of CellRanger
Epithelial cells, Red blood cells and cell-barcodess associated with high mitochondrial mRNA content were excluded by "in-silico" gating precedure which computes that fractoin of UMIs assocated with each gene signature. Thresholds and genes lists are detailed in the manuscripts.
Genome_build: Grch38

← Back to Analysis