GSE134809 Processing Pipeline
Publication
RNA binding protein DDX5 restricts RORγt<sup>+</sup> T<sub>reg</sub> suppressor function to promote intestine inflammation.Science advances (2023) — PMID 36724232
Dataset
GSE134809Single-cell analysis of Crohnâs disease lesions identifies a pathogenic cellular module associated with resistance to anti-TNF therapy
Processing Steps
Generate Jupyter Notebook-
1
FASTQ were demultiplexed using Cell Ranger v2.0 and aligned to the Grch38 human
Cell Ranger v2.0$ Bash example
# Install Cell Ranger (example, adjust for specific environment) # Download Cell Ranger 2.0.0 from 10x Genomics website (requires registration) # For example: # wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-2.0.0.tar.gz # tar -xzf cellranger-2.0.0.tar.gz # export PATH=/path/to/cellranger-2.0.0:$PATH # Define variables SAMPLE_ID="sample_name" # Replace with actual sample identifier FASTQ_DIR="/path/to/fastq_files" # Directory containing FASTQ files (e.g., from mkfastq or direct download) REFERENCE_PATH="/path/to/cellranger_refdata/GRCh38_2.0.0" # Path to the pre-built Cell Ranger GRCh38 reference genome for v2.0 # Execute Cell Ranger count for demultiplexing (cell barcode) and alignment # This command takes FASTQ files, performs cell barcode demultiplexing, aligns reads to the transcriptome, # and generates gene-barcode matrices and other output files. cellranger count \ --id="${SAMPLE_ID}_output" \ --transcriptome="${REFERENCE_PATH}" \ --fastqs="${FASTQ_DIR}" \ --sample="${SAMPLE_ID}" \ --expect-cells=3000 # Example parameter, adjust based on expected cell count # Note: If the initial input was BCL files, an additional 'cellranger mkfastq' step would precede this. # Example mkfastq command (commented out as the description implies FASTQ were already available or processed): # BCL_DIR="/path/to/bcl_directory" # SAMPLE_SHEET="/path/to/sample_sheet.csv" # Required for mkfastq # cellranger mkfastq --id="fastq_output" --run="${BCL_DIR}" --csv="${SAMPLE_SHEET}" # FASTQ_DIR="fastq_output/outs/fastq_path" # Output directory for FASTQ files from mkfastq -
2
Cell barcodes and unique molecular identifiers (UMIs) were extracted and âRawâ UMI matrix generated for each sample
$ Bash example
# Install umi_tools # conda install -c bioconda umi_tools # --- Step 1: Extract UMIs and Cell Barcodes from raw FASTQ reads --- # This step extracts UMIs and Cell Barcodes and appends them to the read names. # The specific patterns (--bc-pattern, --umi-pattern) depend on the library preparation. # For eCLIP, UMIs are typically in the adapter sequence or a specific read. # Example for paired-end reads where UMI is in R1 and CB is in R2 (adjust as needed): # umi_tools extract --bc-pattern=NNNNNNNNNNNNNNNN --umi-pattern=NNNNNNNNNN --extract-method=regex \ # --stdin=sample_R1.fastq.gz --stdout=sample_R1_umi_extracted.fastq.gz \ # --read2-in=sample_R2.fastq.gz --read2-out=sample_R2_umi_extracted.fastq.gz # --- Step 2: Align the UMI-extracted reads to the genome --- # This step is a prerequisite for UMI counting and typically uses a splice-aware aligner like STAR. # STAR --genomeDir /path/to/STAR_genome --readFilesIn sample_R1_umi_extracted.fastq.gz sample_R2_umi_extracted.fastq.gz \ # --outFileNamePrefix sample_ --outSAMtype BAM SortedByCoordinate --runThreadN 8 # --- Step 3: Generate the "Raw" UMI matrix from aligned reads --- # This command counts unique UMIs per gene/feature for each sample, generating the raw UMI matrix. # It assumes the input BAM file (sample_aligned_umi.bam) has UMIs in the read names (e.g., from umi_tools extract). # Replace 'gencode.v38.annotation.gtf' with the appropriate gene annotation file for your reference genome (e.g., GRCh38). umi_tools count --per-gene \ --gene-coords=gencode.v38.annotation.gtf \ --assigned-reads-file=sample_aligned_umi.bam \ --output-counts=sample_umi_counts.tsv \ --output-stats=sample_umi_stats.tsv -
3
extracted cell-barcodes associated with at least 800 UMIs from the âRawâ output UMI matrices of CellRanger
$ Bash example
# Install scanpy and pandas if not already installed # pip install scanpy pandas # Define input and output paths # Replace 'path/to/cellranger_output/raw_feature_bc_matrix' with the actual path CELLRANGER_OUTPUT_DIR="path/to/cellranger_output/raw_feature_bc_matrix" OUTPUT_BARCODES_FILE="filtered_barcodes_800_umis.tsv" UMI_THRESHOLD=800 # Export variables for the Python script export CELLRANGER_OUTPUT_DIR export OUTPUT_BARCODES_FILE export UMI_THRESHOLD # Python script to load Cell Ranger raw matrix, calculate UMIs per cell, and filter python -c " import scanpy as sc import pandas as pd import os input_dir = os.environ.get('CELLRANGER_OUTPUT_DIR') output_file = os.environ.get('OUTPUT_BARCODES_FILE') umi_threshold = int(os.environ.get('UMI_THRESHOLD')) if not input_dir or not output_file or not umi_threshold: print('Error: Environment variables for input_dir, output_file, or umi_threshold are not set.') exit(1) # Check if the input directory exists if not os.path.isdir(input_dir): print(f'Error: Input directory not found: {input_dir}') exit(1) try: # Load raw data from Cell Ranger output # The directory should contain matrix.mtx, barcodes.tsv, features.tsv (can be gzipped) adata = sc.read_10x_mtx( input_dir, var_names='gene_symbols', # Or 'gene_ids' depending on the Cell Ranger version and preference cache=True ) # Calculate total UMIs per cell adata.obs['n_umis'] = adata.X.sum(axis=1) # Filter cells based on UMI count initial_cells = adata.n_obs filtered_adata = adata[adata.obs['n_umis'] >= umi_threshold, :] filtered_cells = filtered_adata.n_obs # Get filtered barcodes filtered_barcodes = filtered_adata.obs_names.tolist() # Save filtered barcodes to a file with open(output_file, 'w') as f: for barcode in filtered_barcodes: f.write(f'{barcode}\n') print(f'Successfully filtered {initial_cells} cells down to {filtered_cells} cells with >= {umi_threshold} UMIs.') print(f'Filtered barcodes saved to: {output_file}') except Exception as e: print(f'An error occurred: {e}') exit(1) " -
4
Epithelial cells, Red blood cells and cell-barcodess associated with high mitochondrial mRNA content were excluded by "in-silico" gating precedure which computes that fractoin of UMIs assocated with each gene signature.
Seurat (Inferred with models/gemini-2.5-flash) v4.0.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install Seurat if not already installed # install.packages("Seurat") # install.packages("dplyr") # Often used with Seurat # install.packages("Matrix") # Dependency # install.packages("ggplot2") # Dependency for plotting library(Seurat) library(dplyr) # --- Placeholder for loading your single-cell data into a Seurat object --- # In a real pipeline, 'seurat_object' would be loaded from a previous step, # e.g., from 10x Genomics CellRanger output using Read10X() and CreateSeuratObject(), # or loaded from an RDS file: # seurat_object <- readRDS("path/to/your/unfiltered_seurat_object.rds") # For demonstration purposes, let's assume 'seurat_object' is already loaded. # If you need to create a dummy object for testing: # counts_matrix <- Matrix::Matrix(sample(0:10, 10000, replace = TRUE), nrow = 100, ncol = 100) # rownames(counts_matrix) <- paste0("gene", 1:100) # colnames(counts_matrix) <- paste0("cell", 1:100) # seurat_object <- CreateSeuratObject(counts = counts_matrix, project = "dummy_project") # seurat_object$nFeature_RNA <- colSums(counts_matrix > 0) # seurat_object$nCount_RNA <- colSums(counts_matrix) # --- Calculate mitochondrial percentage --- # Assuming human data with "MT-" prefix for mitochondrial genes. # For mouse data, use pattern = "^mt-". # This step adds 'percent.mt' to the Seurat object's metadata. seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object, pattern = "^MT-") # --- Define filtering thresholds --- # These thresholds are common starting points and should be adjusted # based on the specific dataset, cell type, and expected data quality. # - min_features: Minimum number of unique genes detected per cell. # - max_features: Maximum number of unique genes detected per cell (to remove potential doublets). # - max_percent_mt: Maximum percentage of mitochondrial reads per cell. # Cells with high mitochondrial content are often indicative of damaged or dying cells. min_features <- 200 max_features <- 2500 # Example: adjust based on expected cell complexity max_percent_mt <- 15 # Example: adjust based on data quality (e.g., 5-20%) # --- Perform "in-silico" gating/filtering --- # Exclude cells based on defined quality control metrics. # This step filters out cells that are likely low quality, damaged, or potential doublets. seurat_object_filtered <- subset(seurat_object, subset = nFeature_RNA > min_features & nFeature_RNA < max_features & percent.mt < max_percent_mt) # --- Additional considerations for "Epithelial cells, Red blood cells" exclusion --- # The description also mentions excluding specific cell types. This typically involves: # 1. Identifying marker genes for these cell types (e.g., HBB/HBA1 for RBCs, KRTs for epithelial). # 2. Scoring cells for these gene signatures (e.g., using Seurat's AddModuleScore). # 3. Removing cells that highly express these signatures or fall into clusters identified as these cell types. # This step is more complex and dataset-specific, requiring prior knowledge of cell type markers. # Example (conceptual, requires actual marker gene expression in data): # seurat_object_filtered <- subset(seurat_object_filtered, subset = HBB < 1 & HBA1 < 1) # To remove RBCs # seurat_object_filtered <- subset(seurat_object_filtered, subset = Epithelial_Score < threshold) # If a score was calculated # --- Save the filtered Seurat object (optional) --- # saveRDS(seurat_object_filtered, file = "filtered_seurat_object.rds") -
5
Thresholds and genes lists are detailed in the manuscripts.
Not specified (Inferred with models/gemini-2.5-flash) vNot specified$ Bash example
# The step description "Thresholds and genes lists are detailed in the manuscripts" # indicates that specific parameters and outputs are documented externally. # This step describes a reporting or documentation aspect rather than a direct computational execution. # Therefore, no executable bash command can be inferred from this description.
Raw Source Text
FASTQ were demultiplexed using Cell Ranger v2.0 and aligned to the Grch38 human Cell barcodes and unique molecular identifiers (UMIs) were extracted and âRawâ UMI matrix generated for each sample extracted cell-barcodes associated with at least 800 UMIs from the âRawâ output UMI matrices of CellRanger Epithelial cells, Red blood cells and cell-barcodess associated with high mitochondrial mRNA content were excluded by "in-silico" gating precedure which computes that fractoin of UMIs assocated with each gene signature. Thresholds and genes lists are detailed in the manuscripts. Genome_build: Grch38