GSE184564 Processing Pipeline — Yeo Lab Publications

Publication

RNA binding protein DDX5 directs tuft cell specification and function to regulate microbial repertoire and disease susceptibility in the intestine.

Gut (2022) — PMID 34853057

Dataset

GSE184564

Spatial transcriptomic studies of the small intestine from WT and DDX5â³IEC (KO) mice

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1

Space v3.1.0 to align reads to mouse reference genome (mm10)

STAR (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Assuming paired-end reads, adjust if single-end
READS_R1="sample_R1.fastq.gz"
READS_R2="sample_R2.fastq.gz"
GENOME_DIR="/path/to/STAR_index/mm10" # Placeholder for mouse (mm10) STAR index
OUTPUT_PREFIX="sample_aligned_"
NUM_THREADS=8 # Adjust based on available CPU cores
RAM_LIMIT_BAM_SORT=60000000000 # 60GB, adjust based on available RAM

# Note: The STAR genome index for mm10 must be pre-built. 
# Example command to build STAR index (run once per genome):
# STAR --runThreadN "${NUM_THREADS}" \
#      --runMode genomeGenerate \
#      --genomeDir "${GENOME_DIR}" \
#      --genomeFastaFiles /path/to/mm10.fa \
#      --sjdbGTFfile /path/to/mm10.gtf \
#      --sjdbOverhang 100 # Adjust based on read length

# Align reads to the mouse (mm10) reference genome using STAR
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --readFilesCommand zcat \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --alignIntronMax 1 \
     --alignEndsType Local \
     --outSAMunmapped Within \
     --outSAMstrandField intronMotif \
     --outFilterScoreMinOverLread 0.66 \
     --outFilterMatchNminOverLread 0.66 \
     --limitBAMsortRAM "${RAM_LIMIT_BAM_SORT}"

View on GitHub

2

Used Loupe Browser to filter Atoh1, Sox4, Lgr5 high progenitor-containing spots and Dclk1, Klf4, Lyz1, Chga high mature secretory lineage-containing spots (Log2 expression > 0.5)

Loupe Browser v(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Loupe Browser is primarily a GUI tool for interactive visualization and analysis.
# The filtering described is typically performed interactively within the Loupe Browser application.
# Below is a conceptual Python script that would achieve similar filtering programmatically
# if the expression data were available in a tabular format (e.g., CSV).

# Create a dummy expression_matrix.csv for demonstration purposes
# In a real scenario, this file would be generated by upstream steps (e.g., Space Ranger output)
echo "Spot_ID,Atoh1,Sox4,Lgr5,Dclk1,Klf4,Lyz1,Chga,OtherGene" > expression_matrix.csv
echo "Spot1,0.1,0.2,0.6,0.1,0.1,0.1,0.1,1.0" >> expression_matrix.csv # Progenitor
echo "Spot2,0.7,0.1,0.2,0.1,0.1,0.1,0.1,1.0" >> expression_matrix.csv # Progenitor
echo "Spot3,0.1,0.1,0.1,0.6,0.2,0.1,0.1,1.0" >> expression_matrix.csv # Mature Secretory
echo "Spot4,0.1,0.1,0.1,0.1,0.7,0.1,0.1,1.0" >> expression_matrix.csv # Mature Secretory
echo "Spot5,0.1,0.1,0.1,0.1,0.1,0.1,0.1,1.0" >> expression_matrix.csv # Neither
echo "Spot6,0.8,0.7,0.6,0.9,0.8,0.7,0.6,1.0" >> expression_matrix.csv # Both (will appear in both outputs)

# Ensure Python and pandas are installed
# conda install pandas numpy

# Execute the Python script for filtering
python -c '
import pandas as pd
import numpy as np

input_matrix = "expression_matrix.csv"
output_progenitor = "filtered_progenitor_spots.csv"
output_mature_secretory = "filtered_mature_secretory_spots.csv"

progenitor_genes = ["Atoh1", "Sox4", "Lgr5"]
mature_secretory_genes = ["Dclk1", "Klf4", "Lyz1", "Chga"]
log2_threshold = 0.5

try:
    df = pd.read_csv(input_matrix, index_col=0)

    progenitor_filter = df[progenitor_genes].gt(log2_threshold).any(axis=1)
    filtered_progenitor_spots = df[progenitor_filter]

    mature_secretory_filter = df[mature_secretory_genes].gt(log2_threshold).any(axis=1)
    filtered_mature_secretory_spots = df[mature_secretory_filter]

    filtered_progenitor_spots.to_csv(output_progenitor)
    filtered_mature_secretory_spots.to_csv(output_mature_secretory)

    print(f"Filtered progenitor spots saved to {output_progenitor}")
    print(f"Filtered mature secretory spots saved to {output_mature_secretory}")

except FileNotFoundError:
    print(f"Error: Input file \\'{input_matrix}\\' not found. Please ensure the file exists.")
except KeyError as e:
    print(f"Error: One or more specified genes not found in the expression matrix columns: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")
'

# Clean up dummy file (optional, for a real pipeline this would be an actual input)
# rm expression_matrix.csv

View on GitHub

3

Differential gene expression analysis

DESeq2 (Inferred with models/gemini-2.5-flash) v1.36.0 (R 4.2.0) GitHub

$ Bash example

# This script performs differential gene expression analysis using DESeq2.
# It assumes you have a gene count matrix (e.g., from featureCounts, Salmon, Kallisto, RSEM)
# and a sample metadata file with experimental design information.

# --- Installation (uncomment and run if DESeq2 is not installed) ---
# # Install R and Bioconductor if not already present
# # For example, using conda:
# # conda create -n deseq2_env r-base=4.2.0 bioconductor-deseq2=1.36.0 -c bioconda -c conda-forge -y
# # conda activate deseq2_env

# # Alternatively, within R:
# # if (!requireNamespace("BiocManager", quietly = TRUE))
# #     install.packages("BiocManager")
# # BiocManager::install("DESeq2")

# --- Prepare Input Files (replace with your actual data) ---
# Create a dummy design file (replace with your actual sample metadata)
# This file should contain sample names as row names and experimental conditions as columns.
# Example: sample,condition,batch
#          sample1,treated,A
#          sample2,treated,A
#          sample3,control,B
#          sample4,control,B
cat << EOF > design.csv
sample,condition,batch
sample1,treated,A
sample2,treated,A
sample3,control,B
sample4,control,B
EOF

# Create dummy count files (replace with your actual gene count matrix)
# This file should contain gene IDs as row names and sample names as columns with raw counts.
# Example: gene_id,sample1,sample2,sample3,sample4
#          geneA,100,120,50,60
#          geneB,200,210,100,110
#          geneC,50,60,150,160
#          geneD,10,12,5,6
#          geneE,300,320,150,170
cat << EOF > counts.csv
gene_id,sample1,sample2,sample3,sample4
geneA,100,120,50,60
geneB,200,210,100,110
geneC,50,60,150,160
geneD,10,12,5,6
geneE,300,320,150,170
EOF

# --- R Script for DESeq2 Analysis ---
# This script will perform the differential expression analysis and save the results.
cat << 'EOF' > run_deseq2.R
library(DESeq2)

# Load count data
# Ensure the first column is gene IDs and subsequent columns are raw counts for each sample.
count_data <- read.csv("counts.csv", row.names = 1)
count_data <- as.matrix(count_data)

# Load sample metadata
# Ensure the first column is sample IDs and subsequent columns are experimental factors.
sample_info <- read.csv("design.csv", row.names = 1)

# Ensure sample order in colData matches countData columns
sample_info <- sample_info[colnames(count_data), , drop = FALSE]

# Create DESeqDataSet object
# The 'design' formula specifies the experimental factors to model.
# Example: ~ condition (for a simple two-group comparison)
# Example: ~ batch + condition (to account for batch effects)
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = ~ condition) # Adjust design formula as per your experiment

# Pre-filtering: remove genes with very low counts across all samples
# This can improve power by removing genes that are unlikely to be differentially expressed.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results for a specific contrast (e.g., 'treated' vs 'control')
# Replace 'condition', 'treated', 'control' with your actual factor name and levels.
res <- results(dds, contrast = c("condition", "treated", "control"))

# Order results by adjusted p-value
res <- res[order(res$padj), ]

# Save results to a CSV file
write.csv(as.data.frame(res), file = "deseq2_results.csv")

# Generate and save normalized counts
normalized_counts <- counts(dds, normalized = TRUE)
write.csv(as.data.frame(normalized_counts), file = "normalized_counts.csv")

message("Differential gene expression analysis complete. Results saved to deseq2_results.csv and normalized_counts.csv")

# Optional: Generate MA plot
# png("MA_plot.png")
# plotMA(res, main="MA plot of DESeq2 results")
# dev.off()

# Optional: Generate Volcano plot (requires EnhancedVolcano package)
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("EnhancedVolcano")
# library(EnhancedVolcano)
# png("Volcano_plot.png", width=800, height=800)
# EnhancedVolcano(res,
#                 lab = rownames(res),
#                 x = 'log2FoldChange',
#                 y = 'pvalue',
#                 title = 'Volcano Plot',
#                 pCutoff = 0.05,
#                 FCcutoff = 1.5,
#                 pointSize = 3.0,
#                 labSize = 6.0,
#                 colAlpha = 1,
#                 legendPosition = 'right',
#                 caption = bquote(~Log[2]~"FC cutoff, 1.5; p-value cutoff, 0.05"),
#                 gridlines.major = FALSE,
#                 gridlines.minor = FALSE,
#                 border = 'full',
#                 borderWidth = 1.5,
#                 borderColour = 'black')
# dev.off()
EOF

# --- Execute the R script ---
Rscript run_deseq2.R

echo "Differential gene expression analysis finished. Check deseq2_results.csv for results."

View on GitHub