GSE147127 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.

Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348

Dataset

Single-nucleus RNA-seq identifies transcriptional heterogeneity in multinucleated skeletal myofibers

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Raw sequencing data of all samples were processed using the cellRanger workflow (version 3.1.0), using a combined intron-exon reference produced as described using the vendor-provided âGenerating a Cell Ranger compatible "pre-mRNA" Reference Packageâ guidelines (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references).

Cell Ranger v3.1.0

$ Bash example

# Install Cell Ranger (example for version 3.1.0)
# wget https://cf.10xgenomics.com/releases/cell-exp/cellranger-3.1.0.tar.gz
# tar -xzf cellranger-3.1.0.tar.gz
# export PATH=/path/to/cellranger-3.1.0:$PATH

# Run cellranger count workflow
# This command assumes a custom pre-mRNA reference was built as described in the provided guidelines.
# Replace placeholders like /path/to/raw_fastqs, /path/to/custom_intron_exon_reference, sample_name, and sample_output_directory.
cellranger count \
    --id=sample_output_directory \
    --transcriptome=/path/to/custom_intron_exon_reference \
    --fastqs=/path/to/raw_fastqs \
    --sample=sample_name \
    --localcores=8 \
    --localmem=64

In brief, the âpre-mRNAâ reference was derived using the default exon-level GTF file provided by 10x Genomics (Â http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-mm10-3.0.0.tar.gz).

Cell Ranger v3.0.0

$ Bash example

# Download the 10x Genomics reference data tarball
wget http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-mm10-3.0.0.tar.gz
tar -xzf refdata-cellranger-mm10-3.0.0.tar.gz

# Define paths to the extracted FASTA and GTF from the 10x Genomics tarball
GENOME_FASTA="refdata-cellranger-mm10-3.0.0/fasta/genome.fa"
GENES_GTF="refdata-cellranger-mm10-3.0.0/genes/genes.gtf"
OUTPUT_GENOME_NAME="mm10_pre_mrna"
OUTPUT_DIR="mm10_pre_mrna_ref"

# Create the pre-mRNA reference using cellranger mkref with the --include-introns flag
cellranger mkref \
    --genome="${OUTPUT_GENOME_NAME}" \
    --fasta="${GENOME_FASTA}" \
    --genes="${GENES_GTF}" \
    --include-introns \
    --ref-path="${OUTPUT_DIR}"

Using the below awk command, this exon-level GTF file into âpre-MRNAâ GTF containing intron transcript definitions).

awk vN/A

$ Bash example

# Define input and output GTF files
INPUT_GTF="exon_level.gtf"
OUTPUT_GTF="pre_mrna.gtf"

# Convert exon-level GTF to "pre-mRNA" GTF using awk
# This command processes exon lines. For each exon line, it outputs two lines:
# first, a line identical to the exon line but with the feature type changed to 'transcript',
# and second, the original exon line. This effectively creates a 'transcript' entry for each exon,
# with the same coordinates as the exon. It does not explicitly define intron features
# or consolidate exons into a single pre-mRNA transcript span as typically understood.
awk -F'\t' 'BEGIN{OFS="\t"} $3=="exon" {print $1,$2,"transcript",$4,$5,$6,$7,$8,$9; print $0}' "${INPUT_GTF}" > "${OUTPUT_GTF}"

Next, the below mkref command was run to produce the final âpre-MRNAâ GTF and genome fasta file ($ cellranger mkref --genome=Mmpre --fasta=genome.fa --genes=genes.premrna.gtf).

Cell Ranger v7.1.0 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Cell Ranger is typically downloaded as a tarball, extracted, and then added to the system's PATH.
# Example installation (adjust version as needed):
# wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-7.1.0.tar.gz
# tar -xzf cellranger-7.1.0.tar.gz
# export PATH=$PATH:/path/to/cellranger-7.1.0

cellranger mkref --genome=Mmpre --fasta=genome.fa --genes=genes.premrna.gtf

For each dataset, we corrected for ambient background RNA by filtering with the R package SoupX (version 0.3.0), using the inferNonExpressedGenes() function to determine which genes had the highest probability of being ambient mRNA, and the strainCells() function in order to transform count matrices.

R v0.3.0 GitHub

$ Bash example

# Install R if not already installed
# sudo apt-get update && sudo apt-get install -y r-base

# Install SoupX package from CRAN
# Rscript -e 'install.packages("SoupX")'

# Define input and output paths
# Replace with actual paths to your 10x Genomics raw and filtered data directories.
# Example:
# filtered_10x_data_dir="path/to/your/filtered_feature_bc_matrix_directory"
# raw_10x_data_dir="path/to/your/raw_feature_bc_matrix_directory"
# output_corrected_counts_mtx="corrected_counts.mtx"

# Using default values if not provided as command-line arguments for demonstration.
# In a real pipeline, these would be explicitly defined or passed.
filtered_10x_data_dir="${1:-filtered_feature_bc_matrix}"
raw_10x_data_dir="${2:-raw_feature_bc_matrix}"
output_corrected_counts_mtx="${3:-corrected_counts.mtx}"

# Export variables to be accessible by Rscript
export filtered_10x_data_dir
export raw_10x_data_dir
export output_corrected_counts_mtx

Rscript -e '
    library(SoupX)
    library(Matrix) # For writeMM

    # Retrieve paths from environment variables
    filtered_10x_data_dir <- Sys.getenv("filtered_10x_data_dir")
    raw_10x_data_dir <- Sys.getenv("raw_10x_data_dir")
    output_corrected_counts_mtx <- Sys.getenv("output_corrected_counts_mtx")

    # Load data and create SoupX object
    # This function automatically loads matrix.mtx, features.tsv, barcodes.tsv
    # from the specified 10x Genomics output directories.
    sc <- SoupX::load10X(filtered_10x_data_dir, raw_10x_data_dir)

    # Determine which genes had the highest probability of being ambient mRNA.
    # This function updates the "nonExpressedGeneList" slot in the SoupX object.
    sc <- SoupX::inferNonExpressedGenes(sc)

    # Estimate the contamination fraction based on the inferred non-expressed genes.
    # This step is crucial before adjusting counts.
    sc <- SoupX::autoEstCont(sc)

    # Transform count matrices by correcting for ambient background RNA.
    # The adjustCounts() function performs the actual correction, conceptually
    # referred to as "strainCells" in the description for transforming count matrices.
    corrected_counts <- SoupX::adjustCounts(sc)

    # Save the corrected counts matrix in MatrixMarket format
    Matrix::writeMM(corrected_counts, file=output_corrected_counts_mtx)

    # Optionally, save gene and barcode names if needed for downstream analysis
    # write.table(rownames(corrected_counts), file=gsub(".mtx$", "_features.tsv", output_corrected_counts_mtx), sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)
    # write.table(colnames(corrected_counts), file=gsub(".mtx$", "_barcodes.tsv", output_corrected_counts_mtx), sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)
'

View on GitHub

Tools Used

Raw Source Text

Raw sequencing data of all samples were processed using the cellRanger workflow (version 3.1.0), using a combined intron-exon reference produced as described using the vendor-provided âGenerating a Cell Ranger compatible "pre-mRNA" Reference Packageâ guidelines (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references). In brief, the âpre-mRNAâ reference was derived using the default exon-level GTF file provided by 10x Genomics (Â http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-mm10-3.0.0.tar.gz). Using the below awk command, this exon-level GTF file into âpre-MRNAâ GTF containing intron transcript definitions). Next, the below mkref command was run to produce the final âpre-MRNAâ GTF and genome fasta file ($ cellranger mkref --genome=Mmpre --fasta=genome.fa --genes=genes.premrna.gtf). For each dataset, we corrected for ambient background RNA by filtering with the R package SoupX (version 0.3.0), using the inferNonExpressedGenes() function to determine which genes had the highest probability of being ambient mRNA, and the strainCells() function in order to transform count matrices.
Genome_build: mm10
Supplementary_files_format_and_content: HDF5

← Back to Analysis