GSE147127 Processing Pipeline
Publication
Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348
Dataset
GSE147127Single-nucleus RNA-seq identifies transcriptional heterogeneity in multinucleated skeletal myofibers
Processing Steps
Generate Jupyter Notebook-
1
Raw sequencing data of all samples were processed using the cellRanger workflow (version 3.1.0), using a combined intron-exon reference produced as described using the vendor-provided âGenerating a Cell Ranger compatible "pre-mRNA" Reference Packageâ guidelines (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references).
Cell Ranger v3.1.0$ Bash example
# Install Cell Ranger (example for version 3.1.0) # wget https://cf.10xgenomics.com/releases/cell-exp/cellranger-3.1.0.tar.gz # tar -xzf cellranger-3.1.0.tar.gz # export PATH=/path/to/cellranger-3.1.0:$PATH # Run cellranger count workflow # This command assumes a custom pre-mRNA reference was built as described in the provided guidelines. # Replace placeholders like /path/to/raw_fastqs, /path/to/custom_intron_exon_reference, sample_name, and sample_output_directory. cellranger count \ --id=sample_output_directory \ --transcriptome=/path/to/custom_intron_exon_reference \ --fastqs=/path/to/raw_fastqs \ --sample=sample_name \ --localcores=8 \ --localmem=64 -
2
In brief, the âpre-mRNAâ reference was derived using the default exon-level GTF file provided by 10x Genomics (Â http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-mm10-3.0.0.tar.gz).
Cell Ranger v3.0.0$ Bash example
# Download the 10x Genomics reference data tarball wget http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-mm10-3.0.0.tar.gz tar -xzf refdata-cellranger-mm10-3.0.0.tar.gz # Define paths to the extracted FASTA and GTF from the 10x Genomics tarball GENOME_FASTA="refdata-cellranger-mm10-3.0.0/fasta/genome.fa" GENES_GTF="refdata-cellranger-mm10-3.0.0/genes/genes.gtf" OUTPUT_GENOME_NAME="mm10_pre_mrna" OUTPUT_DIR="mm10_pre_mrna_ref" # Create the pre-mRNA reference using cellranger mkref with the --include-introns flag cellranger mkref \ --genome="${OUTPUT_GENOME_NAME}" \ --fasta="${GENOME_FASTA}" \ --genes="${GENES_GTF}" \ --include-introns \ --ref-path="${OUTPUT_DIR}" -
3
Using the below awk command, this exon-level GTF file into âpre-MRNAâ GTF containing intron transcript definitions).
awk vN/A$ Bash example
# Define input and output GTF files INPUT_GTF="exon_level.gtf" OUTPUT_GTF="pre_mrna.gtf" # Convert exon-level GTF to "pre-mRNA" GTF using awk # This command processes exon lines. For each exon line, it outputs two lines: # first, a line identical to the exon line but with the feature type changed to 'transcript', # and second, the original exon line. This effectively creates a 'transcript' entry for each exon, # with the same coordinates as the exon. It does not explicitly define intron features # or consolidate exons into a single pre-mRNA transcript span as typically understood. awk -F'\t' 'BEGIN{OFS="\t"} $3=="exon" {print $1,$2,"transcript",$4,$5,$6,$7,$8,$9; print $0}' "${INPUT_GTF}" > "${OUTPUT_GTF}" -
4
Next, the below mkref command was run to produce the final âpre-MRNAâ GTF and genome fasta file ($ cellranger mkref --genome=Mmpre --fasta=genome.fa --genes=genes.premrna.gtf).
Cell Ranger v7.1.0 (Inferred with models/gemini-2.5-flash)$ Bash example
# Cell Ranger is typically downloaded as a tarball, extracted, and then added to the system's PATH. # Example installation (adjust version as needed): # wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-7.1.0.tar.gz # tar -xzf cellranger-7.1.0.tar.gz # export PATH=$PATH:/path/to/cellranger-7.1.0 cellranger mkref --genome=Mmpre --fasta=genome.fa --genes=genes.premrna.gtf
-
5
For each dataset, we corrected for ambient background RNA by filtering with the R package SoupX (version 0.3.0), using the inferNonExpressedGenes() function to determine which genes had the highest probability of being ambient mRNA, and the strainCells() function in order to transform count matrices.
$ Bash example
# Install R if not already installed # sudo apt-get update && sudo apt-get install -y r-base # Install SoupX package from CRAN # Rscript -e 'install.packages("SoupX")' # Define input and output paths # Replace with actual paths to your 10x Genomics raw and filtered data directories. # Example: # filtered_10x_data_dir="path/to/your/filtered_feature_bc_matrix_directory" # raw_10x_data_dir="path/to/your/raw_feature_bc_matrix_directory" # output_corrected_counts_mtx="corrected_counts.mtx" # Using default values if not provided as command-line arguments for demonstration. # In a real pipeline, these would be explicitly defined or passed. filtered_10x_data_dir="${1:-filtered_feature_bc_matrix}" raw_10x_data_dir="${2:-raw_feature_bc_matrix}" output_corrected_counts_mtx="${3:-corrected_counts.mtx}" # Export variables to be accessible by Rscript export filtered_10x_data_dir export raw_10x_data_dir export output_corrected_counts_mtx Rscript -e ' library(SoupX) library(Matrix) # For writeMM # Retrieve paths from environment variables filtered_10x_data_dir <- Sys.getenv("filtered_10x_data_dir") raw_10x_data_dir <- Sys.getenv("raw_10x_data_dir") output_corrected_counts_mtx <- Sys.getenv("output_corrected_counts_mtx") # Load data and create SoupX object # This function automatically loads matrix.mtx, features.tsv, barcodes.tsv # from the specified 10x Genomics output directories. sc <- SoupX::load10X(filtered_10x_data_dir, raw_10x_data_dir) # Determine which genes had the highest probability of being ambient mRNA. # This function updates the "nonExpressedGeneList" slot in the SoupX object. sc <- SoupX::inferNonExpressedGenes(sc) # Estimate the contamination fraction based on the inferred non-expressed genes. # This step is crucial before adjusting counts. sc <- SoupX::autoEstCont(sc) # Transform count matrices by correcting for ambient background RNA. # The adjustCounts() function performs the actual correction, conceptually # referred to as "strainCells" in the description for transforming count matrices. corrected_counts <- SoupX::adjustCounts(sc) # Save the corrected counts matrix in MatrixMarket format Matrix::writeMM(corrected_counts, file=output_corrected_counts_mtx) # Optionally, save gene and barcode names if needed for downstream analysis # write.table(rownames(corrected_counts), file=gsub(".mtx$", "_features.tsv", output_corrected_counts_mtx), sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE) # write.table(colnames(corrected_counts), file=gsub(".mtx$", "_barcodes.tsv", output_corrected_counts_mtx), sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE) '
Tools Used
Raw Source Text
Raw sequencing data of all samples were processed using the cellRanger workflow (version 3.1.0), using a combined intron-exon reference produced as described using the vendor-provided âGenerating a Cell Ranger compatible "pre-mRNA" Reference Packageâ guidelines (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references). In brief, the âpre-mRNAâ reference was derived using the default exon-level GTF file provided by 10x Genomics (Â http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-mm10-3.0.0.tar.gz). Using the below awk command, this exon-level GTF file into âpre-MRNAâ GTF containing intron transcript definitions). Next, the below mkref command was run to produce the final âpre-MRNAâ GTF and genome fasta file ($ cellranger mkref --genome=Mmpre --fasta=genome.fa --genes=genes.premrna.gtf). For each dataset, we corrected for ambient background RNA by filtering with the R package SoupX (version 0.3.0), using the inferNonExpressedGenes() function to determine which genes had the highest probability of being ambient mRNA, and the strainCells() function in order to transform count matrices. Genome_build: mm10 Supplementary_files_format_and_content: HDF5