GSE147127 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.

Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348

Dataset

GSE147127

Single-nucleus RNA-seq identifies transcriptional heterogeneity in multinucleated skeletal myofibers

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Raw sequencing data of all samples were processed using the cellRanger workflow (version 3.1.0), using a combined intron-exon reference produced as described using the vendor-provided “Generating a Cell Ranger compatible "pre-mRNA" Reference Package” guidelines (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references).

    Cell Ranger v3.1.0
    $ Bash example
    # Install Cell Ranger (example for version 3.1.0)
    # wget https://cf.10xgenomics.com/releases/cell-exp/cellranger-3.1.0.tar.gz
    # tar -xzf cellranger-3.1.0.tar.gz
    # export PATH=/path/to/cellranger-3.1.0:$PATH
    
    # Run cellranger count workflow
    # This command assumes a custom pre-mRNA reference was built as described in the provided guidelines.
    # Replace placeholders like /path/to/raw_fastqs, /path/to/custom_intron_exon_reference, sample_name, and sample_output_directory.
    cellranger count \
        --id=sample_output_directory \
        --transcriptome=/path/to/custom_intron_exon_reference \
        --fastqs=/path/to/raw_fastqs \
        --sample=sample_name \
        --localcores=8 \
        --localmem=64
    
  2. 2

    In brief, the “pre-mRNA” reference was derived using the default exon-level GTF file provided by 10x Genomics ( http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-mm10-3.0.0.tar.gz).

    Cell Ranger v3.0.0
    $ Bash example
    # Download the 10x Genomics reference data tarball
    wget http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-mm10-3.0.0.tar.gz
    tar -xzf refdata-cellranger-mm10-3.0.0.tar.gz
    
    # Define paths to the extracted FASTA and GTF from the 10x Genomics tarball
    GENOME_FASTA="refdata-cellranger-mm10-3.0.0/fasta/genome.fa"
    GENES_GTF="refdata-cellranger-mm10-3.0.0/genes/genes.gtf"
    OUTPUT_GENOME_NAME="mm10_pre_mrna"
    OUTPUT_DIR="mm10_pre_mrna_ref"
    
    # Create the pre-mRNA reference using cellranger mkref with the --include-introns flag
    cellranger mkref \
        --genome="${OUTPUT_GENOME_NAME}" \
        --fasta="${GENOME_FASTA}" \
        --genes="${GENES_GTF}" \
        --include-introns \
        --ref-path="${OUTPUT_DIR}"
  3. 3

    Using the below awk command, this exon-level GTF file into “pre-MRNA” GTF containing intron transcript definitions).

    awk vN/A
    $ Bash example
    # Define input and output GTF files
    INPUT_GTF="exon_level.gtf"
    OUTPUT_GTF="pre_mrna.gtf"
    
    # Convert exon-level GTF to "pre-mRNA" GTF using awk
    # This command processes exon lines. For each exon line, it outputs two lines:
    # first, a line identical to the exon line but with the feature type changed to 'transcript',
    # and second, the original exon line. This effectively creates a 'transcript' entry for each exon,
    # with the same coordinates as the exon. It does not explicitly define intron features
    # or consolidate exons into a single pre-mRNA transcript span as typically understood.
    awk -F'\t' 'BEGIN{OFS="\t"} $3=="exon" {print $1,$2,"transcript",$4,$5,$6,$7,$8,$9; print $0}' "${INPUT_GTF}" > "${OUTPUT_GTF}"
  4. 4

    Next, the below mkref command was run to produce the final “pre-MRNA” GTF and genome fasta file ($ cellranger mkref --genome=Mmpre --fasta=genome.fa --genes=genes.premrna.gtf).

    Cell Ranger v7.1.0 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Cell Ranger is typically downloaded as a tarball, extracted, and then added to the system's PATH.
    # Example installation (adjust version as needed):
    # wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-7.1.0.tar.gz
    # tar -xzf cellranger-7.1.0.tar.gz
    # export PATH=$PATH:/path/to/cellranger-7.1.0
    
    cellranger mkref --genome=Mmpre --fasta=genome.fa --genes=genes.premrna.gtf
  5. 5

    For each dataset, we corrected for ambient background RNA by filtering with the R package SoupX (version 0.3.0), using the inferNonExpressedGenes() function to determine which genes had the highest probability of being ambient mRNA, and the strainCells() function in order to transform count matrices.

    $ Bash example
    # Install R if not already installed
    # sudo apt-get update && sudo apt-get install -y r-base
    
    # Install SoupX package from CRAN
    # Rscript -e 'install.packages("SoupX")'
    
    # Define input and output paths
    # Replace with actual paths to your 10x Genomics raw and filtered data directories.
    # Example:
    # filtered_10x_data_dir="path/to/your/filtered_feature_bc_matrix_directory"
    # raw_10x_data_dir="path/to/your/raw_feature_bc_matrix_directory"
    # output_corrected_counts_mtx="corrected_counts.mtx"
    
    # Using default values if not provided as command-line arguments for demonstration.
    # In a real pipeline, these would be explicitly defined or passed.
    filtered_10x_data_dir="${1:-filtered_feature_bc_matrix}"
    raw_10x_data_dir="${2:-raw_feature_bc_matrix}"
    output_corrected_counts_mtx="${3:-corrected_counts.mtx}"
    
    # Export variables to be accessible by Rscript
    export filtered_10x_data_dir
    export raw_10x_data_dir
    export output_corrected_counts_mtx
    
    Rscript -e '
        library(SoupX)
        library(Matrix) # For writeMM
    
        # Retrieve paths from environment variables
        filtered_10x_data_dir <- Sys.getenv("filtered_10x_data_dir")
        raw_10x_data_dir <- Sys.getenv("raw_10x_data_dir")
        output_corrected_counts_mtx <- Sys.getenv("output_corrected_counts_mtx")
    
        # Load data and create SoupX object
        # This function automatically loads matrix.mtx, features.tsv, barcodes.tsv
        # from the specified 10x Genomics output directories.
        sc <- SoupX::load10X(filtered_10x_data_dir, raw_10x_data_dir)
    
        # Determine which genes had the highest probability of being ambient mRNA.
        # This function updates the "nonExpressedGeneList" slot in the SoupX object.
        sc <- SoupX::inferNonExpressedGenes(sc)
    
        # Estimate the contamination fraction based on the inferred non-expressed genes.
        # This step is crucial before adjusting counts.
        sc <- SoupX::autoEstCont(sc)
    
        # Transform count matrices by correcting for ambient background RNA.
        # The adjustCounts() function performs the actual correction, conceptually
        # referred to as "strainCells" in the description for transforming count matrices.
        corrected_counts <- SoupX::adjustCounts(sc)
    
        # Save the corrected counts matrix in MatrixMarket format
        Matrix::writeMM(corrected_counts, file=output_corrected_counts_mtx)
    
        # Optionally, save gene and barcode names if needed for downstream analysis
        # write.table(rownames(corrected_counts), file=gsub(".mtx$", "_features.tsv", output_corrected_counts_mtx), sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)
        # write.table(colnames(corrected_counts), file=gsub(".mtx$", "_barcodes.tsv", output_corrected_counts_mtx), sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)
    '

Tools Used

Raw Source Text
Raw sequencing data of all samples were processed using the cellRanger workflow (version 3.1.0), using a combined intron-exon reference produced as described using the vendor-provided “Generating a Cell Ranger compatible "pre-mRNA" Reference Package” guidelines (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references). In brief, the “pre-mRNA” reference was derived using the default exon-level GTF file provided by 10x Genomics ( http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-mm10-3.0.0.tar.gz). Using the below awk command, this exon-level GTF file into “pre-MRNA” GTF containing intron transcript definitions). Next, the below mkref command was run to produce the final “pre-MRNA” GTF and genome fasta file ($ cellranger mkref --genome=Mmpre --fasta=genome.fa --genes=genes.premrna.gtf). For each dataset, we corrected for ambient background RNA by filtering with the R package SoupX (version 0.3.0), using the inferNonExpressedGenes() function to determine which genes had the highest probability of being ambient mRNA, and the strainCells() function in order to transform count matrices.
Genome_build: mm10
Supplementary_files_format_and_content: HDF5
← Back to Analysis