GSE273092 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.

Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348

Dataset

GSE273092

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling (Zfp697 skeletal muscle knockout RNA-Seq of unloading-rel…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra arguments --outFilterScoreMinOverLread 0 --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0 --outFilterMismatchNmax 33 --seedSearchStartLmax 12 --alignSJoverhangMin 8 --alignEndsType Local

    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables
    READS_R1="reads_R1.fastq.gz" # Placeholder for forward reads
    READS_R2="reads_R2.fastq.gz" # Placeholder for reverse reads (assuming paired-end)
    GENOME_DIR="/path/to/STAR_index/GRCm38" # Path to the STAR genome index for GRCm38
    OUTPUT_BAM="aligned.bam"
    OUTPUT_PREFIX="STAR_output_" # Prefix for STAR output files
    
    # Note: The description mentions trimming adaptor sequences and masking low-complexity/low-quality sequence.
    # This is typically done as a pre-processing step using tools like fastp or Trimmomatic before STAR alignment.
    # The STAR command below focuses on the alignment step with the specified parameters.
    
    # Run STAR alignment
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READS_R1}" "${READS_R2}" \
         --runThreadN 8 \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outFilterScoreMinOverLread 0 \
         --outFilterMatchNmin 0 \
         --outFilterMatchNminOverLread 0 \
         --outFilterMismatchNmax 33 \
         --seedSearchStartLmax 12 \
         --alignSJoverhangMin 8 \
         --alignEndsType Local \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes Standard \
         --outSAMunmapped Within \
         --outSAMmapqUnique 60
    
    # Rename the output BAM file to the desired name
    mv "${OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam" "${OUTPUT_BAM}"
  2. 2

    Reads aligning to gene exons (Mus_musculus.GRCm38.94.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level.

    featureCounts v2.0.3
    $ Bash example
    # Install Subread (which includes featureCounts)
    # conda install -c bioconda subread
    
    # Define input and output files
    INPUT_BAM="aligned_reads.bam" # Placeholder for your aligned BAM file
    GTF_FILE="annotation/Mus_musculus.GRCm38.94.gtf"
    OUTPUT_COUNTS="gene_counts.txt"
    
    # Download reference GTF file (Mus_musculus.GRCm38.94.gtf from Ensembl release 94)
    # mkdir -p annotation
    # wget -O annotation/Mus_musculus.GRCm38.94.gtf.gz ftp://ftp.ensembl.org/pub/release-94/gtf/mus_musculus/Mus_musculus.GRCm38.94.gtf.gz
    # gunzip annotation/Mus_musculus.GRCm38.94.gtf.gz
    
    # Run featureCounts to count reads aligning to gene exons, summarized at gene level
    featureCounts -a "${GTF_FILE}" \
                  -o "${OUTPUT_COUNTS}" \
                  -F GTF \
                  -t exon \
                  -g gene_id \
                  -T 4 \
                  "${INPUT_BAM}"
  3. 3

    Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis

    DESeq2 v1.42.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Bioconductor and DESeq2 (if not already installed)
    # if (!require("BiocManager", quietly = TRUE))
    #     install.packages("BiocManager")
    # BiocManager::install("DESeq2")
    
    # Create a placeholder R script for DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    library(DESeq2)
    
    # --- Configuration --- #
    # Input count matrix file (e.g., from featureCounts, HTSeq-count, RSEM)
    # Assumes genes in rows, samples in columns, first column is gene ID
    counts_file <- "counts.tsv"
    
    # Input sample metadata file
    # Assumes samples in rows, metadata in columns, first column is sample ID
    # Must contain a 'condition' column for differential expression
    coldata_file <- "coldata.tsv"
    
    # Output results file
    output_file <- "deseq2_results.tsv"
    
    # Design formula (e.g., ~ condition). Adjust if more complex designs are needed.
    design_formula <- ~ condition
    
    # FDR threshold for results filtering
    fdr_threshold <- 0.05
    
    # --- Load Data --- #
    # Load count data
    count_data <- read.table(counts_file, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)
    count_data <- as.matrix(count_data)
    
    # Load colData
    coldata <- read.table(coldata_file, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)
    
    # Ensure sample names match and are in the same order
    stopifnot(all(rownames(coldata) == colnames(count_data)))
    
    # --- Create DESeqDataSet --- #
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = coldata,
                                  design = design_formula)
    
    # --- Pre-filtering: Exclude genes with total raw counts less than 1 across all samples ---
    # This means excluding genes with zero total counts across all samples
    keep <- rowSums(counts(dds)) > 0
    dds <- dds[keep,]
    
    # --- Run DESeq2 analysis --- #
    dds <- DESeq(dds)
    
    # --- Extract results with specified FDR (alpha) --- #
    res <- results(dds, alpha = fdr_threshold)
    
    # --- Save results --- #
    write.table(as.data.frame(res), file = output_file, sep = "\t", quote = FALSE, col.names = NA)
    
    message(paste("DESeq2 analysis complete. Results saved to:", output_file))
    EOF
    
    # Placeholder for creating dummy input files
    # In a real pipeline, these would be generated by upstream steps
    
    # Create dummy counts.tsv
    cat << 'EOF' > counts.tsv	Sample1	Sample2	Sample3	Sample4
    GeneA	100	120	50	60
    GeneB	50	60	100	110
    GeneC	10	15	5	8
    GeneD	0	0	0	0
    GeneE	200	210	180	190
    EOF
    
    # Create dummy coldata.tsv
    cat << 'EOF' > coldata.tsv	condition
    Sample1	Control
    Sample2	Control
    Sample3	Treated
    Sample4	Treated
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R

Tools Used

Raw Source Text
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra arguments --outFilterScoreMinOverLread 0   --outFilterMatchNmin 0   --outFilterMatchNminOverLread 0   --outFilterMismatchNmax 33   --seedSearchStartLmax 12   --alignSJoverhangMin 8   --alignEndsType Local
Reads aligning to gene exons (Mus_musculus.GRCm38.94.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level.
Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis
Assembly: GRCm38.p6
Supplementary files format and content: tab-delimited text file includes raw read counts for all samples
Supplementary files format and content: tab-delimited text files includes differetial gene expression analysis output from DESeq2
← Back to Analysis