GSE273092 Processing Pipeline
Publication
Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348
Dataset
GSE273092Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling (Zfp697 skeletal muscle knockout RNA-Seq of unloading-rel…
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra arguments --outFilterScoreMinOverLread 0 --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0 --outFilterMismatchNmax 33 --seedSearchStartLmax 12 --alignSJoverhangMin 8 --alignEndsType Local
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables READS_R1="reads_R1.fastq.gz" # Placeholder for forward reads READS_R2="reads_R2.fastq.gz" # Placeholder for reverse reads (assuming paired-end) GENOME_DIR="/path/to/STAR_index/GRCm38" # Path to the STAR genome index for GRCm38 OUTPUT_BAM="aligned.bam" OUTPUT_PREFIX="STAR_output_" # Prefix for STAR output files # Note: The description mentions trimming adaptor sequences and masking low-complexity/low-quality sequence. # This is typically done as a pre-processing step using tools like fastp or Trimmomatic before STAR alignment. # The STAR command below focuses on the alignment step with the specified parameters. # Run STAR alignment STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --runThreadN 8 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outFilterScoreMinOverLread 0 \ --outFilterMatchNmin 0 \ --outFilterMatchNminOverLread 0 \ --outFilterMismatchNmax 33 \ --seedSearchStartLmax 12 \ --alignSJoverhangMin 8 \ --alignEndsType Local \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outSAMunmapped Within \ --outSAMmapqUnique 60 # Rename the output BAM file to the desired name mv "${OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam" "${OUTPUT_BAM}" -
2
Reads aligning to gene exons (Mus_musculus.GRCm38.94.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level.
featureCounts v2.0.3$ Bash example
# Install Subread (which includes featureCounts) # conda install -c bioconda subread # Define input and output files INPUT_BAM="aligned_reads.bam" # Placeholder for your aligned BAM file GTF_FILE="annotation/Mus_musculus.GRCm38.94.gtf" OUTPUT_COUNTS="gene_counts.txt" # Download reference GTF file (Mus_musculus.GRCm38.94.gtf from Ensembl release 94) # mkdir -p annotation # wget -O annotation/Mus_musculus.GRCm38.94.gtf.gz ftp://ftp.ensembl.org/pub/release-94/gtf/mus_musculus/Mus_musculus.GRCm38.94.gtf.gz # gunzip annotation/Mus_musculus.GRCm38.94.gtf.gz # Run featureCounts to count reads aligning to gene exons, summarized at gene level featureCounts -a "${GTF_FILE}" \ -o "${OUTPUT_COUNTS}" \ -F GTF \ -t exon \ -g gene_id \ -T 4 \ "${INPUT_BAM}" -
3
Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis
$ Bash example
# Install Bioconductor and DESeq2 (if not already installed) # if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("DESeq2") # Create a placeholder R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R library(DESeq2) # --- Configuration --- # # Input count matrix file (e.g., from featureCounts, HTSeq-count, RSEM) # Assumes genes in rows, samples in columns, first column is gene ID counts_file <- "counts.tsv" # Input sample metadata file # Assumes samples in rows, metadata in columns, first column is sample ID # Must contain a 'condition' column for differential expression coldata_file <- "coldata.tsv" # Output results file output_file <- "deseq2_results.tsv" # Design formula (e.g., ~ condition). Adjust if more complex designs are needed. design_formula <- ~ condition # FDR threshold for results filtering fdr_threshold <- 0.05 # --- Load Data --- # # Load count data count_data <- read.table(counts_file, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE) count_data <- as.matrix(count_data) # Load colData coldata <- read.table(coldata_file, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE) # Ensure sample names match and are in the same order stopifnot(all(rownames(coldata) == colnames(count_data))) # --- Create DESeqDataSet --- # dds <- DESeqDataSetFromMatrix(countData = count_data, colData = coldata, design = design_formula) # --- Pre-filtering: Exclude genes with total raw counts less than 1 across all samples --- # This means excluding genes with zero total counts across all samples keep <- rowSums(counts(dds)) > 0 dds <- dds[keep,] # --- Run DESeq2 analysis --- # dds <- DESeq(dds) # --- Extract results with specified FDR (alpha) --- # res <- results(dds, alpha = fdr_threshold) # --- Save results --- # write.table(as.data.frame(res), file = output_file, sep = "\t", quote = FALSE, col.names = NA) message(paste("DESeq2 analysis complete. Results saved to:", output_file)) EOF # Placeholder for creating dummy input files # In a real pipeline, these would be generated by upstream steps # Create dummy counts.tsv cat << 'EOF' > counts.tsv Sample1 Sample2 Sample3 Sample4 GeneA 100 120 50 60 GeneB 50 60 100 110 GeneC 10 15 5 8 GeneD 0 0 0 0 GeneE 200 210 180 190 EOF # Create dummy coldata.tsv cat << 'EOF' > coldata.tsv condition Sample1 Control Sample2 Control Sample3 Treated Sample4 Treated EOF # Execute the R script Rscript run_deseq2.R
Raw Source Text
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra arguments --outFilterScoreMinOverLread 0 --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0 --outFilterMismatchNmax 33 --seedSearchStartLmax 12 --alignSJoverhangMin 8 --alignEndsType Local Reads aligning to gene exons (Mus_musculus.GRCm38.94.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level. Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis Assembly: GRCm38.p6 Supplementary files format and content: tab-delimited text file includes raw read counts for all samples Supplementary files format and content: tab-delimited text files includes differetial gene expression analysis output from DESeq2