GSE273093 Processing Pipeline
Publication
Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348
Dataset
GSE273093Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling (Zfp697 transduced primary mouse myotubes RNA-Seq)
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra arguments --outFilterMultimapNmax 50 --outFilterMultimapScoreRange 3 --outFilterScoreMinOverLread 0.7 --outFilterMatchNminOverLread 0.7 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star=2.7.3a # Create STAR genome index (if not already done) # This step needs to be run once for a given genome assembly. # Replace /path/to/GRCm38_fasta with the actual path to your GRCm38 FASTA file(s). # Replace /path/to/GRCm38_gtf with the actual path to your GRCm38 GTF annotation file. # Replace /path/to/STAR_index/GRCm38 with your desired output directory for the STAR index. # STAR --runThreadN 8 \ # --runMode genomeGenerate \ # --genomeDir /path/to/STAR_index/GRCm38 \ # --genomeFastaFiles /path/to/GRCm38_fasta/GRCm38.fa \ # --sjdbGTFfile /path/to/GRCm38_gtf/GRCm38.gtf \ # --sjdbOverhang 100 # Recommended: (ReadLength - 1) # Align reads using STAR # Replace /path/to/STAR_index/GRCm38 with the actual path to your STAR genome index. # Replace read1.fastq.gz and read2.fastq.gz with your input FASTQ file(s). # If single-end reads, remove 'read2.fastq.gz'. # Replace /path/to/output_dir with your desired output directory. # Replace sample_prefix with a meaningful prefix for your output files. STAR --runThreadN 8 \ --genomeDir /path/to/STAR_index/GRCm38 \ --readFilesIn read1.fastq.gz read2.fastq.gz \ --outFileNamePrefix /path/to/output_dir/sample_prefix_ \ --outFilterMultimapNmax 50 \ --outFilterMultimapScoreRange 3 \ --outFilterScoreMinOverLread 0.7 \ --outFilterMatchNminOverLread 0.7 \ --outFilterMismatchNmax 10 \ --alignIntronMax 500000 \ --alignMatesGapMax 1000000 \ --sjdbScore 2 \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outSAMunmapped Within -
2
Reads aligning to gene exons (Mus_musculus.GRCm38.102.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level with extra parameters -s 0
featureCounts v2.0.3$ Bash example
# Install featureCounts (part of Subread package) # conda install -c bioconda subread # Reference dataset: Mus_musculus.GRCm38.102.gtf # This GTF file can be downloaded from Ensembl's FTP server. # For Ensembl release 102, the path would be similar to: # wget http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz # gunzip Mus_musculus.GRCm38.102.gtf.gz GTF_FILE="Mus_musculus.GRCm38.102.gtf" INPUT_BAM="input.bam" # Placeholder for your alignment file (e.g., STAR output BAM) OUTPUT_COUNTS="gene_counts.txt" featureCounts -a "${GTF_FILE}" \ -o "${OUTPUT_COUNTS}" \ -s 0 \ "${INPUT_BAM}" -
3
Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis
$ Bash example
# Install R and Bioconductor if not already installed # For R: # sudo apt-get update # sudo apt-get install r-base # # For Bioconductor and DESeq2: # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("DESeq2")' # Create a placeholder count matrix and sample metadata for demonstration # In a real scenario, these files would already exist. echo "gene_id,sample1,sample2,sample3,sample4" > count_matrix.csv echo "geneA,10,12,5,7" >> count_matrix.csv echo "geneB,0,0,0,0" >> count_matrix.csv echo "geneC,20,25,10,15" >> count_matrix.csv echo "geneD,5,6,1,2" >> count_matrix.csv echo "geneE,0,1,0,0" >> count_matrix.csv # This gene will be kept due to count >= 1 echo "sample_id,condition" > sample_metadata.csv echo "sample1,control" >> sample_metadata.csv echo "sample2,control" >> sample_metadata.csv echo "sample3,treated" >> sample_metadata.csv echo "sample4,treated" >> sample_metadata.csv # R script for DESeq2 analysis Rscript -e ' library(DESeq2) # --- Configuration --- # Define input and output file paths count_matrix_file <- "count_matrix.csv" # Replace with your actual count matrix file sample_metadata_file <- "sample_metadata.csv" # Replace with your actual sample metadata file output_results_file <- "deseq2_results.csv" design_formula <- "~ condition" # Replace "condition" with your actual primary experimental variable # --- Data Loading --- # Load count data # Assuming the first column is gene names and subsequent columns are sample counts count_data <- read.csv(count_matrix_file, row.names = 1, check.names = FALSE) # Ensure counts are integers count_data <- round(count_data) # Load sample metadata # Assuming the first column is sample names matching count_data columns sample_metadata <- read.csv(sample_metadata_file, row.names = 1) # Ensure sample names match and are in the same order sample_metadata <- sample_metadata[colnames(count_data), , drop = FALSE] # --- DESeq2 Analysis --- # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_metadata, design = as.formula(design_formula)) # Filter out genes with total raw counts less than 1 across all samples # This means genes with a sum of 0 counts across all samples are removed. dds <- dds[rowSums(counts(dds)) >= 1, ] # Run DESeq2 differential expression analysis dds <- DESeq(dds) # Extract results with FDR level 0.05 (alpha=0.05) # Adjust contrast if needed, e.g., contrast=c("condition", "treated", "control") res <- results(dds, alpha = 0.05) # Order results by adjusted p-value res <- res[order(res$padj), ] # Save the results write.csv(as.data.frame(res), file = output_results_file) # Display summary of results summary(res) ' # Clean up placeholder files (optional) # rm count_matrix.csv sample_metadata.csv
Raw Source Text
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to GRCm38 whole genome using STAR 2.7.3a with extra arguments --outFilterMultimapNmax 50 --outFilterMultimapScoreRange 3 --outFilterScoreMinOverLread 0.7 --outFilterMatchNminOverLread 0.7 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2 Reads aligning to gene exons (Mus_musculus.GRCm38.102.gtf) were counted using featureCounts v2.0.3 program from Subread package, summarized at gene level with extra parameters -s 0 Differential gene expression at FDR level 0.05 was obtained using DESeq2 R package, genes with total raw counts less then 1 across all samples were excluded from analysis Assembly: GRCm38.p6 Supplementary files format and content: tab-delimited text file includes raw read counts for all samples Supplementary files format and content: tab-delimited text file includes differetial gene expression analysis output from DESeq2