GSE114820 Processing Pipeline
RNA-Seq
code_examples
4 steps
Publication
Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348
Dataset
GSE114820A Transcriptomic Analysis of the Development of Skeletal Muscle Atrophy in Cancer-Cachexia in Tumor-Bearing Mice
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were performed quality assessment using FastQC.
$ Bash example
# Install FastQC via Conda # conda install -c bioconda fastqc # Run FastQC on input FASTQ files # Replace 'input_reads.fastq.gz' with your actual input file(s) # Replace 'output_directory' with your desired output location fastqc input_reads.fastq.gz -o output_directory
-
2
Sequenced reads were then aligend to GRCm38.p5 reference genome using Hisat2 tool.
$ Bash example
# Install HISAT2 (example using conda) # conda install -c bioconda hisat2 # Create a directory for reference genome and index mkdir -p reference_genome cd reference_genome # Download GRCm38.p5 reference genome from NCBI wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/24/GCF_000001635.24_GRCm38.p5/GCF_000001635.24_GRCm38.p5_genomic.fna.gz gunzip GCF_000001635.24_GRCm38.p5_genomic.fna.gz # Build HISAT2 index # The index name 'GRCm38_p5' is chosen to reflect the reference genome. hisat2-build GCF_000001635.24_GRCm38.p5_genomic.fna GRCm38_p5 cd .. # Define input and output files READS_R1="reads_R1.fastq.gz" # Placeholder for forward reads READS_R2="reads_R2.fastq.gz" # Placeholder for reverse reads (if paired-end) OUTPUT_SAM="aligned_reads.sam" OUTPUT_BAM="aligned_reads.bam" # Align sequenced reads to GRCm38.p5 reference genome using HISAT2 # Assuming paired-end reads and outputting SAM format, then converting to BAM and sorting. # --dta: Report alignments tailored for transcript assemblers like StringTie. # -p 8: Use 8 threads. hisat2 -x reference_genome/GRCm38_p5 -1 ${READS_R1} -2 ${READS_R2} --dta -p 8 -S ${OUTPUT_SAM} # Convert SAM to BAM and sort (requires samtools) # conda install -c bioconda samtools samtools view -bS ${OUTPUT_SAM} | samtools sort -o ${OUTPUT_BAM} # Index the BAM file samtools index ${OUTPUT_BAM} -
3
Raw read counts were then obtained from aligned files using HTSeq.
HTSeq vInferred with models/gemini-2.5-flash$ Bash example
# Install HTSeq (if not already installed) # conda install -c bioconda htseq # Placeholder for reference genome annotation (e.g., GENCODE for hg38) # Replace with the actual path to your GTF file GTF_FILE="/path/to/gencode.v38.annotation.gtf" # Placeholder for aligned BAM file # Replace with the actual path to your aligned BAM file ALIGNED_BAM="/path/to/aligned_reads.bam" # Output file for raw read counts OUTPUT_COUNTS="gene_counts.txt" # Obtain raw read counts using htseq-count # Parameters are common defaults for RNA-seq counting: # -f bam: Input file format is BAM # -r pos: Reads are sorted by position (common for aligners like STAR) # -s reverse: Strandedness (adjust based on library preparation, 'yes', 'no', or 'reverse') # -a 10: Minimum alignment quality score to consider (default is 10) # -t exon: Feature type to count (e.g., 'exon' for gene-level counts) # -i gene_id: Attribute in GTF file to use as feature ID (e.g., 'gene_id') # --mode union: How to handle reads overlapping multiple features (union is common) htseq-count -f bam -r pos -s reverse -a 10 -t exon -i gene_id --mode union "${ALIGNED_BAM}" "${GTF_FILE}" > "${OUTPUT_COUNTS}" -
4
Differential expression was analyzed by DESeq2
$ Bash example
# Install R if not already installed # sudo apt-get update # sudo apt-get install r-base # Install BiocManager if not already installed # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # Install DESeq2 # R -e 'BiocManager::install("DESeq2")' # Create dummy input files for demonstration # In a real scenario, these would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto) mkdir -p data cat << EOF > data/counts.tsv gene_id\tsample1\tsample2\tsample3\tsample4\tsample5\tsample6 geneA\t100\t120\t110\t200\t210\t220 geneB\t50\t60\t55\t100\t110\t105 geneC\t200\t210\t205\t100\t90\t95 geneD\t10\t12\t11\t20\t21\t22 geneE\t5\t6\t5\t10\t11\t10 EOF cat << EOF > data/sample_info.tsv sample_id\tcondition sample1\tcontrol sample2\tcontrol sample3\tcontrol sample4\ttreated sample5\ttreated sample6\ttreated EOF # Define the R script for DESeq2 analysis DESEQ2_SCRIPT=" # Load DESeq2 library library(DESeq2) # --- Configuration --- # Input count matrix file (e.g., from featureCounts, Salmon, Kallisto) # This file should have genes/features as rows and samples as columns. count_matrix_file <- \"data/counts.tsv\" # Sample information file (e.g., CSV or TSV) # This file should have sample names matching column names in count_matrix_file, # and columns for experimental conditions (e.g., \"condition\", \"treatment\"). sample_info_file <- \"data/sample_info.tsv\" # Output file for differential expression results output_results_file <- \"deseq2_results.tsv\" # Design formula (e.g., \"~ condition\", \"~ treatment + condition\") # This specifies the variables to use in the differential expression model. design_formula_str <- \"~ condition\" # --- Load Data --- # Read count matrix count_data <- read.table(count_matrix_file, header = TRUE, row.names = 1, sep = \"\t\") # Ensure counts are integers count_data <- round(count_data) # Read sample information sample_info <- read.table(sample_info_file, header = TRUE, row.names = 1, sep = \"\t\") # Ensure sample names match and are in the same order sample_info <- sample_info[colnames(count_data), , drop = FALSE] # --- Create DESeqDataSet object --- dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = as.formula(design_formula_str)) # --- Run DESeq2 analysis --- dds <- DESeq(dds) # --- Extract Results --- # Example: comparing \"treated\" vs \"control\" if \"condition\" has these levels # Replace \"condition\", \"treated\", \"control\" with actual column name and levels from your sample_info results_obj <- results(dds, contrast = c(\"condition\", \"treated\", \"control\")) # Order results by adjusted p-value results_obj <- results_obj[order(results_obj$padj), ] # --- Save Results --- write.table(as.data.frame(results_obj), file = output_results_file, sep = \"\t\", quote = FALSE, row.names = TRUE) message(paste(\"Differential expression results saved to:\", output_results_file)) " # Execute the R script Rscript -e "$DESEQ2_SCRIPT"
Tools Used
Raw Source Text
Sequenced reads were performed quality assessment using FastQC. Sequenced reads were then aligend to GRCm38.p5 reference genome using Hisat2 tool. Raw read counts were then obtained from aligned files using HTSeq. Differential expression was analyzed by DESeq2 Genome_build: GRCm38.p5 Supplementary_files_format_and_content: Test files include raw read counts for each sample and expression data.