GSE114820 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.

Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348

Dataset

GSE114820

A Transcriptomic Analysis of the Development of Skeletal Muscle Atrophy in Cancer-Cachexia in Tumor-Bearing Mice

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Sequenced reads were performed quality assessment using FastQC.

    FastQC v0.11.9 GitHub
    $ Bash example
    # Install FastQC via Conda
    # conda install -c bioconda fastqc
    
    # Run FastQC on input FASTQ files
    # Replace 'input_reads.fastq.gz' with your actual input file(s)
    # Replace 'output_directory' with your desired output location
    fastqc input_reads.fastq.gz -o output_directory
  2. 2

    Sequenced reads were then aligend to GRCm38.p5 reference genome using Hisat2 tool.

    HISAT2 v2.2.1 GitHub
    $ Bash example
    # Install HISAT2 (example using conda)
    # conda install -c bioconda hisat2
    
    # Create a directory for reference genome and index
    mkdir -p reference_genome
    cd reference_genome
    
    # Download GRCm38.p5 reference genome from NCBI
    wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/24/GCF_000001635.24_GRCm38.p5/GCF_000001635.24_GRCm38.p5_genomic.fna.gz
    gunzip GCF_000001635.24_GRCm38.p5_genomic.fna.gz
    
    # Build HISAT2 index
    # The index name 'GRCm38_p5' is chosen to reflect the reference genome.
    hisat2-build GCF_000001635.24_GRCm38.p5_genomic.fna GRCm38_p5
    
    cd ..
    
    # Define input and output files
    READS_R1="reads_R1.fastq.gz" # Placeholder for forward reads
    READS_R2="reads_R2.fastq.gz" # Placeholder for reverse reads (if paired-end)
    OUTPUT_SAM="aligned_reads.sam"
    OUTPUT_BAM="aligned_reads.bam"
    
    # Align sequenced reads to GRCm38.p5 reference genome using HISAT2
    # Assuming paired-end reads and outputting SAM format, then converting to BAM and sorting.
    # --dta: Report alignments tailored for transcript assemblers like StringTie.
    # -p 8: Use 8 threads.
    hisat2 -x reference_genome/GRCm38_p5 -1 ${READS_R1} -2 ${READS_R2} --dta -p 8 -S ${OUTPUT_SAM}
    
    # Convert SAM to BAM and sort (requires samtools)
    # conda install -c bioconda samtools
    samtools view -bS ${OUTPUT_SAM} | samtools sort -o ${OUTPUT_BAM}
    
    # Index the BAM file
    samtools index ${OUTPUT_BAM}
  3. 3

    Raw read counts were then obtained from aligned files using HTSeq.

    HTSeq vInferred with models/gemini-2.5-flash
    $ Bash example
    # Install HTSeq (if not already installed)
    # conda install -c bioconda htseq
    
    # Placeholder for reference genome annotation (e.g., GENCODE for hg38)
    # Replace with the actual path to your GTF file
    GTF_FILE="/path/to/gencode.v38.annotation.gtf"
    
    # Placeholder for aligned BAM file
    # Replace with the actual path to your aligned BAM file
    ALIGNED_BAM="/path/to/aligned_reads.bam"
    
    # Output file for raw read counts
    OUTPUT_COUNTS="gene_counts.txt"
    
    # Obtain raw read counts using htseq-count
    # Parameters are common defaults for RNA-seq counting:
    # -f bam: Input file format is BAM
    # -r pos: Reads are sorted by position (common for aligners like STAR)
    # -s reverse: Strandedness (adjust based on library preparation, 'yes', 'no', or 'reverse')
    # -a 10: Minimum alignment quality score to consider (default is 10)
    # -t exon: Feature type to count (e.g., 'exon' for gene-level counts)
    # -i gene_id: Attribute in GTF file to use as feature ID (e.g., 'gene_id')
    # --mode union: How to handle reads overlapping multiple features (union is common)
    htseq-count -f bam -r pos -s reverse -a 10 -t exon -i gene_id --mode union "${ALIGNED_BAM}" "${GTF_FILE}" > "${OUTPUT_COUNTS}"
  4. 4

    Differential expression was analyzed by DESeq2

    $ Bash example
    # Install R if not already installed
    # sudo apt-get update
    # sudo apt-get install r-base
    
    # Install BiocManager if not already installed
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    
    # Install DESeq2
    # R -e 'BiocManager::install("DESeq2")'
    
    # Create dummy input files for demonstration
    # In a real scenario, these would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto)
    mkdir -p data
    cat << EOF > data/counts.tsv
    gene_id\tsample1\tsample2\tsample3\tsample4\tsample5\tsample6
    geneA\t100\t120\t110\t200\t210\t220
    geneB\t50\t60\t55\t100\t110\t105
    geneC\t200\t210\t205\t100\t90\t95
    geneD\t10\t12\t11\t20\t21\t22
    geneE\t5\t6\t5\t10\t11\t10
    EOF
    
    cat << EOF > data/sample_info.tsv
    sample_id\tcondition
    sample1\tcontrol
    sample2\tcontrol
    sample3\tcontrol
    sample4\ttreated
    sample5\ttreated
    sample6\ttreated
    EOF
    
    # Define the R script for DESeq2 analysis
    DESEQ2_SCRIPT="
    # Load DESeq2 library
    library(DESeq2)
    
    # --- Configuration ---
    # Input count matrix file (e.g., from featureCounts, Salmon, Kallisto)
    # This file should have genes/features as rows and samples as columns.
    count_matrix_file <- \"data/counts.tsv\"
    
    # Sample information file (e.g., CSV or TSV)
    # This file should have sample names matching column names in count_matrix_file,
    # and columns for experimental conditions (e.g., \"condition\", \"treatment\").
    sample_info_file <- \"data/sample_info.tsv\"
    
    # Output file for differential expression results
    output_results_file <- \"deseq2_results.tsv\"
    
    # Design formula (e.g., \"~ condition\", \"~ treatment + condition\")
    # This specifies the variables to use in the differential expression model.
    design_formula_str <- \"~ condition\"
    
    # --- Load Data ---
    # Read count matrix
    count_data <- read.table(count_matrix_file, header = TRUE, row.names = 1, sep = \"\t\")
    # Ensure counts are integers
    count_data <- round(count_data)
    
    # Read sample information
    sample_info <- read.table(sample_info_file, header = TRUE, row.names = 1, sep = \"\t\")
    
    # Ensure sample names match and are in the same order
    sample_info <- sample_info[colnames(count_data), , drop = FALSE]
    
    # --- Create DESeqDataSet object ---
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_info,
                                  design = as.formula(design_formula_str))
    
    # --- Run DESeq2 analysis ---
    dds <- DESeq(dds)
    
    # --- Extract Results ---
    # Example: comparing \"treated\" vs \"control\" if \"condition\" has these levels
    # Replace \"condition\", \"treated\", \"control\" with actual column name and levels from your sample_info
    results_obj <- results(dds, contrast = c(\"condition\", \"treated\", \"control\"))
    
    # Order results by adjusted p-value
    results_obj <- results_obj[order(results_obj$padj), ]
    
    # --- Save Results ---
    write.table(as.data.frame(results_obj), file = output_results_file, sep = \"\t\", quote = FALSE, row.names = TRUE)
    
    message(paste(\"Differential expression results saved to:\", output_results_file))
    "
    
    # Execute the R script
    Rscript -e "$DESEQ2_SCRIPT"

Tools Used

Raw Source Text
Sequenced reads were performed quality assessment using FastQC.
Sequenced reads were then aligend to GRCm38.p5 reference genome using Hisat2 tool.
Raw read counts were then obtained from aligned files using HTSeq.
Differential expression was analyzed by DESeq2
Genome_build: GRCm38.p5
Supplementary_files_format_and_content: Test files include raw read counts for each sample and expression data.
← Back to Analysis