GSE114820 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.

Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348

Dataset

A Transcriptomic Analysis of the Development of Skeletal Muscle Atrophy in Cancer-Cachexia in Tumor-Bearing Mice

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequenced reads were performed quality assessment using FastQC.

FastQC v0.11.9 GitHub

$ Bash example

# Install FastQC via Conda
# conda install -c bioconda fastqc

# Run FastQC on input FASTQ files
# Replace 'input_reads.fastq.gz' with your actual input file(s)
# Replace 'output_directory' with your desired output location
fastqc input_reads.fastq.gz -o output_directory

View on GitHub

Sequenced reads were then aligend to GRCm38.p5 reference genome using Hisat2 tool.

HISAT2 v2.2.1 GitHub

$ Bash example

# Install HISAT2 (example using conda)
# conda install -c bioconda hisat2

# Create a directory for reference genome and index
mkdir -p reference_genome
cd reference_genome

# Download GRCm38.p5 reference genome from NCBI
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/24/GCF_000001635.24_GRCm38.p5/GCF_000001635.24_GRCm38.p5_genomic.fna.gz
gunzip GCF_000001635.24_GRCm38.p5_genomic.fna.gz

# Build HISAT2 index
# The index name 'GRCm38_p5' is chosen to reflect the reference genome.
hisat2-build GCF_000001635.24_GRCm38.p5_genomic.fna GRCm38_p5

cd ..

# Define input and output files
READS_R1="reads_R1.fastq.gz" # Placeholder for forward reads
READS_R2="reads_R2.fastq.gz" # Placeholder for reverse reads (if paired-end)
OUTPUT_SAM="aligned_reads.sam"
OUTPUT_BAM="aligned_reads.bam"

# Align sequenced reads to GRCm38.p5 reference genome using HISAT2
# Assuming paired-end reads and outputting SAM format, then converting to BAM and sorting.
# --dta: Report alignments tailored for transcript assemblers like StringTie.
# -p 8: Use 8 threads.
hisat2 -x reference_genome/GRCm38_p5 -1 ${READS_R1} -2 ${READS_R2} --dta -p 8 -S ${OUTPUT_SAM}

# Convert SAM to BAM and sort (requires samtools)
# conda install -c bioconda samtools
samtools view -bS ${OUTPUT_SAM} | samtools sort -o ${OUTPUT_BAM}

# Index the BAM file
samtools index ${OUTPUT_BAM}

View on GitHub

Raw read counts were then obtained from aligned files using HTSeq.

HTSeq vInferred with models/gemini-2.5-flash

$ Bash example

# Install HTSeq (if not already installed)
# conda install -c bioconda htseq

# Placeholder for reference genome annotation (e.g., GENCODE for hg38)
# Replace with the actual path to your GTF file
GTF_FILE="/path/to/gencode.v38.annotation.gtf"

# Placeholder for aligned BAM file
# Replace with the actual path to your aligned BAM file
ALIGNED_BAM="/path/to/aligned_reads.bam"

# Output file for raw read counts
OUTPUT_COUNTS="gene_counts.txt"

# Obtain raw read counts using htseq-count
# Parameters are common defaults for RNA-seq counting:
# -f bam: Input file format is BAM
# -r pos: Reads are sorted by position (common for aligners like STAR)
# -s reverse: Strandedness (adjust based on library preparation, 'yes', 'no', or 'reverse')
# -a 10: Minimum alignment quality score to consider (default is 10)
# -t exon: Feature type to count (e.g., 'exon' for gene-level counts)
# -i gene_id: Attribute in GTF file to use as feature ID (e.g., 'gene_id')
# --mode union: How to handle reads overlapping multiple features (union is common)
htseq-count -f bam -r pos -s reverse -a 10 -t exon -i gene_id --mode union "${ALIGNED_BAM}" "${GTF_FILE}" > "${OUTPUT_COUNTS}"

Differential expression was analyzed by DESeq2

DESeq2 v1.44.0 GitHub

$ Bash example

# Install R if not already installed
# sudo apt-get update
# sudo apt-get install r-base

# Install BiocManager if not already installed
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'

# Install DESeq2
# R -e 'BiocManager::install("DESeq2")'

# Create dummy input files for demonstration
# In a real scenario, these would be generated by upstream steps (e.g., featureCounts, Salmon, Kallisto)
mkdir -p data
cat << EOF > data/counts.tsv
gene_id\tsample1\tsample2\tsample3\tsample4\tsample5\tsample6
geneA\t100\t120\t110\t200\t210\t220
geneB\t50\t60\t55\t100\t110\t105
geneC\t200\t210\t205\t100\t90\t95
geneD\t10\t12\t11\t20\t21\t22
geneE\t5\t6\t5\t10\t11\t10
EOF

cat << EOF > data/sample_info.tsv
sample_id\tcondition
sample1\tcontrol
sample2\tcontrol
sample3\tcontrol
sample4\ttreated
sample5\ttreated
sample6\ttreated
EOF

# Define the R script for DESeq2 analysis
DESEQ2_SCRIPT="
# Load DESeq2 library
library(DESeq2)

# --- Configuration ---
# Input count matrix file (e.g., from featureCounts, Salmon, Kallisto)
# This file should have genes/features as rows and samples as columns.
count_matrix_file <- \"data/counts.tsv\"

# Sample information file (e.g., CSV or TSV)
# This file should have sample names matching column names in count_matrix_file,
# and columns for experimental conditions (e.g., \"condition\", \"treatment\").
sample_info_file <- \"data/sample_info.tsv\"

# Output file for differential expression results
output_results_file <- \"deseq2_results.tsv\"

# Design formula (e.g., \"~ condition\", \"~ treatment + condition\")
# This specifies the variables to use in the differential expression model.
design_formula_str <- \"~ condition\"

# --- Load Data ---
# Read count matrix
count_data <- read.table(count_matrix_file, header = TRUE, row.names = 1, sep = \"\t\")
# Ensure counts are integers
count_data <- round(count_data)

# Read sample information
sample_info <- read.table(sample_info_file, header = TRUE, row.names = 1, sep = \"\t\")

# Ensure sample names match and are in the same order
sample_info <- sample_info[colnames(count_data), , drop = FALSE]

# --- Create DESeqDataSet object ---
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = as.formula(design_formula_str))

# --- Run DESeq2 analysis ---
dds <- DESeq(dds)

# --- Extract Results ---
# Example: comparing \"treated\" vs \"control\" if \"condition\" has these levels
# Replace \"condition\", \"treated\", \"control\" with actual column name and levels from your sample_info
results_obj <- results(dds, contrast = c(\"condition\", \"treated\", \"control\"))

# Order results by adjusted p-value
results_obj <- results_obj[order(results_obj$padj), ]

# --- Save Results ---
write.table(as.data.frame(results_obj), file = output_results_file, sep = \"\t\", quote = FALSE, row.names = TRUE)

message(paste(\"Differential expression results saved to:\", output_results_file))
"

# Execute the R script
Rscript -e "$DESEQ2_SCRIPT"

View on GitHub

Tools Used

DESeq2

Raw Source Text

Sequenced reads were performed quality assessment using FastQC.
Sequenced reads were then aligend to GRCm38.p5 reference genome using Hisat2 tool.
Raw read counts were then obtained from aligned files using HTSeq.
Differential expression was analyzed by DESeq2
Genome_build: GRCm38.p5
Supplementary_files_format_and_content: Test files include raw read counts for each sample and expression data.

← Back to Analysis