GSE145480 Processing Pipeline

RNA-Seq code_examples 11 steps

Publication

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.

Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348

Dataset

Rodents as models of human sarcopenia: a comparative analysis reveals conserved modulators of aging-dependent muscle loss

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-seq reads were subjected to 3â adapter and poly(A)/poly(T) tail trimming using Cutadapt v1.9.1 from Martin et al., EMBnet.journal, 2011.

cutadapt v1.9.1 GitHub

$ Bash example

# Install Cutadapt (if not already installed)
# conda install -c bioconda cutadapt=1.9.1

# Run Cutadapt for 3' adapter and poly(A)/poly(T) tail trimming.
# The exact 3' adapter sequence should be known from the library preparation protocol.
# A common Illumina TruSeq 3' adapter is used as a placeholder.
# A{100} and T{100} are used to trim long poly(A) and poly(T) tails respectively.
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -a A{100} -a T{100} \
         -o trimmed_reads.fastq.gz \
         raw_reads.fastq.gz

View on GitHub

The following 3â adapter sequences were utilized for generating RNA-Seq libraries and subsequently trimmed: mate1 5â-AGATCGGAAGAGCACACGTC-3â, mate2 5â-AGATCGGAAGAGCGTCGTGT-3â.

RNA-seq v4.2 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt

# Define input and output file names (placeholders)
READ1_IN="input_read1.fastq.gz"
READ2_IN="input_read2.fastq.gz"
READ1_OUT="trimmed_read1.fastq.gz"
READ2_OUT="trimmed_read2.fastq.gz"
UNPAIRED_READ1_OUT="unpaired_trimmed_read1.fastq.gz"
UNPAIRED_READ2_OUT="unpaired_trimmed_read2.fastq.gz"

# Define adapter sequences for 3' trimming
ADAPTER_MATE1="AGATCGGAAGAGCACACGTC"
ADAPTER_MATE2="AGATCGGAAGAGCGTCGTGT"

# Run cutadapt for paired-end trimming of 3' adapters
# -a ADAPTER_MATE1: 3' adapter for read 1
# -A ADAPTER_MATE2: 3' adapter for read 2
# -o READ1_OUT: Output file for trimmed read 1 (paired)
# -p READ2_OUT: Output file for trimmed read 2 (paired)
# --untrimmed-output: Optional, output reads that were not trimmed (read 1)
# --untrimmed-paired-output: Optional, output reads that were not trimmed (read 2)
# --minimum-length 20: Discard reads shorter than 20 bp after trimming (common practice)
# -j 4: Use 4 CPU cores for parallel processing (adjust as needed)
cutadapt \
  -a "${ADAPTER_MATE1}" \
  -A "${ADAPTER_MATE2}" \
  -o "${READ1_OUT}" \
  -p "${READ2_OUT}" \
  --untrimmed-output "${UNPAIRED_READ1_OUT}" \
  --untrimmed-paired-output "${UNPAIRED_READ2_OUT}" \
  --minimum-length 20 \
  -j 4 \
  "${READ1_IN}" "${READ2_IN}"

View on GitHub

Reads shorter than 30 nucleotides were discarded.

cutadapt (Inferred with models/gemini-2.5-flash) v4.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=4.1

# Define input and output file paths
INPUT_FASTQ="input.fastq.gz"
OUTPUT_FASTQ="output.fastq.gz"

# Discard reads shorter than 30 nucleotides
cutadapt -m 30 -o "${OUTPUT_FASTQ}" "${INPUT_FASTQ}"

View on GitHub

The kallisto v0.43.1 software from Bray et al., Nature Biotechnology, 2016 was used for building transcriptome index and aligning filtered reads.

kallisto v0.43.1 GitHub

$ Bash example

# Install kallisto (if not already installed)
# conda install -c bioconda kallisto

# --- Placeholder Variables ---
# Replace with your actual transcriptome FASTA file (e.g., from Ensembl or GENCODE)
TRANSCRIPTOME_FASTA="Homo_sapiens.GRCh38.cdna.all.fa.gz" 
# Replace with your desired index output prefix
KALLISTO_INDEX_PREFIX="human_grch38_kallisto_index"
# Replace with your filtered read files (e.g., R1 and R2 for paired-end)
READS_R1="filtered_reads_R1.fastq.gz"
READS_R2="filtered_reads_R2.fastq.gz"
# Replace with your desired output directory for quantification results
QUANT_OUTPUT_DIR="kallisto_quant_output"

# --- Step 1: Build transcriptome index ---
# The -i flag specifies the output index file prefix.
# The input is the transcriptome FASTA file.
kallisto index -i "${KALLISTO_INDEX_PREFIX}" "${TRANSCRIPTOME_FASTA}"

# --- Step 2: Align/Quantify filtered reads ---
# The -i flag specifies the path to the kallisto index.
# The -o flag specifies the output directory for quantification results.
# Provide the filtered read files (space-separated for paired-end).
kallisto quant -i "${KALLISTO_INDEX_PREFIX}" -o "${QUANT_OUTPUT_DIR}" "${READS_R1}" "${READS_R2}"

View on GitHub

The default options of kallisto were utilized for building transcriptome index.

kallisto vNot specified GitHub

$ Bash example

# Install kallisto (if not already installed)
# conda install -c bioconda kallisto

# Placeholder for transcriptome FASTA file (e.g., human GRCh38/hg38 cDNA sequences)
# Replace with the actual path to your transcriptome FASTA file.
TRANSCRIPTOME_FASTA="/path/to/gencode.v44.transcripts.fa.gz"

# Output prefix for the kallisto index
INDEX_OUTPUT_PREFIX="kallisto_index"

# Build the transcriptome index using default options
kallisto index -i "${INDEX_OUTPUT_PREFIX}" "${TRANSCRIPTOME_FASTA}"

View on GitHub

For aligning filtered reads we used options â--rf-strandedâ and â--pseudobamâ.

Salmon (Inferred with models/gemini-2.5-flash) v1.10.0 GitHub

$ Bash example

# Install Salmon (example using conda)
# conda install -c bioconda salmon

# Example: Create a Salmon index using a reference transcriptome
# Replace 'human_gencode_v44_transcriptome.fa' with your actual transcriptome FASTA file (e.g., from GENCODE or Ensembl)
# salmon index -t human_gencode_v44_transcriptome.fa -i salmon_index --gencode

# Run Salmon quantification with specified options
# Replace 'filtered_reads.fastq.gz' with your actual input filtered reads file
# Replace 'salmon_quant_output' with your desired output directory
salmon quant -i salmon_index -l A --rf-stranded --pseudobam -r filtered_reads.fastq.gz -o salmon_quant_output

View on GitHub

Mapped reads were assigned to transcripts in a weighted manner: if a read was uniquely mapped to a transcript, then the transcriptâs read count was incremented by 1; if a read was mapped to n different transcripts, each transcriptâs read count was incremented by 1/n.

RSEM (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Install RSEM
# conda install -c bioconda rsem

# Placeholder for reference genome and annotation files (e.g., GRCh38)
# Replace with actual paths to your genome FASTA and GTF files
GENOME_FASTA="path/to/GRCh38.primary_assembly.genome.fa"
GTF_FILE="path/to/gencode.v38.annotation.gtf"
RSEM_REFERENCE_NAME="GRCh38_rsem_ref"

# Build RSEM reference index (if not already built)
# This step only needs to be run once per reference
# rsem-prepare-reference --gtf "${GTF_FILE}" "${GENOME_FASTA}" "${RSEM_REFERENCE_NAME}"

# Input BAM file (e.g., from STAR alignment)
INPUT_BAM="aligned_reads.bam"

# Output prefix for RSEM results
OUTPUT_PREFIX="transcript_quantification"

# Quantify transcript expression using RSEM
# RSEM uses an Expectation-Maximization (EM) algorithm to estimate transcript abundances,
# which inherently handles multi-mapping reads by distributing their counts probabilistically.
# This effectively assigns fractional counts similar to the 1/n description.
# The description implies single-end reads, as it talks about "a read" being mapped.
# If reads are paired-end, use --paired-end instead of --single-end.
rsem-calculate-expression \
    --bam \
    --single-end \
    "${INPUT_BAM}" \
    "${RSEM_REFERENCE_NAME}" \
    "${OUTPUT_PREFIX}"

View on GitHub

The expression of each transcript was estimated in transcripts per million (TPM) units by dividing its read count by the transcript length and normalizing to the library size.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM (example using conda)
# conda install -c bioconda rsem

# Download reference genome and annotation (example for human GRCh38, Gencode v38)
# mkdir -p references/rsem_index
# wget -P references/rsem_index https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh38.primary_assembly.genome.fa.gz
# wget -P references/rsem_index https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
# gunzip references/rsem_index/*.gz

# Build RSEM reference index (this step needs to be done once per reference)
# rsem-prepare-reference \
#   --gtf references/rsem_index/gencode.v38.annotation.gtf \
#   references/rsem_index/GRCh38.primary_assembly.genome.fa \
#   references/rsem_index/GRCh38_gencode_v38_rsem_index

# Define input and output paths
INPUT_READS_R1="path/to/your/sample_R1.fastq.gz" # Replace with actual input R1 file
INPUT_READS_R2="path/to/your/sample_R2.fastq.gz" # Replace with actual input R2 file (remove if single-end)
RSEM_INDEX="references/rsem_index/GRCh38_gencode_v38_rsem_index" # Path to the RSEM index created above
OUTPUT_PREFIX="results/sample_expression" # Prefix for output files

# Run RSEM to estimate transcript expression in TPM
# Using --paired-end for paired-end reads, adjust if single-end
# -p specifies number of threads
# --output-genome-bam generates a BAM file aligned to the genome, useful for downstream analysis
rsem-calculate-expression \
  --paired-end \
  -p 8 \
  --output-genome-bam \
  "${INPUT_READS_R1}" \
  "${INPUT_READS_R2}" \
  "${RSEM_INDEX}" \
  "${OUTPUT_PREFIX}"

View on GitHub

The expression of a gene was obtained by summing up the normalized expression of the transcripts associated with it.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install RSEM if not already installed
# conda install -c bioconda rsem

# --- Reference Preparation (if not already done) ---
# RSEM requires a reference transcriptome index. This step needs to be run once per reference.
# Assuming GRCh38 as the latest assembly placeholder.
# Example commands to download human genome FASTA and GTF (e.g., from Ensembl release 109):
# wget -O Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# wget -O Homo_sapiens.GRCh38.109.gtf.gz ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
# gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# gunzip Homo_sapiens.GRCh38.109.gtf.gz

# Create RSEM reference index using the GTF annotation and genome FASTA:
# rsem-prepare-reference \
# --gtf Homo_sapiens.GRCh38.109.gtf \
# Homo_sapiens.GRCh38.dna.primary_assembly.fa \
# GRCh38_rsem_index

# --- Gene Expression Quantification ---
# This command calculates gene and transcript expression from aligned reads.
# The 'genes.results' output file will contain gene-level expression, which RSEM derives by aggregating transcript-level estimates.
# Input: Aligned reads (e.g., from STAR, HISAT2) in BAM format.
# '--bam' indicates the input is a BAM file.
# '--no-qualities' is often used when input BAM is from an external aligner and quality scores are not needed by RSEM's internal processing.
# '-p 8' specifies to use 8 threads for parallel processing.
# 'aligned_reads.bam' is a placeholder for your input BAM file.
# 'GRCh38_rsem_index' is the path to the RSEM reference index directory created in the previous step.
# 'sample_output_prefix' will be used to name the output files (e.g., sample_output_prefix.genes.results, sample_output_prefix.isoforms.results).
# For paired-end reads, add '--paired-end' after '--bam'.

rsem-calculate-expression \
--bam \
--no-qualities \
-p 8 \
aligned_reads.bam \
GRCh38_rsem_index \
sample_output_prefix

View on GitHub

For every gene, read counts of transcripts associated with this gene were also summed up and further used for the differential expression analysis.

tximport (Inferred with models/gemini-2.5-flash) v1.30.0 GitHub

$ Bash example

# --- Installation (commented out) ---
# Install R and Bioconductor (if not already installed)
# sudo apt-get update
# sudo apt-get install r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install(c("tximport", "readr"))'

# --- Configuration ---
# Define paths to transcript quantification results (e.g., Salmon output directories)
# Replace with actual paths to your sample quantification directories
# Example:
# QUANT_DIRS=(
#     "data/salmon_output/sample1"
#     "data/salmon_output/sample2"
#     "data/salmon_output/sample3"
# )
# For demonstration, using dummy paths:
QUANT_DIRS=(
    "path/to/sample1_quant_dir"
    "path/to/sample2_quant_dir"
    "path/to/sample3_quant_dir"
)

# Define the type of quantification output (e.g., "salmon", "kallisto", "rsem")
QUANT_TYPE="salmon"

# Define the path to the transcript-to-gene mapping file (tx2gene.tsv)
# This file typically has two columns: 'TXNAME' and 'GENEID'.
# It can be generated from your GTF/GFF annotation file (e.g., gencode.v44.annotation.gtf.gz for human GRCh38)
# using R packages like 'rtracklayer' and 'GenomicFeatures'.
# Example command to generate tx2gene.tsv (requires R, rtracklayer, GenomicFeatures, readr):
# R -e 'library(rtracklayer); library(GenomicFeatures); library(readr); \
#      gtf_path <- "path/to/gencode.v44.annotation.gtf.gz"; \
#      txdb <- makeTxDbFromGRanges(rtracklayer::import(gtf_path)); \
#      k <- keys(txdb, keytype = "TXNAME"); \
#      tx2gene_df <- select(txdb, k, "GENEID", "TXNAME"); \
#      colnames(tx2gene_df) <- c("TXNAME", "GENEID"); \
#      write_tsv(tx2gene_df, "path/to/tx2gene.tsv")'
TX2GENE_FILE="path/to/tx2gene.tsv"

# Define the output file for the gene-level count matrix
OUTPUT_GENE_COUNTS="gene_level_counts.tsv"

# --- Create R script for tximport ---
cat << 'EOF' > run_tximport.R
library(tximport)
library(readr)
library(dplyr) # For potential data manipulation, though not strictly needed for basic tximport

# Get variables from environment
quant_dirs_str <- Sys.getenv("QUANT_DIRS")
quant_type <- Sys.getenv("QUANT_TYPE")
tx2gene_file <- Sys.getenv("TX2GENE_FILE")
output_file <- Sys.getenv("OUTPUT_GENE_COUNTS")

# Parse quantification directories string into a vector
quant_dirs <- unlist(strsplit(quant_dirs_str, ","))

# Construct full paths to quantification files (e.g., quant.sf for Salmon)
# Adjust "quant.sf" if using a different quantification tool (e.g., "abundance.h5" for Kallisto)
files <- file.path(quant_dirs, "quant.sf")
names(files) <- basename(quant_dirs) # Use directory names as sample names

message(paste("Loading tx2gene mapping from:", tx2gene_file))
tx2gene <- read_tsv(tx2gene_file)

message(paste("Importing and aggregating counts using tximport (type:", quant_type, ")..."))
# ignoreTxVersion = TRUE is often useful for GTF files with version numbers in transcript IDs
txi <- tximport(files, type = quant_type, tx2gene = tx2gene, ignoreTxVersion = TRUE)

# Extract gene-level counts
gene_counts <- as.data.frame(txi$counts)

message(paste("Saving gene-level counts to:", output_file))
write.table(gene_counts, file = output_file, sep = "\t", quote = FALSE, col.names = NA)

message("Gene-level count aggregation complete.")
EOF

# --- Execution ---
# Export variables to be accessible by the R script
export QUANT_DIRS=$(IFS=,; echo "${QUANT_DIRS[*]}")
export QUANT_TYPE="${QUANT_TYPE}"
export TX2GENE_FILE="${TX2GENE_FILE}"
export OUTPUT_GENE_COUNTS="${OUTPUT_GENE_COUNTS}"

# Run the R script
Rscript run_tximport.R

# Clean up the temporary R script
rm run_tximport.R

View on GitHub

As the reference mouse transcriptome, we considered sequences of protein coding transcripts with the support level 1-3 based on genome assembly GRCm38 (release 92) and transcript annotations from Ensembl database (see Hubbard et al., Nucleic Acids Research, 2002).

Ensembl vrelease 92

$ Bash example

mkdir -p ensembl_grcm38_release92
cd ensembl_grcm38_release92

# Download mouse cDNA sequences (all transcripts) for GRCm38, Ensembl release 92
# This file contains both protein-coding and non-coding transcripts.
wget ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz

# Download mouse GTF annotation file for GRCm38, Ensembl release 92
# This file is necessary to identify protein-coding transcripts and their support levels.
wget ftp://ftp.ensembl.org/pub/release-92/gtf/mus_musculus/Mus_musculus.GRCm38.92.gtf.gz

# Unzip files
gunzip Mus_musculus.GRCm38.cdna.all.fa.gz
gunzip Mus_musculus.GRCm38.92.gtf.gz

# The description specifies "sequences of protein coding transcripts with the support level 1-3".
# This implies a post-download filtering step using the GTF annotation.
# For example, one would parse Mus_musculus.GRCm38.92.gtf to identify transcript IDs
# that are 'protein_coding' and have a Transcript Support Level (TSL) of 1, 2, or 3.
# Then, these filtered transcript IDs would be used to extract the corresponding sequences
# from Mus_musculus.GRCm38.cdna.all.fa.
# A specific tool for this filtering is not mentioned in the description.
# This bash block provides the raw Ensembl data from which such a reference would be derived.

Tools Used

RNA-seq

Raw Source Text

RNA-seq reads were subjected to 3â adapter and poly(A)/poly(T) tail trimming using Cutadapt v1.9.1 from Martin et al., EMBnet.journal, 2011. The following 3â adapter sequences were utilized for generating RNA-Seq libraries and subsequently trimmed: mate1 5â-AGATCGGAAGAGCACACGTC-3â, mate2 5â-AGATCGGAAGAGCGTCGTGT-3â.
Reads shorter than 30 nucleotides were discarded.
The kallisto v0.43.1 software from Bray et al., Nature Biotechnology, 2016 was used for building transcriptome index and aligning filtered reads. The default options of kallisto were utilized for building transcriptome index. For aligning filtered reads we used options â--rf-strandedâ and â--pseudobamâ.
Mapped reads were assigned to transcripts in a weighted manner: if a read was uniquely mapped to a transcript, then the transcriptâs read count was incremented by 1; if a read was mapped to n different transcripts, each transcriptâs read count was incremented by 1/n.
The expression of each transcript was estimated in transcripts per million (TPM) units by dividing its read count by the transcript length and normalizing to the library size. The expression of a gene was obtained by summing up the normalized expression of the transcripts associated with it. For every gene, read counts of transcripts associated with this gene were also summed up and further used for the differential expression analysis.
As the reference mouse transcriptome, we considered sequences of protein coding transcripts with the support level 1-3 based on genome assembly GRCm38 (release 92) and transcript annotations from Ensembl database (see Hubbard et al., Nucleic Acids Research, 2002).
Genome_build: GRCm38
Supplementary_files_format_and_content: counts_mouse_aging_timecourse.txt: Tab-delimited text file includes raw counts at the gene level for each sample.
Supplementary_files_format_and_content: tpms_mouse_aging_timecourse.txt: Tab-delimited text file includes TPM values at the gene level for each sample.

← Back to Analysis