GSE277160 Processing Pipeline

RNA-Seq code_examples 8 steps

Publication

Integrated multi-omics analysis of zinc-finger proteins uncovers roles in RNA regulation.

Molecular cell (2024) — PMID 39303722

Dataset

Integrated multi-omics analysis of zinc finger proteins uncovers roles in RNA regulation [RNA-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

We aligned reads to the human genome version GRCh38 with annotation version Gencode v40 using STAR (v2.7.1a).

STAR v2.7.1a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.7.1a

# Define paths and variables
# Replace with actual paths to your STAR index and FASTQ files
GENOME_DIR="/path/to/STAR_index/GRCh38_Gencode_v40" # Pre-built STAR index for GRCh38 and Gencode v40
READ1_FASTQ="sample_R1.fastq.gz" # Input FASTQ file (read 1)
READ2_FASTQ="sample_R2.fastq.gz" # Input FASTQ file (read 2), remove if single-end
OUTPUT_PREFIX="sample_aligned_" # Prefix for output files
NUM_THREADS=8 # Number of threads to use, adjust as needed

# Align reads to the human genome GRCh38 with Gencode v40 annotation
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.05 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --runThreadN "${NUM_THREADS}"

View on GitHub

Gene expression levels were quantified using RSEM (v1.3.0), in which the TPM values were used to calculate Pearson correlation coefficient between replicates.

RSEM v1.3.0 GitHub

$ Bash example

# Install RSEM (if not already installed)
# For example, using conda:
# conda install -c bioconda rsem=1.3.0

# --- IMPORTANT: RSEM requires a pre-built reference index. ---
# Example command to build a reference (replace with actual paths and genome/transcriptome files):
# rsem-prepare-reference --gtf genome.gtf --bowtie2 genome.fa rsem_reference_index_name

# Define variables for reference and input files
RSEM_REFERENCE_INDEX="path/to/rsem_indexed_reference/hg38_rsem_index" # Replace with actual path to RSEM indexed reference (e.g., built from GRCh38)
READ1_FASTQ="sample_R1.fastq.gz" # Replace with actual path to R1 FASTQ file
READ2_FASTQ="sample_R2.fastq.gz" # Replace with actual path to R2 FASTQ file
OUTPUT_SAMPLE_NAME="sample_output" # Prefix for output files (e.g., sample_output.genes.results)

# Quantify gene expression levels using RSEM
# This command assumes paired-end, gzipped FASTQ files and uses RSEM's default aligner (often Bowtie2).
# For single-end reads, remove --paired-end and provide only one FASTQ file.
# For using STAR as the aligner, add --star option and ensure STAR is in your PATH
# and the reference was built with the --star option.
rsem-calculate-expression \
  --paired-end \
  --gzip-compressed \
  "${READ1_FASTQ}" \
  "${READ2_FASTQ}" \
  "${RSEM_REFERENCE_INDEX}" \
  "${OUTPUT_SAMPLE_NAME}"

View on GitHub

Samples meeting specific criteria were selected for further analysis: only those with unique aligned reads greater than 10M and a Pearson correlation coefficient between replicates of 0.9 or higher were considered.

QC filtering script (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# This script filters samples based on pre-computed Quality Control (QC) metrics.
# It assumes an input file (e.g., 'qc_metrics.tsv') containing sample-specific
# metrics, including unique aligned reads and replicate correlation.

# Define filtering thresholds
MIN_UNIQUE_ALIGNED_READS=10000000 # 10 million unique aligned reads
MIN_PEARSON_CORRELATION=0.9      # Pearson correlation coefficient of 0.9 or higher

# Placeholder for the actual input QC metrics file.
# In a real bioinformatics pipeline, this file would be generated by
# previous steps (e.g., alignment, QC metric calculation).
INPUT_QC_FILE="qc_metrics.tsv"

# Output file to store the selected sample IDs or their full QC rows
OUTPUT_SELECTED_SAMPLES="selected_samples.tsv"

# Filter samples using awk.
# This example assumes the 'unique_aligned_reads' metric is in the 2nd column ($2)
# and the 'replicate_correlation' metric is in the 3rd column ($3) of the
# tab-separated INPUT_QC_FILE.
# Adjust column numbers ($2, $3) based on the actual format of your 'qc_metrics.tsv'.
# The first line (header) is printed without filtering.
awk -v min_reads="$MIN_UNIQUE_ALIGNED_READS" -v min_corr="$MIN_PEARSON_CORRELATION" '
BEGIN { FS="\t"; OFS="\t" }
NR==1 { print; next } # Print header line
$2 >= min_reads && $3 >= min_corr { print }
' "$INPUT_QC_FILE" > "$OUTPUT_SELECTED_SAMPLES"

View on GitHub

Samples with fewer reads were re-sequenced to ensure an adequate count, while those with lower correlation were repeated.

Custom Quality Control Script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This step describes an experimental decision-making process based on quality control metrics (read counts and correlation), rather than a specific, universally recognized computational tool. Re-sequencing and repeating experiments are wet-lab actions.
# A hypothetical custom script or pipeline module would typically perform checks like:

# Example: Check read count (hypothetical threshold)
# READ_COUNT=$(samtools flagstat input.bam | grep "total" | awk '{print $1}')
# if [ "$READ_COUNT" -lt 10000000 ]; then
#     echo "Sample has fewer reads ($READ_COUNT), flagged for re-sequencing."
# fi

# Example: Check replicate correlation (hypothetical threshold, assuming deepTools or similar was used to generate metrics)
# # CORRELATION_VALUE=$(python -c "import numpy as np; matrix=np.load('correlation_matrix.npz')['correlations']; print(matrix[0,1])") # Simplified
# # if (( $(echo "$CORRELATION_VALUE < 0.8" | bc -l) )); then
# #     echo "Sample has lower correlation, flagged for repeating."
# # fi

# No direct execution command for the described 're-sequencing' or 'repeating' actions, as these are experimental decisions.

Plus and minus strand bigwig files were generated using bedGraphToBigWig (v2.9).

UCSC tools vv2.9 GitHub

$ Bash example

# Install UCSC tools (bedGraphToBigWig is part of this suite)
# conda install -c bioconda ucsc-bedgraphtobigwig
# conda install -c bioconda ucsc-fetchchromsizes

# Placeholder for reference genome assembly (e.g., hg38)
REFERENCE_GENOME="hg38"

# Generate chrom.sizes file for the reference genome
fetchChromSizes ${REFERENCE_GENOME} > ${REFERENCE_GENOME}.chrom.sizes

# Generate plus strand bigWig file
bedGraphToBigWig plus_strand.bedGraph ${REFERENCE_GENOME}.chrom.sizes plus_strand.bigWig

# Generate minus strand bigWig file
bedGraphToBigWig minus_strand.bedGraph ${REFERENCE_GENOME}.chrom.sizes minus_strand.bigWig

View on GitHub

Transcript abundance was quantified using Salmon (v1.9.0) with the âgcBias option.

Salmon v1.9.0 GitHub

$ Bash example

# Install Salmon (if not already installed)
# conda install -c bioconda salmon

# --- Placeholder for reference data ---
# Salmon requires a transcriptome index. 
# Replace <TRANSCRIPTOME_FASTA> with the path to your transcriptome FASTA file (e.g., from GENCODE, Ensembl).
# Replace <INDEX_DIR> with your desired output directory for the index.
# Example for human GRCh38:
# salmon index -t /path/to/human_grch38_transcriptome.fasta.gz -i /path/to/salmon_index/human_grch38

# --- Quantification command ---
# Replace <INDEX_DIR> with the actual path to your Salmon index.
# Replace <READ1_FASTQ> and <READ2_FASTQ> with your input FASTQ files. 
# If single-end reads, use -r <READS_FASTQ> instead of -1 and -2.
# Replace <OUTPUT_DIR> with your desired output directory for quantification results.
# -l A: Auto-detect library type (common default for RNA-seq). Adjust if a specific library type is known (e.g., -l ISR for paired-end stranded).
# -p: Number of threads to use (adjust based on available resources).

salmon quant \
  -i /path/to/salmon_index/human_grch38 \
  -l A \
  -1 /path/to/reads/sample_R1.fastq.gz \
  -2 /path/to/reads/sample_R2.fastq.gz \
  --gcBias \
  -o /path/to/output/sample_quant \
  -p 8

View on GitHub

CQN (v1.40.0) was employed to normalize gene-level GC content and length biases.

CQN v1.40.0 GitHub

$ Bash example

# Install R and Bioconductor packages if not already installed
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("cqn")'

# Placeholder for input files. Replace with actual paths.
# gene_counts.tsv: Tab-separated file with gene IDs as row names and sample counts as columns.
#   Example:
#   GeneID  Sample1 Sample2
#   GeneA   100     150
#   GeneB   200     220
#   ...
#
# gene_features.tsv: Tab-separated file with gene IDs, GC_content, and Length columns.
#   These features are typically derived from a reference genome (e.g., hg38)
#   and a gene annotation file (e.g., GENCODE GTF).
#   Example:
#   GeneID  GC_content  Length
#   GeneA   0.45        1500
#   GeneB   0.52        2000
#   ...

# Create an R script for CQN normalization
cat << 'EOF' > run_cqn_normalization.R
library(cqn)

# --- Configuration ---
counts_file <- "gene_counts.tsv" # Path to your raw gene counts matrix
features_file <- "gene_features.tsv" # Path to your gene features (GC content, length)
output_file <- "cqn_normalized_counts.tsv" # Output file for normalized counts

# --- Load Data ---
message(paste("Loading counts from:", counts_file))
counts_matrix <- read.delim(counts_file, row.names = 1, check.names = FALSE)
message(paste("Loading features from:", features_file))
features_df <- read.delim(features_file, row.names = 1, check.names = FALSE)

# Ensure gene IDs match and are in the same order
common_genes <- intersect(rownames(counts_matrix), rownames(features_df))
if (length(common_genes) == 0) {
  stop("No common genes found between counts and features files. Please check gene IDs.")
}
counts_matrix <- counts_matrix[common_genes, ]
features_df <- features_df[common_genes, ]
message(paste("Found", length(common_genes), "common genes for normalization."))

# Extract GC content and lengths
gc_content_vector <- features_df$GC_content
gene_length_vector <- features_df$Length

# Basic validation
if (any(is.na(gc_content_vector)) || any(is.na(gene_length_vector))) {
  stop("NA values found in GC content or gene lengths. Please check your features file.")
}
if (any(gene_length_vector <= 0)) {
  stop("Gene lengths must be positive. Please check your features file.")
}
if (any(counts_matrix < 0)) {
  warning("Negative counts found in the matrix. CQN expects non-negative counts.")
}

# --- Perform CQN Normalization ---
message("Performing CQN normalization...")
# Using total counts as size factors. For more advanced normalization,
# size factors from DESeq2/edgeR could be used for better library size adjustment.
size_factors <- colSums(counts_matrix)

# CQN function expects counts to be non-negative integers.
# If your counts are already normalized or transformed, adjust accordingly.
cqn_res <- cqn(counts = counts_matrix,
               x = gc_content_vector,
               lengths = gene_length_vector,
               sizeFactors = size_factors)

# Get normalized counts (log2 scale)
# The 'y' component is the log2(normalized_counts / sizeFactors)
# The 'offset' component accounts for GC content and length biases.
# Adding the offset to y gives the log2-transformed normalized expression values.
normalized_log2_counts <- cqn_res$y + cqn_res$offset

# Save results
message(paste("Saving normalized counts to:", output_file))
write.table(normalized_log2_counts, output_file, sep = "\t", quote = FALSE, col.names = NA)

message("CQN normalization complete.")
EOF

# Execute the R script
Rscript run_cqn_normalization.R

View on GitHub

Transcript merging to genes was performed using Tximport (v1.22.0).

Tximport v1.22.0 GitHub

$ Bash example

# --- Installation (commented out) ---
# Install R if not already installed
# sudo apt-get update && sudo apt-get install -y r-base

# Install BiocManager and tximport within R
# R -q -e 'install.packages("BiocManager", repos="https://cloud.r-project.org")'
# R -q -e 'BiocManager::install("tximport")'
# R -q -e 'install.packages("readr", repos="https://cloud.r-project.org")' # For write_tsv

# --- Placeholder Data Generation (for demonstration) ---
# In a real pipeline, these would be actual output from quantification tools (e.g., Salmon)
# and a pre-generated tx2gene mapping.

# Create dummy Salmon quantification files
mkdir -p sample1 sample2 sample3
echo -e "Name\tLength\tEffectiveLength\tTPM\tNumReads" > sample1/quant.sf
echo -e "ENST00000000001.1\t1000\t900\t10.5\t100" >> sample1/quant.sf
echo -e "ENST00000000002.1\t1200\t1100\t20.3\t250" >> sample1/quant.sf
echo -e "ENST00000000003.1\t800\t700\t5.1\t50" >> sample1/quant.sf

echo -e "Name\tLength\tEffectiveLength\tTPM\tNumReads" > sample2/quant.sf
echo -e "ENST00000000001.1\t1000\t900\t12.1\t120" >> sample2/quant.sf
echo -e "ENST00000000002.1\t1200\t1100\t18.7\t230" >> sample2/quant.sf
echo -e "ENST00000000003.1\t800\t700\t6.2\t60" >> sample2/quant.sf

echo -e "Name\tLength\tEffectiveLength\tTPM\tNumReads" > sample3/quant.sf
echo -e "ENST00000000001.1\t1000\t900\t9.8\t95" >> sample3/quant.sf
echo -e "ENST00000000002.1\t1200\t1100\t22.5\t270" >> sample3/quant.sf
echo -e "ENST00000000003.1\t800\t700\t4.9\t45" >> sample3/quant.sf

# Create a dummy tx2gene mapping file
# In a real scenario, this would be generated from a GTF/GFF annotation file
# (e.g., using GenomicFeatures::makeTx2gene() in R).
# Reference annotation source: Ensembl, GENCODE, UCSC (e.g., Homo_sapiens.GRCh38.109.gtf)
echo -e "transcript_id\tgene_id" > tx2gene.tsv
echo -e "ENST00000000001.1\tENSG00000000001.1" >> tx2gene.tsv
echo -e "ENST00000000002.1\tENSG00000000001.1" >> tx2gene.tsv # Two transcripts map to one gene
echo -e "ENST00000000003.1\tENSG00000000002.1" >> tx2gene.tsv

# --- R script for Tximport ---
# This script performs the transcript merging to genes
cat << 'EOF' > run_tximport.R
library(tximport)
library(readr) # For write_tsv

# --- Configuration ---
# List of paths to transcript quantification files (e.g., Salmon quant.sf files)
quant_files <- c(
  "sample1/quant.sf",
  "sample2/quant.sf",
  "sample3/quant.sf"
)
names(quant_files) <- c("sample1", "sample2", "sample3")

# Path to the transcript-to-gene mapping file
# This file is typically generated from a reference genome annotation (e.g., GTF/GFF)
# Example reference: Homo_sapiens.GRCh38.109.gtf from Ensembl
tx2gene_file <- "tx2gene.tsv"

# Output file for gene-level counts
output_gene_counts_file <- "gene_level_counts.tsv"

# --- Load tx2gene mapping ---
tx2gene <- read.delim(tx2gene_file, header = TRUE, stringsAsFactors = FALSE)

# --- Perform transcript merging to genes ---
# 'type' should match the quantification tool used (e.g., "salmon", "kallisto", "rsem")
# 'countsFromAbundance' specifies how to get gene-level counts from transcript abundances.
# "lengthScaledTPM" is often recommended for differential expression analysis.
txi <- tximport(
  files = quant_files,
  type = "salmon", # Assuming Salmon was used for quantification
  tx2gene = tx2gene,
  countsFromAbundance = "lengthScaledTPM",
  ignoreTxVersion = TRUE # Useful if transcript IDs in quant files have versions (e.g., ENST... .1)
)

# Extract gene-level counts
gene_counts <- as.data.frame(txi$counts)

# Write gene-level counts to a TSV file
write_tsv(gene_counts, output_gene_counts_file)

message(paste("Gene-level counts saved to:", output_gene_counts_file))
EOF

# --- Execution Command ---
Rscript run_tximport.R

# --- Clean up dummy data (optional) ---
# rm -rf sample1 sample2 sample3 tx2gene.tsv run_tximport.R

View on GitHub

Tools Used

STAR UCSC tools

Raw Source Text

We aligned reads to the human genome version GRCh38 with annotation version Gencode v40 using STAR (v2.7.1a). Gene expression levels were quantified using RSEM (v1.3.0), in which the TPM values were used to calculate Pearson correlation coefficient between replicates. Samples meeting specific criteria were selected for further analysis: only those with unique aligned reads greater than 10M and a Pearson correlation coefficient between replicates of 0.9 or higher were considered. Samples with fewer reads were re-sequenced to ensure an adequate count, while those with lower correlation were repeated. Plus and minus strand bigwig files were generated using bedGraphToBigWig (v2.9). Transcript abundance was quantified using Salmon (v1.9.0) with the âgcBias option. CQN (v1.40.0) was employed to normalize gene-level GC content and length biases. Transcript merging to genes was performed using Tximport (v1.22.0).
Assembly: GRCh38
Supplementary files format and content: Bam files and salmon quant files

← Back to Analysis