GSE175885 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Gain-of-function cardiomyopathic mutations in RBM20 rewire splicing regulation and re-distribute ribonucleoprotein granules within processing bodies.

Nature communications (2021) — PMID 34732726

Dataset

GSE175885

RNA-Seq of isogenic human iPS cell-derived cardiomyocytes with RBM20 mutations created by genome editing (RiboMinus RNA-Seq)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

To identify, quantify and annotate circular RNAs from back-splice junctions we first aligned the RNASeq results using TopHat2 on the RiboMinus depleted paired-end RNA-Seq FASTQ files, via the CIRCexplorer2 pipeline, on each individual sample (default options).

TopHat v2 GitHub

$ Bash example

# Install TopHat2 (e.g., via Bioconda)
# conda install -c bioconda tophat=2.1.1 # Or a specific version
# conda install -c bioconda bowtie2

# Define input and output paths
GENOME_INDEX="path/to/hg38_index/hg38" # Replace with your genome index path (e.g., from Ensembl or UCSC for Homo sapiens hg38)
GTF_FILE="path/to/hg38.gtf" # Replace with your GTF annotation file path (e.g., from Ensembl or UCSC for Homo sapiens hg38)
SAMPLE_ID="sample1" # Replace with actual sample ID
READ1_FASTQ="${SAMPLE_ID}_R1.fastq.gz" # Replace with actual R1 FASTQ file
READ2_FASTQ="${SAMPLE_ID}_R2.fastq.gz" # Replace with actual R2 FASTQ file
OUTPUT_DIR="${SAMPLE_ID}_tophat_out"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run TopHat2 for paired-end alignment with default options
tophat2 \
  -o "${OUTPUT_DIR}" \
  -G "${GTF_FILE}" \
  --num-threads 8 \
  "${GENOME_INDEX}" \
  "${READ1_FASTQ}" \
  "${READ2_FASTQ}"

View on GitHub

This workflow identified far greater putative circRNAs than using the same pipeline with STAR 2.4.0.

STAR v2.4.0 GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.4.0

# Placeholder for STAR genome index directory (e.g., for human GRCh38)
GENOME_DIR="/path/to/STAR_index/GRCh38"

# Placeholder for input FASTQ files
READ1="sample_R1.fastq.gz"
READ2="sample_R2.fastq.gz"

# Placeholder for output directory and prefix
OUTPUT_PREFIX="circRNA_alignment_output_"

# Create output directory if it doesn't exist
mkdir -p "$(dirname ${OUTPUT_PREFIX})"

# Run STAR alignment for circRNA detection
# Parameters like --chimSegmentMin and --chimJunctionOverhangMin are crucial for identifying chimeric junctions indicative of circRNAs.
STAR \
  --genomeDir ${GENOME_DIR} \
  --readFilesIn ${READ1} ${READ2} \
  --runThreadN 8 \
  --outFileNamePrefix ${OUTPUT_PREFIX} \
  --outSAMtype BAM SortedByCoordinate \
  --outFilterMultimapNmax 20 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --chimSegmentMin 15 \
  --chimJunctionOverhangMin 15 \
  --outReadsUnmapped Fastx \
  --outSAMattributes NH HI AS NM MD jM jI XS

# Note: After STAR alignment, a dedicated circRNA calling tool (e.g., find_circ, CIRCexplorer2, CIRI2) 
# would typically be used to process the Chimeric.out.junction file generated by STAR 
# to identify and quantify putative circRNAs.
# Example (conceptual, tool not specified in description):
# find_circ.py ${OUTPUT_PREFIX}Chimeric.out.junction > circRNAs.bed

View on GitHub

For differential circRNA analysis we ran RUVSEQ from the CSBB pipeline (https://github.com/csbbcompbio/CSBB-v3.0) on the back-splice junction counts for each sample for all relevant comparison groups.

RUVSEQ vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor if not already present
# sudo apt-get update
# sudo apt-get install r-base
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('RUVSeq')"

# Placeholder for input files:
# back_splice_junction_counts.tsv: Tab-separated file with circRNA counts per sample.
# sample_metadata.tsv: Tab-separated file with sample information, including comparison groups.

# Assuming the CSBB-v3.0 pipeline contains a script for differential circRNA analysis
# that incorporates RUVSEQ for normalization. The exact script name and parameters
# would be specific to the CSBB implementation. This command represents running
# such a script from the pipeline, processing back-splice junction counts for
# specified comparison groups.

Rscript /path/to/CSBB-v3.0/scripts/run_differential_circRNA_analysis_with_ruvseq.R \
  --counts back_splice_junction_counts.tsv \
  --metadata sample_metadata.tsv \
  --comparison-groups "Control_vs_Treatment" \
  --output-dir ruvseq_differential_results

View on GitHub

This workflow applies the software EdgeR to perform a differential expression analysis using a generalized linear model approach using the upper quartile normalization.

edgeR v3.44.1 GitHub

$ Bash example

# Install edgeR if not already installed
# conda install -c bioconda bioconductor-edger
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('edgeR')"

# Create a dummy R script for differential expression analysis with edgeR
# This script assumes 'counts.tsv' (gene counts) and 'sample_info.tsv' (design matrix) as input.
# Replace these with actual file paths.

cat << 'EOF' > run_edger_dea.R
library(edgeR)

# --- Configuration ---
# Input files (replace with actual paths)
counts_file <- "counts.tsv"
sample_info_file <- "sample_info.tsv"
output_file <- "edger_differential_expression_results.tsv"
design_formula_str <- "~ condition" # Example: adjust based on your sample_info_file columns

# --- Load Data ---
# Load count matrix
counts <- read.delim(counts_file, row.names = 1, check.names = FALSE)
# Ensure counts are integers
counts <- round(counts)

# Load sample information (design matrix)
sample_info <- read.delim(sample_info_file, row.names = 1, check.names = FALSE)

# Ensure sample order matches between counts and sample_info
sample_info <- sample_info[colnames(counts), , drop = FALSE]

# --- edgeR Analysis ---

# Create DGEList object
dge <- DGEList(counts = counts, group = sample_info$condition) # 'condition' is an example column

# Filter out lowly expressed genes (optional but recommended)
# Keep genes with at least 10 counts per million (CPM) in at least 2 samples
keep <- filterByExpr(dge, group = sample_info$condition)
dge <- dge[keep, , keep.lib.sizes = FALSE]

# Upper quartile normalization
dge <- calcNormFactors(dge, method = "upperquartile")

# Create design matrix
design <- model.matrix(as.formula(design_formula_str), data = sample_info)

# Estimate dispersion
dge <- estimateDisp(dge, design)

# Fit generalized linear model (GLM) using quasi-likelihood F-tests
fit <- glmQLFit(dge, design)

# Perform differential expression testing
# For a simple two-group comparison (e.g., control vs. treated) with design `~ condition`,
# the last coefficient typically represents the logFC of the second level vs the first (reference) level.
lrt <- glmQLFTest(fit, coef = ncol(design)) # Test the last coefficient

# Get differential expression results
results <- topTags(lrt, n = Inf, sort.by = "PValue")$table

# Write results to file
write.table(results, file = output_file, sep = "\t", quote = FALSE, row.names = TRUE)

message("Differential expression analysis complete. Results saved to ", output_file)
EOF

# Create dummy input files for demonstration
echo -e "Gene\tSample1_Control\tSample2_Control\tSample3_Treated\tSample4_Treated" > counts.tsv
echo -e "GeneA\t100\t120\t500\t550" >> counts.tsv
echo -e "GeneB\t50\t60\t30\t35" >> counts.tsv
echo -e "GeneC\t200\t210\t220\t230" >> counts.tsv
echo -e "GeneD\t10\t12\t100\t110" >> counts.tsv

echo -e "Sample\tcondition" > sample_info.tsv
echo -e "Sample1_Control\tcontrol" >> sample_info.tsv
echo -e "Sample2_Control\tcontrol" >> sample_info.tsv
echo -e "Sample3_Treated\ttreated" >> sample_info.tsv
echo -e "Sample4_Treated\ttreated" >> sample_info.tsv

# Execute the R script
Rscript run_edger_dea.R

# Clean up dummy files (optional)
# rm counts.tsv sample_info.tsv run_edger_dea.R edger_differential_expression_results.tsv

View on GitHub

All together, we identified circRNAs for 448 genes, with evidence of at least two reads per circRNA. circRNAs with a non-adjusted EdgeR p<0.1 were considered differentially expressed.

edgeR v0.1 GitHub

$ Bash example

# Install R and Bioconductor if not already installed
# sudo apt-get update
# sudo apt-get install r-base
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('edgeR')"

# Create a dummy R script for edgeR differential expression
cat << 'EOF' > run_edger.R
library(edgeR)

# --- Configuration ---
counts_file <- "circrna_counts.tsv" # Placeholder for circRNA count matrix
sample_info_file <- "sample_info.tsv" # Placeholder for sample metadata (e.g., condition)
output_file <- "differentially_expressed_circRNAs.tsv"
p_value_threshold <- 0.1 # From description: "non-adjusted EdgeR p<0.1"

# --- Load Data ---
# Assuming counts_file is a tab-separated file with circRNA IDs as row names and sample IDs as column names
# The description implies circRNAs with at least two reads are already identified and included here.
counts <- read.delim(counts_file, row.names = 1, check.names = FALSE)

# Assuming sample_info_file has columns like 'SampleID', 'Condition'
sample_info <- read.delim(sample_info_file, row.names = 1, check.names = FALSE)

# Ensure sample order matches between counts and sample_info
sample_info <- sample_info[colnames(counts), , drop = FALSE]

# Create DGEList object
dge <- DGEList(counts = counts)

# --- Normalization ---
dge <- calcNormFactors(dge)

# --- Design Matrix ---
# Assuming 'Condition' is the column in sample_info that defines groups
design <- model.matrix(~0 + sample_info$Condition)
colnames(design) <- levels(sample_info$Condition)

# --- Estimate Dispersion ---
dge <- estimateDisp(dge, design)

# --- Fit GLM ---
fit <- glmQLFit(dge, design) # Using QLF for more robust error control

# --- Define Contrasts (Example: comparing the first two conditions found) ---
cond_levels <- levels(sample_info$Condition)
if (length(cond_levels) >= 2) {
    contrast_name <- paste0(cond_levels[2], "_vs_", cond_levels[1])
    contrast_formula <- as.formula(paste0(cond_levels[2], " - ", cond_levels[1]))
    contrast_matrix <- makeContrasts(contrast_formula, levels = design)
} else {
    stop("Not enough conditions for differential expression comparison.")
}

# --- Perform Test ---
qlf_test <- glmQLFTest(fit, contrast = contrast_matrix)

# --- Get Results ---
results <- topTags(qlf_test, n = Inf, adjust.method = "none")$table

# --- Filter by p-value ---
# "circRNAs with a non-adjusted EdgeR p<0.1 were considered differentially expressed."
deg_circRNAs <- results[results$PValue < p_value_threshold, ]

# --- Save Results ---
write.table(deg_circRNAs, file = output_file, sep = "\t", quote = FALSE, row.names = TRUE)

message(paste("Differentially expressed circRNAs (p <", p_value_threshold, ") saved to", output_file))
EOF

# Create dummy input files for demonstration
# circrna_counts.tsv: Assumes circRNAs with at least two reads are already included.
echo -e "circRNA_ID\tSample1_Control\tSample2_Control\tSample3_Treated\tSample4_Treated" > circrna_counts.tsv
echo -e "circRNA_1\t10\t12\t50\t55" >> circrna_counts.tsv
echo -e "circRNA_2\t100\t110\t105\t115" >> circrna_counts.tsv
echo -e "circRNA_3\t5\t6\t8\t9" >> circrna_counts.tsv
echo -e "circRNA_4\t20\t22\t10\t11" >> circrna_counts.tsv
echo -e "circRNA_5\t2\t3\t2\t3" >> circrna_counts.tsv # Example meeting "at least two reads"

echo -e "SampleID\tCondition" > sample_info.tsv
echo -e "Sample1_Control\tControl" >> sample_info.tsv
echo -e "Sample2_Control\tControl" >> sample_info.tsv
echo -e "Sample3_Treated\tTreated" >> sample_info.tsv
echo -e "Sample4_Treated\tTreated" >> sample_info.tsv

# Execute the R script
Rscript run_edger.R

View on GitHub

Tools Used

TopHat STAR

Raw Source Text

To identify, quantify and annotate circular RNAs from back-splice junctions we first aligned the RNASeq results using TopHat2 on the RiboMinus depleted paired-end RNA-Seq FASTQ files, via the CIRCexplorer2 pipeline, on each individual sample (default options). This workflow identified far greater putative circRNAs than using the same pipeline with STAR 2.4.0. For differential circRNA analysis we ran RUVSEQ from the CSBB pipeline (https://github.com/csbbcompbio/CSBB-v3.0) on the back-splice junction counts for each sample for all relevant comparison groups. This workflow applies the software EdgeR to perform a differential expression analysis using a generalized linear model approach using the upper quartile normalization. All together, we identified circRNAs for 448 genes, with evidence of at least two reads per circRNA. circRNAs with a non-adjusted EdgeR p<0.1 were considered differentially expressed.
Genome_build: hg19
Supplementary_files_format_and_content: tab-delimited-text
Supplementary_files_format_and_content: CIRCexplorer2 back-splice junction read counts

← Back to Analysis