GSE175885 Processing Pipeline
Publication
Gain-of-function cardiomyopathic mutations in RBM20 rewire splicing regulation and re-distribute ribonucleoprotein granules within processing bodies.Nature communications (2021) — PMID 34732726
Dataset
GSE175885RNA-Seq of isogenic human iPS cell-derived cardiomyocytes with RBM20 mutations created by genome editing (RiboMinus RNA-Seq)
Processing Steps
Generate Jupyter Notebook-
1
To identify, quantify and annotate circular RNAs from back-splice junctions we first aligned the RNASeq results using TopHat2 on the RiboMinus depleted paired-end RNA-Seq FASTQ files, via the CIRCexplorer2 pipeline, on each individual sample (default options).
$ Bash example
# Install TopHat2 (e.g., via Bioconda) # conda install -c bioconda tophat=2.1.1 # Or a specific version # conda install -c bioconda bowtie2 # Define input and output paths GENOME_INDEX="path/to/hg38_index/hg38" # Replace with your genome index path (e.g., from Ensembl or UCSC for Homo sapiens hg38) GTF_FILE="path/to/hg38.gtf" # Replace with your GTF annotation file path (e.g., from Ensembl or UCSC for Homo sapiens hg38) SAMPLE_ID="sample1" # Replace with actual sample ID READ1_FASTQ="${SAMPLE_ID}_R1.fastq.gz" # Replace with actual R1 FASTQ file READ2_FASTQ="${SAMPLE_ID}_R2.fastq.gz" # Replace with actual R2 FASTQ file OUTPUT_DIR="${SAMPLE_ID}_tophat_out" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run TopHat2 for paired-end alignment with default options tophat2 \ -o "${OUTPUT_DIR}" \ -G "${GTF_FILE}" \ --num-threads 8 \ "${GENOME_INDEX}" \ "${READ1_FASTQ}" \ "${READ2_FASTQ}" -
2
This workflow identified far greater putative circRNAs than using the same pipeline with STAR 2.4.0.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.4.0 # Placeholder for STAR genome index directory (e.g., for human GRCh38) GENOME_DIR="/path/to/STAR_index/GRCh38" # Placeholder for input FASTQ files READ1="sample_R1.fastq.gz" READ2="sample_R2.fastq.gz" # Placeholder for output directory and prefix OUTPUT_PREFIX="circRNA_alignment_output_" # Create output directory if it doesn't exist mkdir -p "$(dirname ${OUTPUT_PREFIX})" # Run STAR alignment for circRNA detection # Parameters like --chimSegmentMin and --chimJunctionOverhangMin are crucial for identifying chimeric junctions indicative of circRNAs. STAR \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --runThreadN 8 \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 20 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --chimSegmentMin 15 \ --chimJunctionOverhangMin 15 \ --outReadsUnmapped Fastx \ --outSAMattributes NH HI AS NM MD jM jI XS # Note: After STAR alignment, a dedicated circRNA calling tool (e.g., find_circ, CIRCexplorer2, CIRI2) # would typically be used to process the Chimeric.out.junction file generated by STAR # to identify and quantify putative circRNAs. # Example (conceptual, tool not specified in description): # find_circ.py ${OUTPUT_PREFIX}Chimeric.out.junction > circRNAs.bed -
3
For differential circRNA analysis we ran RUVSEQ from the CSBB pipeline (https://github.com/csbbcompbio/CSBB-v3.0) on the back-splice junction counts for each sample for all relevant comparison groups.
$ Bash example
# Install R and Bioconductor if not already present # sudo apt-get update # sudo apt-get install r-base # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('RUVSeq')" # Placeholder for input files: # back_splice_junction_counts.tsv: Tab-separated file with circRNA counts per sample. # sample_metadata.tsv: Tab-separated file with sample information, including comparison groups. # Assuming the CSBB-v3.0 pipeline contains a script for differential circRNA analysis # that incorporates RUVSEQ for normalization. The exact script name and parameters # would be specific to the CSBB implementation. This command represents running # such a script from the pipeline, processing back-splice junction counts for # specified comparison groups. Rscript /path/to/CSBB-v3.0/scripts/run_differential_circRNA_analysis_with_ruvseq.R \ --counts back_splice_junction_counts.tsv \ --metadata sample_metadata.tsv \ --comparison-groups "Control_vs_Treatment" \ --output-dir ruvseq_differential_results -
4
This workflow applies the software EdgeR to perform a differential expression analysis using a generalized linear model approach using the upper quartile normalization.
$ Bash example
# Install edgeR if not already installed # conda install -c bioconda bioconductor-edger # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('edgeR')" # Create a dummy R script for differential expression analysis with edgeR # This script assumes 'counts.tsv' (gene counts) and 'sample_info.tsv' (design matrix) as input. # Replace these with actual file paths. cat << 'EOF' > run_edger_dea.R library(edgeR) # --- Configuration --- # Input files (replace with actual paths) counts_file <- "counts.tsv" sample_info_file <- "sample_info.tsv" output_file <- "edger_differential_expression_results.tsv" design_formula_str <- "~ condition" # Example: adjust based on your sample_info_file columns # --- Load Data --- # Load count matrix counts <- read.delim(counts_file, row.names = 1, check.names = FALSE) # Ensure counts are integers counts <- round(counts) # Load sample information (design matrix) sample_info <- read.delim(sample_info_file, row.names = 1, check.names = FALSE) # Ensure sample order matches between counts and sample_info sample_info <- sample_info[colnames(counts), , drop = FALSE] # --- edgeR Analysis --- # Create DGEList object dge <- DGEList(counts = counts, group = sample_info$condition) # 'condition' is an example column # Filter out lowly expressed genes (optional but recommended) # Keep genes with at least 10 counts per million (CPM) in at least 2 samples keep <- filterByExpr(dge, group = sample_info$condition) dge <- dge[keep, , keep.lib.sizes = FALSE] # Upper quartile normalization dge <- calcNormFactors(dge, method = "upperquartile") # Create design matrix design <- model.matrix(as.formula(design_formula_str), data = sample_info) # Estimate dispersion dge <- estimateDisp(dge, design) # Fit generalized linear model (GLM) using quasi-likelihood F-tests fit <- glmQLFit(dge, design) # Perform differential expression testing # For a simple two-group comparison (e.g., control vs. treated) with design `~ condition`, # the last coefficient typically represents the logFC of the second level vs the first (reference) level. lrt <- glmQLFTest(fit, coef = ncol(design)) # Test the last coefficient # Get differential expression results results <- topTags(lrt, n = Inf, sort.by = "PValue")$table # Write results to file write.table(results, file = output_file, sep = "\t", quote = FALSE, row.names = TRUE) message("Differential expression analysis complete. Results saved to ", output_file) EOF # Create dummy input files for demonstration echo -e "Gene\tSample1_Control\tSample2_Control\tSample3_Treated\tSample4_Treated" > counts.tsv echo -e "GeneA\t100\t120\t500\t550" >> counts.tsv echo -e "GeneB\t50\t60\t30\t35" >> counts.tsv echo -e "GeneC\t200\t210\t220\t230" >> counts.tsv echo -e "GeneD\t10\t12\t100\t110" >> counts.tsv echo -e "Sample\tcondition" > sample_info.tsv echo -e "Sample1_Control\tcontrol" >> sample_info.tsv echo -e "Sample2_Control\tcontrol" >> sample_info.tsv echo -e "Sample3_Treated\ttreated" >> sample_info.tsv echo -e "Sample4_Treated\ttreated" >> sample_info.tsv # Execute the R script Rscript run_edger_dea.R # Clean up dummy files (optional) # rm counts.tsv sample_info.tsv run_edger_dea.R edger_differential_expression_results.tsv -
5
All together, we identified circRNAs for 448 genes, with evidence of at least two reads per circRNA. circRNAs with a non-adjusted EdgeR p<0.1 were considered differentially expressed.
$ Bash example
# Install R and Bioconductor if not already installed # sudo apt-get update # sudo apt-get install r-base # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('edgeR')" # Create a dummy R script for edgeR differential expression cat << 'EOF' > run_edger.R library(edgeR) # --- Configuration --- counts_file <- "circrna_counts.tsv" # Placeholder for circRNA count matrix sample_info_file <- "sample_info.tsv" # Placeholder for sample metadata (e.g., condition) output_file <- "differentially_expressed_circRNAs.tsv" p_value_threshold <- 0.1 # From description: "non-adjusted EdgeR p<0.1" # --- Load Data --- # Assuming counts_file is a tab-separated file with circRNA IDs as row names and sample IDs as column names # The description implies circRNAs with at least two reads are already identified and included here. counts <- read.delim(counts_file, row.names = 1, check.names = FALSE) # Assuming sample_info_file has columns like 'SampleID', 'Condition' sample_info <- read.delim(sample_info_file, row.names = 1, check.names = FALSE) # Ensure sample order matches between counts and sample_info sample_info <- sample_info[colnames(counts), , drop = FALSE] # Create DGEList object dge <- DGEList(counts = counts) # --- Normalization --- dge <- calcNormFactors(dge) # --- Design Matrix --- # Assuming 'Condition' is the column in sample_info that defines groups design <- model.matrix(~0 + sample_info$Condition) colnames(design) <- levels(sample_info$Condition) # --- Estimate Dispersion --- dge <- estimateDisp(dge, design) # --- Fit GLM --- fit <- glmQLFit(dge, design) # Using QLF for more robust error control # --- Define Contrasts (Example: comparing the first two conditions found) --- cond_levels <- levels(sample_info$Condition) if (length(cond_levels) >= 2) { contrast_name <- paste0(cond_levels[2], "_vs_", cond_levels[1]) contrast_formula <- as.formula(paste0(cond_levels[2], " - ", cond_levels[1])) contrast_matrix <- makeContrasts(contrast_formula, levels = design) } else { stop("Not enough conditions for differential expression comparison.") } # --- Perform Test --- qlf_test <- glmQLFTest(fit, contrast = contrast_matrix) # --- Get Results --- results <- topTags(qlf_test, n = Inf, adjust.method = "none")$table # --- Filter by p-value --- # "circRNAs with a non-adjusted EdgeR p<0.1 were considered differentially expressed." deg_circRNAs <- results[results$PValue < p_value_threshold, ] # --- Save Results --- write.table(deg_circRNAs, file = output_file, sep = "\t", quote = FALSE, row.names = TRUE) message(paste("Differentially expressed circRNAs (p <", p_value_threshold, ") saved to", output_file)) EOF # Create dummy input files for demonstration # circrna_counts.tsv: Assumes circRNAs with at least two reads are already included. echo -e "circRNA_ID\tSample1_Control\tSample2_Control\tSample3_Treated\tSample4_Treated" > circrna_counts.tsv echo -e "circRNA_1\t10\t12\t50\t55" >> circrna_counts.tsv echo -e "circRNA_2\t100\t110\t105\t115" >> circrna_counts.tsv echo -e "circRNA_3\t5\t6\t8\t9" >> circrna_counts.tsv echo -e "circRNA_4\t20\t22\t10\t11" >> circrna_counts.tsv echo -e "circRNA_5\t2\t3\t2\t3" >> circrna_counts.tsv # Example meeting "at least two reads" echo -e "SampleID\tCondition" > sample_info.tsv echo -e "Sample1_Control\tControl" >> sample_info.tsv echo -e "Sample2_Control\tControl" >> sample_info.tsv echo -e "Sample3_Treated\tTreated" >> sample_info.tsv echo -e "Sample4_Treated\tTreated" >> sample_info.tsv # Execute the R script Rscript run_edger.R
Raw Source Text
To identify, quantify and annotate circular RNAs from back-splice junctions we first aligned the RNASeq results using TopHat2 on the RiboMinus depleted paired-end RNA-Seq FASTQ files, via the CIRCexplorer2 pipeline, on each individual sample (default options). This workflow identified far greater putative circRNAs than using the same pipeline with STAR 2.4.0. For differential circRNA analysis we ran RUVSEQ from the CSBB pipeline (https://github.com/csbbcompbio/CSBB-v3.0) on the back-splice junction counts for each sample for all relevant comparison groups. This workflow applies the software EdgeR to perform a differential expression analysis using a generalized linear model approach using the upper quartile normalization. All together, we identified circRNAs for 448 genes, with evidence of at least two reads per circRNA. circRNAs with a non-adjusted EdgeR p<0.1 were considered differentially expressed. Genome_build: hg19 Supplementary_files_format_and_content: tab-delimited-text Supplementary_files_format_and_content: CIRCexplorer2 back-splice junction read counts