GSE145984 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Transcriptome-wide profiles of circular RNA and RNA-binding protein interactions reveal effects on circular RNA biogenesis and cancer pathway expression.

Genome medicine (2020) — PMID 33287884

Dataset

GSE145984

Total RNA-Seq of KHSRP knockdown (KD) and control samples in HepG2 and K562

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Reads were trimmed with Trim Galore v0.4.1 and cutadapt v1.15

cutadapt v1.15 GitHub

$ Bash example

# Install cutadapt v1.15
# conda install -c bioconda cutadapt=1.15

# Example command for paired-end reads trimming with common Illumina adapters and quality filtering.
# This command assumes standard Illumina universal adapters and quality trimming.
# Replace 'reads_R1.fastq.gz', 'reads_R2.fastq.gz' with your actual input files.
# Replace 'trimmed_reads_R1.fastq.gz', 'trimmed_reads_R2.fastq.gz' with your desired output files.
# The adapter sequence 'AGATCGGAAGAGC' is a common Illumina universal adapter for 3' end.
# -a AGATCGGAAGAGC: 3' adapter sequence for read 1
# -A AGATCGGAAGAGC: 3' adapter sequence for read 2
# -q 20,20: Quality trim from both ends with a threshold of 20
# --minimum-length 20: Discard reads shorter than 20 bp after trimming
# -o: Output file for read 1
# -p: Output file for read 2
cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC -q 20,20 --minimum-length 20 -o trimmed_reads_R1.fastq.gz -p trimmed_reads_R2.fastq.gz reads_R1.fastq.gz reads_R2.fastq.gz

View on GitHub

Mapping to hg19 was performed with bwa v0.7.17 and samtools v1.9

samtools v1.9 GitHub

$ Bash example

# Define placeholder paths
REF_GENOME="path/to/hg19.fa" # Placeholder for hg19 reference genome
BWA_OUTPUT_SAM="bwa_aligned.sam" # Placeholder for SAM file output by bwa
OUTPUT_PREFIX="sample_aligned"
SORTED_BAM="${OUTPUT_PREFIX}.sorted.bam"

# Install samtools v1.9 (commented out)
# conda install -c bioconda samtools=1.9

# Convert SAM to BAM, sort, and save
# This command assumes 'bwa_aligned.sam' is the input from bwa.
# If bwa output was piped, it would look like:
# bwa mem -t 8 "${REF_GENOME}" reads.fastq.gz | samtools view -bS - | samtools sort -o "${SORTED_BAM}"
samtools view -bS "${BWA_OUTPUT_SAM}" | samtools sort -o "${SORTED_BAM}"

# Index the sorted BAM file
samtools index "${SORTED_BAM}"

View on GitHub

We used the CIRI2 pipeline v2.0.6 to detect circRNAs.

CIRI2 v2.0.6

$ Bash example

# Install CIRI2 (if not already installed)
# conda install -c bioconda ciri2

# Example command for CIRI2 to detect circRNAs
# This command assumes you have an aligned BAM file (e.g., from STAR or HISAT2)
# and reference genome files (FASTA and GTF).
# Replace 'aligned_reads.bam', 'circrna_output', 'genome.fasta', and 'genes.gtf' with your actual file paths.

perl CIRI2.pl -I aligned_reads.bam -O circrna_output -A genome.fasta -G genes.gtf

The CIRI2 pipeline was run with a gene transfer format (GTF) file (hg19) to annotate the overlapping gene of the circRNAs

CIRI2 (Inferred with models/gemini-2.5-flash) vNot specified

$ Bash example

# Install CIRI2 (example using conda)
# conda create -n ciri2_env python=3.8
# conda activate ciri2_env
# conda install -c bioconda ciri2

# Download hg19 GTF (GRCh37) if not already present
# mkdir -p ref_data
# wget -O ref_data/Homo_sapiens.GRCh37.75.gtf.gz ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
# gunzip -f ref_data/Homo_sapiens.GRCh37.75.gtf.gz
GTF_FILE="ref_data/Homo_sapiens.GRCh37.75.gtf"

# Placeholder for input RNA-seq alignment BAM file, which CIRI2 uses to identify circRNAs
# and then annotate them with the provided GTF.
INPUT_BAM="rnaseq_aligned.bam"
OUTPUT_PREFIX="circrna_annotation_hg19"

# Run CIRI2 to identify and annotate circRNAs, including overlapping genes, using the hg19 GTF.
# -I: Input RNA-seq alignment file (BAM format)
# -A: Gene annotation file (GTF format) for annotating circRNAs with overlapping genes
# -O: Output file prefix
CIRI2 -I "${INPUT_BAM}" -A "${GTF_FILE}" -O "${OUTPUT_PREFIX}"

Expression values of circRNAs were normalized to total reads

R (Inferred with models/gemini-2.5-flash) v4.2.0 GitHub

$ Bash example

# Install R and necessary packages if not already installed
# conda create -n r_env r-base=4.2.0 r-essentials r-recommended -y
# conda activate r_env
# R -e "install.packages(c('readr', 'dplyr', 'tibble'), repos='http://cran.us.r-project.org')"

# Input file with raw circRNA counts (e.g., from a circRNA quantification tool like CIRI2, DCC, etc.)
# This file should have circRNA IDs in the first column and raw counts for samples in subsequent columns.
# Example:
# circRNA_ID\tSample1\tSample2
# circ_1\t100\t150
# circ_2\t50\t75
INPUT_COUNTS="circRNA_raw_counts.tsv"
OUTPUT_NORMALIZED="circRNA_normalized_cpm.tsv"

# Placeholder for reference genome used in upstream circRNA quantification (e.g., hg38 for human).
# This normalization step itself does not directly use a reference genome.
REFERENCE_GENOME="hg38"

# R script to perform normalization to Counts Per Million (CPM).
# This directly implements "normalized to total reads" by dividing by library size and scaling.
R_SCRIPT=$(cat <<EOF
library(readr)
library(dplyr)
library(tibble)

input_file <- Sys.getenv("INPUT_COUNTS")
output_file <- Sys.getenv("OUTPUT_NORMALIZED")

# Read counts data
counts_df <- read_tsv(input_file)

# Assuming the first column is circRNA ID and subsequent columns are raw counts for samples
circRNA_ids <- counts_df[[1]]
counts_matrix <- as.matrix(counts_df[,-1])
rownames(counts_matrix) <- circRNA_ids

# Calculate total reads per sample (column sums of the count matrix).
# This represents the library size for each sample based on the quantified circRNAs.
total_reads_per_sample <- colSums(counts_matrix)

# Normalize to total reads (CPM - Counts Per Million)
# Divide each count by the total reads for its sample, then multiply by 1,000,000
normalized_counts_cpm <- t(t(counts_matrix) / total_reads_per_sample) * 1e6

# Convert the normalized matrix back to a data frame and add circRNA IDs as a column
normalized_df <- as.data.frame(normalized_counts_cpm)
normalized_df <- normalized_df %>%
  rownames_to_column(var = "circRNA_ID")

# Write the normalized counts to an output file
write_tsv(normalized_df, output_file)
EOF
)

# Execute the R script
INPUT_COUNTS="${INPUT_COUNTS}" OUTPUT_NORMALIZED="${OUTPUT_NORMALIZED}" Rscript -e "${R_SCRIPT}"

View on GitHub

Raw Source Text

Reads were trimmed with Trim Galore v0.4.1 and cutadapt v1.15
Mapping to hg19 was performed with bwa v0.7.17 and samtools v1.9
We used the CIRI2 pipeline v2.0.6 to detect circRNAs. The CIRI2 pipeline was run with a gene transfer format (GTF) file (hg19) to annotate the overlapping gene of the circRNAs
Expression values of circRNAs were normalized to total reads
Genome_build: hg19
Supplementary_files_format_and_content: tab-delimited text file include counts per million (CPM) values for each circRNA in each sample

← Back to Analysis