GSE86791 Processing Pipeline

RNA-Seq code_examples 16 steps

Publication

RNA-binding protein CPEB1 remodels host and viral RNA landscapes.

Nature structural & molecular biology (2016) — PMID 27775709

Dataset

RNA binding protein CPEB1 remodels host and viral RNA landscapes [RNA-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-seq data was aligned to the human hg19 genome build using Olego and Star aligners, and alternative splicing and alternative polyadenylation were estimated as described below.

RNA-seq v2.7.10a (Inferred with models/gemini-2.5-flash), 1.0.6 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# --- Reference Genome Preparation ---
# Create a directory for reference files
mkdir -p reference/hg19

# Download human hg19 genome FASTA from UCSC
# wget -P reference/hg19 http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip -f reference/hg19/hg19.fa.gz
HG19_FASTA="reference/hg19/hg19.fa"

# Download human GRCh37.75 GTF (equivalent to hg19) from Ensembl
# wget -P reference/hg19 ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
# gunzip -f reference/hg19/Homo_sapiens.GRCh37.75.gtf.gz
HG19_GTF="reference/hg19/Homo_sapiens.GRCh37.75.gtf"

# --- STAR Aligner ---
# Installation (example using Bioconda)
# conda install -c bioconda star=2.7.10a

# Create STAR genome index
STAR_INDEX_DIR="star_index_hg19"
mkdir -p "${STAR_INDEX_DIR}"
STAR --runMode genomeGenerate \
     --genomeDir "${STAR_INDEX_DIR}" \
     --genomeFastaFiles "${HG19_FASTA}" \
     --sjdbGTFfile "${HG19_GTF}" \
     --runThreadN 8 # Adjust thread count as needed

# Align RNA-seq reads using STAR
# Assuming paired-end reads: sample_R1.fastq.gz and sample_R2.fastq.gz
# Replace with actual input file paths
INPUT_READ1="sample_R1.fastq.gz"
INPUT_READ2="sample_R2.fastq.gz"
OUTPUT_PREFIX="star_aligned_sample"

STAR --genomeDir "${STAR_INDEX_DIR}" \
     --readFilesIn "${INPUT_READ1}" "${INPUT_READ2}" \
     --readFilesCommand zcat \
     --runThreadN 8 \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard \
     --quantMode GeneCounts \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --alignSJDBoverhangMin 1 \
     --alignSJoverhangMin 8 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000

# --- Olego Aligner ---
# Installation (example using Bioconda)
# conda install -c bioconda olego=1.0.6

# Create Olego genome index
OLEGO_INDEX_DIR="olego_index_hg19"
mkdir -p "${OLEGO_INDEX_DIR}"
olego index -f "${HG19_FASTA}" -g "${HG19_GTF}" -o "${OLEGO_INDEX_DIR}"

# Align RNA-seq reads using Olego
# Assuming paired-end reads: sample_R1.fastq.gz and sample_R2.fastq.gz
# Replace with actual input file paths
OLEGO_OUTPUT_BAM="olego_aligned_sample.bam"

olego align -i "${OLEGO_INDEX_DIR}" \
            -1 "${INPUT_READ1}" \
            -2 "${INPUT_READ2}" \
            -o "${OLEGO_OUTPUT_BAM}" \
            -t 8 # Adjust thread count as needed

View on GitHub

Quantas software as described in Charizanis et al, 2012, Neuron, was used to estimate alternative splicing.

Quantas software v2012 GitHub

$ Bash example

# Quantas software, as described in Charizanis et al, 2012, Neuron, is a custom tool
# developed for estimating alternative splicing from RNA-seq data.
# Specific installation instructions or a public repository for Quantas are not readily available.
# The following command is a generic placeholder based on typical alternative splicing quantification workflows.

# Define placeholder paths for input and reference files
# Replace with actual paths to your aligned RNA-seq BAM file and genome annotation GTF.
INPUT_BAM="path/to/your_aligned_reads.bam"
GENOME_GTF="path/to/human_hg38.gtf" # Example: GRCh38/hg38 genome annotation

# Placeholder command for Quantas execution.
# The actual command would depend on the specific implementation of the Quantas script.
# It likely takes aligned reads (BAM) and a genome annotation (GTF) as input.
# The output would typically be a table of splicing event quantification.
python /path/to/quantas_script.py \
    --input "$INPUT_BAM" \
    --gtf "$GENOME_GTF" \
    --output "quantas_alternative_splicing_results.tsv" \
    --log "quantas.log"

View on GitHub

Olego aligned alignment files were used to count observed junction reads for each exon.

rMATS (Inferred with models/gemini-2.5-flash) v4.0.2 GitHub

$ Bash example

# Install rMATS (if not already installed)
# conda install -c bioconda rmats-turbo

# Define input and output paths
INPUT_BAM="olego_aligned.bam" # Assuming a single Olego aligned BAM file
GTF_FILE="gencode.v38.annotation.gtf" # Placeholder for human annotation (e.g., from GENCODE)
READ_LENGTH=100 # Inferring a common read length, adjust if known
OUTPUT_DIR="rmats_junction_counts"
TMP_DIR="rmats_tmp"
NUM_THREADS=8

# Create output and temporary directories
mkdir -p "${OUTPUT_DIR}"
mkdir -p "${TMP_DIR}"

# Run rMATS to count junction reads for exons.
# rMATS is primarily for differential splicing, but its output includes raw counts
# for junction reads supporting various splicing events (e.g., exon inclusion/exclusion).
# For a single sample, we can provide the same BAM file for both b1 and b2 to generate these counts.
rmats.py \
    --b1 "${INPUT_BAM}" \
    --b2 "${INPUT_BAM}" \
    --gtf "${GTF_FILE}" \
    --readLength "${READ_LENGTH}" \
    --tmp "${TMP_DIR}" \
    --nthread "${NUM_THREADS}" \
    --od "${OUTPUT_DIR}"

View on GitHub

Weighted number of exon or exon-junction fragments uniquely supporting the inclusion or skipping isoform of each cassette exon and a probability score was assigned to each isoform.

Skipper (Inferred with models/gemini-2.5-flash) v0.1.0 GitHub

$ Bash example

# Install Skipper (if not already installed)
# pip install skipper

# Define input files and reference data
INPUT_BAM="aligned_reads.bam"
ANNOTATION_GTF="gencode.v44.annotation.gtf" # Placeholder for latest human GTF (e.g., GRCh38)
GENOME_FASTA="GRCh38.primary_assembly.genome.fa" # Placeholder for latest human genome FASTA
OUTPUT_DIR="skipper_quantification_output"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run Skipper to quantify alternative splicing events
# The description implies quantification of cassette exons (SE events) and assigning probability scores.
# Skipper's 'quantify' command handles this by default for various event types, including SE.
skipper quantify \
    "${INPUT_BAM}" \
    "${ANNOTATION_GTF}" \
    --genome "${GENOME_FASTA}" \
    --output "${OUTPUT_DIR}"

View on GitHub

A Fisherâs exact test was used to evaluate the statistical significance of splicing changes using both exon and exon-junction fragments, followed by Benjamini multiple testing correction to estimate the false discovery rate (FDR).

rMATS (Inferred with models/gemini-2.5-flash) v4.1.2 GitHub

$ Bash example

# Install rMATS via conda
# conda create -n rmats_env python=3.8
# conda activate rmats_env
# conda install -c bioconda rmats=4.1.2

# Define input and output paths
# Placeholder: Replace with actual comma-separated BAM files for control replicates
CONTROL_BAMS="control_rep1.bam,control_rep2.bam"
# Placeholder: Replace with actual comma-separated BAM files for treatment replicates
TREATMENT_BAMS="treatment_rep1.bam,treatment_rep2.bam"
# Placeholder: Use a relevant genome annotation GTF file (e.g., from GENCODE for GRCh38)
GTF_FILE="gencode.v38.annotation.gtf"
OUTPUT_DIR="rmats_output"
TMP_DIR="rmats_tmp"
READ_LENGTH=75 # Placeholder: Adjust based on the actual read length of your sequencing data
NUM_THREADS=8 # Placeholder: Adjust based on available CPU cores

# Create output and temporary directories if they don't exist
mkdir -p "${OUTPUT_DIR}"
mkdir -p "${TMP_DIR}"

# Execute rMATS for differential splicing analysis
# -t paired: Specifies paired-end reads. Use -t single for single-end reads.
# --readLength: Required parameter specifying the length of the sequencing reads.
# --nthread: Specifies the number of threads to use for parallel processing.
# rMATS automatically performs Benjamini-Hochberg FDR correction as part of its statistical analysis.
rmats.py \
    --b1 "${CONTROL_BAMS}" \
    --b2 "${TREATMENT_BAMS}" \
    --gtf "${GTF_FILE}" \
    --od "${OUTPUT_DIR}" \
    --tmp "${TMP_DIR}" \
    -t paired \
    --readLength "${READ_LENGTH}" \
    --nthread "${NUM_THREADS}"

View on GitHub

In addition, inclusion or exclusion junction reads were used to calculate the proportional change of exon inclusion (dI).

rMATS (Inferred with models/gemini-2.5-flash) v4.1.2 GitHub

$ Bash example

# Install rMATS (example using conda)
# conda create -n rmats_env python=3.8
# conda activate rmats_env
# conda install -c bioconda rmats-turbo

# Example rMATS command for differential splicing analysis.
# This assumes you have two groups of replicates (e.g., control and treatment)
# and their aligned BAM files. You will need to create text files listing the BAM files for each group.

# Placeholder: Create file_b1.txt (list of BAM files for group 1, e.g., control)
# Each BAM file path should be separated by a comma if multiple replicates, or a single path.
# Example: echo "/path/to/control_rep1.bam,/path/to/control_rep2.bam" > file_b1.txt

# Placeholder: Create file_b2.txt (list of BAM files for group 2, e.g., treatment)
# Example: echo "/path/to/treatment_rep1.bam,/path/to/treatment_rep2.bam" > file_b2.txt

# Reference genome annotation (GTF) - Using GENCODE human hg38 as a placeholder.
# Replace with the actual path to your GTF file.
GTF_FILE="/path/to/gencode.v38.annotation.gtf"

# Output directory for rMATS results
OUTPUT_DIR="rmats_differential_splicing_output"

# Create output directory if it doesn't exist
mkdir -p ${OUTPUT_DIR}

# Run rMATS to calculate proportional change of exon inclusion (dI/dPSI)
# --b1 and --b2 specify text files listing BAM files for two groups.
# --gtf provides the gene annotation file.
# --readLength and --variableReadLength handle read length variations.
# --nthread sets the number of CPU threads.
# --tmp specifies a temporary directory.
# --od sets the output directory.
# --task as specifies alternative splicing analysis.
# --statoff disables statistical analysis if only raw counts are needed, but usually kept on for dI.
# --tstat 10 sets the minimum total number of reads for a splicing event to be considered.
# --libType specifies the library type (e.g., fr-firststrand for Illumina TruSeq stranded).
rmats.py \
    --b1 file_b1.txt \
    --b2 file_b2.txt \
    --gtf ${GTF_FILE} \
    --readLength 50 \
    --variableReadLength \
    --nthread 8 \
    --tmp tmp_rmats_dir \
    --od ${OUTPUT_DIR} \
    --task as \
    --statoff \
    --tstat 10 \
    --libType fr-firststrand # Adjust based on library prep (e.g., fr-unstranded, fr-secondstrand)

View on GitHub

See documentation at http://zhanglab.c2b2.columbia.edu/index.php/Quantas_Documentation.

Quantas (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install Quantas (example, adjust as needed based on your environment)
# conda install -c bioconda quantas

# Placeholder for reference genome annotation (e.g., GTF file for the latest assembly)
# Replace 'path/to/latest_assembly.gtf' with the actual path to your GTF file.
# Source for latest assembly GTF: Ensembl, GENCODE, UCSC, etc.
GENOME_ANNOTATION="path/to/latest_assembly.gtf"

# Placeholder for input alignment file (e.g., BAM file generated by STAR or HISAT2)
# Replace 'path/to/input_alignments.bam' with your actual input BAM file.
INPUT_BAM="path/to/input_alignments.bam"

# Placeholder for output directory where Quantas results will be stored
OUTPUT_DIR="quantas_output"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Example Quantas command for quantifying alternative splicing events.
# The exact parameters will depend on the specific analysis goals and Quantas version.
# Consult the official Quantas documentation for detailed usage and available options.
quantas \
    --input "${INPUT_BAM}" \
    --gtf "${GENOME_ANNOTATION}" \
    --output_dir "${OUTPUT_DIR}" \
    --threads 8 # Example: Use 8 threads for parallel processing

Tandem UTR isoform analysis was performed with the MISO algorithm v0.5.2 using default settings as described in Katz et al, Nature, 2010, except for use of custom 3â²UTR isoform annotations.

MISO v0.5.2 GitHub

$ Bash example

# Install MISO v0.5.2 (Note: MISO v0.5.2 typically requires Python 2.7)
# conda create -n miso_env python=2.7
# conda activate miso_env
# pip install MISO==0.5.2

# Placeholder for input BAM file and output directory
INPUT_BAM="input.bam" # Replace with your actual input BAM file
OUTPUT_DIR="miso_utr_analysis_output" # Directory for MISO results
CUSTOM_UTR_GFF="custom_3UTR_annotations.gff" # Replace with your custom 3'UTR isoform annotations GFF file

# Create an index for the custom 3'UTR isoform annotations
MISO_INDEX_DIR="miso_utr_index"
miso index "${CUSTOM_UTR_GFF}" "${MISO_INDEX_DIR}"

# Perform tandem UTR isoform analysis with MISO using default settings
miso --run "${MISO_INDEX_DIR}" "${INPUT_BAM}" --output-dir "${OUTPUT_DIR}"

View on GitHub

We used a Bayes-factor threshold of 10,000 and difference values (delta Psi) with an absolute value of at least 0.03

MISO (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Install MISO (if not already installed)
# pip install miso

# Example usage of compare_miso for differential splicing analysis.
# This assumes you have MISO output directories (containing .miso files) for two conditions/samples
# and a GFF annotation file that defines the alternative splicing events.

# Define input MISO summary directories for two conditions/samples
# Replace with actual paths to your MISO output directories (e.g., from a previous run of MISO quantification)
MISO_DIR_CONDITION1="path/to/miso_output_condition1"
MISO_DIR_CONDITION2="path/to/miso_output_condition2"

# Define the GFF annotation file used for MISO quantification.
# This GFF should contain the alternative splicing events (e.g., from MISO's index_gff command or a pre-built annotation).
# Replace with the actual path to your GFF file (e.g., human_events.gff3).
SPLICING_EVENTS_GFF="path/to/splicing_events.gff3"

# Define output directory for differential splicing results
OUTPUT_DIR="differential_splicing_results"
mkdir -p "${OUTPUT_DIR}"

# Run compare_miso to identify differential splicing events.
# --bayes-factor-threshold: Minimum Bayes factor for an event to be considered significant.
# --delta-psi-threshold: Minimum absolute delta Psi for an event to be considered significant.
compare_miso \
    --compare-samples "${MISO_DIR_CONDITION1}" "${MISO_DIR_CONDITION2}" \
    --events-gff "${SPLICING_EVENTS_GFF}" \
    --output-dir "${OUTPUT_DIR}" \
    --bayes-factor-threshold 10000 \
    --delta-psi-threshold 0.03

View on GitHub

For TAIL-seq data analysis, image files were downloaded from the MiSeq and run on tailseeker (Chang et al, 2014, Mol Cell) to determine accurate polyA tail lengths, yielding paired fastq files corresponding to the 5' (R5) and 3' (R3 polyA tail) ends of each read.

tailseeker v2014 GitHub

$ Bash example

# Installation: As 'tailseeker' is likely a custom script or pipeline based on Chang et al., 2014,
# a direct installation command like 'conda install tailseeker' is unlikely to exist.
# It would involve setting up a custom environment and potentially cloning specific scripts.
# For demonstration, we assume a hypothetical executable 'tailseeker' is available in the PATH.
# Example (if it were a Python script):
# git clone https://github.com/some_repo/tailseeker.git
# cd tailseeker
# pip install -r requirements.txt
# export PATH=$PWD:$PATH

# Define input MiSeq run directory and desired output prefix
MISEQ_RUN_DIR="/path/to/your/miseq_run_directory" # e.g., /data/MiSeq_Run_12345
OUTPUT_PREFIX="sample_id" # e.g., my_experiment_sample

# Execute tailseeker to process raw MiSeq data (from image files/BCLs)
# and generate paired fastq files for 5' (R5) and 3' (R3 polyA tail) ends.
# The exact parameters are inferred based on the description of input and output.
tailseeker --input-miseq-dir "${MISEQ_RUN_DIR}" \
           --output-r5-fastq "${OUTPUT_PREFIX}_R5.fastq.gz" \
           --output-r3-fastq "${OUTPUT_PREFIX}_R3.fastq.gz" \
           --log-file "${OUTPUT_PREFIX}_tailseeker.log"

View on GitHub

Reads were aligned against the human genome (hg19) and viral genomes (Human_Herpesvirus_5_strain_Merlin, AY446894.2) using STAR under default parameters.

STAR v2.7.10a GitHub

$ Bash example

# conda install -c bioconda star

# --- Reference Data Preparation ---
# Create directories for reference genomes and STAR index
mkdir -p references/hg19 references/viral star_index/hg19_viral

# Download hg19 primary assembly FASTA from UCSC
wget -P references/hg19/ https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
gunzip references/hg19/hg19.fa.gz

# Download hg19 GTF (e.g., NCBI RefSeq from UCSC) for splice junction annotation
wget -P references/hg19/ https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ncbiRefSeq.gtf.gz
gunzip references/hg19/hg19.ncbiRefSeq.gtf.gz

# Download Human_Herpesvirus_5_strain_Merlin (AY446894.2) FASTA from NCBI
wget -O references/viral/AY446894.2.fasta "https://www.ncbi.nlm.nih.gov/nuccore/AY446894.2?report=fasta&format=text"

# Build STAR index for human (hg19) and viral genomes
# This step combines the human FASTA and GTF with the viral FASTA into a single index.
STAR --runMode genomeGenerate \
     --genomeDir star_index/hg19_viral \
     --genomeFastaFiles references/hg19/hg19.fa references/viral/AY446894.2.fasta \
     --sjdbGTFfile references/hg19/hg19.ncbiRefSeq.gtf \
     --runThreadN 8 # Adjust threads as needed for index generation

# --- STAR Alignment ---
# Placeholder for input FASTQ files (assuming paired-end and gzipped)
# Replace read1.fastq.gz and read2.fastq.gz with actual file paths
# Replace sample_name with an appropriate prefix for output files
STAR --genomeDir star_index/hg19_viral \
     --readFilesIn read1.fastq.gz read2.fastq.gz \
     --readFilesCommand zcat \
     --runThreadN 8 \
     --outFileNamePrefix sample_name. \
     --outSAMtype BAM SortedByCoordinate

View on GitHub

Features were assigned using Subread with gencode v19 annotations and with Human_Herpesvirus_5_strain_Merlin features, and filtered to obtain only the uniquely mapped protein coding genes.

featureCounts v2.0.3 GitHub

$ Bash example

# Define reference paths and input/output files
GENCODE_GTF="path/to/gencode.v19.annotation.gtf" # Source: https://www.gencodegenes.org/human/release_19.html
HHV5_GTF="path/to/Human_Herpesvirus_5_strain_Merlin.gtf" # Source: NCBI, UCSC, or specific viral genome databases
INPUT_BAM="aligned_reads.bam" # Placeholder for input BAM file
OUTPUT_COUNTS="gene_counts.txt"
THREADS=8 # Number of threads to use

# --- Installation (commented out) ---
# It's recommended to install featureCounts (part of the Subread package)
# via a package manager like Conda:
# conda install -c bioconda subread

# --- Prepare combined and filtered annotation GTF ---
# 1. Filter Gencode v19 GTF for protein_coding genes.
#    Gencode GTF files typically have 'gene_type "protein_coding";' in the attributes column.
#    We also keep header lines starting with '#'.
grep -E '^#|gene_type "protein_coding";' "${GENCODE_GTF}" > gencode_v19_protein_coding.gtf

# 2. Combine the filtered Gencode GTF with the Human Herpesvirus 5 GTF.
#    This creates a single annotation file for featureCounts.
cat gencode_v19_protein_coding.gtf "${HHV5_GTF}" > combined_filtered_annotation.gtf

# --- Run featureCounts ---
# -a: Annotation file in GTF/GFF format.
# -o: Output file for read counts.
# -F GTF: Specify that the annotation file is in GTF format.
# -t exon: Specify 'exon' as the feature type to count reads against.
# -g gene_id: Specify 'gene_id' as the attribute to group features into genes.
# -s 0: Specify unstranded library preparation (0=unstranded, 1=forward, 2=reverse).
#       Adjust if your library is stranded.
# -T: Number of threads to use for parallel processing.
# -p: Input reads are paired-end (remove if single-end).
# By default, featureCounts only counts uniquely mapped reads,
# which aligns with "uniquely mapped protein coding genes".
featureCounts -a combined_filtered_annotation.gtf \
              -o "${OUTPUT_COUNTS}" \
              -F GTF \
              -t exon \
              -g gene_id \
              -s 0 \
              -T "${THREADS}" \
              -p \
              "${INPUT_BAM}"

View on GitHub

For analysis of virally mapped reads, all genes were counted.

featureCounts (Inferred with models/gemini-2.5-flash) v2.0.x (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install featureCounts (part of Subread package)
# conda install -c bioconda subread

# Define input and output files
VIRAL_BAM="viral_mapped_reads.bam" # Placeholder for BAM file containing virally mapped reads
VIRAL_GTF="viral_genome.gtf"     # Placeholder for viral gene annotation file (GTF format)
OUTPUT_COUNTS="viral_gene_counts.txt"

# Run featureCounts to count reads per gene
# -a: Annotation file (GTF/GFF)
# -o: Output file for counts
# -F GTF: Specify annotation file format
# -t exon: Feature type to count (e.g., 'exon' for genes)
# -g gene_id: Attribute to group features by (e.g., 'gene_id')
# -s 0: Strandedness (0=unstranded, 1=forward, 2=reverse). Assuming unstranded as not specified.
# --primary: Only count primary alignments (optional, but often good practice)
featureCounts -a "${VIRAL_GTF}" \
              -o "${OUTPUT_COUNTS}" \
              -F GTF \
              -t exon \
              -g gene_id \
              -s 0 \
              --primary \
              "${VIRAL_BAM}"

View on GitHub

Reads with tails measuring 0 lengths were removed.

cutadapt (Inferred with models/gemini-2.5-flash) v4.x GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt

# Define input and output file paths
INPUT_READS="input_reads.fastq.gz"
OUTPUT_READS="trimmed_reads.fastq.gz"
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example Illumina adapter sequence

# Remove reads with tails (adapters) and discard reads that become 0 length after trimming.
# cutadapt by default removes reads that become empty after trimming.
# The -m 1 option explicitly ensures that reads shorter than 1 bp (i.e., 0 bp) are discarded.
cutadapt -a "${ADAPTER_SEQUENCE}" -m 1 -o "${OUTPUT_READS}" "${INPUT_READS}"

View on GitHub

For genes with at least 20 mapped reads, median lengths were measured and the global distributions of these lengths were compared against each other using the Kolmogorov-Smirnov test.

R (Inferred with models/gemini-2.5-flash) v4.2.0 GitHub

$ Bash example

# This script assumes that 'condition1_median_gene_lengths.txt' and 'condition2_median_gene_lengths.txt'
# have been generated by an upstream process. This upstream process would involve:
# 1. Aligning reads to a reference genome.
# 2. Quantifying reads per gene.
# 3. Filtering genes to include only those with at least 20 mapped reads.
# 4. For the remaining genes, extracting read lengths for reads mapping to them.
# 5. Calculating the median read length for each gene.
# 6. Storing these median lengths (one per line) in the respective input files.

# Example of how to create dummy input files for demonstration:
# echo -e "100\n110\n105\n120\n95" > condition1_median_gene_lengths.txt
# echo -e "115\n125\n120\n130\n110" > condition2_median_gene_lengths.txt

# Install R if not already available
# conda install -c conda-forge r-base=4.2.0

# Create the R script for performing the Kolmogorov-Smirnov test
cat <<EOF > ks_test_gene_lengths.R
# ks_test_gene_lengths.R
args <- commandArgs(trailingOnly = TRUE)
if (length(args) != 2) {
  stop("Usage: Rscript ks_test_gene_lengths.R <file1> <file2>", call. = FALSE)
}

file1 <- args[1]
file2 <- args[2]

# Read data from input files
# Each file is expected to contain one median gene length per line
data1 <- as.numeric(readLines(file1))
data2 <- as.numeric(readLines(file2))

# Perform Kolmogorov-Smirnov test
ks_result <- ks.test(data1, data2)

cat("Kolmogorov-Smirnov Test Results:\n")
cat("D statistic:", ks_result$statistic, "\n")
cat("P-value:", ks_result$p.value, "\n")

# Optional: Save results to a file
# write.table(data.frame(Statistic = ks_result$statistic, P_value = ks_result$p.value),
#             file = "ks_test_results.txt", quote = FALSE, row.names = FALSE)

# Optional: Plotting for visualization (requires a graphical environment)
# png("length_distributions_ecdf.png")
# plot(ecdf(data1), main="Empirical Cumulative Distribution Functions", xlab="Median Gene Length", ylab="ECDF", col="blue", lwd=2)
# lines(ecdf(data2), col="red", lwd=2)
# legend("bottomright", legend=c(basename(file1), basename(file2)), col=c("blue", "red"), lty=1, lwd=2)
# dev.off()
EOF

# Execute the R script
Rscript ks_test_gene_lengths.R condition1_median_gene_lengths.txt condition2_median_gene_lengths.txt

View on GitHub

For each gene captured in all samples and with at least 20 mapped reads, individual tail length distributions were compared amongst samples using the Mann Whitney U test with a P value cutoff of 0.025

R (Inferred with models/gemini-2.5-flash) v4.1.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R if not already installed
# sudo apt-get update
# sudo apt-get install r-base

# Create a dummy input file for demonstration purposes.
# In a real scenario, this file would be generated by upstream steps
# and contain tail length data for each read, gene, and sample.
# Ensure enough data points per gene per sample to meet the 'at least 20 mapped reads' criteria.
cat <<EOF > aggregated_tail_lengths.tsv
gene	sample	tail_length
GENE1	SampleA	45
GENE1	SampleA	52
GENE1	SampleA	60
GENE1	SampleA	48
GENE1	SampleA	55
GENE1	SampleA	42
GENE1	SampleA	50
GENE1	SampleA	58
GENE1	SampleA	47
GENE1	SampleA	53
GENE1	SampleA	46
GENE1	SampleA	51
GENE1	SampleA	59
GENE1	SampleA	49
GENE1	SampleA	54
GENE1	SampleA	43
GENE1	SampleA	56
GENE1	SampleA	61
GENE1	SampleA	44
GENE1	SampleA	57
GENE1	SampleB	50
GENE1	SampleB	58
GENE1	SampleB	65
GENE1	SampleB	53
GENE1	SampleB	62
GENE1	SampleB	51
GENE1	SampleB	59
GENE1	SampleB	66
GENE1	SampleB	54
GENE1	SampleB	63
GENE1	SampleB	52
GENE1	SampleB	60
GENE1	SampleB	67
GENE1	SampleB	55
GENE1	SampleB	64
GENE1	SampleB	53
GENE1	SampleB	61
GENE1	SampleB	68
GENE1	SampleB	56
GENE1	SampleB	65
GENE2	SampleA	30
GENE2	SampleA	32
GENE2	SampleA	28
GENE2	SampleA	35
GENE2	SampleA	31
GENE2	SampleA	29
GENE2	SampleA	33
GENE2	SampleA	27
GENE2	SampleA	34
GENE2	SampleA	30
GENE2	SampleA	28
GENE2	SampleA	32
GENE2	SampleA	26
GENE2	SampleA	33
GENE2	SampleA	29
GENE2	SampleA	31
GENE2	SampleA	27
GENE2	SampleA	34
GENE2	SampleA	30
GENE2	SampleA	28
GENE2	SampleB	31
GENE2	SampleB	33
GENE2	SampleB	29
GENE2	SampleB	36
GENE2	SampleB	32
GENE2	SampleB	30
GENE2	SampleB	34
GENE2	SampleB	28
GENE2	SampleB	35
GENE2	SampleB	31
GENE2	SampleB	29
GENE2	SampleB	33
GENE2	SampleB	27
GENE2	SampleB	34
GENE2	SampleB	30
GENE2	SampleB	32
GENE2	SampleB	28
GENE2	SampleB	35
GENE2	SampleB	31
GENE2	SampleB	29
EOF

# Create the R script for comparison
cat <<'EOF_R_SCRIPT' > compare_tail_lengths.R
# R script for comparing tail length distributions using Mann-Whitney U test

# Function to perform Mann-Whitney U test for a given gene across samples
compare_tail_lengths <- function(data_df, gene_id, p_cutoff = 0.025) {
  gene_data <- subset(data_df, gene == gene_id)
  
  if (nrow(gene_data) < 2) {
    return(NULL) # Not enough data for comparison
  }
  
  samples <- unique(gene_data$sample)
  if (length(samples) < 2) {
    return(NULL) # Need at least two samples for comparison
  }
  
  results <- list()
  
  # Iterate through all unique pairs of samples
  for (i in 1:(length(samples) - 1)) {
    for (j in (i + 1):length(samples)) {
      sample1 <- samples[i]
      sample2 <- samples[j]
      
      tails1 <- subset(gene_data, sample == sample1)$tail_length
      tails2 <- subset(gene_data, sample == sample2)$tail_length
      
      # Filter for at least 20 mapped reads (assuming 'tail_length' represents a read)
      if (length(tails1) >= 20 && length(tails2) >= 20) {
        # Mann-Whitney U test is performed using wilcox.test in R
        test_result <- wilcox.test(tails1, tails2, exact = FALSE) 
        
        results[[paste0(sample1, "_vs_", sample2)]] <- data.frame(
          gene = gene_id,
          sample1 = sample1,
          sample2 = sample2,
          p_value = test_result$p.value,
          significant = test_result$p.value < p_cutoff
        )
      }
    }
  }
  return(do.call(rbind, results))
}

# --- Main script execution ---

# Load aggregated tail length data
input_file="aggregated_tail_lengths.tsv"
tail_data <- read.delim(input_file, header=TRUE, sep="\t")

all_genes <- unique(tail_data$gene)
all_results <- list()

for (gene in all_genes) {
  gene_results <- compare_tail_lengths(tail_data, gene, p_cutoff = 0.025)
  if (!is.null(gene_results)) {
    all_results[[gene]] <- gene_results
  }
}

final_report <- do.call(rbind, all_results)

# Output results
output_file="tail_length_comparison_report.tsv"
write.table(final_report, file = output_file, sep = "\t", row.names = FALSE, quote = FALSE)

print(paste("Comparison report written to:", output_file))
EOF_R_SCRIPT

# Execute the R script
Rscript compare_tail_lengths.R

View on GitHub

Tools Used

RNA-seq STAR

Raw Source Text

RNA-seq data was aligned to the human hg19 genome build using Olego and Star aligners, and alternative splicing and alternative polyadenylation were estimated as described below.
Quantas software as described in Charizanis et al, 2012, Neuron, was used to estimate alternative splicing. Olego aligned alignment files were used to count observed junction reads for each exon. Weighted number of exon or exon-junction fragments uniquely supporting the inclusion or skipping isoform of each cassette exon and a probability score was assigned to each isoform. A Fisherâs exact test was used to evaluate the statistical significance of splicing changes using both exon and exon-junction fragments, followed by Benjamini multiple testing correction to estimate the false discovery rate (FDR). In addition, inclusion or exclusion junction reads were used to calculate the proportional change of exon inclusion (dI). See documentation at http://zhanglab.c2b2.columbia.edu/index.php/Quantas_Documentation.
Tandem UTR isoform analysis was performed with the MISO algorithm v0.5.2 using default settings as described in Katz et al, Nature, 2010, except for use of custom 3â²UTR isoform annotations. We used a Bayes-factor threshold of 10,000 and difference values (delta Psi) with an absolute value of at least 0.03
For TAIL-seq data analysis, image files were downloaded from the MiSeq and run on tailseeker (Chang et al, 2014, Mol Cell) to determine accurate polyA tail lengths, yielding paired fastq files corresponding to the 5' (R5) and 3' (R3 polyA tail) ends of each read. Reads were aligned against the human genome (hg19) and viral genomes (Human_Herpesvirus_5_strain_Merlin, AY446894.2) using STAR under default parameters. Features were assigned using Subread with gencode v19 annotations and with Human_Herpesvirus_5_strain_Merlin features, and filtered to obtain only the uniquely mapped protein coding genes. For analysis of virally mapped reads, all genes were counted. Reads with tails measuring 0 lengths were removed. For genes with at least 20 mapped reads, median lengths were measured and the global distributions of these lengths were compared against each other using the Kolmogorov-Smirnov test. For each gene captured in all samples and with at least 20 mapped reads, individual tail length distributions were compared amongst samples using the Mann Whitney U test with a P value cutoff of 0.025
Genome_build: hg19
Supplementary_files_format_and_content: bedgraph
Supplementary_files_format_and_content: TableS3: Significantly changing alternative host polyadenylation events
Supplementary_files_format_and_content: TableS6: Significant host gene alternative cassette splicing changes upon CPEB1 KD and CPEB1 OE from RNA-seq
Supplementary_files_format_and_content: TableS8: PolyA tail lengths analyzed by TAIL-seq

← Back to Analysis