GSE86791 Processing Pipeline
Publication
RNA-binding protein CPEB1 remodels host and viral RNA landscapes.Nature structural & molecular biology (2016) — PMID 27775709
Processing Steps
Generate Jupyter Notebook-
1
RNA-seq data was aligned to the human hg19 genome build using Olego and Star aligners, and alternative splicing and alternative polyadenylation were estimated as described below.
RNA-seq v2.7.10a (Inferred with models/gemini-2.5-flash), 1.0.6 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# --- Reference Genome Preparation --- # Create a directory for reference files mkdir -p reference/hg19 # Download human hg19 genome FASTA from UCSC # wget -P reference/hg19 http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # gunzip -f reference/hg19/hg19.fa.gz HG19_FASTA="reference/hg19/hg19.fa" # Download human GRCh37.75 GTF (equivalent to hg19) from Ensembl # wget -P reference/hg19 ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz # gunzip -f reference/hg19/Homo_sapiens.GRCh37.75.gtf.gz HG19_GTF="reference/hg19/Homo_sapiens.GRCh37.75.gtf" # --- STAR Aligner --- # Installation (example using Bioconda) # conda install -c bioconda star=2.7.10a # Create STAR genome index STAR_INDEX_DIR="star_index_hg19" mkdir -p "${STAR_INDEX_DIR}" STAR --runMode genomeGenerate \ --genomeDir "${STAR_INDEX_DIR}" \ --genomeFastaFiles "${HG19_FASTA}" \ --sjdbGTFfile "${HG19_GTF}" \ --runThreadN 8 # Adjust thread count as needed # Align RNA-seq reads using STAR # Assuming paired-end reads: sample_R1.fastq.gz and sample_R2.fastq.gz # Replace with actual input file paths INPUT_READ1="sample_R1.fastq.gz" INPUT_READ2="sample_R2.fastq.gz" OUTPUT_PREFIX="star_aligned_sample" STAR --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${INPUT_READ1}" "${INPUT_READ2}" \ --readFilesCommand zcat \ --runThreadN 8 \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard \ --quantMode GeneCounts \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJDBoverhangMin 1 \ --alignSJoverhangMin 8 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 # --- Olego Aligner --- # Installation (example using Bioconda) # conda install -c bioconda olego=1.0.6 # Create Olego genome index OLEGO_INDEX_DIR="olego_index_hg19" mkdir -p "${OLEGO_INDEX_DIR}" olego index -f "${HG19_FASTA}" -g "${HG19_GTF}" -o "${OLEGO_INDEX_DIR}" # Align RNA-seq reads using Olego # Assuming paired-end reads: sample_R1.fastq.gz and sample_R2.fastq.gz # Replace with actual input file paths OLEGO_OUTPUT_BAM="olego_aligned_sample.bam" olego align -i "${OLEGO_INDEX_DIR}" \ -1 "${INPUT_READ1}" \ -2 "${INPUT_READ2}" \ -o "${OLEGO_OUTPUT_BAM}" \ -t 8 # Adjust thread count as needed -
2
Quantas software as described in Charizanis et al, 2012, Neuron, was used to estimate alternative splicing.
$ Bash example
# Quantas software, as described in Charizanis et al, 2012, Neuron, is a custom tool # developed for estimating alternative splicing from RNA-seq data. # Specific installation instructions or a public repository for Quantas are not readily available. # The following command is a generic placeholder based on typical alternative splicing quantification workflows. # Define placeholder paths for input and reference files # Replace with actual paths to your aligned RNA-seq BAM file and genome annotation GTF. INPUT_BAM="path/to/your_aligned_reads.bam" GENOME_GTF="path/to/human_hg38.gtf" # Example: GRCh38/hg38 genome annotation # Placeholder command for Quantas execution. # The actual command would depend on the specific implementation of the Quantas script. # It likely takes aligned reads (BAM) and a genome annotation (GTF) as input. # The output would typically be a table of splicing event quantification. python /path/to/quantas_script.py \ --input "$INPUT_BAM" \ --gtf "$GENOME_GTF" \ --output "quantas_alternative_splicing_results.tsv" \ --log "quantas.log" -
3
Olego aligned alignment files were used to count observed junction reads for each exon.
$ Bash example
# Install rMATS (if not already installed) # conda install -c bioconda rmats-turbo # Define input and output paths INPUT_BAM="olego_aligned.bam" # Assuming a single Olego aligned BAM file GTF_FILE="gencode.v38.annotation.gtf" # Placeholder for human annotation (e.g., from GENCODE) READ_LENGTH=100 # Inferring a common read length, adjust if known OUTPUT_DIR="rmats_junction_counts" TMP_DIR="rmats_tmp" NUM_THREADS=8 # Create output and temporary directories mkdir -p "${OUTPUT_DIR}" mkdir -p "${TMP_DIR}" # Run rMATS to count junction reads for exons. # rMATS is primarily for differential splicing, but its output includes raw counts # for junction reads supporting various splicing events (e.g., exon inclusion/exclusion). # For a single sample, we can provide the same BAM file for both b1 and b2 to generate these counts. rmats.py \ --b1 "${INPUT_BAM}" \ --b2 "${INPUT_BAM}" \ --gtf "${GTF_FILE}" \ --readLength "${READ_LENGTH}" \ --tmp "${TMP_DIR}" \ --nthread "${NUM_THREADS}" \ --od "${OUTPUT_DIR}" -
4
Weighted number of exon or exon-junction fragments uniquely supporting the inclusion or skipping isoform of each cassette exon and a probability score was assigned to each isoform.
$ Bash example
# Install Skipper (if not already installed) # pip install skipper # Define input files and reference data INPUT_BAM="aligned_reads.bam" ANNOTATION_GTF="gencode.v44.annotation.gtf" # Placeholder for latest human GTF (e.g., GRCh38) GENOME_FASTA="GRCh38.primary_assembly.genome.fa" # Placeholder for latest human genome FASTA OUTPUT_DIR="skipper_quantification_output" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run Skipper to quantify alternative splicing events # The description implies quantification of cassette exons (SE events) and assigning probability scores. # Skipper's 'quantify' command handles this by default for various event types, including SE. skipper quantify \ "${INPUT_BAM}" \ "${ANNOTATION_GTF}" \ --genome "${GENOME_FASTA}" \ --output "${OUTPUT_DIR}" -
5
A Fisherâs exact test was used to evaluate the statistical significance of splicing changes using both exon and exon-junction fragments, followed by Benjamini multiple testing correction to estimate the false discovery rate (FDR).
$ Bash example
# Install rMATS via conda # conda create -n rmats_env python=3.8 # conda activate rmats_env # conda install -c bioconda rmats=4.1.2 # Define input and output paths # Placeholder: Replace with actual comma-separated BAM files for control replicates CONTROL_BAMS="control_rep1.bam,control_rep2.bam" # Placeholder: Replace with actual comma-separated BAM files for treatment replicates TREATMENT_BAMS="treatment_rep1.bam,treatment_rep2.bam" # Placeholder: Use a relevant genome annotation GTF file (e.g., from GENCODE for GRCh38) GTF_FILE="gencode.v38.annotation.gtf" OUTPUT_DIR="rmats_output" TMP_DIR="rmats_tmp" READ_LENGTH=75 # Placeholder: Adjust based on the actual read length of your sequencing data NUM_THREADS=8 # Placeholder: Adjust based on available CPU cores # Create output and temporary directories if they don't exist mkdir -p "${OUTPUT_DIR}" mkdir -p "${TMP_DIR}" # Execute rMATS for differential splicing analysis # -t paired: Specifies paired-end reads. Use -t single for single-end reads. # --readLength: Required parameter specifying the length of the sequencing reads. # --nthread: Specifies the number of threads to use for parallel processing. # rMATS automatically performs Benjamini-Hochberg FDR correction as part of its statistical analysis. rmats.py \ --b1 "${CONTROL_BAMS}" \ --b2 "${TREATMENT_BAMS}" \ --gtf "${GTF_FILE}" \ --od "${OUTPUT_DIR}" \ --tmp "${TMP_DIR}" \ -t paired \ --readLength "${READ_LENGTH}" \ --nthread "${NUM_THREADS}" -
6
In addition, inclusion or exclusion junction reads were used to calculate the proportional change of exon inclusion (dI).
$ Bash example
# Install rMATS (example using conda) # conda create -n rmats_env python=3.8 # conda activate rmats_env # conda install -c bioconda rmats-turbo # Example rMATS command for differential splicing analysis. # This assumes you have two groups of replicates (e.g., control and treatment) # and their aligned BAM files. You will need to create text files listing the BAM files for each group. # Placeholder: Create file_b1.txt (list of BAM files for group 1, e.g., control) # Each BAM file path should be separated by a comma if multiple replicates, or a single path. # Example: echo "/path/to/control_rep1.bam,/path/to/control_rep2.bam" > file_b1.txt # Placeholder: Create file_b2.txt (list of BAM files for group 2, e.g., treatment) # Example: echo "/path/to/treatment_rep1.bam,/path/to/treatment_rep2.bam" > file_b2.txt # Reference genome annotation (GTF) - Using GENCODE human hg38 as a placeholder. # Replace with the actual path to your GTF file. GTF_FILE="/path/to/gencode.v38.annotation.gtf" # Output directory for rMATS results OUTPUT_DIR="rmats_differential_splicing_output" # Create output directory if it doesn't exist mkdir -p ${OUTPUT_DIR} # Run rMATS to calculate proportional change of exon inclusion (dI/dPSI) # --b1 and --b2 specify text files listing BAM files for two groups. # --gtf provides the gene annotation file. # --readLength and --variableReadLength handle read length variations. # --nthread sets the number of CPU threads. # --tmp specifies a temporary directory. # --od sets the output directory. # --task as specifies alternative splicing analysis. # --statoff disables statistical analysis if only raw counts are needed, but usually kept on for dI. # --tstat 10 sets the minimum total number of reads for a splicing event to be considered. # --libType specifies the library type (e.g., fr-firststrand for Illumina TruSeq stranded). rmats.py \ --b1 file_b1.txt \ --b2 file_b2.txt \ --gtf ${GTF_FILE} \ --readLength 50 \ --variableReadLength \ --nthread 8 \ --tmp tmp_rmats_dir \ --od ${OUTPUT_DIR} \ --task as \ --statoff \ --tstat 10 \ --libType fr-firststrand # Adjust based on library prep (e.g., fr-unstranded, fr-secondstrand) -
7
See documentation at http://zhanglab.c2b2.columbia.edu/index.php/Quantas_Documentation.
Quantas (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash)$ Bash example
# Install Quantas (example, adjust as needed based on your environment) # conda install -c bioconda quantas # Placeholder for reference genome annotation (e.g., GTF file for the latest assembly) # Replace 'path/to/latest_assembly.gtf' with the actual path to your GTF file. # Source for latest assembly GTF: Ensembl, GENCODE, UCSC, etc. GENOME_ANNOTATION="path/to/latest_assembly.gtf" # Placeholder for input alignment file (e.g., BAM file generated by STAR or HISAT2) # Replace 'path/to/input_alignments.bam' with your actual input BAM file. INPUT_BAM="path/to/input_alignments.bam" # Placeholder for output directory where Quantas results will be stored OUTPUT_DIR="quantas_output" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Example Quantas command for quantifying alternative splicing events. # The exact parameters will depend on the specific analysis goals and Quantas version. # Consult the official Quantas documentation for detailed usage and available options. quantas \ --input "${INPUT_BAM}" \ --gtf "${GENOME_ANNOTATION}" \ --output_dir "${OUTPUT_DIR}" \ --threads 8 # Example: Use 8 threads for parallel processing -
8
Tandem UTR isoform analysis was performed with the MISO algorithm v0.5.2 using default settings as described in Katz et al, Nature, 2010, except for use of custom 3â²UTR isoform annotations.
$ Bash example
# Install MISO v0.5.2 (Note: MISO v0.5.2 typically requires Python 2.7) # conda create -n miso_env python=2.7 # conda activate miso_env # pip install MISO==0.5.2 # Placeholder for input BAM file and output directory INPUT_BAM="input.bam" # Replace with your actual input BAM file OUTPUT_DIR="miso_utr_analysis_output" # Directory for MISO results CUSTOM_UTR_GFF="custom_3UTR_annotations.gff" # Replace with your custom 3'UTR isoform annotations GFF file # Create an index for the custom 3'UTR isoform annotations MISO_INDEX_DIR="miso_utr_index" miso index "${CUSTOM_UTR_GFF}" "${MISO_INDEX_DIR}" # Perform tandem UTR isoform analysis with MISO using default settings miso --run "${MISO_INDEX_DIR}" "${INPUT_BAM}" --output-dir "${OUTPUT_DIR}" -
9
We used a Bayes-factor threshold of 10,000 and difference values (delta Psi) with an absolute value of at least 0.03
$ Bash example
# Install MISO (if not already installed) # pip install miso # Example usage of compare_miso for differential splicing analysis. # This assumes you have MISO output directories (containing .miso files) for two conditions/samples # and a GFF annotation file that defines the alternative splicing events. # Define input MISO summary directories for two conditions/samples # Replace with actual paths to your MISO output directories (e.g., from a previous run of MISO quantification) MISO_DIR_CONDITION1="path/to/miso_output_condition1" MISO_DIR_CONDITION2="path/to/miso_output_condition2" # Define the GFF annotation file used for MISO quantification. # This GFF should contain the alternative splicing events (e.g., from MISO's index_gff command or a pre-built annotation). # Replace with the actual path to your GFF file (e.g., human_events.gff3). SPLICING_EVENTS_GFF="path/to/splicing_events.gff3" # Define output directory for differential splicing results OUTPUT_DIR="differential_splicing_results" mkdir -p "${OUTPUT_DIR}" # Run compare_miso to identify differential splicing events. # --bayes-factor-threshold: Minimum Bayes factor for an event to be considered significant. # --delta-psi-threshold: Minimum absolute delta Psi for an event to be considered significant. compare_miso \ --compare-samples "${MISO_DIR_CONDITION1}" "${MISO_DIR_CONDITION2}" \ --events-gff "${SPLICING_EVENTS_GFF}" \ --output-dir "${OUTPUT_DIR}" \ --bayes-factor-threshold 10000 \ --delta-psi-threshold 0.03 -
10
For TAIL-seq data analysis, image files were downloaded from the MiSeq and run on tailseeker (Chang et al, 2014, Mol Cell) to determine accurate polyA tail lengths, yielding paired fastq files corresponding to the 5' (R5) and 3' (R3 polyA tail) ends of each read.
$ Bash example
# Installation: As 'tailseeker' is likely a custom script or pipeline based on Chang et al., 2014, # a direct installation command like 'conda install tailseeker' is unlikely to exist. # It would involve setting up a custom environment and potentially cloning specific scripts. # For demonstration, we assume a hypothetical executable 'tailseeker' is available in the PATH. # Example (if it were a Python script): # git clone https://github.com/some_repo/tailseeker.git # cd tailseeker # pip install -r requirements.txt # export PATH=$PWD:$PATH # Define input MiSeq run directory and desired output prefix MISEQ_RUN_DIR="/path/to/your/miseq_run_directory" # e.g., /data/MiSeq_Run_12345 OUTPUT_PREFIX="sample_id" # e.g., my_experiment_sample # Execute tailseeker to process raw MiSeq data (from image files/BCLs) # and generate paired fastq files for 5' (R5) and 3' (R3 polyA tail) ends. # The exact parameters are inferred based on the description of input and output. tailseeker --input-miseq-dir "${MISEQ_RUN_DIR}" \ --output-r5-fastq "${OUTPUT_PREFIX}_R5.fastq.gz" \ --output-r3-fastq "${OUTPUT_PREFIX}_R3.fastq.gz" \ --log-file "${OUTPUT_PREFIX}_tailseeker.log" -
11
Reads were aligned against the human genome (hg19) and viral genomes (Human_Herpesvirus_5_strain_Merlin, AY446894.2) using STAR under default parameters.
$ Bash example
# conda install -c bioconda star # --- Reference Data Preparation --- # Create directories for reference genomes and STAR index mkdir -p references/hg19 references/viral star_index/hg19_viral # Download hg19 primary assembly FASTA from UCSC wget -P references/hg19/ https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz gunzip references/hg19/hg19.fa.gz # Download hg19 GTF (e.g., NCBI RefSeq from UCSC) for splice junction annotation wget -P references/hg19/ https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ncbiRefSeq.gtf.gz gunzip references/hg19/hg19.ncbiRefSeq.gtf.gz # Download Human_Herpesvirus_5_strain_Merlin (AY446894.2) FASTA from NCBI wget -O references/viral/AY446894.2.fasta "https://www.ncbi.nlm.nih.gov/nuccore/AY446894.2?report=fasta&format=text" # Build STAR index for human (hg19) and viral genomes # This step combines the human FASTA and GTF with the viral FASTA into a single index. STAR --runMode genomeGenerate \ --genomeDir star_index/hg19_viral \ --genomeFastaFiles references/hg19/hg19.fa references/viral/AY446894.2.fasta \ --sjdbGTFfile references/hg19/hg19.ncbiRefSeq.gtf \ --runThreadN 8 # Adjust threads as needed for index generation # --- STAR Alignment --- # Placeholder for input FASTQ files (assuming paired-end and gzipped) # Replace read1.fastq.gz and read2.fastq.gz with actual file paths # Replace sample_name with an appropriate prefix for output files STAR --genomeDir star_index/hg19_viral \ --readFilesIn read1.fastq.gz read2.fastq.gz \ --readFilesCommand zcat \ --runThreadN 8 \ --outFileNamePrefix sample_name. \ --outSAMtype BAM SortedByCoordinate -
12
Features were assigned using Subread with gencode v19 annotations and with Human_Herpesvirus_5_strain_Merlin features, and filtered to obtain only the uniquely mapped protein coding genes.
$ Bash example
# Define reference paths and input/output files GENCODE_GTF="path/to/gencode.v19.annotation.gtf" # Source: https://www.gencodegenes.org/human/release_19.html HHV5_GTF="path/to/Human_Herpesvirus_5_strain_Merlin.gtf" # Source: NCBI, UCSC, or specific viral genome databases INPUT_BAM="aligned_reads.bam" # Placeholder for input BAM file OUTPUT_COUNTS="gene_counts.txt" THREADS=8 # Number of threads to use # --- Installation (commented out) --- # It's recommended to install featureCounts (part of the Subread package) # via a package manager like Conda: # conda install -c bioconda subread # --- Prepare combined and filtered annotation GTF --- # 1. Filter Gencode v19 GTF for protein_coding genes. # Gencode GTF files typically have 'gene_type "protein_coding";' in the attributes column. # We also keep header lines starting with '#'. grep -E '^#|gene_type "protein_coding";' "${GENCODE_GTF}" > gencode_v19_protein_coding.gtf # 2. Combine the filtered Gencode GTF with the Human Herpesvirus 5 GTF. # This creates a single annotation file for featureCounts. cat gencode_v19_protein_coding.gtf "${HHV5_GTF}" > combined_filtered_annotation.gtf # --- Run featureCounts --- # -a: Annotation file in GTF/GFF format. # -o: Output file for read counts. # -F GTF: Specify that the annotation file is in GTF format. # -t exon: Specify 'exon' as the feature type to count reads against. # -g gene_id: Specify 'gene_id' as the attribute to group features into genes. # -s 0: Specify unstranded library preparation (0=unstranded, 1=forward, 2=reverse). # Adjust if your library is stranded. # -T: Number of threads to use for parallel processing. # -p: Input reads are paired-end (remove if single-end). # By default, featureCounts only counts uniquely mapped reads, # which aligns with "uniquely mapped protein coding genes". featureCounts -a combined_filtered_annotation.gtf \ -o "${OUTPUT_COUNTS}" \ -F GTF \ -t exon \ -g gene_id \ -s 0 \ -T "${THREADS}" \ -p \ "${INPUT_BAM}" -
13
For analysis of virally mapped reads, all genes were counted.
featureCounts (Inferred with models/gemini-2.5-flash) v2.0.x (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install featureCounts (part of Subread package) # conda install -c bioconda subread # Define input and output files VIRAL_BAM="viral_mapped_reads.bam" # Placeholder for BAM file containing virally mapped reads VIRAL_GTF="viral_genome.gtf" # Placeholder for viral gene annotation file (GTF format) OUTPUT_COUNTS="viral_gene_counts.txt" # Run featureCounts to count reads per gene # -a: Annotation file (GTF/GFF) # -o: Output file for counts # -F GTF: Specify annotation file format # -t exon: Feature type to count (e.g., 'exon' for genes) # -g gene_id: Attribute to group features by (e.g., 'gene_id') # -s 0: Strandedness (0=unstranded, 1=forward, 2=reverse). Assuming unstranded as not specified. # --primary: Only count primary alignments (optional, but often good practice) featureCounts -a "${VIRAL_GTF}" \ -o "${OUTPUT_COUNTS}" \ -F GTF \ -t exon \ -g gene_id \ -s 0 \ --primary \ "${VIRAL_BAM}" -
14
Reads with tails measuring 0 lengths were removed.
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt # Define input and output file paths INPUT_READS="input_reads.fastq.gz" OUTPUT_READS="trimmed_reads.fastq.gz" ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example Illumina adapter sequence # Remove reads with tails (adapters) and discard reads that become 0 length after trimming. # cutadapt by default removes reads that become empty after trimming. # The -m 1 option explicitly ensures that reads shorter than 1 bp (i.e., 0 bp) are discarded. cutadapt -a "${ADAPTER_SEQUENCE}" -m 1 -o "${OUTPUT_READS}" "${INPUT_READS}" -
15
For genes with at least 20 mapped reads, median lengths were measured and the global distributions of these lengths were compared against each other using the Kolmogorov-Smirnov test.
$ Bash example
# This script assumes that 'condition1_median_gene_lengths.txt' and 'condition2_median_gene_lengths.txt' # have been generated by an upstream process. This upstream process would involve: # 1. Aligning reads to a reference genome. # 2. Quantifying reads per gene. # 3. Filtering genes to include only those with at least 20 mapped reads. # 4. For the remaining genes, extracting read lengths for reads mapping to them. # 5. Calculating the median read length for each gene. # 6. Storing these median lengths (one per line) in the respective input files. # Example of how to create dummy input files for demonstration: # echo -e "100\n110\n105\n120\n95" > condition1_median_gene_lengths.txt # echo -e "115\n125\n120\n130\n110" > condition2_median_gene_lengths.txt # Install R if not already available # conda install -c conda-forge r-base=4.2.0 # Create the R script for performing the Kolmogorov-Smirnov test cat <<EOF > ks_test_gene_lengths.R # ks_test_gene_lengths.R args <- commandArgs(trailingOnly = TRUE) if (length(args) != 2) { stop("Usage: Rscript ks_test_gene_lengths.R <file1> <file2>", call. = FALSE) } file1 <- args[1] file2 <- args[2] # Read data from input files # Each file is expected to contain one median gene length per line data1 <- as.numeric(readLines(file1)) data2 <- as.numeric(readLines(file2)) # Perform Kolmogorov-Smirnov test ks_result <- ks.test(data1, data2) cat("Kolmogorov-Smirnov Test Results:\n") cat("D statistic:", ks_result$statistic, "\n") cat("P-value:", ks_result$p.value, "\n") # Optional: Save results to a file # write.table(data.frame(Statistic = ks_result$statistic, P_value = ks_result$p.value), # file = "ks_test_results.txt", quote = FALSE, row.names = FALSE) # Optional: Plotting for visualization (requires a graphical environment) # png("length_distributions_ecdf.png") # plot(ecdf(data1), main="Empirical Cumulative Distribution Functions", xlab="Median Gene Length", ylab="ECDF", col="blue", lwd=2) # lines(ecdf(data2), col="red", lwd=2) # legend("bottomright", legend=c(basename(file1), basename(file2)), col=c("blue", "red"), lty=1, lwd=2) # dev.off() EOF # Execute the R script Rscript ks_test_gene_lengths.R condition1_median_gene_lengths.txt condition2_median_gene_lengths.txt -
16
For each gene captured in all samples and with at least 20 mapped reads, individual tail length distributions were compared amongst samples using the Mann Whitney U test with a P value cutoff of 0.025
$ Bash example
# Install R if not already installed # sudo apt-get update # sudo apt-get install r-base # Create a dummy input file for demonstration purposes. # In a real scenario, this file would be generated by upstream steps # and contain tail length data for each read, gene, and sample. # Ensure enough data points per gene per sample to meet the 'at least 20 mapped reads' criteria. cat <<EOF > aggregated_tail_lengths.tsv gene sample tail_length GENE1 SampleA 45 GENE1 SampleA 52 GENE1 SampleA 60 GENE1 SampleA 48 GENE1 SampleA 55 GENE1 SampleA 42 GENE1 SampleA 50 GENE1 SampleA 58 GENE1 SampleA 47 GENE1 SampleA 53 GENE1 SampleA 46 GENE1 SampleA 51 GENE1 SampleA 59 GENE1 SampleA 49 GENE1 SampleA 54 GENE1 SampleA 43 GENE1 SampleA 56 GENE1 SampleA 61 GENE1 SampleA 44 GENE1 SampleA 57 GENE1 SampleB 50 GENE1 SampleB 58 GENE1 SampleB 65 GENE1 SampleB 53 GENE1 SampleB 62 GENE1 SampleB 51 GENE1 SampleB 59 GENE1 SampleB 66 GENE1 SampleB 54 GENE1 SampleB 63 GENE1 SampleB 52 GENE1 SampleB 60 GENE1 SampleB 67 GENE1 SampleB 55 GENE1 SampleB 64 GENE1 SampleB 53 GENE1 SampleB 61 GENE1 SampleB 68 GENE1 SampleB 56 GENE1 SampleB 65 GENE2 SampleA 30 GENE2 SampleA 32 GENE2 SampleA 28 GENE2 SampleA 35 GENE2 SampleA 31 GENE2 SampleA 29 GENE2 SampleA 33 GENE2 SampleA 27 GENE2 SampleA 34 GENE2 SampleA 30 GENE2 SampleA 28 GENE2 SampleA 32 GENE2 SampleA 26 GENE2 SampleA 33 GENE2 SampleA 29 GENE2 SampleA 31 GENE2 SampleA 27 GENE2 SampleA 34 GENE2 SampleA 30 GENE2 SampleA 28 GENE2 SampleB 31 GENE2 SampleB 33 GENE2 SampleB 29 GENE2 SampleB 36 GENE2 SampleB 32 GENE2 SampleB 30 GENE2 SampleB 34 GENE2 SampleB 28 GENE2 SampleB 35 GENE2 SampleB 31 GENE2 SampleB 29 GENE2 SampleB 33 GENE2 SampleB 27 GENE2 SampleB 34 GENE2 SampleB 30 GENE2 SampleB 32 GENE2 SampleB 28 GENE2 SampleB 35 GENE2 SampleB 31 GENE2 SampleB 29 EOF # Create the R script for comparison cat <<'EOF_R_SCRIPT' > compare_tail_lengths.R # R script for comparing tail length distributions using Mann-Whitney U test # Function to perform Mann-Whitney U test for a given gene across samples compare_tail_lengths <- function(data_df, gene_id, p_cutoff = 0.025) { gene_data <- subset(data_df, gene == gene_id) if (nrow(gene_data) < 2) { return(NULL) # Not enough data for comparison } samples <- unique(gene_data$sample) if (length(samples) < 2) { return(NULL) # Need at least two samples for comparison } results <- list() # Iterate through all unique pairs of samples for (i in 1:(length(samples) - 1)) { for (j in (i + 1):length(samples)) { sample1 <- samples[i] sample2 <- samples[j] tails1 <- subset(gene_data, sample == sample1)$tail_length tails2 <- subset(gene_data, sample == sample2)$tail_length # Filter for at least 20 mapped reads (assuming 'tail_length' represents a read) if (length(tails1) >= 20 && length(tails2) >= 20) { # Mann-Whitney U test is performed using wilcox.test in R test_result <- wilcox.test(tails1, tails2, exact = FALSE) results[[paste0(sample1, "_vs_", sample2)]] <- data.frame( gene = gene_id, sample1 = sample1, sample2 = sample2, p_value = test_result$p.value, significant = test_result$p.value < p_cutoff ) } } } return(do.call(rbind, results)) } # --- Main script execution --- # Load aggregated tail length data input_file="aggregated_tail_lengths.tsv" tail_data <- read.delim(input_file, header=TRUE, sep="\t") all_genes <- unique(tail_data$gene) all_results <- list() for (gene in all_genes) { gene_results <- compare_tail_lengths(tail_data, gene, p_cutoff = 0.025) if (!is.null(gene_results)) { all_results[[gene]] <- gene_results } } final_report <- do.call(rbind, all_results) # Output results output_file="tail_length_comparison_report.tsv" write.table(final_report, file = output_file, sep = "\t", row.names = FALSE, quote = FALSE) print(paste("Comparison report written to:", output_file)) EOF_R_SCRIPT # Execute the R script Rscript compare_tail_lengths.R
Raw Source Text
RNA-seq data was aligned to the human hg19 genome build using Olego and Star aligners, and alternative splicing and alternative polyadenylation were estimated as described below. Quantas software as described in Charizanis et al, 2012, Neuron, was used to estimate alternative splicing. Olego aligned alignment files were used to count observed junction reads for each exon. Weighted number of exon or exon-junction fragments uniquely supporting the inclusion or skipping isoform of each cassette exon and a probability score was assigned to each isoform. A Fisherâs exact test was used to evaluate the statistical significance of splicing changes using both exon and exon-junction fragments, followed by Benjamini multiple testing correction to estimate the false discovery rate (FDR). In addition, inclusion or exclusion junction reads were used to calculate the proportional change of exon inclusion (dI). See documentation at http://zhanglab.c2b2.columbia.edu/index.php/Quantas_Documentation. Tandem UTR isoform analysis was performed with the MISO algorithm v0.5.2 using default settings as described in Katz et al, Nature, 2010, except for use of custom 3â²UTR isoform annotations. We used a Bayes-factor threshold of 10,000 and difference values (delta Psi) with an absolute value of at least 0.03 For TAIL-seq data analysis, image files were downloaded from the MiSeq and run on tailseeker (Chang et al, 2014, Mol Cell) to determine accurate polyA tail lengths, yielding paired fastq files corresponding to the 5' (R5) and 3' (R3 polyA tail) ends of each read. Reads were aligned against the human genome (hg19) and viral genomes (Human_Herpesvirus_5_strain_Merlin, AY446894.2) using STAR under default parameters. Features were assigned using Subread with gencode v19 annotations and with Human_Herpesvirus_5_strain_Merlin features, and filtered to obtain only the uniquely mapped protein coding genes. For analysis of virally mapped reads, all genes were counted. Reads with tails measuring 0 lengths were removed. For genes with at least 20 mapped reads, median lengths were measured and the global distributions of these lengths were compared against each other using the Kolmogorov-Smirnov test. For each gene captured in all samples and with at least 20 mapped reads, individual tail length distributions were compared amongst samples using the Mann Whitney U test with a P value cutoff of 0.025 Genome_build: hg19 Supplementary_files_format_and_content: bedgraph Supplementary_files_format_and_content: TableS3: Significantly changing alternative host polyadenylation events Supplementary_files_format_and_content: TableS6: Significant host gene alternative cassette splicing changes upon CPEB1 KD and CPEB1 OE from RNA-seq Supplementary_files_format_and_content: TableS8: PolyA tail lengths analyzed by TAIL-seq