GSE86791 Processing Pipeline

RNA-Seq code_examples 16 steps

Publication

RNA-binding protein CPEB1 remodels host and viral RNA landscapes.

Nature structural & molecular biology (2016) — PMID 27775709

Dataset

GSE86791

RNA binding protein CPEB1 remodels host and viral RNA landscapes [RNA-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    RNA-seq data was aligned to the human hg19 genome build using Olego and Star aligners, and alternative splicing and alternative polyadenylation were estimated as described below.

    RNA-seq v2.7.10a (Inferred with models/gemini-2.5-flash), 1.0.6 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # --- Reference Genome Preparation ---
    # Create a directory for reference files
    mkdir -p reference/hg19
    
    # Download human hg19 genome FASTA from UCSC
    # wget -P reference/hg19 http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
    # gunzip -f reference/hg19/hg19.fa.gz
    HG19_FASTA="reference/hg19/hg19.fa"
    
    # Download human GRCh37.75 GTF (equivalent to hg19) from Ensembl
    # wget -P reference/hg19 ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
    # gunzip -f reference/hg19/Homo_sapiens.GRCh37.75.gtf.gz
    HG19_GTF="reference/hg19/Homo_sapiens.GRCh37.75.gtf"
    
    # --- STAR Aligner ---
    # Installation (example using Bioconda)
    # conda install -c bioconda star=2.7.10a
    
    # Create STAR genome index
    STAR_INDEX_DIR="star_index_hg19"
    mkdir -p "${STAR_INDEX_DIR}"
    STAR --runMode genomeGenerate \
         --genomeDir "${STAR_INDEX_DIR}" \
         --genomeFastaFiles "${HG19_FASTA}" \
         --sjdbGTFfile "${HG19_GTF}" \
         --runThreadN 8 # Adjust thread count as needed
    
    # Align RNA-seq reads using STAR
    # Assuming paired-end reads: sample_R1.fastq.gz and sample_R2.fastq.gz
    # Replace with actual input file paths
    INPUT_READ1="sample_R1.fastq.gz"
    INPUT_READ2="sample_R2.fastq.gz"
    OUTPUT_PREFIX="star_aligned_sample"
    
    STAR --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${INPUT_READ1}" "${INPUT_READ2}" \
         --readFilesCommand zcat \
         --runThreadN 8 \
         --outFileNamePrefix "${OUTPUT_PREFIX}_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outSAMattributes Standard \
         --quantMode GeneCounts \
         --outFilterType BySJout \
         --outFilterMultimapNmax 20 \
         --alignSJDBoverhangMin 1 \
         --alignSJoverhangMin 8 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000
    
    # --- Olego Aligner ---
    # Installation (example using Bioconda)
    # conda install -c bioconda olego=1.0.6
    
    # Create Olego genome index
    OLEGO_INDEX_DIR="olego_index_hg19"
    mkdir -p "${OLEGO_INDEX_DIR}"
    olego index -f "${HG19_FASTA}" -g "${HG19_GTF}" -o "${OLEGO_INDEX_DIR}"
    
    # Align RNA-seq reads using Olego
    # Assuming paired-end reads: sample_R1.fastq.gz and sample_R2.fastq.gz
    # Replace with actual input file paths
    OLEGO_OUTPUT_BAM="olego_aligned_sample.bam"
    
    olego align -i "${OLEGO_INDEX_DIR}" \
                -1 "${INPUT_READ1}" \
                -2 "${INPUT_READ2}" \
                -o "${OLEGO_OUTPUT_BAM}" \
                -t 8 # Adjust thread count as needed
  2. 2

    Quantas software as described in Charizanis et al, 2012, Neuron, was used to estimate alternative splicing.

    Quantas software v2012 GitHub
    $ Bash example
    # Quantas software, as described in Charizanis et al, 2012, Neuron, is a custom tool
    # developed for estimating alternative splicing from RNA-seq data.
    # Specific installation instructions or a public repository for Quantas are not readily available.
    # The following command is a generic placeholder based on typical alternative splicing quantification workflows.
    
    # Define placeholder paths for input and reference files
    # Replace with actual paths to your aligned RNA-seq BAM file and genome annotation GTF.
    INPUT_BAM="path/to/your_aligned_reads.bam"
    GENOME_GTF="path/to/human_hg38.gtf" # Example: GRCh38/hg38 genome annotation
    
    # Placeholder command for Quantas execution.
    # The actual command would depend on the specific implementation of the Quantas script.
    # It likely takes aligned reads (BAM) and a genome annotation (GTF) as input.
    # The output would typically be a table of splicing event quantification.
    python /path/to/quantas_script.py \
        --input "$INPUT_BAM" \
        --gtf "$GENOME_GTF" \
        --output "quantas_alternative_splicing_results.tsv" \
        --log "quantas.log"
  3. 3

    Olego aligned alignment files were used to count observed junction reads for each exon.

    rMATS (Inferred with models/gemini-2.5-flash) v4.0.2 GitHub
    $ Bash example
    # Install rMATS (if not already installed)
    # conda install -c bioconda rmats-turbo
    
    # Define input and output paths
    INPUT_BAM="olego_aligned.bam" # Assuming a single Olego aligned BAM file
    GTF_FILE="gencode.v38.annotation.gtf" # Placeholder for human annotation (e.g., from GENCODE)
    READ_LENGTH=100 # Inferring a common read length, adjust if known
    OUTPUT_DIR="rmats_junction_counts"
    TMP_DIR="rmats_tmp"
    NUM_THREADS=8
    
    # Create output and temporary directories
    mkdir -p "${OUTPUT_DIR}"
    mkdir -p "${TMP_DIR}"
    
    # Run rMATS to count junction reads for exons.
    # rMATS is primarily for differential splicing, but its output includes raw counts
    # for junction reads supporting various splicing events (e.g., exon inclusion/exclusion).
    # For a single sample, we can provide the same BAM file for both b1 and b2 to generate these counts.
    rmats.py \
        --b1 "${INPUT_BAM}" \
        --b2 "${INPUT_BAM}" \
        --gtf "${GTF_FILE}" \
        --readLength "${READ_LENGTH}" \
        --tmp "${TMP_DIR}" \
        --nthread "${NUM_THREADS}" \
        --od "${OUTPUT_DIR}"
  4. 4

    Weighted number of exon or exon-junction fragments uniquely supporting the inclusion or skipping isoform of each cassette exon and a probability score was assigned to each isoform.

    Skipper (Inferred with models/gemini-2.5-flash) v0.1.0 GitHub
    $ Bash example
    # Install Skipper (if not already installed)
    # pip install skipper
    
    # Define input files and reference data
    INPUT_BAM="aligned_reads.bam"
    ANNOTATION_GTF="gencode.v44.annotation.gtf" # Placeholder for latest human GTF (e.g., GRCh38)
    GENOME_FASTA="GRCh38.primary_assembly.genome.fa" # Placeholder for latest human genome FASTA
    OUTPUT_DIR="skipper_quantification_output"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run Skipper to quantify alternative splicing events
    # The description implies quantification of cassette exons (SE events) and assigning probability scores.
    # Skipper's 'quantify' command handles this by default for various event types, including SE.
    skipper quantify \
        "${INPUT_BAM}" \
        "${ANNOTATION_GTF}" \
        --genome "${GENOME_FASTA}" \
        --output "${OUTPUT_DIR}"
  5. 5

    A Fisher’s exact test was used to evaluate the statistical significance of splicing changes using both exon and exon-junction fragments, followed by Benjamini multiple testing correction to estimate the false discovery rate (FDR).

    rMATS (Inferred with models/gemini-2.5-flash) v4.1.2 GitHub
    $ Bash example
    # Install rMATS via conda
    # conda create -n rmats_env python=3.8
    # conda activate rmats_env
    # conda install -c bioconda rmats=4.1.2
    
    # Define input and output paths
    # Placeholder: Replace with actual comma-separated BAM files for control replicates
    CONTROL_BAMS="control_rep1.bam,control_rep2.bam"
    # Placeholder: Replace with actual comma-separated BAM files for treatment replicates
    TREATMENT_BAMS="treatment_rep1.bam,treatment_rep2.bam"
    # Placeholder: Use a relevant genome annotation GTF file (e.g., from GENCODE for GRCh38)
    GTF_FILE="gencode.v38.annotation.gtf"
    OUTPUT_DIR="rmats_output"
    TMP_DIR="rmats_tmp"
    READ_LENGTH=75 # Placeholder: Adjust based on the actual read length of your sequencing data
    NUM_THREADS=8 # Placeholder: Adjust based on available CPU cores
    
    # Create output and temporary directories if they don't exist
    mkdir -p "${OUTPUT_DIR}"
    mkdir -p "${TMP_DIR}"
    
    # Execute rMATS for differential splicing analysis
    # -t paired: Specifies paired-end reads. Use -t single for single-end reads.
    # --readLength: Required parameter specifying the length of the sequencing reads.
    # --nthread: Specifies the number of threads to use for parallel processing.
    # rMATS automatically performs Benjamini-Hochberg FDR correction as part of its statistical analysis.
    rmats.py \
        --b1 "${CONTROL_BAMS}" \
        --b2 "${TREATMENT_BAMS}" \
        --gtf "${GTF_FILE}" \
        --od "${OUTPUT_DIR}" \
        --tmp "${TMP_DIR}" \
        -t paired \
        --readLength "${READ_LENGTH}" \
        --nthread "${NUM_THREADS}"
  6. 6

    In addition, inclusion or exclusion junction reads were used to calculate the proportional change of exon inclusion (dI).

    rMATS (Inferred with models/gemini-2.5-flash) v4.1.2 GitHub
    $ Bash example
    # Install rMATS (example using conda)
    # conda create -n rmats_env python=3.8
    # conda activate rmats_env
    # conda install -c bioconda rmats-turbo
    
    # Example rMATS command for differential splicing analysis.
    # This assumes you have two groups of replicates (e.g., control and treatment)
    # and their aligned BAM files. You will need to create text files listing the BAM files for each group.
    
    # Placeholder: Create file_b1.txt (list of BAM files for group 1, e.g., control)
    # Each BAM file path should be separated by a comma if multiple replicates, or a single path.
    # Example: echo "/path/to/control_rep1.bam,/path/to/control_rep2.bam" > file_b1.txt
    
    # Placeholder: Create file_b2.txt (list of BAM files for group 2, e.g., treatment)
    # Example: echo "/path/to/treatment_rep1.bam,/path/to/treatment_rep2.bam" > file_b2.txt
    
    # Reference genome annotation (GTF) - Using GENCODE human hg38 as a placeholder.
    # Replace with the actual path to your GTF file.
    GTF_FILE="/path/to/gencode.v38.annotation.gtf"
    
    # Output directory for rMATS results
    OUTPUT_DIR="rmats_differential_splicing_output"
    
    # Create output directory if it doesn't exist
    mkdir -p ${OUTPUT_DIR}
    
    # Run rMATS to calculate proportional change of exon inclusion (dI/dPSI)
    # --b1 and --b2 specify text files listing BAM files for two groups.
    # --gtf provides the gene annotation file.
    # --readLength and --variableReadLength handle read length variations.
    # --nthread sets the number of CPU threads.
    # --tmp specifies a temporary directory.
    # --od sets the output directory.
    # --task as specifies alternative splicing analysis.
    # --statoff disables statistical analysis if only raw counts are needed, but usually kept on for dI.
    # --tstat 10 sets the minimum total number of reads for a splicing event to be considered.
    # --libType specifies the library type (e.g., fr-firststrand for Illumina TruSeq stranded).
    rmats.py \
        --b1 file_b1.txt \
        --b2 file_b2.txt \
        --gtf ${GTF_FILE} \
        --readLength 50 \
        --variableReadLength \
        --nthread 8 \
        --tmp tmp_rmats_dir \
        --od ${OUTPUT_DIR} \
        --task as \
        --statoff \
        --tstat 10 \
        --libType fr-firststrand # Adjust based on library prep (e.g., fr-unstranded, fr-secondstrand)
    
  7. 7

    See documentation at http://zhanglab.c2b2.columbia.edu/index.php/Quantas_Documentation.

    Quantas (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install Quantas (example, adjust as needed based on your environment)
    # conda install -c bioconda quantas
    
    # Placeholder for reference genome annotation (e.g., GTF file for the latest assembly)
    # Replace 'path/to/latest_assembly.gtf' with the actual path to your GTF file.
    # Source for latest assembly GTF: Ensembl, GENCODE, UCSC, etc.
    GENOME_ANNOTATION="path/to/latest_assembly.gtf"
    
    # Placeholder for input alignment file (e.g., BAM file generated by STAR or HISAT2)
    # Replace 'path/to/input_alignments.bam' with your actual input BAM file.
    INPUT_BAM="path/to/input_alignments.bam"
    
    # Placeholder for output directory where Quantas results will be stored
    OUTPUT_DIR="quantas_output"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Example Quantas command for quantifying alternative splicing events.
    # The exact parameters will depend on the specific analysis goals and Quantas version.
    # Consult the official Quantas documentation for detailed usage and available options.
    quantas \
        --input "${INPUT_BAM}" \
        --gtf "${GENOME_ANNOTATION}" \
        --output_dir "${OUTPUT_DIR}" \
        --threads 8 # Example: Use 8 threads for parallel processing
    
  8. 8

    Tandem UTR isoform analysis was performed with the MISO algorithm v0.5.2 using default settings as described in Katz et al, Nature, 2010, except for use of custom 3′UTR isoform annotations.

    MISO v0.5.2 GitHub
    $ Bash example
    # Install MISO v0.5.2 (Note: MISO v0.5.2 typically requires Python 2.7)
    # conda create -n miso_env python=2.7
    # conda activate miso_env
    # pip install MISO==0.5.2
    
    # Placeholder for input BAM file and output directory
    INPUT_BAM="input.bam" # Replace with your actual input BAM file
    OUTPUT_DIR="miso_utr_analysis_output" # Directory for MISO results
    CUSTOM_UTR_GFF="custom_3UTR_annotations.gff" # Replace with your custom 3'UTR isoform annotations GFF file
    
    # Create an index for the custom 3'UTR isoform annotations
    MISO_INDEX_DIR="miso_utr_index"
    miso index "${CUSTOM_UTR_GFF}" "${MISO_INDEX_DIR}"
    
    # Perform tandem UTR isoform analysis with MISO using default settings
    miso --run "${MISO_INDEX_DIR}" "${INPUT_BAM}" --output-dir "${OUTPUT_DIR}"
    
  9. 9

    We used a Bayes-factor threshold of 10,000 and difference values (delta Psi) with an absolute value of at least 0.03

    MISO (Inferred with models/gemini-2.5-flash) vNot specified GitHub
    $ Bash example
    # Install MISO (if not already installed)
    # pip install miso
    
    # Example usage of compare_miso for differential splicing analysis.
    # This assumes you have MISO output directories (containing .miso files) for two conditions/samples
    # and a GFF annotation file that defines the alternative splicing events.
    
    # Define input MISO summary directories for two conditions/samples
    # Replace with actual paths to your MISO output directories (e.g., from a previous run of MISO quantification)
    MISO_DIR_CONDITION1="path/to/miso_output_condition1"
    MISO_DIR_CONDITION2="path/to/miso_output_condition2"
    
    # Define the GFF annotation file used for MISO quantification.
    # This GFF should contain the alternative splicing events (e.g., from MISO's index_gff command or a pre-built annotation).
    # Replace with the actual path to your GFF file (e.g., human_events.gff3).
    SPLICING_EVENTS_GFF="path/to/splicing_events.gff3"
    
    # Define output directory for differential splicing results
    OUTPUT_DIR="differential_splicing_results"
    mkdir -p "${OUTPUT_DIR}"
    
    # Run compare_miso to identify differential splicing events.
    # --bayes-factor-threshold: Minimum Bayes factor for an event to be considered significant.
    # --delta-psi-threshold: Minimum absolute delta Psi for an event to be considered significant.
    compare_miso \
        --compare-samples "${MISO_DIR_CONDITION1}" "${MISO_DIR_CONDITION2}" \
        --events-gff "${SPLICING_EVENTS_GFF}" \
        --output-dir "${OUTPUT_DIR}" \
        --bayes-factor-threshold 10000 \
        --delta-psi-threshold 0.03
  10. 10

    For TAIL-seq data analysis, image files were downloaded from the MiSeq and run on tailseeker (Chang et al, 2014, Mol Cell) to determine accurate polyA tail lengths, yielding paired fastq files corresponding to the 5' (R5) and 3' (R3 polyA tail) ends of each read.

    tailseeker v2014 GitHub
    $ Bash example
    # Installation: As 'tailseeker' is likely a custom script or pipeline based on Chang et al., 2014,
    # a direct installation command like 'conda install tailseeker' is unlikely to exist.
    # It would involve setting up a custom environment and potentially cloning specific scripts.
    # For demonstration, we assume a hypothetical executable 'tailseeker' is available in the PATH.
    # Example (if it were a Python script):
    # git clone https://github.com/some_repo/tailseeker.git
    # cd tailseeker
    # pip install -r requirements.txt
    # export PATH=$PWD:$PATH
    
    # Define input MiSeq run directory and desired output prefix
    MISEQ_RUN_DIR="/path/to/your/miseq_run_directory" # e.g., /data/MiSeq_Run_12345
    OUTPUT_PREFIX="sample_id" # e.g., my_experiment_sample
    
    # Execute tailseeker to process raw MiSeq data (from image files/BCLs)
    # and generate paired fastq files for 5' (R5) and 3' (R3 polyA tail) ends.
    # The exact parameters are inferred based on the description of input and output.
    tailseeker --input-miseq-dir "${MISEQ_RUN_DIR}" \
               --output-r5-fastq "${OUTPUT_PREFIX}_R5.fastq.gz" \
               --output-r3-fastq "${OUTPUT_PREFIX}_R3.fastq.gz" \
               --log-file "${OUTPUT_PREFIX}_tailseeker.log"
  11. 11

    Reads were aligned against the human genome (hg19) and viral genomes (Human_Herpesvirus_5_strain_Merlin, AY446894.2) using STAR under default parameters.

    $ Bash example
    # conda install -c bioconda star
    
    # --- Reference Data Preparation ---
    # Create directories for reference genomes and STAR index
    mkdir -p references/hg19 references/viral star_index/hg19_viral
    
    # Download hg19 primary assembly FASTA from UCSC
    wget -P references/hg19/ https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
    gunzip references/hg19/hg19.fa.gz
    
    # Download hg19 GTF (e.g., NCBI RefSeq from UCSC) for splice junction annotation
    wget -P references/hg19/ https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ncbiRefSeq.gtf.gz
    gunzip references/hg19/hg19.ncbiRefSeq.gtf.gz
    
    # Download Human_Herpesvirus_5_strain_Merlin (AY446894.2) FASTA from NCBI
    wget -O references/viral/AY446894.2.fasta "https://www.ncbi.nlm.nih.gov/nuccore/AY446894.2?report=fasta&format=text"
    
    # Build STAR index for human (hg19) and viral genomes
    # This step combines the human FASTA and GTF with the viral FASTA into a single index.
    STAR --runMode genomeGenerate \
         --genomeDir star_index/hg19_viral \
         --genomeFastaFiles references/hg19/hg19.fa references/viral/AY446894.2.fasta \
         --sjdbGTFfile references/hg19/hg19.ncbiRefSeq.gtf \
         --runThreadN 8 # Adjust threads as needed for index generation
    
    # --- STAR Alignment ---
    # Placeholder for input FASTQ files (assuming paired-end and gzipped)
    # Replace read1.fastq.gz and read2.fastq.gz with actual file paths
    # Replace sample_name with an appropriate prefix for output files
    STAR --genomeDir star_index/hg19_viral \
         --readFilesIn read1.fastq.gz read2.fastq.gz \
         --readFilesCommand zcat \
         --runThreadN 8 \
         --outFileNamePrefix sample_name. \
         --outSAMtype BAM SortedByCoordinate
  12. 12

    Features were assigned using Subread with gencode v19 annotations and with Human_Herpesvirus_5_strain_Merlin features, and filtered to obtain only the uniquely mapped protein coding genes.

    featureCounts v2.0.3 GitHub
    $ Bash example
    # Define reference paths and input/output files
    GENCODE_GTF="path/to/gencode.v19.annotation.gtf" # Source: https://www.gencodegenes.org/human/release_19.html
    HHV5_GTF="path/to/Human_Herpesvirus_5_strain_Merlin.gtf" # Source: NCBI, UCSC, or specific viral genome databases
    INPUT_BAM="aligned_reads.bam" # Placeholder for input BAM file
    OUTPUT_COUNTS="gene_counts.txt"
    THREADS=8 # Number of threads to use
    
    # --- Installation (commented out) ---
    # It's recommended to install featureCounts (part of the Subread package)
    # via a package manager like Conda:
    # conda install -c bioconda subread
    
    # --- Prepare combined and filtered annotation GTF ---
    # 1. Filter Gencode v19 GTF for protein_coding genes.
    #    Gencode GTF files typically have 'gene_type "protein_coding";' in the attributes column.
    #    We also keep header lines starting with '#'.
    grep -E '^#|gene_type "protein_coding";' "${GENCODE_GTF}" > gencode_v19_protein_coding.gtf
    
    # 2. Combine the filtered Gencode GTF with the Human Herpesvirus 5 GTF.
    #    This creates a single annotation file for featureCounts.
    cat gencode_v19_protein_coding.gtf "${HHV5_GTF}" > combined_filtered_annotation.gtf
    
    # --- Run featureCounts ---
    # -a: Annotation file in GTF/GFF format.
    # -o: Output file for read counts.
    # -F GTF: Specify that the annotation file is in GTF format.
    # -t exon: Specify 'exon' as the feature type to count reads against.
    # -g gene_id: Specify 'gene_id' as the attribute to group features into genes.
    # -s 0: Specify unstranded library preparation (0=unstranded, 1=forward, 2=reverse).
    #       Adjust if your library is stranded.
    # -T: Number of threads to use for parallel processing.
    # -p: Input reads are paired-end (remove if single-end).
    # By default, featureCounts only counts uniquely mapped reads,
    # which aligns with "uniquely mapped protein coding genes".
    featureCounts -a combined_filtered_annotation.gtf \
                  -o "${OUTPUT_COUNTS}" \
                  -F GTF \
                  -t exon \
                  -g gene_id \
                  -s 0 \
                  -T "${THREADS}" \
                  -p \
                  "${INPUT_BAM}"
  13. 13

    For analysis of virally mapped reads, all genes were counted.

    featureCounts (Inferred with models/gemini-2.5-flash) v2.0.x (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install featureCounts (part of Subread package)
    # conda install -c bioconda subread
    
    # Define input and output files
    VIRAL_BAM="viral_mapped_reads.bam" # Placeholder for BAM file containing virally mapped reads
    VIRAL_GTF="viral_genome.gtf"     # Placeholder for viral gene annotation file (GTF format)
    OUTPUT_COUNTS="viral_gene_counts.txt"
    
    # Run featureCounts to count reads per gene
    # -a: Annotation file (GTF/GFF)
    # -o: Output file for counts
    # -F GTF: Specify annotation file format
    # -t exon: Feature type to count (e.g., 'exon' for genes)
    # -g gene_id: Attribute to group features by (e.g., 'gene_id')
    # -s 0: Strandedness (0=unstranded, 1=forward, 2=reverse). Assuming unstranded as not specified.
    # --primary: Only count primary alignments (optional, but often good practice)
    featureCounts -a "${VIRAL_GTF}" \
                  -o "${OUTPUT_COUNTS}" \
                  -F GTF \
                  -t exon \
                  -g gene_id \
                  -s 0 \
                  --primary \
                  "${VIRAL_BAM}"
    
  14. 14

    Reads with tails measuring 0 lengths were removed.

    cutadapt (Inferred with models/gemini-2.5-flash) v4.x GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt
    
    # Define input and output file paths
    INPUT_READS="input_reads.fastq.gz"
    OUTPUT_READS="trimmed_reads.fastq.gz"
    ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example Illumina adapter sequence
    
    # Remove reads with tails (adapters) and discard reads that become 0 length after trimming.
    # cutadapt by default removes reads that become empty after trimming.
    # The -m 1 option explicitly ensures that reads shorter than 1 bp (i.e., 0 bp) are discarded.
    cutadapt -a "${ADAPTER_SEQUENCE}" -m 1 -o "${OUTPUT_READS}" "${INPUT_READS}"
  15. 15

    For genes with at least 20 mapped reads, median lengths were measured and the global distributions of these lengths were compared against each other using the Kolmogorov-Smirnov test.

    R (Inferred with models/gemini-2.5-flash) v4.2.0 GitHub
    $ Bash example
    # This script assumes that 'condition1_median_gene_lengths.txt' and 'condition2_median_gene_lengths.txt'
    # have been generated by an upstream process. This upstream process would involve:
    # 1. Aligning reads to a reference genome.
    # 2. Quantifying reads per gene.
    # 3. Filtering genes to include only those with at least 20 mapped reads.
    # 4. For the remaining genes, extracting read lengths for reads mapping to them.
    # 5. Calculating the median read length for each gene.
    # 6. Storing these median lengths (one per line) in the respective input files.
    
    # Example of how to create dummy input files for demonstration:
    # echo -e "100\n110\n105\n120\n95" > condition1_median_gene_lengths.txt
    # echo -e "115\n125\n120\n130\n110" > condition2_median_gene_lengths.txt
    
    # Install R if not already available
    # conda install -c conda-forge r-base=4.2.0
    
    # Create the R script for performing the Kolmogorov-Smirnov test
    cat <<EOF > ks_test_gene_lengths.R
    # ks_test_gene_lengths.R
    args <- commandArgs(trailingOnly = TRUE)
    if (length(args) != 2) {
      stop("Usage: Rscript ks_test_gene_lengths.R <file1> <file2>", call. = FALSE)
    }
    
    file1 <- args[1]
    file2 <- args[2]
    
    # Read data from input files
    # Each file is expected to contain one median gene length per line
    data1 <- as.numeric(readLines(file1))
    data2 <- as.numeric(readLines(file2))
    
    # Perform Kolmogorov-Smirnov test
    ks_result <- ks.test(data1, data2)
    
    cat("Kolmogorov-Smirnov Test Results:\n")
    cat("D statistic:", ks_result$statistic, "\n")
    cat("P-value:", ks_result$p.value, "\n")
    
    # Optional: Save results to a file
    # write.table(data.frame(Statistic = ks_result$statistic, P_value = ks_result$p.value),
    #             file = "ks_test_results.txt", quote = FALSE, row.names = FALSE)
    
    # Optional: Plotting for visualization (requires a graphical environment)
    # png("length_distributions_ecdf.png")
    # plot(ecdf(data1), main="Empirical Cumulative Distribution Functions", xlab="Median Gene Length", ylab="ECDF", col="blue", lwd=2)
    # lines(ecdf(data2), col="red", lwd=2)
    # legend("bottomright", legend=c(basename(file1), basename(file2)), col=c("blue", "red"), lty=1, lwd=2)
    # dev.off()
    EOF
    
    # Execute the R script
    Rscript ks_test_gene_lengths.R condition1_median_gene_lengths.txt condition2_median_gene_lengths.txt
  16. 16

    For each gene captured in all samples and with at least 20 mapped reads, individual tail length distributions were compared amongst samples using the Mann Whitney U test with a P value cutoff of 0.025

    R (Inferred with models/gemini-2.5-flash) v4.1.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R if not already installed
    # sudo apt-get update
    # sudo apt-get install r-base
    
    # Create a dummy input file for demonstration purposes.
    # In a real scenario, this file would be generated by upstream steps
    # and contain tail length data for each read, gene, and sample.
    # Ensure enough data points per gene per sample to meet the 'at least 20 mapped reads' criteria.
    cat <<EOF > aggregated_tail_lengths.tsv
    gene	sample	tail_length
    GENE1	SampleA	45
    GENE1	SampleA	52
    GENE1	SampleA	60
    GENE1	SampleA	48
    GENE1	SampleA	55
    GENE1	SampleA	42
    GENE1	SampleA	50
    GENE1	SampleA	58
    GENE1	SampleA	47
    GENE1	SampleA	53
    GENE1	SampleA	46
    GENE1	SampleA	51
    GENE1	SampleA	59
    GENE1	SampleA	49
    GENE1	SampleA	54
    GENE1	SampleA	43
    GENE1	SampleA	56
    GENE1	SampleA	61
    GENE1	SampleA	44
    GENE1	SampleA	57
    GENE1	SampleB	50
    GENE1	SampleB	58
    GENE1	SampleB	65
    GENE1	SampleB	53
    GENE1	SampleB	62
    GENE1	SampleB	51
    GENE1	SampleB	59
    GENE1	SampleB	66
    GENE1	SampleB	54
    GENE1	SampleB	63
    GENE1	SampleB	52
    GENE1	SampleB	60
    GENE1	SampleB	67
    GENE1	SampleB	55
    GENE1	SampleB	64
    GENE1	SampleB	53
    GENE1	SampleB	61
    GENE1	SampleB	68
    GENE1	SampleB	56
    GENE1	SampleB	65
    GENE2	SampleA	30
    GENE2	SampleA	32
    GENE2	SampleA	28
    GENE2	SampleA	35
    GENE2	SampleA	31
    GENE2	SampleA	29
    GENE2	SampleA	33
    GENE2	SampleA	27
    GENE2	SampleA	34
    GENE2	SampleA	30
    GENE2	SampleA	28
    GENE2	SampleA	32
    GENE2	SampleA	26
    GENE2	SampleA	33
    GENE2	SampleA	29
    GENE2	SampleA	31
    GENE2	SampleA	27
    GENE2	SampleA	34
    GENE2	SampleA	30
    GENE2	SampleA	28
    GENE2	SampleB	31
    GENE2	SampleB	33
    GENE2	SampleB	29
    GENE2	SampleB	36
    GENE2	SampleB	32
    GENE2	SampleB	30
    GENE2	SampleB	34
    GENE2	SampleB	28
    GENE2	SampleB	35
    GENE2	SampleB	31
    GENE2	SampleB	29
    GENE2	SampleB	33
    GENE2	SampleB	27
    GENE2	SampleB	34
    GENE2	SampleB	30
    GENE2	SampleB	32
    GENE2	SampleB	28
    GENE2	SampleB	35
    GENE2	SampleB	31
    GENE2	SampleB	29
    EOF
    
    # Create the R script for comparison
    cat <<'EOF_R_SCRIPT' > compare_tail_lengths.R
    # R script for comparing tail length distributions using Mann-Whitney U test
    
    # Function to perform Mann-Whitney U test for a given gene across samples
    compare_tail_lengths <- function(data_df, gene_id, p_cutoff = 0.025) {
      gene_data <- subset(data_df, gene == gene_id)
      
      if (nrow(gene_data) < 2) {
        return(NULL) # Not enough data for comparison
      }
      
      samples <- unique(gene_data$sample)
      if (length(samples) < 2) {
        return(NULL) # Need at least two samples for comparison
      }
      
      results <- list()
      
      # Iterate through all unique pairs of samples
      for (i in 1:(length(samples) - 1)) {
        for (j in (i + 1):length(samples)) {
          sample1 <- samples[i]
          sample2 <- samples[j]
          
          tails1 <- subset(gene_data, sample == sample1)$tail_length
          tails2 <- subset(gene_data, sample == sample2)$tail_length
          
          # Filter for at least 20 mapped reads (assuming 'tail_length' represents a read)
          if (length(tails1) >= 20 && length(tails2) >= 20) {
            # Mann-Whitney U test is performed using wilcox.test in R
            test_result <- wilcox.test(tails1, tails2, exact = FALSE) 
            
            results[[paste0(sample1, "_vs_", sample2)]] <- data.frame(
              gene = gene_id,
              sample1 = sample1,
              sample2 = sample2,
              p_value = test_result$p.value,
              significant = test_result$p.value < p_cutoff
            )
          }
        }
      }
      return(do.call(rbind, results))
    }
    
    # --- Main script execution ---
    
    # Load aggregated tail length data
    input_file="aggregated_tail_lengths.tsv"
    tail_data <- read.delim(input_file, header=TRUE, sep="\t")
    
    all_genes <- unique(tail_data$gene)
    all_results <- list()
    
    for (gene in all_genes) {
      gene_results <- compare_tail_lengths(tail_data, gene, p_cutoff = 0.025)
      if (!is.null(gene_results)) {
        all_results[[gene]] <- gene_results
      }
    }
    
    final_report <- do.call(rbind, all_results)
    
    # Output results
    output_file="tail_length_comparison_report.tsv"
    write.table(final_report, file = output_file, sep = "\t", row.names = FALSE, quote = FALSE)
    
    print(paste("Comparison report written to:", output_file))
    EOF_R_SCRIPT
    
    # Execute the R script
    Rscript compare_tail_lengths.R

Tools Used

Raw Source Text
RNA-seq data was aligned to the human hg19 genome build using Olego and Star aligners, and alternative splicing and alternative polyadenylation were estimated as described below.
Quantas software as described in Charizanis et al, 2012, Neuron, was used to estimate alternative splicing. Olego aligned alignment files were used to count observed junction reads for each exon. Weighted number of exon or exon-junction fragments uniquely supporting the inclusion or skipping isoform of each cassette exon and a probability score was assigned to each isoform. A Fisher’s exact test was used to evaluate the statistical significance of splicing changes using both exon and exon-junction fragments, followed by Benjamini multiple testing correction to estimate the false discovery rate (FDR). In addition, inclusion or exclusion junction reads were used to calculate the proportional change of exon inclusion (dI). See documentation at http://zhanglab.c2b2.columbia.edu/index.php/Quantas_Documentation.
Tandem UTR isoform analysis was performed with the MISO algorithm v0.5.2 using default settings as described in Katz et al, Nature, 2010, except for use of custom 3′UTR isoform annotations. We used a Bayes-factor threshold of 10,000 and difference values (delta Psi) with an absolute value of at least 0.03
For TAIL-seq data analysis, image files were downloaded from the MiSeq and run on tailseeker (Chang et al, 2014, Mol Cell) to determine accurate polyA tail lengths, yielding paired fastq files corresponding to the 5' (R5) and 3' (R3 polyA tail) ends of each read. Reads were aligned against the human genome (hg19) and viral genomes (Human_Herpesvirus_5_strain_Merlin, AY446894.2) using STAR under default parameters. Features were assigned using Subread with gencode v19 annotations and with Human_Herpesvirus_5_strain_Merlin features, and filtered to obtain only the uniquely mapped protein coding genes. For analysis of virally mapped reads, all genes were counted. Reads with tails measuring 0 lengths were removed. For genes with at least 20 mapped reads, median lengths were measured and the global distributions of these lengths were compared against each other using the Kolmogorov-Smirnov test. For each gene captured in all samples and with at least 20 mapped reads, individual tail length distributions were compared amongst samples using the Mann Whitney U test with a P value cutoff of 0.025
Genome_build: hg19
Supplementary_files_format_and_content: bedgraph
Supplementary_files_format_and_content: TableS3: Significantly changing alternative host polyadenylation events
Supplementary_files_format_and_content: TableS6: Significant host gene alternative cassette splicing changes upon CPEB1 KD and CPEB1 OE from RNA-seq
Supplementary_files_format_and_content: TableS8: PolyA tail lengths analyzed by TAIL-seq
← Back to Analysis