GSE123613 Processing Pipeline

OTHER code_examples 9 steps

Publication

Pseudouridine synthases modify human pre-mRNA co-transcriptionally and affect pre-mRNA processing.

Molecular cell (2022) — PMID 35051350

Dataset

GSE123613

Pseudouridine synthases modify human pre-mRNA co-transcriptionally and affect splicingÂ

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: Pseduo-seq

    N/A (Library strategy, not a computational tool) (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # The "Library strategy: Pseudo-seq" describes a wet-lab protocol for RNA modification quantification,
    # not a computational analysis step. Therefore, there is no direct bash command associated with
    # "performing" Pseudo-seq computationally.
    # Subsequent computational steps would involve processing sequencing data generated from Pseudo-seq libraries,
    # such as quality control, alignment, and quantification of modifications.
  2. 2

    Reverse reads were trimmed of 5′ amplicon sequence using cutadapt (-G)

    cutadapt v1.18 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt=1.18
    
    # Trim 5' amplicon sequence from reverse reads (R2)
    # The amplicon sequence 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCA' is a common 5' adapter for reverse reads in eCLIP.
    # -G: Trim 5' adapter from the reverse read.
    # -e 0.1: Maximum 10% error rate.
    # -m 15: Discard reads shorter than 15 bp after trimming.
    cutadapt -G "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
             -e 0.1 \
             -m 15 \
             -o output_R2_trimmed.fastq.gz \
             input_R2.fastq.gz
  3. 3

    Trimmed paired end reads collapsed using PEAR (default settings)

    PEAR v0.9.11 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install PEAR (Paired-End reAd merger) using Bioconda
    # conda install -c bioconda pear
    
    # Define input and output file paths
    # Replace 'input_R1.fastq.gz' and 'input_R2.fastq.gz' with your actual trimmed paired-end read files.
    # Replace 'output_prefix' with your desired output file prefix.
    # PEAR will generate multiple output files: output_prefix.assembled.fastq, output_prefix.unassembled.forward.fastq, etc.
    INPUT_FORWARD_READS="input_R1.fastq.gz"
    INPUT_REVERSE_READS="input_R2.fastq.gz"
    OUTPUT_PREFIX="collapsed_reads"
    
    # Collapse trimmed paired-end reads using PEAR with default settings
    # -f: Specify the input forward reads file
    # -r: Specify the input reverse reads file
    # -o: Specify the output file prefix
    pear -f "${INPUT_FORWARD_READS}" -r "${INPUT_REVERSE_READS}" -o "${OUTPUT_PREFIX}"
  4. 4

    PCR duplicates collapsed for in vitro Pseudo-seq libraries using fastx_collapser

    fastx_collapser vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install fastx_toolkit (which includes fastx_collapser)
    # conda install -c bioconda fastx_toolkit
    
    # Collapse PCR duplicates for in vitro Pseudo-seq libraries
    # Replace 'pseudo_seq_reads.fastq' with your actual input file
    # Replace 'pseudo_seq_reads_collapsed.fastq' with your desired output file
    fastx_collapser -i pseudo_seq_reads.fastq -o pseudo_seq_reads_collapsed.fastq
  5. 5

    Reads were mapped to bowtie indices of the pool or GRCh37 for chromatin-associated RNA Pseudo-seq with tophat2.

    Bowtie v2.1.1 GitHub
    $ Bash example
    # Install tophat2 (example using conda)
    # conda install -c bioconda tophat2
    
    # Install bowtie (tophat2 dependency, example using conda)
    # conda install -c bioconda bowtie
    
    # Placeholder for reference genome and annotation
    # Download GRCh37 (hg19) reference genome FASTA and build Bowtie index
    # mkdir -p reference/GRCh37
    # wget -P reference/GRCh37 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
    # gunzip reference/GRCh37/hg19.fa.gz
    # bowtie-build reference/GRCh37/hg19.fa reference/GRCh37/hg19_bowtie_index
    
    # Download GRCh37 (hg19) GTF annotation (Ensembl release 75 is a common choice for GRCh37)
    # wget -P reference/GRCh37 ftp://ftp.ensembl.org/pub/grch37/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
    # gunzip reference/GRCh37/Homo_sapiens.GRCh37.75.gtf.gz
    
    # Define variables
    BOWTIE_INDEX="reference/GRCh37/hg19_bowtie_index" # Path to the Bowtie index for GRCh37
    GTF_FILE="reference/GRCh37/Homo_sapiens.GRCh37.75.gtf" # Path to the GTF annotation file for GRCh37
    INPUT_READS="reads.fastq" # Placeholder for input FASTQ file(s). For paired-end, use "reads_R1.fastq reads_R2.fastq"
    OUTPUT_DIR="tophat_output" # Directory for TopHat2 output
    
    # Create output directory
    mkdir -p "${OUTPUT_DIR}"
    
    # Run tophat2 for single-end reads
    # For paired-end reads, the command would be: tophat2 -G "${GTF_FILE}" -o "${OUTPUT_DIR}" "${BOWTIE_INDEX}" "reads_R1.fastq" "reads_R2.fastq"
    tophat2 \
        -G "${GTF_FILE}" \
        -o "${OUTPUT_DIR}" \
        "${BOWTIE_INDEX}" \
        "${INPUT_READS}"
    
  6. 6

    Reads from PUS1 knockout and wild type mRNA-seq were mapped to GRCh37 using STAR

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Note: A STAR genome index for GRCh37 must be pre-built. 
    # Example command to build index (if needed, replace with actual paths and parameters):
    # STAR --runMode genomeGenerate --genomeDir /path/to/STAR_index/GRCh37 --genomeFastaFiles /path/to/GRCh37.fa --sjdbGTFfile /path/to/GRCh37.gtf --runThreadN 8
    
    # Align PUS1 knockout mRNA-seq reads
    STAR \
      --genomeDir /path/to/STAR_index/GRCh37 \
      --readFilesIn PUS1_knockout_R1.fastq.gz PUS1_knockout_R2.fastq.gz \
      --runThreadN 8 \
      --outFileNamePrefix PUS1_knockout_ \
      --outSAMtype BAM SortedByCoordinate \
      --outSAMattributes NH HI AS nM MD \
      --readFilesCommand zcat
    
    # Align wild type mRNA-seq reads
    STAR \
      --genomeDir /path/to/STAR_index/GRCh37 \
      --readFilesIn wild_type_R1.fastq.gz wild_type_R2.fastq.gz \
      --runThreadN 8 \
      --outFileNamePrefix wild_type_ \
      --outSAMtype BAM SortedByCoordinate \
      --outSAMattributes NH HI AS nM MD \
      --readFilesCommand zcat
  7. 7

    Pseudouridylation was assessed using cutom python scripts (see text)

    Python vNot specified
    $ Bash example
    # Installation (if needed):
    # conda install python
    
    # Reference genome placeholder: hg38 (or specific to organism, if applicable to pseudouridylation assessment)
    
    # This command is a placeholder for custom Python scripts. 
    # The actual script name and parameters would be detailed in the full text of the study.
    # Replace 'custom_pseudouridylation_script.py' with the actual script name.
    # Replace '<input_data>' with the relevant input file(s) (e.g., aligned sequencing data).
    # Replace '<output_file>' with the desired output file name.
    python custom_pseudouridylation_script.py --input_data <input_data> --output_results <output_file> --genome_assembly hg38
  8. 8

    Differential mRNA level analysis of PUS1 knockout and wildtype mRNA-seq using DEseq2

    DESeq2 vR package (Bioconductor) GitHub
    $ Bash example
    # Create dummy input files for demonstration
    # In a real scenario, these would be generated by upstream steps (e.g., featureCounts, Salmon)
    mkdir -p deseq2_inputs
    echo -e "gene_id\tWT_rep1\tWT_rep2\tKO_rep1\tKO_rep2" > deseq2_inputs/count_matrix.tsv
    echo -e "geneA\t100\t120\t50\t60" >> deseq2_inputs/count_matrix.tsv
    echo -e "geneB\t200\t210\t300\t320" >> deseq2_inputs/count_matrix.tsv
    echo -e "geneC\t50\t60\t20\t25" >> deseq2_inputs/count_matrix.tsv
    echo -e "geneD\t10\t12\t15\t18" >> deseq2_inputs/count_matrix.tsv
    echo -e "geneE\t5\t8\t2\t3" >> deseq2_inputs/count_matrix.tsv
    
    echo -e "sample_id\tcondition" > deseq2_inputs/sample_info.tsv
    echo -e "WT_rep1\twildtype" >> deseq2_inputs/sample_info.tsv
    echo -e "WT_rep2\twildtype" >> deseq2_inputs/sample_info.tsv
    echo -e "KO_rep1\tknockout" >> deseq2_inputs/sample_info.tsv
    echo -e "KO_rep2\tknockout" >> deseq2_inputs/sample_info.tsv
    
    # Save the R script
    cat << 'EOF' > deseq2_analysis.R
    # Load DESeq2 library
    library(DESeq2)
    
    # --- Configuration ---
    # Define input files (replace with actual paths)
    count_matrix_file <- "deseq2_inputs/count_matrix.tsv" # e.g., output from featureCounts, HTSeq, or aggregated Salmon/Kallisto counts
    sample_info_file <- "deseq2_inputs/sample_info.tsv"   # Tab-separated file with sample_id, condition (e.g., WT, KO)
    
    # Define output directory
    output_dir <- "deseq2_results"
    dir.create(output_dir, showWarnings = FALSE)
    
    # --- Load Data ---
    # Load count matrix
    # Assuming the first column is gene_id and subsequent columns are sample counts
    count_data <- read.table(count_matrix_file, header = TRUE, row.names = 1, sep = "\t")
    # Ensure counts are integers
    count_data <- round(count_data)
    
    # Load sample information
    sample_info <- read.table(sample_info_file, header = TRUE, row.names = 1, sep = "\t")
    
    # Ensure sample names match and are in the same order
    # It's crucial that the column names of count_data match the row names of sample_info
    # and that they are in the same order.
    sample_info <- sample_info[colnames(count_data), , drop = FALSE]
    
    # --- Create DESeqDataSet object ---
    # Design formula: ~ condition (assuming 'condition' is a column in sample_info)
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_info,
                                  design = ~ condition)
    
    # Pre-filtering: remove genes with very low counts
    # Keep genes that have at least 10 counts in total across all samples
    keep <- rowSums(counts(dds)) >= 10
    dds <- dds[keep,]
    
    # Set the reference level for the 'condition' factor (e.g., 'wildtype' as reference)
    # Make sure 'wildtype' and 'knockout' match your actual condition names
    dds$condition <- relevel(dds$condition, ref = "wildtype")
    
    # --- Run DESeq2 analysis ---
    dds <- DESeq(dds)
    
    # --- Extract results ---
    # Compare PUS1 knockout vs wildtype
    # Make sure 'knockout' and 'wildtype' match your actual condition names
    res <- results(dds, contrast = c("condition", "knockout", "wildtype"))
    
    # Order results by adjusted p-value
    res_ordered <- res[order(res$padj),]
    
    # Summarize results
    summary(res_ordered)
    
    # Save results to a CSV file
    write.csv(as.data.frame(res_ordered), file = file.path(output_dir, "deseq2_PUS1_KO_vs_WT_results.csv"))
    
    # Optional: Save normalized counts
    normalized_counts <- counts(dds, normalized = TRUE)
    write.csv(as.data.frame(normalized_counts), file = file.path(output_dir, "deseq2_normalized_counts.csv"))
    
    # Optional: Generate MA plot
    pdf(file.path(output_dir, "MA_plot_PUS1_KO_vs_WT.pdf"))
    plotMA(res, main="MA plot: PUS1 KO vs WT")
    dev.off()
    
    # Optional: Generate dispersion plot
    pdf(file.path(output_dir, "dispersion_plot.pdf"))
    plotDispEsts(dds)
    dev.off()
    
    message("DESeq2 analysis complete. Results saved in ", output_dir)
    EOF
    
    # Installation instructions (commented out)
    # For R and Bioconductor packages, you typically install them within an R session.
    # # Install R if not already installed (e.g., on Ubuntu/Debian)
    # # sudo apt update
    # # sudo apt install r-base
    #
    # # Install Bioconductor and DESeq2 within R
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # R -e 'BiocManager::install("DESeq2")'
    
    # Execute the DESeq2 R script
    Rscript deseq2_analysis.R
  9. 9

    Differential splicing analysis of PUS1 knockout and wildtype mRNA-seq using rMATS

    rMATS v4.1.2 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install rMATS (example using conda)
    # conda create -n rmats_env python=3.8
    # conda activate rmats_env
    # conda install -c bioconda rmats-turbo
    
    # Define input files and parameters
    # Placeholder for PUS1 knockout and wildtype mRNA-seq BAM files
    # Replace with actual paths to your BAM files
    WT_BAM_LIST="wt_samples.txt"
    KO_BAM_LIST="ko_samples.txt"
    
    # Create placeholder BAM list files (replace with your actual BAM paths)
    # echo "/path/to/wt_rep1.bam,/path/to/wt_rep2.bam" > ${WT_BAM_LIST}
    # echo "/path/to/ko_rep1.bam,/path/to/ko_rep2.bam" > ${KO_BAM_LIST}
    
    # Placeholder for GTF annotation file (e.g., human hg38 GENCODE v38)
    # Download from: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
    GTF_FILE="gencode.v38.annotation.gtf"
    
    OUTPUT_DIR="rmats_differential_splicing_output"
    TEMP_DIR="rmats_tmp"
    READ_TYPE="paired" # Assuming paired-end mRNA-seq reads
    READ_LENGTH=100    # Placeholder read length, adjust as per your data
    NUM_THREADS=8      # Number of threads to use
    
    # Create output and temporary directories
    mkdir -p ${OUTPUT_DIR}
    mkdir -p ${TEMP_DIR}
    
    # Execute rMATS for differential splicing analysis
    # Ensure 'rmats.py' is in your PATH or provide the full path to the script
    rmats.py \
      --b1 ${WT_BAM_LIST} \
      --b2 ${KO_BAM_LIST} \
      --gtf ${GTF_FILE} \
      --od ${OUTPUT_DIR} \
      --tmp ${TEMP_DIR} \
      -t ${READ_TYPE} \
      -len ${READ_LENGTH} \
      --nthread ${NUM_THREADS}
    

Tools Used

Raw Source Text
Library strategy: Pseduo-seq
Reverse reads were trimmed of 5′ amplicon sequence using cutadapt (-G)
Trimmed paired end reads collapsed using PEAR (default settings)
PCR duplicates collapsed for in vitro Pseudo-seq libraries using fastx_collapser
Reads were mapped to bowtie indices of the pool or GRCh37 for chromatin-associated RNA Pseudo-seq with tophat2. Reads from PUS1 knockout and wild type mRNA-seq were mapped to GRCh37 using STAR
Pseudouridylation was assessed using cutom python scripts (see text)
Differential mRNA level analysis of PUS1 knockout and wildtype mRNA-seq using DEseq2
Differential splicing analysis of PUS1 knockout and wildtype mRNA-seq using rMATS
Genome_build: GRCh37 for sequencing experiments from cellular RNA and Bowtie indices made from fasta files of pool sequences
Supplementary_files_format_and_content: Tab delimited text files of peak values for pseudouridylation (see text)
Supplementary_files_format_and_content: rMATS output
Supplementary_files_format_and_content: DEseq2 output
← Back to Analysis