GSE123613 Processing Pipeline

OTHER code_examples 9 steps

Publication

Pseudouridine synthases modify human pre-mRNA co-transcriptionally and affect pre-mRNA processing.

Molecular cell (2022) — PMID 35051350

Dataset

Pseudouridine synthases modify human pre-mRNA co-transcriptionally and affect splicingÂ

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: Pseduo-seq

N/A (Library strategy, not a computational tool) (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# The "Library strategy: Pseudo-seq" describes a wet-lab protocol for RNA modification quantification,
# not a computational analysis step. Therefore, there is no direct bash command associated with
# "performing" Pseudo-seq computationally.
# Subsequent computational steps would involve processing sequencing data generated from Pseudo-seq libraries,
# such as quality control, alignment, and quantification of modifications.

View on GitHub

Reverse reads were trimmed of 5â² amplicon sequence using cutadapt (-G)

cutadapt v1.18 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=1.18

# Trim 5' amplicon sequence from reverse reads (R2)
# The amplicon sequence 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCA' is a common 5' adapter for reverse reads in eCLIP.
# -G: Trim 5' adapter from the reverse read.
# -e 0.1: Maximum 10% error rate.
# -m 15: Discard reads shorter than 15 bp after trimming.
cutadapt -G "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
         -e 0.1 \
         -m 15 \
         -o output_R2_trimmed.fastq.gz \
         input_R2.fastq.gz

View on GitHub

Trimmed paired end reads collapsed using PEAR (default settings)

PEAR v0.9.11 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install PEAR (Paired-End reAd merger) using Bioconda
# conda install -c bioconda pear

# Define input and output file paths
# Replace 'input_R1.fastq.gz' and 'input_R2.fastq.gz' with your actual trimmed paired-end read files.
# Replace 'output_prefix' with your desired output file prefix.
# PEAR will generate multiple output files: output_prefix.assembled.fastq, output_prefix.unassembled.forward.fastq, etc.
INPUT_FORWARD_READS="input_R1.fastq.gz"
INPUT_REVERSE_READS="input_R2.fastq.gz"
OUTPUT_PREFIX="collapsed_reads"

# Collapse trimmed paired-end reads using PEAR with default settings
# -f: Specify the input forward reads file
# -r: Specify the input reverse reads file
# -o: Specify the output file prefix
pear -f "${INPUT_FORWARD_READS}" -r "${INPUT_REVERSE_READS}" -o "${OUTPUT_PREFIX}"

PCR duplicates collapsed for in vitro Pseudo-seq libraries using fastx_collapser

fastx_collapser vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install fastx_toolkit (which includes fastx_collapser)
# conda install -c bioconda fastx_toolkit

# Collapse PCR duplicates for in vitro Pseudo-seq libraries
# Replace 'pseudo_seq_reads.fastq' with your actual input file
# Replace 'pseudo_seq_reads_collapsed.fastq' with your desired output file
fastx_collapser -i pseudo_seq_reads.fastq -o pseudo_seq_reads_collapsed.fastq

View on GitHub

Reads were mapped to bowtie indices of the pool or GRCh37 for chromatin-associated RNA Pseudo-seq with tophat2.

Bowtie v2.1.1 GitHub

$ Bash example

# Install tophat2 (example using conda)
# conda install -c bioconda tophat2

# Install bowtie (tophat2 dependency, example using conda)
# conda install -c bioconda bowtie

# Placeholder for reference genome and annotation
# Download GRCh37 (hg19) reference genome FASTA and build Bowtie index
# mkdir -p reference/GRCh37
# wget -P reference/GRCh37 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip reference/GRCh37/hg19.fa.gz
# bowtie-build reference/GRCh37/hg19.fa reference/GRCh37/hg19_bowtie_index

# Download GRCh37 (hg19) GTF annotation (Ensembl release 75 is a common choice for GRCh37)
# wget -P reference/GRCh37 ftp://ftp.ensembl.org/pub/grch37/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
# gunzip reference/GRCh37/Homo_sapiens.GRCh37.75.gtf.gz

# Define variables
BOWTIE_INDEX="reference/GRCh37/hg19_bowtie_index" # Path to the Bowtie index for GRCh37
GTF_FILE="reference/GRCh37/Homo_sapiens.GRCh37.75.gtf" # Path to the GTF annotation file for GRCh37
INPUT_READS="reads.fastq" # Placeholder for input FASTQ file(s). For paired-end, use "reads_R1.fastq reads_R2.fastq"
OUTPUT_DIR="tophat_output" # Directory for TopHat2 output

# Create output directory
mkdir -p "${OUTPUT_DIR}"

# Run tophat2 for single-end reads
# For paired-end reads, the command would be: tophat2 -G "${GTF_FILE}" -o "${OUTPUT_DIR}" "${BOWTIE_INDEX}" "reads_R1.fastq" "reads_R2.fastq"
tophat2 \
    -G "${GTF_FILE}" \
    -o "${OUTPUT_DIR}" \
    "${BOWTIE_INDEX}" \
    "${INPUT_READS}"

View on GitHub

Reads from PUS1 knockout and wild type mRNA-seq were mapped to GRCh37 using STAR

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Note: A STAR genome index for GRCh37 must be pre-built. 
# Example command to build index (if needed, replace with actual paths and parameters):
# STAR --runMode genomeGenerate --genomeDir /path/to/STAR_index/GRCh37 --genomeFastaFiles /path/to/GRCh37.fa --sjdbGTFfile /path/to/GRCh37.gtf --runThreadN 8

# Align PUS1 knockout mRNA-seq reads
STAR \
  --genomeDir /path/to/STAR_index/GRCh37 \
  --readFilesIn PUS1_knockout_R1.fastq.gz PUS1_knockout_R2.fastq.gz \
  --runThreadN 8 \
  --outFileNamePrefix PUS1_knockout_ \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes NH HI AS nM MD \
  --readFilesCommand zcat

# Align wild type mRNA-seq reads
STAR \
  --genomeDir /path/to/STAR_index/GRCh37 \
  --readFilesIn wild_type_R1.fastq.gz wild_type_R2.fastq.gz \
  --runThreadN 8 \
  --outFileNamePrefix wild_type_ \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes NH HI AS nM MD \
  --readFilesCommand zcat

View on GitHub

Pseudouridylation was assessed using cutom python scripts (see text)

Python vNot specified

$ Bash example

# Installation (if needed):
# conda install python

# Reference genome placeholder: hg38 (or specific to organism, if applicable to pseudouridylation assessment)

# This command is a placeholder for custom Python scripts. 
# The actual script name and parameters would be detailed in the full text of the study.
# Replace 'custom_pseudouridylation_script.py' with the actual script name.
# Replace '<input_data>' with the relevant input file(s) (e.g., aligned sequencing data).
# Replace '<output_file>' with the desired output file name.
python custom_pseudouridylation_script.py --input_data <input_data> --output_results <output_file> --genome_assembly hg38

Differential mRNA level analysis of PUS1 knockout and wildtype mRNA-seq using DEseq2

DESeq2 vR package (Bioconductor) GitHub

$ Bash example

# Create dummy input files for demonstration
# In a real scenario, these would be generated by upstream steps (e.g., featureCounts, Salmon)
mkdir -p deseq2_inputs
echo -e "gene_id\tWT_rep1\tWT_rep2\tKO_rep1\tKO_rep2" > deseq2_inputs/count_matrix.tsv
echo -e "geneA\t100\t120\t50\t60" >> deseq2_inputs/count_matrix.tsv
echo -e "geneB\t200\t210\t300\t320" >> deseq2_inputs/count_matrix.tsv
echo -e "geneC\t50\t60\t20\t25" >> deseq2_inputs/count_matrix.tsv
echo -e "geneD\t10\t12\t15\t18" >> deseq2_inputs/count_matrix.tsv
echo -e "geneE\t5\t8\t2\t3" >> deseq2_inputs/count_matrix.tsv

echo -e "sample_id\tcondition" > deseq2_inputs/sample_info.tsv
echo -e "WT_rep1\twildtype" >> deseq2_inputs/sample_info.tsv
echo -e "WT_rep2\twildtype" >> deseq2_inputs/sample_info.tsv
echo -e "KO_rep1\tknockout" >> deseq2_inputs/sample_info.tsv
echo -e "KO_rep2\tknockout" >> deseq2_inputs/sample_info.tsv

# Save the R script
cat << 'EOF' > deseq2_analysis.R
# Load DESeq2 library
library(DESeq2)

# --- Configuration ---
# Define input files (replace with actual paths)
count_matrix_file <- "deseq2_inputs/count_matrix.tsv" # e.g., output from featureCounts, HTSeq, or aggregated Salmon/Kallisto counts
sample_info_file <- "deseq2_inputs/sample_info.tsv"   # Tab-separated file with sample_id, condition (e.g., WT, KO)

# Define output directory
output_dir <- "deseq2_results"
dir.create(output_dir, showWarnings = FALSE)

# --- Load Data ---
# Load count matrix
# Assuming the first column is gene_id and subsequent columns are sample counts
count_data <- read.table(count_matrix_file, header = TRUE, row.names = 1, sep = "\t")
# Ensure counts are integers
count_data <- round(count_data)

# Load sample information
sample_info <- read.table(sample_info_file, header = TRUE, row.names = 1, sep = "\t")

# Ensure sample names match and are in the same order
# It's crucial that the column names of count_data match the row names of sample_info
# and that they are in the same order.
sample_info <- sample_info[colnames(count_data), , drop = FALSE]

# --- Create DESeqDataSet object ---
# Design formula: ~ condition (assuming 'condition' is a column in sample_info)
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = ~ condition)

# Pre-filtering: remove genes with very low counts
# Keep genes that have at least 10 counts in total across all samples
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# Set the reference level for the 'condition' factor (e.g., 'wildtype' as reference)
# Make sure 'wildtype' and 'knockout' match your actual condition names
dds$condition <- relevel(dds$condition, ref = "wildtype")

# --- Run DESeq2 analysis ---
dds <- DESeq(dds)

# --- Extract results ---
# Compare PUS1 knockout vs wildtype
# Make sure 'knockout' and 'wildtype' match your actual condition names
res <- results(dds, contrast = c("condition", "knockout", "wildtype"))

# Order results by adjusted p-value
res_ordered <- res[order(res$padj),]

# Summarize results
summary(res_ordered)

# Save results to a CSV file
write.csv(as.data.frame(res_ordered), file = file.path(output_dir, "deseq2_PUS1_KO_vs_WT_results.csv"))

# Optional: Save normalized counts
normalized_counts <- counts(dds, normalized = TRUE)
write.csv(as.data.frame(normalized_counts), file = file.path(output_dir, "deseq2_normalized_counts.csv"))

# Optional: Generate MA plot
pdf(file.path(output_dir, "MA_plot_PUS1_KO_vs_WT.pdf"))
plotMA(res, main="MA plot: PUS1 KO vs WT")
dev.off()

# Optional: Generate dispersion plot
pdf(file.path(output_dir, "dispersion_plot.pdf"))
plotDispEsts(dds)
dev.off()

message("DESeq2 analysis complete. Results saved in ", output_dir)
EOF

# Installation instructions (commented out)
# For R and Bioconductor packages, you typically install them within an R session.
# # Install R if not already installed (e.g., on Ubuntu/Debian)
# # sudo apt update
# # sudo apt install r-base
#
# # Install Bioconductor and DESeq2 within R
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("DESeq2")'

# Execute the DESeq2 R script
Rscript deseq2_analysis.R

View on GitHub

Differential splicing analysis of PUS1 knockout and wildtype mRNA-seq using rMATS

rMATS v4.1.2 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install rMATS (example using conda)
# conda create -n rmats_env python=3.8
# conda activate rmats_env
# conda install -c bioconda rmats-turbo

# Define input files and parameters
# Placeholder for PUS1 knockout and wildtype mRNA-seq BAM files
# Replace with actual paths to your BAM files
WT_BAM_LIST="wt_samples.txt"
KO_BAM_LIST="ko_samples.txt"

# Create placeholder BAM list files (replace with your actual BAM paths)
# echo "/path/to/wt_rep1.bam,/path/to/wt_rep2.bam" > ${WT_BAM_LIST}
# echo "/path/to/ko_rep1.bam,/path/to/ko_rep2.bam" > ${KO_BAM_LIST}

# Placeholder for GTF annotation file (e.g., human hg38 GENCODE v38)
# Download from: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
GTF_FILE="gencode.v38.annotation.gtf"

OUTPUT_DIR="rmats_differential_splicing_output"
TEMP_DIR="rmats_tmp"
READ_TYPE="paired" # Assuming paired-end mRNA-seq reads
READ_LENGTH=100    # Placeholder read length, adjust as per your data
NUM_THREADS=8      # Number of threads to use

# Create output and temporary directories
mkdir -p ${OUTPUT_DIR}
mkdir -p ${TEMP_DIR}

# Execute rMATS for differential splicing analysis
# Ensure 'rmats.py' is in your PATH or provide the full path to the script
rmats.py \
  --b1 ${WT_BAM_LIST} \
  --b2 ${KO_BAM_LIST} \
  --gtf ${GTF_FILE} \
  --od ${OUTPUT_DIR} \
  --tmp ${TEMP_DIR} \
  -t ${READ_TYPE} \
  -len ${READ_LENGTH} \
  --nthread ${NUM_THREADS}

View on GitHub

Tools Used

STAR Python DESeq2

Raw Source Text

Library strategy: Pseduo-seq
Reverse reads were trimmed of 5â² amplicon sequence using cutadapt (-G)
Trimmed paired end reads collapsed using PEAR (default settings)
PCR duplicates collapsed for in vitro Pseudo-seq libraries using fastx_collapser
Reads were mapped to bowtie indices of the pool or GRCh37 for chromatin-associated RNA Pseudo-seq with tophat2. Reads from PUS1 knockout and wild type mRNA-seq were mapped to GRCh37 using STAR
Pseudouridylation was assessed using cutom python scripts (see text)
Differential mRNA level analysis of PUS1 knockout and wildtype mRNA-seq using DEseq2
Differential splicing analysis of PUS1 knockout and wildtype mRNA-seq using rMATS
Genome_build: GRCh37 for sequencing experiments from cellular RNA and Bowtie indices made from fasta files of pool sequences
Supplementary_files_format_and_content: Tab delimited text files of peak values for pseudouridylation (see text)
Supplementary_files_format_and_content: rMATS output
Supplementary_files_format_and_content: DEseq2 output

← Back to Analysis