GSE123613 Processing Pipeline
Publication
Pseudouridine synthases modify human pre-mRNA co-transcriptionally and affect pre-mRNA processing.Molecular cell (2022) — PMID 35051350
Dataset
GSE123613Pseudouridine synthases modify human pre-mRNA co-transcriptionally and affect splicingÂ
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: Pseduo-seq
N/A (Library strategy, not a computational tool) (Inferred with models/gemini-2.5-flash) vN/A GitHub$ Bash example
# The "Library strategy: Pseudo-seq" describes a wet-lab protocol for RNA modification quantification, # not a computational analysis step. Therefore, there is no direct bash command associated with # "performing" Pseudo-seq computationally. # Subsequent computational steps would involve processing sequencing data generated from Pseudo-seq libraries, # such as quality control, alignment, and quantification of modifications.
-
2
Reverse reads were trimmed of 5â² amplicon sequence using cutadapt (-G)
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=1.18 # Trim 5' amplicon sequence from reverse reads (R2) # The amplicon sequence 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCA' is a common 5' adapter for reverse reads in eCLIP. # -G: Trim 5' adapter from the reverse read. # -e 0.1: Maximum 10% error rate. # -m 15: Discard reads shorter than 15 bp after trimming. cutadapt -G "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \ -e 0.1 \ -m 15 \ -o output_R2_trimmed.fastq.gz \ input_R2.fastq.gz -
3
Trimmed paired end reads collapsed using PEAR (default settings)
PEAR v0.9.11 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install PEAR (Paired-End reAd merger) using Bioconda # conda install -c bioconda pear # Define input and output file paths # Replace 'input_R1.fastq.gz' and 'input_R2.fastq.gz' with your actual trimmed paired-end read files. # Replace 'output_prefix' with your desired output file prefix. # PEAR will generate multiple output files: output_prefix.assembled.fastq, output_prefix.unassembled.forward.fastq, etc. INPUT_FORWARD_READS="input_R1.fastq.gz" INPUT_REVERSE_READS="input_R2.fastq.gz" OUTPUT_PREFIX="collapsed_reads" # Collapse trimmed paired-end reads using PEAR with default settings # -f: Specify the input forward reads file # -r: Specify the input reverse reads file # -o: Specify the output file prefix pear -f "${INPUT_FORWARD_READS}" -r "${INPUT_REVERSE_READS}" -o "${OUTPUT_PREFIX}" -
4
PCR duplicates collapsed for in vitro Pseudo-seq libraries using fastx_collapser
$ Bash example
# Install fastx_toolkit (which includes fastx_collapser) # conda install -c bioconda fastx_toolkit # Collapse PCR duplicates for in vitro Pseudo-seq libraries # Replace 'pseudo_seq_reads.fastq' with your actual input file # Replace 'pseudo_seq_reads_collapsed.fastq' with your desired output file fastx_collapser -i pseudo_seq_reads.fastq -o pseudo_seq_reads_collapsed.fastq
-
5
Reads were mapped to bowtie indices of the pool or GRCh37 for chromatin-associated RNA Pseudo-seq with tophat2.
$ Bash example
# Install tophat2 (example using conda) # conda install -c bioconda tophat2 # Install bowtie (tophat2 dependency, example using conda) # conda install -c bioconda bowtie # Placeholder for reference genome and annotation # Download GRCh37 (hg19) reference genome FASTA and build Bowtie index # mkdir -p reference/GRCh37 # wget -P reference/GRCh37 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # gunzip reference/GRCh37/hg19.fa.gz # bowtie-build reference/GRCh37/hg19.fa reference/GRCh37/hg19_bowtie_index # Download GRCh37 (hg19) GTF annotation (Ensembl release 75 is a common choice for GRCh37) # wget -P reference/GRCh37 ftp://ftp.ensembl.org/pub/grch37/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz # gunzip reference/GRCh37/Homo_sapiens.GRCh37.75.gtf.gz # Define variables BOWTIE_INDEX="reference/GRCh37/hg19_bowtie_index" # Path to the Bowtie index for GRCh37 GTF_FILE="reference/GRCh37/Homo_sapiens.GRCh37.75.gtf" # Path to the GTF annotation file for GRCh37 INPUT_READS="reads.fastq" # Placeholder for input FASTQ file(s). For paired-end, use "reads_R1.fastq reads_R2.fastq" OUTPUT_DIR="tophat_output" # Directory for TopHat2 output # Create output directory mkdir -p "${OUTPUT_DIR}" # Run tophat2 for single-end reads # For paired-end reads, the command would be: tophat2 -G "${GTF_FILE}" -o "${OUTPUT_DIR}" "${BOWTIE_INDEX}" "reads_R1.fastq" "reads_R2.fastq" tophat2 \ -G "${GTF_FILE}" \ -o "${OUTPUT_DIR}" \ "${BOWTIE_INDEX}" \ "${INPUT_READS}" -
6
Reads from PUS1 knockout and wild type mRNA-seq were mapped to GRCh37 using STAR
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Note: A STAR genome index for GRCh37 must be pre-built. # Example command to build index (if needed, replace with actual paths and parameters): # STAR --runMode genomeGenerate --genomeDir /path/to/STAR_index/GRCh37 --genomeFastaFiles /path/to/GRCh37.fa --sjdbGTFfile /path/to/GRCh37.gtf --runThreadN 8 # Align PUS1 knockout mRNA-seq reads STAR \ --genomeDir /path/to/STAR_index/GRCh37 \ --readFilesIn PUS1_knockout_R1.fastq.gz PUS1_knockout_R2.fastq.gz \ --runThreadN 8 \ --outFileNamePrefix PUS1_knockout_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes NH HI AS nM MD \ --readFilesCommand zcat # Align wild type mRNA-seq reads STAR \ --genomeDir /path/to/STAR_index/GRCh37 \ --readFilesIn wild_type_R1.fastq.gz wild_type_R2.fastq.gz \ --runThreadN 8 \ --outFileNamePrefix wild_type_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes NH HI AS nM MD \ --readFilesCommand zcat
-
7
Pseudouridylation was assessed using cutom python scripts (see text)
Python vNot specified$ Bash example
# Installation (if needed): # conda install python # Reference genome placeholder: hg38 (or specific to organism, if applicable to pseudouridylation assessment) # This command is a placeholder for custom Python scripts. # The actual script name and parameters would be detailed in the full text of the study. # Replace 'custom_pseudouridylation_script.py' with the actual script name. # Replace '<input_data>' with the relevant input file(s) (e.g., aligned sequencing data). # Replace '<output_file>' with the desired output file name. python custom_pseudouridylation_script.py --input_data <input_data> --output_results <output_file> --genome_assembly hg38
-
8
Differential mRNA level analysis of PUS1 knockout and wildtype mRNA-seq using DEseq2
$ Bash example
# Create dummy input files for demonstration # In a real scenario, these would be generated by upstream steps (e.g., featureCounts, Salmon) mkdir -p deseq2_inputs echo -e "gene_id\tWT_rep1\tWT_rep2\tKO_rep1\tKO_rep2" > deseq2_inputs/count_matrix.tsv echo -e "geneA\t100\t120\t50\t60" >> deseq2_inputs/count_matrix.tsv echo -e "geneB\t200\t210\t300\t320" >> deseq2_inputs/count_matrix.tsv echo -e "geneC\t50\t60\t20\t25" >> deseq2_inputs/count_matrix.tsv echo -e "geneD\t10\t12\t15\t18" >> deseq2_inputs/count_matrix.tsv echo -e "geneE\t5\t8\t2\t3" >> deseq2_inputs/count_matrix.tsv echo -e "sample_id\tcondition" > deseq2_inputs/sample_info.tsv echo -e "WT_rep1\twildtype" >> deseq2_inputs/sample_info.tsv echo -e "WT_rep2\twildtype" >> deseq2_inputs/sample_info.tsv echo -e "KO_rep1\tknockout" >> deseq2_inputs/sample_info.tsv echo -e "KO_rep2\tknockout" >> deseq2_inputs/sample_info.tsv # Save the R script cat << 'EOF' > deseq2_analysis.R # Load DESeq2 library library(DESeq2) # --- Configuration --- # Define input files (replace with actual paths) count_matrix_file <- "deseq2_inputs/count_matrix.tsv" # e.g., output from featureCounts, HTSeq, or aggregated Salmon/Kallisto counts sample_info_file <- "deseq2_inputs/sample_info.tsv" # Tab-separated file with sample_id, condition (e.g., WT, KO) # Define output directory output_dir <- "deseq2_results" dir.create(output_dir, showWarnings = FALSE) # --- Load Data --- # Load count matrix # Assuming the first column is gene_id and subsequent columns are sample counts count_data <- read.table(count_matrix_file, header = TRUE, row.names = 1, sep = "\t") # Ensure counts are integers count_data <- round(count_data) # Load sample information sample_info <- read.table(sample_info_file, header = TRUE, row.names = 1, sep = "\t") # Ensure sample names match and are in the same order # It's crucial that the column names of count_data match the row names of sample_info # and that they are in the same order. sample_info <- sample_info[colnames(count_data), , drop = FALSE] # --- Create DESeqDataSet object --- # Design formula: ~ condition (assuming 'condition' is a column in sample_info) dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = ~ condition) # Pre-filtering: remove genes with very low counts # Keep genes that have at least 10 counts in total across all samples keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # Set the reference level for the 'condition' factor (e.g., 'wildtype' as reference) # Make sure 'wildtype' and 'knockout' match your actual condition names dds$condition <- relevel(dds$condition, ref = "wildtype") # --- Run DESeq2 analysis --- dds <- DESeq(dds) # --- Extract results --- # Compare PUS1 knockout vs wildtype # Make sure 'knockout' and 'wildtype' match your actual condition names res <- results(dds, contrast = c("condition", "knockout", "wildtype")) # Order results by adjusted p-value res_ordered <- res[order(res$padj),] # Summarize results summary(res_ordered) # Save results to a CSV file write.csv(as.data.frame(res_ordered), file = file.path(output_dir, "deseq2_PUS1_KO_vs_WT_results.csv")) # Optional: Save normalized counts normalized_counts <- counts(dds, normalized = TRUE) write.csv(as.data.frame(normalized_counts), file = file.path(output_dir, "deseq2_normalized_counts.csv")) # Optional: Generate MA plot pdf(file.path(output_dir, "MA_plot_PUS1_KO_vs_WT.pdf")) plotMA(res, main="MA plot: PUS1 KO vs WT") dev.off() # Optional: Generate dispersion plot pdf(file.path(output_dir, "dispersion_plot.pdf")) plotDispEsts(dds) dev.off() message("DESeq2 analysis complete. Results saved in ", output_dir) EOF # Installation instructions (commented out) # For R and Bioconductor packages, you typically install them within an R session. # # Install R if not already installed (e.g., on Ubuntu/Debian) # # sudo apt update # # sudo apt install r-base # # # Install Bioconductor and DESeq2 within R # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("DESeq2")' # Execute the DESeq2 R script Rscript deseq2_analysis.R -
9
Differential splicing analysis of PUS1 knockout and wildtype mRNA-seq using rMATS
$ Bash example
# Install rMATS (example using conda) # conda create -n rmats_env python=3.8 # conda activate rmats_env # conda install -c bioconda rmats-turbo # Define input files and parameters # Placeholder for PUS1 knockout and wildtype mRNA-seq BAM files # Replace with actual paths to your BAM files WT_BAM_LIST="wt_samples.txt" KO_BAM_LIST="ko_samples.txt" # Create placeholder BAM list files (replace with your actual BAM paths) # echo "/path/to/wt_rep1.bam,/path/to/wt_rep2.bam" > ${WT_BAM_LIST} # echo "/path/to/ko_rep1.bam,/path/to/ko_rep2.bam" > ${KO_BAM_LIST} # Placeholder for GTF annotation file (e.g., human hg38 GENCODE v38) # Download from: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz GTF_FILE="gencode.v38.annotation.gtf" OUTPUT_DIR="rmats_differential_splicing_output" TEMP_DIR="rmats_tmp" READ_TYPE="paired" # Assuming paired-end mRNA-seq reads READ_LENGTH=100 # Placeholder read length, adjust as per your data NUM_THREADS=8 # Number of threads to use # Create output and temporary directories mkdir -p ${OUTPUT_DIR} mkdir -p ${TEMP_DIR} # Execute rMATS for differential splicing analysis # Ensure 'rmats.py' is in your PATH or provide the full path to the script rmats.py \ --b1 ${WT_BAM_LIST} \ --b2 ${KO_BAM_LIST} \ --gtf ${GTF_FILE} \ --od ${OUTPUT_DIR} \ --tmp ${TEMP_DIR} \ -t ${READ_TYPE} \ -len ${READ_LENGTH} \ --nthread ${NUM_THREADS}
Raw Source Text
Library strategy: Pseduo-seq Reverse reads were trimmed of 5â² amplicon sequence using cutadapt (-G) Trimmed paired end reads collapsed using PEAR (default settings) PCR duplicates collapsed for in vitro Pseudo-seq libraries using fastx_collapser Reads were mapped to bowtie indices of the pool or GRCh37 for chromatin-associated RNA Pseudo-seq with tophat2. Reads from PUS1 knockout and wild type mRNA-seq were mapped to GRCh37 using STAR Pseudouridylation was assessed using cutom python scripts (see text) Differential mRNA level analysis of PUS1 knockout and wildtype mRNA-seq using DEseq2 Differential splicing analysis of PUS1 knockout and wildtype mRNA-seq using rMATS Genome_build: GRCh37 for sequencing experiments from cellular RNA and Bowtie indices made from fasta files of pool sequences Supplementary_files_format_and_content: Tab delimited text files of peak values for pseudouridylation (see text) Supplementary_files_format_and_content: rMATS output Supplementary_files_format_and_content: DEseq2 output