GSE255844 Processing Pipeline
Publication
Long-read Ribo-STAMP simultaneously measures transcription and translation with isoform resolution.Genome research (2024) — PMID 38906680
Dataset
GSE255844Long-read Ribo-STAMP simultaneously measures transcription and translation at full length isoform resolution
Processing Steps
Generate Jupyter Notebook-
1
Sequencing data was processed using the Isoseq v4 pipeline with lima (parameter: --isoseq) to generate full-length non-concatemer reads and isoseq refine (parameter: --require-polya) to generate refined reads.
$ Bash example
# Install PacBio SMRT Tools (pbbioconda) if not already installed # conda create -n isoseq_env pbbioconda # conda activate isoseq_env # Assuming 'input.ccs.bam' are circular consensus (CCS) reads and 'primers.fasta' contains IsoSeq primers (e.g., SMRTbell adapters) # Generate full-length non-concatemer (FLNC) reads using lima lima --isoseq input.ccs.bam primers.fasta output.flnc.bam # Refine FLNC reads, requiring a polyA tail isoseq refine --require-polya output.flnc.bam output.refined.bam
-
2
HEK293T APOBEC1-only and Ribo-STAMP data were aligned to hg19 reference and MDA-MB-231 Ribo-STAMP data (NT and CoCl2) were aligned to hg38 reference using pbmm2 align (parameter: --preset ISOSEQ).
$ Bash example
# Install pbmm2 (part of pbtools) # conda install -c bioconda pbmm2 # Define reference genomes # Download hg19 reference FASTA from UCSC # wget -O hg19.fasta.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # gunzip hg19.fasta.gz HG19_REF="hg19.fasta" # Path to hg19 reference FASTA # Download hg38 reference FASTA from UCSC # wget -O hg38.fasta.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz # gunzip hg38.fasta.gz HG38_REF="hg38.fasta" # Path to hg38 reference FASTA # Input data placeholders (assuming FASTQ files as input for alignment) HEK293T_INPUT="hek293t_apobec1_ribostamp.fastq" MDAMB231_NT_INPUT="mdamb231_ribostamp_nt.fastq" MDAMB231_COCL2_INPUT="mdamb231_ribostamp_cocl2.fastq" # Align HEK293T APOBEC1-only and Ribo-STAMP data to hg19 pbmm2 align "$HG19_REF" "$HEK293T_INPUT" "hek293t_apobec1_ribostamp_hg19.bam" --preset ISOSEQ # Align MDA-MB-231 Ribo-STAMP NT data to hg38 pbmm2 align "$HG38_REF" "$MDAMB231_NT_INPUT" "mdamb231_ribostamp_nt_hg38.bam" --preset ISOSEQ # Align MDA-MB-231 Ribo-STAMP CoCl2 data to hg38 pbmm2 align "$HG38_REF" "$MDAMB231_COCL2_INPUT" "mdamb231_ribostamp_cocl2_hg38.bam" --preset ISOSEQ
-
3
QC was completed using NanoPlot (parameters: --raw and --tsv_stats).
$ Bash example
# Install NanoPlot (example using conda) # conda create -n nanopore_qc nanopore-qc # conda activate nanopore_qc # conda install -c bioconda nanopore-qc # Example NanoPlot command for QC of raw Nanopore reads # Assuming 'nanopore_reads.fastq' is the input raw FASTQ file # Output will be generated in the 'nanoplot_output' directory mkdir -p nanoplot_output NanoPlot --raw --tsv_stats --fastq nanopore_reads.fastq --outdir nanoplot_output
-
4
Reads were filtered for uniquely mapped reads and read counts obtained using IsoQuant (parameters: --data_type pacbio, --transcript_quantification unique_only, and --gene_quantification unique_only)
$ Bash example
# Install IsoQuant (example using pip) # pip install IsoQuant # Or using conda # conda create -n isoquant_env python=3.8 # conda activate isoquant_env # pip install IsoQuant # Placeholder for input PacBio aligned reads (BAM) INPUT_BAM="path/to/your/pacbio_aligned_reads.bam" # Placeholder for reference genome FASTA (e.g., hg38.fa) GENOME_FASTA="path/to/your/reference_genome.fasta" # Placeholder for gene annotation GTF/GFF3 (e.g., gencode.v38.annotation.gtf) ANNOTATION_GTF="path/to/your/annotation.gtf" # Output directory for IsoQuant results OUTPUT_DIR="isoquant_output" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Execute IsoQuant for read quantification isoquant \ --data_type pacbio \ --transcript_quantification unique_only \ --gene_quantification unique_only \ --genome "${GENOME_FASTA}" \ --gtf "${ANNOTATION_GTF}" \ --bam "${INPUT_BAM}" \ --output "${OUTPUT_DIR}" -
5
Edits were identified and filtered using custom scripts.
$ Bash example
# This step identifies and filters RNA editing sites using custom scripts. # The specific script and parameters would depend on the custom implementation. # Input is typically a VCF file containing potential RNA editing sites identified from aligned RNA-seq data. # Output is a filtered VCF file with high-confidence RNA editing sites. # Define input and output files (placeholders) INPUT_VCF="identified_rna_edits.vcf" OUTPUT_VCF="filtered_rna_edits.vcf" # Define reference genome (using hg38 as a common latest assembly placeholder) # The actual reference genome used should match the one used for alignment. REFERENCE_GENOME="/path/to/human_genome/hg38.fa" # Placeholder for custom script execution. # The script would typically take parameters for filtering criteria such as: # - Minimum read depth at the editing site # - Minimum allele frequency of the edited base # - Exclusion of known SNPs (e.g., from dbSNP) # - Exclusion of sites in repetitive regions or low-complexity regions # Example command for a hypothetical custom script: # Replace 'custom_rna_edit_filter.sh' with the actual script name. # Replace parameters with those used in the specific custom script. custom_rna_edit_filter.sh \ --input ${INPUT_VCF} \ --output ${OUTPUT_VCF} \ --reference ${REFERENCE_GENOME} \ --min_depth 10 \ --min_allele_frequency 0.1 \ --exclude_snps "/path/to/dbSNP/common_snps.vcf.gz" \ --filter_criteria "custom_filter_settings.txt"
Tools Used
Raw Source Text
Sequencing data was processed using the Isoseq v4 pipeline with lima (parameter: --isoseq) to generate full-length non-concatemer reads and isoseq refine (parameter: --require-polya) to generate refined reads. HEK293T APOBEC1-only and Ribo-STAMP data were aligned to hg19 reference and MDA-MB-231 Ribo-STAMP data (NT and CoCl2) were aligned to hg38 reference using pbmm2 align (parameter: --preset ISOSEQ). QC was completed using NanoPlot (parameters: --raw and --tsv_stats). Reads were filtered for uniquely mapped reads and read counts obtained using IsoQuant (parameters: --data_type pacbio, --transcript_quantification unique_only, and --gene_quantification unique_only) Edits were identified and filtered using custom scripts. Assembly: hg19, hg38 Supplementary files format and content: tab-delimited file containing edited positions, the number of reads with C-to-U edits at each positions (conversion), the total number of reads at each position, and each edit's assignment to a gene and isoform (BED)