Carnegie Mellon University
Browse
- No file added yet -

Discovering Novel Polyketide and Saccharide Natural Products through Integrated Computational Mass Spectrometry and Microbial Genome Mining

Download (19.91 MB)
thesis
posted on 2024-09-20, 19:40 authored by Donghui YanDonghui Yan

 Natural products are critical sources of new drug candidates, with 34% of FDA approved small molecules being derived from these sources. These compounds are  essential in treating a wide range of diseases, including infections, cancers, and  immunological disorders. The discovery of new natural products has been revo lutionized by advances in high-throughput technologies, which enable the acquisition of vast amounts of tandem mass spectra and sequencing data from microbial  isolates and environmental/host-associated microbial communities. This wealth of  data is stored in rapidly expanding public databases such as the Integrated Microbial Genomes & Microbiomes Atlas of Biosynthetic Gene Clusters (IMG-ABC) and  the Global Natural Products Social (GNPS) Molecular Networking platform. These  repositories have become invaluable resources for natural product discovery.  

Despite the availability of extensive genomic and metabolomic data, there is a  significant methodological gap in integrating these data types for the discovery of  novel natural products. Existing approaches often treat genomics and metabolomics  in isolation, failing to leverage the combined power of these datasets. To address  this gap, this thesis introduces two novel machine-learning methods: Seq2PKS and  Seq2Saccharide.

Seq2PKS is designed to streamline the discovery of novel modular polyketides.  It utilizes a machine learning model to predict monomers recruited in biosynthetic  gene clusters (BGCs) and a rule-based approach to identify mature monomers fur ther. These monomers are then assembled into predicted sequences to construct  initial backbones, which are further modified to produce mature structures. The  method is validated using a mass spectral search, ensuring the accuracy of the pre dicted structures. Seq2PKS has demonstrated higher accuracy in predicting polyke tide structures compared to existing tools, making it a powerful tool for natural prod uct discovery.  

In addition to Seq2PKS, this thesis presents Seq2Saccharide, a novel method  aimed at discovering new aminoglycosides and oligosaccharide products. It uses a  probabilistic model to predict primary monomers and assembles these into candi date backbones, which are then modified to produce mature structures. The final  structures are matched against mass spectral data using an error-tolerant database  search. Seq2Saccharide has been benchmarked against known saccharides and has  outperformed existing methods in predicting saccharide structures.  

By integrating computational mass spectrometry with microbial genome mining,  Seq2PKS and Seq2Saccharide significantly enhance the discovery of new natural  products. As a result, several natural products and novel BGCs for existing natural  products have been identified.  

Funding

Discovering novel small molecule antibiotics from complex microbial communities by integrating computational mass spectrometry and metagenome mining

National Institute of General Medical Sciences

Find out more...

Computational methods for nonribosomal peptide discovery

Directorate for Biological Sciences

Find out more...

History

Date

2024-08-01

Degree Type

  • Dissertation

Department

  • Computational Biology

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Hosein Mohimani

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC