Discovering Novel Polyketide and Saccharide Natural Products through Integrated Computational Mass Spectrometry and Microbial Genome Mining
Natural products are critical sources of new drug candidates, with 34% of FDA approved small molecules being derived from these sources. These compounds are essential in treating a wide range of diseases, including infections, cancers, and immunological disorders. The discovery of new natural products has been revo lutionized by advances in high-throughput technologies, which enable the acquisition of vast amounts of tandem mass spectra and sequencing data from microbial isolates and environmental/host-associated microbial communities. This wealth of data is stored in rapidly expanding public databases such as the Integrated Microbial Genomes & Microbiomes Atlas of Biosynthetic Gene Clusters (IMG-ABC) and the Global Natural Products Social (GNPS) Molecular Networking platform. These repositories have become invaluable resources for natural product discovery.
Despite the availability of extensive genomic and metabolomic data, there is a significant methodological gap in integrating these data types for the discovery of novel natural products. Existing approaches often treat genomics and metabolomics in isolation, failing to leverage the combined power of these datasets. To address this gap, this thesis introduces two novel machine-learning methods: Seq2PKS and Seq2Saccharide.
Seq2PKS is designed to streamline the discovery of novel modular polyketides. It utilizes a machine learning model to predict monomers recruited in biosynthetic gene clusters (BGCs) and a rule-based approach to identify mature monomers fur ther. These monomers are then assembled into predicted sequences to construct initial backbones, which are further modified to produce mature structures. The method is validated using a mass spectral search, ensuring the accuracy of the pre dicted structures. Seq2PKS has demonstrated higher accuracy in predicting polyke tide structures compared to existing tools, making it a powerful tool for natural prod uct discovery.
In addition to Seq2PKS, this thesis presents Seq2Saccharide, a novel method aimed at discovering new aminoglycosides and oligosaccharide products. It uses a probabilistic model to predict primary monomers and assembles these into candi date backbones, which are then modified to produce mature structures. The final structures are matched against mass spectral data using an error-tolerant database search. Seq2Saccharide has been benchmarked against known saccharides and has outperformed existing methods in predicting saccharide structures.
By integrating computational mass spectrometry with microbial genome mining, Seq2PKS and Seq2Saccharide significantly enhance the discovery of new natural products. As a result, several natural products and novel BGCs for existing natural products have been identified.
Funding
Discovering novel small molecule antibiotics from complex microbial communities by integrating computational mass spectrometry and metagenome mining
National Institute of General Medical Sciences
Find out more...Computational methods for nonribosomal peptide discovery
Directorate for Biological Sciences
Find out more...History
Date
2024-08-01Degree Type
- Dissertation
Department
- Computational Biology
Degree Name
- Doctor of Philosophy (PhD)