Self-Supervised Representation Learning for Molecular Property Predictions
Deep learning (DL) has been widely implemented in molecular modeling for property predictions. However, there are two major challenges in DL for molecules. (1) The chemical space of potentially active molecules is gigantic. (2) Labeled data of molecular properties is limited due to expensive and time-consuming simulations and experiments. DL models trained on such limited data in a supervised-learning manner struggle to perform well on novel molecules. Recently, self-supervised learning (SSL), gathers growing attention for learning representations from unlabeled data via obtaining supervisory objectives from the data itself. Unlike supervised learning, SSL can leverage massive data without manually annotated labels, which bears the promise of learning generic molecular representations for various applications.
In this dissertation, we study self-supervised molecular representation learning that makes use of large unlabeled data for better molecular property predictions. This dissertation consists of three parts, where we investigate SSL with different representations of molecules for different applications. In Part I, we introduce contrastive learning (CL) to learn representation from 2D molecular graphs with graph neural networks (GNNs). We further improve the CL framework via faulty negative mitigation with fingerprints as well as fragment-level contrasting between decomposed molecular motifs. A wide variety of property prediction tasks concerning small organic molecules, including physiology, biophysics, physical chemistry, and quantum mechanics, have been investigated in this part. In Part II, we investigate SSL methods that leverage 3D molecular geometries. In particular, denoising pre-training is proposed which significantly improves the accuracy of molecular potential predictions with equivariant GNNs. Notably, our models pre-trained on small molecules demonstrate remarkable transferability, improving performance when fine-tuned on diverse molecular systems, including different elements, charged molecules, biomolecules, and larger systems. Lastly in Part III, we investigate the development of structure-agnostic language models, especially Transformers, in chemical science. We propose chemical-aware tokenization and adapt masked language modeling for polymer property predictions. Moreover, we utilize the multimodalities of metal-organic frameworks (MOFs) through jointly training two branches of string representations encoded by Transformers and 3D geometric representations encoded by alignment. Overall, our research advances self-supervised molecular representation learning for improved prediction accuracy of various molecular properties, with potential implications for accelerating drug and material discovery.
History
Date
2023-05-01Degree Type
- Dissertation
Department
- Mechanical Engineering
Degree Name
- Doctor of Philosophy (PhD)