Novel Persistent Embeddings of Protein Pockets and their Evaluation on Proper Train-Test Splits for Ligand Binding Affinity
Computational drug discovery offers significant speed-ups in the discovery phase of the drug development pipeline. A key thematic element of drug discovery involves repurposing, the recycling of either drug or receptor pockets to develop new therapeutics. To accomplish this, protein pocket descriptors are utilized to create a fixed size vector describing the protein pocket or small molecule of interest, and the distance between these vectors describes how similar or dissimilar those pairs are. In this work, we describe novel protein pocket descriptors with ideas taken from ligand binding affinity tasks, and evaluate those descriptors across several datasets. These novel protein pocket descriptors borrow from persistent homology and introduce novel embedding creation methods including gradient binning, to develop higher-resolution embeddings. In addition, these descriptors were utilized in ligand binding affinity tasks, in conjunction with a Geometric Vector Perceptron graph neural network, to assess their ability to predict downstream tasks and to evaluate information-leakage-free test-train splits. Persistence embeddings for protein pockets, derived from ligand-binding complex descriptors, adequately contain information to perform classification using learned models. However, they display mixed results when Euclidean distance is used to find similar and dissimilar pairs. Additionally, we find that our descriptors are able to perform downstream learning tasks. These persistent homology models, while they perform well on datasets like PDBBind, suffer from a failure to properly generalize with information-leakage minimized splits
History
Date
2024-12-01Degree Type
- Master's Thesis
Department
- Biological Sciences
Degree Name
- Master of Science (MS)