Carnegie Mellon University
Browse

Generative Models for Structured Discrete Data with Application to Drug Discovery

Download (4.74 MB)
thesis
posted on 2025-07-10, 16:53 authored by Chenghui ZhouChenghui Zhou
<p dir="ltr">My thesis focuses on generative models and their applications to discrete data. We propose novel algorithms that integrate insights from state-of-the-art generative models and domain- specific knowledge of discrete data types. These algorithms aim to enhance property similarity to training data, improve data validity, and elevate the overall quality of generated outputs. The first part of my thesis investigates converting geometric images into a discrete representation using context-free grammar. We discuss effective and scalable techniques to identify suitable representations in a large search space. The second part of my thesis examines the behavior of Variational Autoencoders (VAEs) in recovering high-dimensional data embedded in lower- dimensional manifolds, assessing their ability to recover the manifold and the data density over it. Extending our exploration of VAEs into discrete data domains, particularly in molecular data generation, we found that a method enhancing VAEs' manifold recovery for continuous data also significantly improves discrete data generation. We study its benefits and limitations using the ChEMBL dataset and two smaller datasets of active molecules for protein targets. Lastly, addressing the challenge of generating stable 3D molecules, the thesis incorporates a non-differentiable chemistry oracle, GFN2-xTB, into the denoising process to improve geometry and stability. This approach is validated on datasets like QM9 and GEOM, demonstrating higher stability rates among generated molecules.</p>

Funding

INTELLIGENT MODEL-BASED ADAPTATION FOR MOBILE ROBOTICS

United States Department of the Air Force

Find out more...

ACCESSIBLE MACHINE LEARNING

United States Department of the Air Force

Find out more...

NRI: A Cognitive Navigation Assistant for the Blind

Directorate for Computer & Information Science & Engineering

Find out more...

History

Date

2024-08-24

Degree Type

  • Dissertation

Thesis Department

  • Machine Learning

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Barnabas Poczos

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC