Learning the structure of directed acyclic graphs (DAGs, also known as Bayesian networks) from data is an important and classical problem in machine learning, with prominent applications in causal inference, fairness, interpretability,
and biology, etc. This is a challenging problem since the search space of DAGs is combinatorial and scales superexponentially with the number of nodes. Existing
approaches often rely on various local heuristics for enforcing the acyclicity constraint. By contrast, structure learning for undirected graphical models (e.g. Gaussian MRF) is recognized as a tractable optimization problem nowadays, and achieved huge success in various practical domains such as bioinformatics. In this thesis, we take a first step towards bridging this gap between directed and undirected graphical models. We begin by introducing a fundamentally different strategy for Bayesian network structure learning: We formulate the problem as a purely continuous optimization program over real matrices that avoids the combinatorial constraint entirely. This is achieved by a novel characterization of acyclicity that is not only smooth but also exact. The resulting problem can
be efficiently solved by standard numerical algorithms, without imposing any structural assumptions on the graph such as bounded treewidth or in-degree. We then study the generalization of the above continuous algorithm to learning
nonparametric DAGs. We extend the algebraic characterization of acyclicity to nonparametric structural equation model (SEM) by leveraging nonparametric
sparsity based on partial derivatives, resulting in a continuous optimization problem that can be applied to a variety of nonparametric and semiparametric models including GLMs, additive noise models, and index models as special cases. Lastly, we introduce a unified view of score-based and ICA-based methods based on the proposed continuous optimization framework. In particular, we
show that the popular ICA-based methods that exploits non-Gaussianity of the independent noise distribution can be handled by the continuous optimization framework, which is conceptually clearer and easier to incorporate prior knowledge, and has the potential to be generalized to allow for models with hidden confounders and feedback loops.