Algorithms for Interpretable High-Dimensional Regression
The success of machine learning has had an unquestionable impact in nearly every domain of quantitative research and in society at large. A substantial portion of this success is due to advances in deep learning, which have enabled technology that was hardly imaginable a generation ago.
However, the black-box nature of neural networks is still a barrier to adoption of machine learning models in many domains. One issue is that the opacity of black-box models make them difficult to trust, especially in situations where incorrect predictions can be catastrophic. Another issue is that the complex models they provide may be impractical in many cases— for example, for use in mathematical optimization frameworks. In this context, it is critically important that researchers develop machine learning methods that are interpretable—that is, whose predictions and mechanisms are understandable to human stakeholders.
Much recent work has sought to apply mathematical programming (MP) techniques for constructing interpretable machine learning models. In particular, the work of Cozad et al. [1–4] laid the foundation for mathematical ABSTRACT programming approaches to three broad interpretable machine learning approaches: (1) sparse linear regression, (2) symbolic regression, and (3) constrained regression. However, the approaches in [1–4] were all limited to problems of fairly small size. In this thesis, we highlight the effectiveness of interpretable machine learning methods and we apply ideas from mathematical programming to build upon the work of Cozad et al. in these three areas by focusing on MP-based methods for high-dimensional problems.
We start by illustrating the advantages of interpretable machine learning methods over black-box methods in the context of optimization solver tuning. Traditional approaches to solver tuning seek to design algorithms that can efficiently search through the space of hyperparameters. Instead, the novel approach in our work is to first identify a set of features to describe an nonlinear program (NLP). We learn a predictive black-box ML model to identify settings of the chosen parameters to reduce IPOPT iteration count by on average 15%, and up to 100%, across many NLPs. We also show that an interpretable ML model works nearly as well as the black-box approach, and has the advantage of leveraging only a small feature space that is very easy to compute.
Sparse linear regression is a vast field and there are many different algorithms available to build models. We start here by performing a meta-ABSTRACT analysis of the literature to provide recommendations to users about when to use which regression algorithm.
We then consider Best Subset Selection (BSS), which constructs sparse linear regression models by selecting a subset of potential regression variables to include into the regression model to minimize prediction error while also minimizing complexity. The combinatorial nature of the problem means that BSS may be impractical to solve in many instances. We propose new approximate methods that round the QP-relaxation of the MIQP formulation of the BSS problem to construct good approximate solutions for large BSS problems. We provide theoretical analysis of our algorithms for special cases to give insight into why they perform well. Empirically, our rounding algorithms outperform other popular approaches for subset selection.
Building on our work on subset selection, we propose the flexible Regularized Subsets framework that performs variable selection and coefficient shrinkage in two-stages: first using any subset selection method to select variables, then using a convex penalty to estimate coefficients. This approach encourages both sparsity and robustness. We show that Regularized Subsets with rounding for variable selection outperforms other linear model-building approaches in a variety of problem settings. In addition, Regularized Subsets is able to build models from a candidate set of thousands of variables in ABSTRACT seconds.
Finally, we consider the constrained regression problem. Modelers would often like to further improve accuracy by imposing constraints on the regression coefficients to incorporate a priori knowledge into the linear regression algorithm. We propose the flexible Linearly-Constrained Regularized Subsets framework for this task. The Linearly-Constrained Regularized Subsets framework can easily handle dozens of regression variables and many linear constraints. In addition, LC-RS leads to significantly sparser models than the current standard in sparse constrained regression—the Constrained Lasso—with similar predictive performance. Finally, we provide LC-RS as an open-source tool.
We then investigate symbolic regression—a popular method for its ability to construct equations of arbitrary mathematical form. We introduce the algorithm STAR: Symbolic regression Through Algebraic Representations which uses ideas from MP to build symbolic regression models with relatively low computational burden. STAR uses nonlinear programming and integerprogramming techniques to define the structure of the symbolic expression and optimize its parameters. Further leveraging the MP-approach, STAR can be used to provably enhance the constant structure of an arbitrary symbolic expression, as well as optimize the features of the expression. STAR shows ABSTRACT empirical performance that exceeds that of some top SR algorithms in terms of training error and testing error, as well as computational time, and does so for very wide variety of test cases. STAR is the first algorithm to show that mathematical programming is a viable paradigm for symbolic regression in real problems with dozens of regressors and hundreds of data points.
Finally, we extend STAR to present a novel math-programming approach to constrained symbolic regression that is able to incorporate a priori knowledge by enforcing constraints in either a hard or soft fashion. In certain cases, our algorithm can guarantee that constraints are met at points outside the input set. Our approach is the first symbolic expression algorithm that can enforce general algebraic constraints outside the domain of the input variables.
- Doctor of Philosophy (PhD)