Carnegie Mellon University
Browse

Gradient Descent for Non-convex Problems in Modern Machine Learning

Download (2.19 MB)
thesis
posted on 2019-06-27, 15:57 authored by Simon DuSimon Du
Machine learning has become an important tool set for artificial intelligence and data science across many fields. A modern machine learning method can be often reduced to a mathematical optimization problem. Among algorithms to solve the optimization problem, gradient descent and its variants like stochastic gradient descent and momentum methods are the most popular ones. The optimization problem induced from classical machine learning methods is often a convex and smooth one, for which gradient descent is guaranteed to solve it efficiently. On the other hand,
modern machine learning methods, like deep neural networks, often require solving a non-smooth and non-convex problem. Theoretically, non-convex mathematical
optimization problems cannot be solved efficiently. However, in practice, gradient descent and its variants can find a global optimum efficiently. These competing facts
show that often there are special structures in the optimization problems that can make gradient descent succeed in practice. This thesis presents technical contributions to fill the gap between theory and
practice on the gradient descent algorithm. The outline of the thesis is as follows. In the first part, we consider applying gradient descent to minimize the empirical
risk of a neural network. We will show if a multi-layer neural network with smooth activation function is sufficiently wide, then randomly initialized gradient descent can efficiently find a global minimum of the empirical risk. We will also show the same result for the two-layer neural network with Rectified Linear Unit (ReLU) activation function. It is quite surprising that although the objective function of neural networks is non-convex, gradient descent can still find their global minimum. Lastly, we will study structural property of the trajectory induced by the gradient descent algorithm.
 In the second part, we assume the label is generated from a two-layer teacher convolutional neural network and we consider using gradient descent to recover the teacher convolutional neural network. We will show that if the input distribution is Gaussian, then gradient descent can recovered a one-hidden-layer convolutional neural network in which both the convolutional weights and the output wights are unknown parameters to be recovered. We will also show that the Gaussian input assumption can be relaxed to a general structural assumption if we only need to recover a single convolutional filter. In the third part, we study conditions under which gradient descent fails. We
will show gradient descent can take exponential time to optimize a smooth function with the strict saddle point property for which the noise-injected gradient can optimize in polynomial time. While our focus is theoretical, whenever possible, we also present experiments that illustrate our theoretical findings.

Funding

Department of Energy award DEAR0000596

Department of the Interior award D17AP00001

Air Force Research Laboratory award FA87501720212

Foxconn Technology Group

History

Date

2019-05-13

Degree Type

  • Dissertation

Department

  • Machine Learning

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Barnabas Poczos Aarti Singh

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC