Adaptive Efficient Histograms
Nonparametric density estimation is a fundamental task in statistics. In many applications, such as clustering, non-parametric testing, classi?cation, anomaly detection
and topological data analysis, density estimation is performed as a necessary preliminary step to solve more complex problems. Unfortunately, both the theoretical
guarantees and the computational costs of virtually all commonly used density estimators are impacted signi?cantly by the dimensionality of the data. This is due to the
intrinsic hardness of the problem: minimax optimal density estimation is computationally infeasible even in small dimensions. In order to overcome the computational and
theoretical limitations of the classical density estimations framework, in this thesis we study computationally e?cient methods for obtaining adaptive histograms, piecewise
constant density estimators over data adaptive partitions de?ned by axis-aligned hyperrectangles. We consider a variant of the density estimator tree (DET) method of Ram and Gray (2011), a very fast, fully data-driven greedy CART-like procedure that returns a tree-structured density estimator. We show through extensive simulations that
our modi?ed DET procedure outperforms the standard version of DET, as well as traditional density estimators under a variety of scenarios and metrics, and is highly
interpretable. Inference for classical density estimation methods comes at a price, generally requiring certain smoothness conditions on the underlying density and computationally intensive procedures such as the bootstrap. In addition, the greedy nature of the DET method and the lack of theoretical understanding of its properties makes the task even more formidable. The second contribution of this thesis is the development of computationally feasible procedures to construct con?dence sets for high-dimensional
distributions based on the DET estimator. Our methodology relies on sample splitting and harness the explicit form of the upper level sets of the DET estimations, which can
be easily evaluated as a union of axes-aligned hyper-rectangles. The con?dence sets we develop come with coverage guarantees that are ?nite-sample, dimension free and do not require any smoothness on the data generating distribution. We propose three ways for building such con?dence sets, and provide efficient algorithms that return
density functions guaranteed to belong to the con?dence set. We explore various applications of our methods, including classi?cation, clustering, anomaly detection, and background subtraction.
History
Date
2021-12-16Degree Type
- Dissertation
Department
- Statistics and Data Science
Degree Name
- Doctor of Philosophy (PhD)