Carnegie Mellon University
xgu1_pdf_sds_2021.pdf (2.98 MB)

Adaptive Efficient Histograms

Download (2.98 MB)
posted on 2022-05-04, 18:54 authored by Xiaoyi GuXiaoyi Gu

Nonparametric density estimation is a fundamental task in statistics. In many applications, such as clustering, non-parametric testing, classi?cation, anomaly detection

and topological data analysis, density estimation is performed as a necessary preliminary step to solve more complex problems. Unfortunately, both the theoretical

guarantees and the computational costs of virtually all commonly used density estimators are impacted signi?cantly by the dimensionality of the data. This is due to the

intrinsic hardness of the problem: minimax optimal density estimation is computationally infeasible even in small dimensions. In order to overcome the computational and

theoretical limitations of the classical density estimations framework, in this thesis we study computationally e?cient methods for obtaining adaptive histograms, piecewise

constant density estimators over data adaptive partitions de?ned by axis-aligned hyperrectangles. We consider a variant of the density estimator tree (DET) method of Ram and Gray (2011), a very fast, fully data-driven greedy CART-like procedure that returns a tree-structured density estimator. We show through extensive simulations that

our modi?ed DET procedure outperforms the standard version of DET, as well as traditional density estimators under a variety of scenarios and metrics, and is highly

interpretable. Inference for classical density estimation methods comes at a price, generally requiring certain smoothness conditions on the underlying density and computationally intensive procedures such as the bootstrap. In addition, the greedy nature of the DET method and the lack of theoretical understanding of its properties makes the task even more formidable. The second contribution of this thesis is the development of computationally feasible procedures to construct con?dence sets for high-dimensional

distributions based on the DET estimator. Our methodology relies on sample splitting and harness the explicit form of the upper level sets of the DET estimations, which can

be easily evaluated as a union of axes-aligned hyper-rectangles. The con?dence sets we develop come with coverage guarantees that are ?nite-sample, dimension free and do not require any smoothness on the data generating distribution. We propose three ways for building such con?dence sets, and provide efficient algorithms that return

density functions guaranteed to belong to the con?dence set. We explore various applications of our methods, including classi?cation, clustering, anomaly detection, and background subtraction.




Degree Type

  • Dissertation


  • Statistics and Data Science

Degree Name

  • Doctor of Philosophy (PhD)


Alessandro Rinaldo Arun Kuchibhotla

Usage metrics



    Ref. manager