High-dimensional Change Point Analysis for Categorical Data
Statistical change point analysis is concerned with identifying abrupt changes in the data, generally observed as a time series, that are due to changes in the underlying distribution and not to random fluctuations. Though the statistical literature on change point analysis is vast, with a multitude of theories and methods, the traditional, prevailing framework for change point problems is inadequate to capture the complexity of modern high-dimensional data. Motivated by the increasing demands in applications and by the gap in theoretical and methodological research, in this thesis we develop novel procedures and theoretical results for change point detection with high-dimensional categorical data. The thesis consists of three major parts.
First, we establish high probability bounds for localization error of change point estimators under the general setting of multiple change points in categorical responses with high-dimensional covariates. Our theory includes popular models, such as Poisson models for count response, logit models for nominal responses, and Bradley-Terry models for ranking. As an application, we study novel change point problems for high-dimensional ranking models.
Second, we develop a novel, general, and computationally efficient framework called Divide and Conquer Dynamic Programming (DCDP) for localizing change points in time series data with highdimensional features. DCDP is applicable to a broad variety of high-dimensional statistical models and can enjoy almost linear computational complexity. We investigate the performance of DCDP in three commonly studied change point settings in high dimensions: the mean model, the Gaussian graphical model, and the linear regression model. In all three cases, we derive non-asymptotic bounds for the accuracy of the DCDP change point estimators, which match existing bounds established in the literature and are, in some cases, minimax rate optimal.
Last, we study the change point localization problem for directed multilayer random dot product networks with latent positions drawn at each time point. We deploy a state-of-the-art tensor algorithm to provide an estimator of the expected value of the adjacency matrix that provably outperforms existing procedures. We further develop a novel online nonparametric change point detection algorithm based on kernel density estimators and bound its detection delay. Our bound depends explicitly on all the model parameters, including the number of nodes, the magnitude and time of the change, the number of layers, and the dimension of the latent position.
History
Date
2023-04-20Degree Type
- Dissertation
Department
- Statistics and Data Science
Degree Name
- Doctor of Philosophy (PhD)