Clustering Problems for High Dimensional Data
We consider a clustering problem where we observe feature vectors Xi ∈ Rp, i = 1, 2, ..., n, from several possible classes. The class labels are unknown and the main interest is to estimate these labels. We propose a three-step clustering procedure where we first evaluate the significance of each feature by the Kolmogorov-Smirnov statistic, then we select the small fraction of features for which the Kolmogorov-Smirnov scores exceed a preselected threshold t > 0, and then use only the selected features for clustering by one version of the Principal Component Analysis (PCA). In this procedure, one of the main challenges is how to set the threshold t. We propose a new approach to set the threshold, where the core is the so-called Signal-to-Noise Ratio (SNR) in post-selection PCA. SNR is reminiscent of the recent innovation of Higher Criticism; for this reason, we call the proposed threshold the Higher Criticism Threshold (HCT), despite that it is significantly different from the HCT proposed earlier by [Donoho 2008] in the context of classification. Motivated by many examples in Big Data, we study the spectral clustering with HCT for a model where the signals are both rare and weak for two-classes clustering case. Through delicate PCA, we forge a close link between the HCT and the ideal threshold choice, and show that the HCT yields optimal results in the spectral clustering approach. The approach is successfully applied to three gene microarray data sets, where it compares favorably with existing clustering methods. Our analysis is subtle and requires new development in the Random Matrix Theory (RMT). One challenge we face is that most results in the RMT can not be applied directly to our case: existing results are usually for matrices with i.i.d. entries, but the object of interest in the current case is the post-selection data matrix, where (due to feature selection) the columns are non-independent and have hard-to-track distributions. We develop intricate new RMT to overcome this problem. We also find the theoretical approximation for the tail distribution of Kolmogorov-Smirnov Statistic under null hypothesis and alternative hypothesis. With the theoretical approximation, we can claim the effectiveness of KS statistic. Besides, we also find the fundamental limits for clustering problem, signal recovery problem, and detection problem under the Asymptotic Rare and Weak model. We find the boundary such that when the model parameters are beyond the boundary, then the inference is unavailable, otherwise there are some methods (usually exhausted search) to achieve the inference.
History
Date
2014-06-01Degree Type
- Dissertation
Department
- Statistics
Degree Name
- Doctor of Philosophy (PhD)