Carnegie Mellon University
Clustering Problems for High Dimensional Data.pdf (1.67 MB)

Clustering Problems for High Dimensional Data

Download (1.67 MB)
posted on 2014-06-01, 00:00 authored by Wangie Wang

We consider a clustering problem where we observe feature vectors Xi ∈ Rp, i = 1, 2, ..., n, from several possible classes. The class labels are unknown and the main interest is to estimate these labels. We propose a three-step clustering procedure where we first evaluate the significance of each feature by the Kolmogorov-Smirnov statistic, then we select the small fraction of features for which the Kolmogorov-Smirnov scores exceed a preselected threshold t > 0, and then use only the selected features for clustering by one version of the Principal Component Analysis (PCA). In this procedure, one of the main challenges is how to set the threshold t. We propose a new approach to set the threshold, where the core is the so-called Signal-to-Noise Ratio (SNR) in post-selection PCA. SNR is reminiscent of the recent innovation of Higher Criticism; for this reason, we call the proposed threshold the Higher Criticism Threshold (HCT), despite that it is significantly different from the HCT proposed earlier by [Donoho 2008] in the context of classification. Motivated by many examples in Big Data, we study the spectral clustering with HCT for a model where the signals are both rare and weak for two-classes clustering case. Through delicate PCA, we forge a close link between the HCT and the ideal threshold choice, and show that the HCT yields optimal results in the spectral clustering approach. The approach is successfully applied to three gene microarray data sets, where it compares favorably with existing clustering methods. Our analysis is subtle and requires new development in the Random Matrix Theory (RMT). One challenge we face is that most results in the RMT can not be applied directly to our case: existing results are usually for matrices with i.i.d. entries, but the object of interest in the current case is the post-selection data matrix, where (due to feature selection) the columns are non-independent and have hard-to-track distributions. We develop intricate new RMT to overcome this problem. We also find the theoretical approximation for the tail distribution of Kolmogorov-Smirnov Statistic under null hypothesis and alternative hypothesis. With the theoretical approximation, we can claim the effectiveness of KS statistic. Besides, we also find the fundamental limits for clustering problem, signal recovery problem, and detection problem under the Asymptotic Rare and Weak model. We find the boundary such that when the model parameters are beyond the boundary, then the inference is unavailable, otherwise there are some methods (usually exhausted search) to achieve the inference.




Degree Type

  • Dissertation


  • Statistics

Degree Name

  • Doctor of Philosophy (PhD)


Jiashun Jin

Usage metrics



    Ref. manager