Clustering Problems for High Dimensional Data

Wang, Wangie

doi:10.1184/R1/6715118.v1

Clustering Problems for High Dimensional Data.pdf (1.67 MB)

Clustering Problems for High Dimensional Data

thesis

posted on 2014-06-01, 00:00 authored by Wangie Wang

We consider a clustering problem where we observe feature vectors X_i ∈ R^p, i = 1, 2, ..., n, from several possible classes. The class labels are unknown and the main interest is to estimate these labels. We propose a three-step clustering procedure where we first evaluate the significance of each feature by the Kolmogorov-Smirnov statistic, then we select the small fraction of features for which the Kolmogorov-Smirnov scores exceed a preselected threshold t > 0, and then use only the selected features for clustering by one version of the Principal Component Analysis (PCA). In this procedure, one of the main challenges is how to set the threshold t. We propose a new approach to set the threshold, where the core is the so-called Signal-to-Noise Ratio (SNR) in post-selection PCA. SNR is reminiscent of the recent innovation of Higher Criticism; for this reason, we call the proposed threshold the Higher Criticism Threshold (HCT), despite that it is significantly different from the HCT proposed earlier by [Donoho 2008] in the context of classification. Motivated by many examples in Big Data, we study the spectral clustering with HCT for a model where the signals are both rare and weak for two-classes clustering case. Through delicate PCA, we forge a close link between the HCT and the ideal threshold choice, and show that the HCT yields optimal results in the spectral clustering approach. The approach is successfully applied to three gene microarray data sets, where it compares favorably with existing clustering methods. Our analysis is subtle and requires new development in the Random Matrix Theory (RMT). One challenge we face is that most results in the RMT can not be applied directly to our case: existing results are usually for matrices with i.i.d. entries, but the object of interest in the current case is the post-selection data matrix, where (due to feature selection) the columns are non-independent and have hard-to-track distributions. We develop intricate new RMT to overcome this problem. We also find the theoretical approximation for the tail distribution of Kolmogorov-Smirnov Statistic under null hypothesis and alternative hypothesis. With the theoretical approximation, we can claim the effectiveness of KS statistic. Besides, we also find the fundamental limits for clustering problem, signal recovery problem, and detection problem under the Asymptotic Rare and Weak model. We find the boundary such that when the model parameters are beyond the boundary, then the inference is unavailable, otherwise there are some methods (usually exhausted search) to achieve the inference.

History

Date

2014-06-01

Degree Type

Dissertation

Department

Statistics

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Jiashun Jin

Usage metrics

Keywords

Statistics

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Clustering Problems for High Dimensional Data

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports