This thesis focuses on developing scalable clustering and anomaly detection methods, with realistic assumptions and theoretically-sound guarantees, for analyzing high-dimensional data. It also studies the theory behind the performance of the proposed methods. Specifically, this thesis takes an inferential approach to searching for evidence that indicates the presence of two or more collections of data, with different distributions, in a single data set. It addresses two fundamental questions relating to this: (a) How can we perform clustering that results in statistically significant clusters? (b) In high energy physics, how can we detect new signals in experimental data, that are not explained by known physics models, without assuming a model for the new signal? In order to answer the first question, we consider clustering based on significance tests for Gaussian Mixture Models (GMMs). Our starting point is the SigClust method developed by Liu et al. (2008), which introduces a test based on the k-means objective (with k = 2) to decide whether the data should be split into
two clusters. When applied recursively, this test yields a method for hierarchical clustering that is equipped
with significance guarantees. We study the limiting distribution and power of this approach in some examples
and show that there are large regions of the parameter space where the power is low. We then introduce a new
test based on the idea of relative ?t. Unlike prior work, we test for whether a mixture of Gaussians provides a
better ?t relative to a single Gaussian, without assuming that either model is correct. The proposed test has a simple critical value and provides provable error control. We then develop several different versions of the test, one of which provides exact type I error control without requiring any asymptotic approximations. We show how the test can be applied recursively to obtain a hierarchical clustering of the data with significance guarantees. We also construct a sequential, non-hierarchical version of the approach that can additionally be used for model selection. We conclude with an extensive simulation study and a cluster analysis of a gene expression dataset. To answer the second question, we search for new signals that appear as deviations from known Standard Model physics in experimental particle physics data. To do this, we determine whether there is any significant difference between the distribution of background samples alone (generated from an assumed Monte Carlo
model according to the Standard Model) and the distribution of the actual experimental observations, which could be a mixture of background and signal samples. Traditionally, model-dependent methods are used to train a supervised classifier to detect hypothesized signals expected under models of new physics. In this thesis, we propose a model-independent method, that does not make any assumptions about the signal and uses a semi-supervised classifier to detect the presence of a signal in the experimental data. We use a test based on the likelihood ratio test statistic as well as one based on the area under the curve (AUC)
statistic. The second test is based on the assumption that if the experimental data does not contain any signal then the classifier should find the experimental data indistinguishable from the background data. Additionally, we explore active subspace methods to interpret the proposed semi-supervised classifier tests in order to understand properties of the signal detected in the experimental data. We conclude by studying the performance of the methods on a data set related to the search for the Higgs Boson provided by the
ATLAS experiment at the Large Hadron Collider (LHC) at CERN.