Statistical Theory and Methods for Comparing Distributions
2020-05-19T19:53:30Z (GMT) by
With the recent advancement of data collection techniques, there has been an explosive growth in the size
and complex of data sets in many application domains. The rise of such unprecedented data has posed new
challenges as well as new opportunities to researchers in statistics and data science. Traditional methods,
tailored to static and low-dimensional data, perform poorly or are no longer applicable for modern high dimensional
data with complex structures. Moreover, classical asymptotic theory easily breaks down under non-traditional settings where numerous parameters can interact in dynamic ways. Motivated by these new challenges, this dissertation aims to develop novel methods and technical tools suitable for modern high dimensional data with particular emphasis on three types of testing problems: (i) one-sample testing, (ii)
two-sample testing and (iii) independence testing.
One of the major contributions of this thesis is to introduce a
exible two-sample testing framework that can leverage any existing classi?fication or regression method. By taking advantage of state-of-the-art algorithms in machine learning, the proposed method can efficiently handle different types of variables and various structures in high-dimensional data with competitive power under a variety of practical scenarios. To justify our approach, we provide rigorous theoretical and empirical analysis of their performance. With a speci?fic focus on Fisher's linear discriminant analysis, we prove more sophisticated results including minimax
optimality under common regularity conditions. In addition to supervised learning approaches, we also contribute to the literature by proposing goodness-of-?t tests for high-dimensional multinomials as well as multivariate generalizations of classical rank-based tests.
Another theme of this dissertation is concerned with permutation tests. Although the permutation
approach is standard in practical implementations of two-sample and independence testing, its theoretical
properties, especially power, have not been explored beyond simple cases. A major challenge of analyzing the
permutation test is that it depends on a random critical value which is a function of observations. We study
how to overcome this challenge and demonstrate that the permutation test has competitive power properties
for many interesting problems under non-traditional settings. In particular we use the minimax perspective
to evaluate the performance of a test and show that the permutation test is optimal for the problems where
minimax lower bounds are available.