Uncertainty Quantification under Distribution Shifts
In the realm of scientific discoveries and practical applications, reliable application of statistical methodologies necessitates a thorough examination of potential failure scenarios. One major concern is related to the robustness of deployed methods to changes in data distribution. A classical assumption that collected data consists of observations drawn independently from the same unknown distribution (referred to as the i.i.d. assumption) is frequently violated in real-world scenarios. Therefore, it becomes essential to design statistical methods that are either inherently robust to or capable of effectively handling violations of conventional assumptions.
The first part of this thesis is devoted to topics in sequential testing — a complementary approach to traditional batch testing. Unlike batch testing where the sample size is specified before collecting data, sequential tests process data online and update inference on the fly. Specifically, we consider two closely related problems of sequential nonparametric two-sample and independence testing, which have extensive applications in various sub-fields of machine learning and statistics, often involving high-dimensional observation spaces, such as images or text. One major drawback of batch nonparametric two-sample and independence tests is that in general composite nonparametric settings, even if the null hypothesis is false, it is not possible to determine beforehand collecting how much data is sufficient to reject the null. If an analyst strongly believes that the null is false but specified sample size that was too small, then nothing can rescue the situation as the error budget is fully utilized. Conversely, excessive data collection followed by batch testing, is highly sub-optimal from several standpoints, including memory and computation usage. To address these limitations, we develop consistent sequential tests for both problems and justify their excellent empirical performance.
In addition, we consider the problem of detecting harmful distribution shifts. In practical settings, the assumption that the test data, observed during model deployment, are independent of and identically distributed as the data used for training is often violated. Therefore, it is essential to augment a learned model with a set of tools that raise alerts whenever critical changes occur. Naive testing for the presence of distribution shifts is not fully practical as it fails to account for the malignancy of a shift. Raising unnecessary alarms in benign scenarios can lead to delays and a substantial increase in deployment costs. In this work, we define a harmful shift as the one characterized by a significant drop in model performance according to pre-defined metrics and develop sequential tests to detect the presence of such harmful distribution shifts.
The second part of this thesis is devoted to topics in predictive uncertainty quantification. For a test point, classification models usually output a set of scores between zero and one, and a natural intention is to interpret those in a frequentist way (as probabilities of belonging to each of the classes). However, without additional (strong) assumptions, such interpretation fails to hold true. The discrepancy between the forecasts and long-run label frequencies is called model miscalibration. As an alternative way of communicating uncertainty, set-valued prediction returns a set of labels for classification or an interval/collection of intervals for regression problems. Amongst various tools for performing set-valued prediction, conformal prediction has become popular due to its reliable reflection of uncertainty under minimal assumptions.
One problem that is being considered is that of distribution-free posthoc recalibration in the context of binary classification. We establish a connection between calibration and alternative methods for quantifying predictive uncertainty and use it to derive an impossibility result for distribution-free recalibration via popular scaling-based recalibration methods. In the separate project, we consider assumption-light ways of quantifying predictive uncertainty in the presence of label shift when at the deployment stage class label proportions change (common in medical settings). We analyze strategies for handling label shift without labeled data from the target domain.
- Statistics and Data Science
- Doctor of Philosophy (PhD)