Modern perspectives for low-dimensional models, with applications to single-cell RNA transcriptomes analysis
An evolving line of machine learning works observe empirical evidence that suggests interpolating estimators — the ones that achieve zero training error — may not necessarily be harmful; specifically, it is observed that several learning algorithms favor low ℓ1-norm solutions in the over-parameterized regime. This motivates our interest in the minimum ℓ1-norm interpolator, and more broadly, the low-dimension methods for highdimensional data.
In the first part, concretely, we consider the noisy sparse regression model under Gaussian design, focusing on linear sparsity and high-dimensional asymptotics (so that both the number of features and the sparsity level scale proportionally with the sample size). We observe, and provide rigorous theoretical justification for, a curious multi-descent phenomenon; that is, the generalization risk of the minimum ℓ1-norm interpolator undergoes multiple (and possibly more than two) phases of descent and ascent as one increases the model capacity. Our finding is built upon an exact characterization of the risk behavior, which is governed by a system of two non-linear equations with two unknowns.
Such phenomenon inspires us to explore the low-dimensional latent structures in the real-world highdimensional data. Specifically, we focus on the transfer/meta learning task for single-cell RNA transcriptome data – for which one core technical challenge comes from tradeoff between overcorrection and alignment – by establishing shared low-dimensional factors between multiple data sources (technologies/batches/regions). We propose two algorithms for different aims: (1) a feature-wise shift model for integration and cell clustering across different technologies/batches; (2) a region-specific sparse perturbation model to extract the areal signature from the whole-brain data. Despite the simplicity of model setup, our algorithms produce robust and high-accuracy results, as well as novel biological implications, compared with other existing methods.
Specifically, the regional-perturbation model is applied to two large fetal human brain dataset, to investigate the expression pattern of ASD risk genes. We provide a detailed analysis picture that is divided into three phases. In the first phase, we compare the ASD risk gene expression across developmental stages, to identify the active periods of risk gene expression. In the second phase, we divide the cells by their cell type and region labels to locate the important regions. The above analysis leads to the ultimate goal, to discover cell subgroups that are truly significant in the development of ASD, after accounting for the batch and regional variances; discovery of such subgroups is one important task in the understanding of ASD. We shall highlight that the whole analysis procedure is made possible after introducing the regularized regionand age-specific subspace perturbations, under the low-dimensional model framework.
- Statistics and Data Science
- Doctor of Philosophy (PhD)