Carnegie Mellon University
Browse
yue-li-dissertation.pdf (25.75 MB)

Modern perspectives for low-dimensional models, with applications to single-cell RNA transcriptomes analysis

Download (25.75 MB)
thesis
posted on 2022-10-07, 20:25 authored by Yue LiYue Li

An evolving line of machine learning works observe empirical evidence that suggests interpolating estimators — the ones that achieve zero training error — may not necessarily be harmful; specifically, it is observed that several learning algorithms favor low ℓ1-norm solutions in the over-parameterized regime. This motivates our interest in the minimum ℓ1-norm interpolator, and more broadly, the low-dimension methods for highdimensional data.

In the first part, concretely, we consider the noisy sparse regression model under Gaussian design, focusing on linear sparsity and high-dimensional asymptotics (so that both the number of features and the sparsity level scale proportionally with the sample size). We observe, and provide rigorous theoretical justification for, a curious multi-descent phenomenon; that is, the generalization risk of the minimum ℓ1-norm interpolator undergoes multiple (and possibly more than two) phases of descent and ascent as one increases the model capacity. Our finding is built upon an exact characterization of the risk behavior, which is governed by a system of two non-linear equations with two unknowns.

Such phenomenon inspires us to explore the low-dimensional latent structures in the real-world highdimensional data. Specifically, we focus on the transfer/meta learning task for single-cell RNA transcriptome data – for which one core technical challenge comes from tradeoff between overcorrection and alignment – by establishing shared low-dimensional factors between multiple data sources (technologies/batches/regions). We propose two algorithms for different aims: (1) a feature-wise shift model for integration and cell clustering across different technologies/batches; (2) a region-specific sparse perturbation model to extract the areal signature from the whole-brain data. Despite the simplicity of model setup, our algorithms produce robust and high-accuracy results, as well as novel biological implications, compared with other existing methods.

Specifically, the regional-perturbation model is applied to two large fetal human brain dataset, to investigate the expression pattern of ASD risk genes. We provide a detailed analysis picture that is divided into three phases. In the first phase, we compare the ASD risk gene expression across developmental stages, to identify the active periods of risk gene expression. In the second phase, we divide the cells by their cell type and region labels to locate the important regions. The above analysis leads to the ultimate goal, to discover cell subgroups that are truly significant in the development of ASD, after accounting for the batch and regional variances; discovery of such subgroups is one important task in the understanding of ASD. We shall highlight that the whole analysis procedure is made possible after introducing the regularized regionand age-specific subspace perturbations, under the low-dimensional model framework. 


Funding

NIH-R01MH123184

History

Date

2022-08-01

Degree Type

  • Dissertation

Department

  • Statistics and Data Science

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Kathryn Roeder

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC