kevinl1_phd_sds_2020.pdf (27.47 MB)
Download file

High-Dimensional Statistical Methods to Model Heterogeneity in Genomic Data

Download (27.47 MB)
posted on 29.05.2020, 21:33 by Kevin LinKevin Lin
Often in genomic studies, understanding the heterogeneity among the samples can be helpful to address scientific questions directly, as well as to better understand how to
model the data in downstream analyses. As an example of the former, geneticists are interested in understanding which regions of genome of tumor cells are erroneously too long
or too short when compared to their control cells counterparts { a phenomenon known as copy number variation (CNV). Geneticists deploy comparative genomic hybridization (CGH) methods to collect data, which are analyzed by changepoint methods to detect heterogeneity among segments of the genome to directly address this scientific? question. As an example of the latter, single-cell RNA-sequencing (RNA-seq) data give geneticists
new opportunities to understand how individual cells express different genes at different intensities. In these studies, capturing the heterogeneity among cells is often the ?first step for improved downstream analyses.
In this thesis, we design various high-dimensional statistical methods to address the types of heterogeneity often found in genomic data. We provide a high-level overview of
genomics in the ?first chapter. In the second chapter, we develop a method to determine, among a collection of different microarray expression datasets, a large subset of datasets that have similar covariance matrices, which is applied in an analysis pipeline to help detect genes associated with autism spectrum disorder (ASD). In the third and fourth chapters, we develop theoretical understandings of changepoint detection methods and quantify their
detected changepoints' statistical signi?ficance, which are applied to CGH data to infer which segments of the genome display copy number variation. In the ?fifth chapter, we develop a non-linear dimension reduction method based on matrix factorization for one-parameter exponential-family distributions and study its theoretical properties. Our method is applied to study the cell developmental trajectories of oligodendrocytes { a particular cell type that plays an important role in the central nervous system.




Degree Type




Degree Name

  • Doctor of Philosophy (PhD)


Kathryn Roeder Jing Lei

Usage metrics