%0 Thesis %A Xie, Pengtao %D 2018 %T Diversity-Promoting and Large-Scale Machine Learning for Healthcare %U https://kilthub.cmu.edu/articles/thesis/Diversity-Promoting_and_Large-Scale_Machine_Learning_for_Healthcare/7553468 %R 10.1184/R1/7553468.v1 %2 https://kilthub.cmu.edu/ndownloader/files/14038481 %K Diversity-promoting Learning %K Large-scale Distributed Learning %K Machine Learning for Healthcare %K Regularization %K Bayesian Priors %K Generalization Error Analysis %K System and Algorithm Co-design %X In healthcare, a tsunami of medical data has emerged, including electronic health
records, images, literature, etc. These data are heterogeneous and noisy, which renders
clinical decision-makings time-consuming, error-prone, and suboptimal. In this thesis, we develop machine learning (ML) models and systems for distilling highvalue patterns from unstructured clinical data and making informed and real-time
medical predictions and recommendations, to aid physicians in improving the efficiency
of workflow and the quality of patient care. When developing these models, we encounter several challenges: (1) How to better capture infrequent clinical patterns,
such as rare subtypes of diseases; (2) How to make the models generalize well on unseen patients? (3) How to promote the interpretability of the decisions? (4)
How to improve the timeliness of decision-making without sacrificing its quality?
(5) How to efficiently discover massive clinical patterns from large-scale data?
To address challenges (1-4), we systematically study diversity-promoting learning, which encourages the components in ML models (1) to diversely spread out to
give infrequent patterns a broader coverage, (2) to be imposed with structured constraints for better generalization performance, (3) to be mutually complementary for
more compact representation of information, and (4) to be less redundant for better interpretability. The study is performed in both frequentist statistics and Bayesian
statistics. In the former, we develop diversity-promoting regularizers that are empirically effective, theoretically analyzable, and computationally efficient, and propose
a rich set of optimization algorithms to solve the regularized problems. In the latter, we propose Bayesian priors that can effectively entail an inductive bias of “diversity”
among a finite or infinite number of components and develop efficient posterior inference algorithms. We provide theoretical analysis on why promoting diversity can
better capture infrequent betters and improve generalization. The developed regularizers and priors are demonstrated to be effective in a wide range of ML models.
To address challenge (5), we study large-scale learning. Specifically, we design efficient distributed ML systems by exploiting a system-algorithm co-design
approach. Inspired by a sufficient factor property of many ML models, we design a peer-to-peer system – Orpheus – that significantly reduces communication and
fault tolerance costs. We also provide theoretical analysis showing that algorithms executed on Orpheus are guaranteed to converge. The efficiency of our system is
demonstrated in several large-scale applications.
We apply the proposed diversity-promoting learning (DPL) techniques and the distributed ML system to solve healthcare problems. In a similar-patient retrieval
application, DPL shows great effectiveness in improving retrieval performance on infrequent diseases, enabling fast and accurate retrieval, and reducing overfitting.
In a medical-topic discovery task, our Orpheus system is able to extract tens of thousands of topics from millions of documents in a few hours. Besides these two
applications, we also design effective ML models for hierarchical multi-label tagging
of medical images and automated ICD coding. %I Carnegie Mellon University