Carnegie Mellon University
Browse

Large-Scale Machine Learning over Streaming Data

Download (3.28 MB)
thesis
posted on 2023-01-27, 19:30 authored by Ellango JothimurugesanEllango Jothimurugesan

This thesis introduces new techniques for efficiently training machine learning models over continuously arriving data to achieve high accuracy, even under changes in the data distribution over time, known as concept drift. First, we address the case of IID data with STRSAGA, an optimization algorithm based on variance-reduced stochastic gradient descent that can incorporate incrementally arriving data and efficiently converges to statistical accuracy. Second, we address the case of non-IID data over time with DriftSurf. Previous work on drift detection generally rely on threshold parameters that are difficult to set, making them less practical without prior knowledge of the magnitude and rate of change. DriftSurf improves the robustness of traditional drift detection tests through a stable-state/reactive-state process, and attains higher statistical accuracy whenever an efficient optimizer like STRSAGA is used. Third, we address the case of non-IID data both over time and distributed in space in the federated learning setting with FedDrift. We empirically show that previous centralized drift adaptation and previous personalized federated learning methods are ill-suited under staggered drifts. FedDrift is the first algorithm explicitly designed for both dimensions of heterogeneity, and accurately identifies distinct concepts by learning a time-varying clustering, which enables collaborative training despite drifts. We show the presented algorithms are effective through theoretical competitive analyses and experimental studies that demonstrate higher accuracy on benchmark datasets over the prior state-of-the-art. 

History

Date

2022-11-30

Degree Type

  • Dissertation

Department

  • Computer Science

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Phillip B. Gibbons

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC