Carnegie Mellon University
Browse

Data Sharing with Generative Adversarial Networks: From Theory to Practice

Download (9.38 MB)
thesis
posted on 2023-03-27, 21:10 authored by Zinan LinZinan Lin

In today’s age of big data, data sharing among companies, customers, and researchers has become a critical activity that drives advancements across industry and academia. In these data sharing scenarios, stakeholders want the shared data to have high fidelity, meaning that it accurately reflects the important properties of the original data for downstream applications. At the same time, the shared data must be privacy-preserving, so that sensitive business and personal information from the data holder is not disclosed during the data sharing process. Unfortunately, achieving both of these goals simultaneously is challenging with existing data sharing techniques such as anonymization, simulation, or simple generative models. 

Recent advances in generative adversarial networks (GANs) offer a new opportunity to tackle this long-standing challenge. Given a dataset of images, GANs can synthesize new, random images that are from the same distribution as the original images. Their impressive results in synthesizing photorealistic, high-resolution images suggest the potential of GANs as a building block for a data sharing tool. However, notable challenges remain. On the fidelity front, GANs’ generated samples often lack diversity, and GANs are notoriously unstable to train—small changes to hyper-parameters can lead to poor sample fidelity. Moreover, real-world data required in data sharing applications (e.g., long and multi-dimensional time series) has its own unique characteristics different from images, which creates additional fidelity challenges. On the privacy front, the privacy properties of GANs are not well understood, and making them privacy-preserving is still an open question. 

In this dissertation, we explore how to build a high-fidelity and privacy-preserving data sharing tool with GANs. We tackle this question in a full-stack fashion, from studying and improving the theoretical foundations of GANs to applying these insights in practical data sharing applications. On the fidelity front, we propose theoretical frameworks for analyzing GAN’s sample diversity and training stability problems. Based on these insights, we propose simple and effective fixes to boost GANs’ sample diversity and training stability, resulting in better sample fidelity. On the privacy front, we analyze the fundamental privacy properties of GANs, identify the privacy issues, and design new frameworks and approaches to protect sensitive business and personal information in the original data. Finally, based on these insights, we build a practical GAN-based data sharing tool for time series data and demonstrate its fidelity across applications in systems and networking domains. We also package the algorithmic contributions in this dissertation in a modular library for future applications. 

History

Date

2023-01-11

Degree Type

  • Dissertation

Department

  • Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Giulia Fanti and Vyas Sekar

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC