Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
Abstract
Limited data access is a longstanding barrier to data-driven re- search and development in the networked systems community. In this work, we explore if and how generative adversarial net- works (GANs) can be used to incentivize data sharing by enabling a generic framework for sharing synthetic datasets with minimal expert knowledge. As a specific target, our focus in this paper is on time series datasets with metadata (e.g., packet loss rate mea- surements with corresponding ISPs). We identify key challenges of existing GAN approaches for such workloads with respect to fidelity (e.g., long-term dependencies, complex multidimensional relationships, mode collapse) and privacy (i.e., existing guarantees are poorly understood and can sacrifice fidelity). To improve fi- delity, we design a custom workflow called DoppelGANger (DG) and demonstrate that across diverse real-world datasets (e.g., band- width measurements, cluster requests, web sessions) and use cases (e.g., structural characterization, predictive modeling, algorithm comparison), DG achieves up to 43% better fidelity than baseline models. Although we do not resolve the privacy problem in this work, we identify fundamental challenges with both classical no- tions of privacy and recent advances to improve the privacy prop- erties of GANs, and suggest a potential roadmap for addressing these challenges. By shedding light on the promise and challenges, we hope our work can rekindle the conversation on workflows for data sharing.