Martinez Duarte, D. (2024). Federated Generation of Synthetic Tabular Data [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2024.112561
Machine learning (ML) models have been demonstrated to be beneficial in various domains. However, their application remains severely limited due to concerns about (1) using personal data for training ML models and (2) exchanging data between different organizations, like hospitals and banks. Both cases might lead to privacy breaches and disclosure of sensitive information. In this work, we tackle both problems simultaneously by generating synthetic data in a federated learning manner. Previous work in this field primarily addresses image data generation, while we focus on tabular data, which is more relevant for sensitive data domains.In particular, we proposed adapting two centralized tabular data generation methods, Bayesian Networks and Variational Autoencoders, to the federated setting with a novel aggregation approach applied specifically to Bayesian Networks. We perform an exhaustive evaluation of the generated synthetic on three datasets in terms of fidelity, utility, and privacy. Further, we demonstrate how the data performance changes depending on data partition among clients participating in federated learning and how the number of clients impacts the results. Our results suggest that, in many cases, the proposed methods in federated settings perform similarly to those in centralized settings and outperform local data generation. However, the imbalance among clients significantly affects the synthetic data generated by Variational Autoencoders.
en
Machine learning (ML) models have been demonstrated to be beneficial in various domains. However, their application remains severely limited due to concerns about (1) using personal data for training ML models and (2) exchanging data between different organizations, like hospitals and banks. Both cases might lead to privacy breaches and disclosure of sensitive information. In this work, we tackle both problems simultaneously by generating synthetic data in a federated learning manner. Previous work in this field primarily addresses image data generation, while we focus on tabular data, which is more relevant for sensitive data domains.In particular, we proposed adapting two centralized tabular data generation methods, Bayesian Networks and Variational Autoencoders, to the federated setting with a novel aggregation approach applied specifically to Bayesian Networks. We perform an exhaustive evaluation of the generated synthetic on three datasets in terms of fidelity, utility, and privacy. Further, we demonstrate how the data performance changes depending on data partition among clients participating in federated learning and how the number of clients impacts the results. Our results suggest that, in many cases, the proposed methods in federated settings perform similarly to those in centralized settings and outperform local data generation. However, the imbalance among clients significantly affects the synthetic data generated by Variational Autoencoders.
de
Additional information:
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft