Maksan, T. (2026). Enhancing Privacy in Machine Learning through Teacher-Guided Synthetic Data Generation: A Modified PATE Framework [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.133330
E194 - Institut für Information Systems Engineering
-
Date (published):
2026
-
Number of Pages:
83
-
Keywords:
Machine Learning, Federated Learning, Privacy, Synthetic Data
en
Abstract:
Privacy-preserving machine learning aims to enable the training of high-utility models without exposing sensitive training data. The Private Aggregation of Teacher Ensembles (PATE) framework addresses this challenge by transferring knowledge from multiple models trained on disjoint private datasets to a student model through differentially private aggregation of teacher predictions on auxiliary data. While effective, this approach relies on repeated access to a teacher ensemble and the availability of suitable unlabeled data, which may be impractical in certain application settings.This thesis investigates a modified extension of the PATE framework in which teacher-generated synthetic data replaces real auxiliary query data. Each teacher model is trained on a disjoint private shard and subsequently, each teacher model independently generates a synthetic dataset reflecting its local training distribution. The student model is trained using differentially private aggregation of teacher votes on these synthetic datasets. The work focuses on quantifying the resulting privacy--utility trade-off in terms of the differential privacy parameter $\varepsilon$ and analyzing the data efficiency of synthetic queries in comparison to standard PATE.An empirical evaluation is conducted on the MNIST benchmark dataset. Experiments systematically vary the number of synthetic aggregation queries and the aggregation noise scale under both Laplace and Gaussian mechanisms. Results show that student performance improves monotonically with the number of aggregation queries, while the privacy budget increases linearly under basic composition. Compared to real-query PATE, synthetic-query PATE requires substantially larger privacy budgets to achieve comparable predictive performance, indicating reduced privacy efficiency. An upper baseline using a pretrained decoder reference (no DP on generation) is evaluated to isolate generative capacity limits.The findings demonstrate that replacing real auxiliary data with teacher-generated synthetic queries preserves the structural properties of PATE but introduces significant data-efficiency constraints. While the approach eliminates reliance on external query datasets, it shifts the primary bottleneck to the volume and representational quality of synthetic queries. Overall, this thesis provides a quantitative analysis of synthetic-query PATE and characterizes its limitations within a measurable differential privacy framework.
en
Additional information:
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers