<div class="csl-bib-body">
<div class="csl-entry">Vystaukin, D. (2026). <i>A Controlled Virtual Environment for High-Quality, Realistic, and Accurately Labeled Data Generation in Network Security</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.133086</div>
</div>
-
dc.identifier.uri
https://doi.org/10.34726/hss.2026.133086
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/228528
-
dc.description
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft
-
dc.description
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers
-
dc.description.abstract
Modern network intrusion detection research relies on the availability of high-quality datasets for the development and validation of detection algorithms. However, data quality issues currently permeate such network traffic datasets---most notably a lack of representative benign traffic, poor data labeling, and the inability to reproduce and amend datasets. While recording real-world network traffic is the gold standard for high-quality data, poor reproducibility and costly labeling efforts limit its practicality as a data generation technique. Virtualized environments offer a cost-effective alternative in which network hosts generate traffic according to scripted behavior profiles, with data collection and labeling being fully automated.In this thesis we create a host orchestration framework within a Linux-based virtual machine, which we use to generate network traces and labeling metadata for small-scale, clearly defined network environments consisting of multiple benign hosts on an internal network and multiple external attackers. We develop custom networking scenarios which detail each host’s role on the network, and implement scripted host behavior profiles that specify malicious host-to-host and benign host-to-Internet-service interactions. We execute these networking scenarios using our host orchestration framework, and select the most diverse dataset in the generated collection for quality analysis. We evaluate the challenge posed by this dataset to unsupervised Machine Learning (ML)-based anomaly detection algorithms. We extract network flows from the dataset’s raw packet capture using several feature representations, and perform streaming-based and static analysis on these flows. Detection results are evaluated using algorithm performance scores and visual time-series-based analysis.Our dataset proves extremely challenging for fully unsupervised streaming-based algorithms due to the high percentage of malicious flows in the dataset (81%) and their high spatial and temporal density relative to sparse benign traffic; most algorithms adapt to attack traffic as the baseline for normal behavior and rank benign traffic as anomalous. An important exception is the Sparse Data Observers (SDO) algorithm, which successfully detects malicious traffic because it leverages a fully benign (semi-supervised) training phase to learn normality before anomalies are introduced. This shows that our datasets, despite being analytically demanding, are potentially solvable, presenting an attractive and necessary challenge for research and training in network security. In summary, our framework is capable of generating diverse, clearly attributable network traffic, which is useful for investigating and explaining failure modes of ML-based detection approaches in cybersecurity research. This addresses a known gap in the technical-scientific community that has been repeatedly identified by experts.
en
dc.language
English
-
dc.language.iso
en
-
dc.rights.uri
http://rightsstatements.org/vocab/InC/1.0/
-
dc.subject
Erkennung von Netzwerkangriffen
de
dc.subject
Hochwertige Datensätze
de
dc.subject
Bewertung von Machine-Learning-Verfahren
de
dc.subject
Network Intrusion Detection
en
dc.subject
Quality Data
en
dc.subject
Machine Learning Algorithm Evaluation
en
dc.title
A Controlled Virtual Environment for High-Quality, Realistic, and Accurately Labeled Data Generation in Network Security
en
dc.title.alternative
Eine kontrollierte virtuelle Umgebung zur Generierung hochwertiger, realistischer und präzise annotierter Daten in der Netzwerksicherheit
de
dc.type
Thesis
en
dc.type
Hochschulschrift
de
dc.rights.license
In Copyright
en
dc.rights.license
Urheberrechtsschutz
de
dc.identifier.doi
10.34726/hss.2026.133086
-
dc.contributor.affiliation
TU Wien, Österreich
-
dc.rights.holder
Denis Vystaukin
-
dc.publisher.place
Wien
-
tuw.version
vor
-
tuw.thesisinformation
Technische Universität Wien
-
dc.contributor.assistant
Zseby, Tanja
-
tuw.publication.orgunit
E389 - Institute of Telecommunications
-
dc.type.qualificationlevel
Diploma
-
dc.identifier.libraryid
AC17890507
-
dc.description.numberOfPages
89
-
dc.thesistype
Diplomarbeit
de
dc.thesistype
Diploma Thesis
en
dc.rights.identifier
In Copyright
en
dc.rights.identifier
Urheberrechtsschutz
de
tuw.advisor.staffStatus
staff
-
tuw.assistant.staffStatus
staff
-
tuw.advisor.orcid
0000-0001-6081-969X
-
tuw.assistant.orcid
0000-0002-5391-467X
-
item.languageiso639-1
en
-
item.openairecristype
http://purl.org/coar/resource_type/c_bdcc
-
item.fulltext
with Fulltext
-
item.mimetype
application/pdf
-
item.grantfulltext
open
-
item.openairetype
master thesis
-
item.cerifentitytype
Publications
-
item.openaccessfulltext
Open Access
-
crisitem.author.dept
E384-01 - Forschungsbereich Software-intensive Systems