A Controlled Virtual Environment for High-Quality, Realistic, and Accurately Labeled Data Generation in Network Security

Vystaukin, Denis

doi:10.34726/hss.2026.133086

DC Field

Value

Language

dc.contributor.advisor

Iglesias Vazquez, Felix

dc.contributor.author

Vystaukin, Denis

dc.date.accessioned

2026-06-08T08:42:22Z

dc.date.issued

2026

dc.date.submitted

2026-05

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Vystaukin, D. (2026). <i>A Controlled Virtual Environment for High-Quality, Realistic, and Accurately Labeled Data Generation in Network Security</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.133086</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2026.133086

dc.identifier.uri

http://hdl.handle.net/20.500.12708/228528

dc.description

Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft

dc.description

Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers

dc.description.abstract

Modern network intrusion detection research relies on the availability of high-quality datasets for the development and validation of detection algorithms. However, data quality issues currently permeate such network traffic datasets---most notably a lack of representative benign traffic, poor data labeling, and the inability to reproduce and amend datasets. While recording real-world network traffic is the gold standard for high-quality data, poor reproducibility and costly labeling efforts limit its practicality as a data generation technique. Virtualized environments offer a cost-effective alternative in which network hosts generate traffic according to scripted behavior profiles, with data collection and labeling being fully automated.In this thesis we create a host orchestration framework within a Linux-based virtual machine, which we use to generate network traces and labeling metadata for small-scale, clearly defined network environments consisting of multiple benign hosts on an internal network and multiple external attackers. We develop custom networking scenarios which detail each host’s role on the network, and implement scripted host behavior profiles that specify malicious host-to-host and benign host-to-Internet-service interactions. We execute these networking scenarios using our host orchestration framework, and select the most diverse dataset in the generated collection for quality analysis. We evaluate the challenge posed by this dataset to unsupervised Machine Learning (ML)-based anomaly detection algorithms. We extract network flows from the dataset’s raw packet capture using several feature representations, and perform streaming-based and static analysis on these flows. Detection results are evaluated using algorithm performance scores and visual time-series-based analysis.Our dataset proves extremely challenging for fully unsupervised streaming-based algorithms due to the high percentage of malicious flows in the dataset (81%) and their high spatial and temporal density relative to sparse benign traffic; most algorithms adapt to attack traffic as the baseline for normal behavior and rank benign traffic as anomalous. An important exception is the Sparse Data Observers (SDO) algorithm, which successfully detects malicious traffic because it leverages a fully benign (semi-supervised) training phase to learn normality before anomalies are introduced. This shows that our datasets, despite being analytically demanding, are potentially solvable, presenting an attractive and necessary challenge for research and training in network security. In summary, our framework is capable of generating diverse, clearly attributable network traffic, which is useful for investigating and explaining failure modes of ML-based detection approaches in cybersecurity research. This addresses a known gap in the technical-scientific community that has been repeatedly identified by experts.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Erkennung von Netzwerkangriffen

dc.subject

Hochwertige Datensätze

dc.subject

Bewertung von Machine-Learning-Verfahren

dc.subject

Network Intrusion Detection

dc.subject

Quality Data

dc.subject

Machine Learning Algorithm Evaluation

dc.title

A Controlled Virtual Environment for High-Quality, Realistic, and Accurately Labeled Data Generation in Network Security

dc.title.alternative

Eine kontrollierte virtuelle Umgebung zur Generierung hochwertiger, realistischer und präzise annotierter Daten in der Netzwerksicherheit

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2026.133086

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Denis Vystaukin

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

dc.contributor.assistant

Zseby, Tanja

tuw.publication.orgunit

E389 - Institute of Telecommunications

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC17890507

dc.description.numberOfPages

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.assistant.staffStatus

staff

tuw.advisor.orcid

0000-0001-6081-969X

tuw.assistant.orcid

0000-0002-5391-467X

item.languageiso639-1

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.fulltext

with Fulltext

item.mimetype

application/pdf

item.grantfulltext

open

item.openairetype

master thesis

item.cerifentitytype

Publications

item.openaccessfulltext

Open Access

crisitem.author.dept

E384-01 - Forschungsbereich Software-intensive Systems

crisitem.author.parentorg

E384 - Institut für Computertechnik

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(2.09 MB)

In Copyright

Show simple item record

Page view(s)

checked on Jun 8, 2026

Download(s)

checked on Jun 8, 2026

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM