A semi-supervised approach for the configuration and optimization of machine learning based anomaly detection algorithms

Beck, Viktor

doi:10.34726/hss.2024.121600

Record link:

https://doi.org/10.34726/hss.2024.121600
http://hdl.handle.net/20.500.12708/202326

Title:

A semi-supervised approach for the configuration and optimization of machine learning based anomaly detection algorithms

Citation:

Beck, V. (2024). A semi-supervised approach for the configuration and optimization of machine learning based anomaly detection algorithms [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2024.121600

reposiTUm DOI:

10.34726/hss.2024.121600

CatalogPlus:

AC17334279

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Beck, Viktor

Advisor:

Rauber, Andreas

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2024

Number of Pages:

Keywords:

Anomaly Detection; Cybersecurity; configuration; automation; optimization; log data; feature selection; hyperparameter tuning; similarity; stability

Abstract:

Cyber-Bedrohungen entwickeln sich ständig weiter und neue Angriffstechniken werden rasch entwickelt. Anomalieerkennung (AE) in System-Logzeilen ist daher zunehmend wichtiger, da sie in der Lage ist, Angriffe bekannter, aber auch unbekannter Art zu erkennen. Die Konfiguration von AE-Algorithmen hängt stark von den Daten ab und umfasst die Auswahl von Merkmalen und die Festlegung von Parametern wie Schwellenwerten oder Fenstergrößen. Der Prozess ist folglich nicht trivial und erfordert oft manuelle Eingriffe von Experten, was Zugänglichkeit und Wirksamkeit von AE-Algorithmen einschränkt. Diese Arbeit stellt daher die Configuration-Engine (CE) vor, ein halbüberwachter Ansatz zur Automatisierung des Konfigurationsprozesses von AE-Algorithmen. Die CE wendet einen datenwissenschaftlichen Ansatz an, um Eigenschaften von Teilen von Logzeilen zu identifizieren. Dabei verwendet sie einen Parser, um in Zeilen sinnvolle statische und variable Tokens zu erkennen, die AE-Detektoren analysieren können. Das CE kategorisiert Variablen auf Grundlage ihrer Eigenschaften und ihres Verhaltens über die Zeit. Basierend auf den Anforderungen der vorliegenden AE-Detektoren legt die CE fest, welche Teile des Logs ein Detektor beobachten soll und bestimmt die entsprechenden Konfigurationsparameter. Diese Arbeit betrachtet 6 Detektoren des AMiners, einer fortgeschrittenen AE-Pipeline, die eine breite Palette von AE-Algorithmen umfasst. Zusätzlich enthält die CE einen Optimierungsansatz zur weiteren Verfeinerung von Konfigurationen.Die Leistung wurde anhand punktueller und kollektiver Anomalien bewertet, die in einer Reihe von Apache Access- und Audit-Datensätzen auftreten. Bei kollektiven Anomalien lieferte das CE Konfigurationen, die eine durchschnittliche Präzision von über 0.95 für Apache- und über 0.9 für Audit-Datensätze für 5 der 6 Detektoren erreichten, während der Recall bei 1.0 lag. Damit konkurriert sie mit der Leistung der von drei verschiedenen Experten handgefertigten Konfigurationen, die die Grundlage für die Bewertung bildeten. Darüber hinaus verbesserte die Optimierung die Präzision von CE- und Expertenkonfigurationen in 29 von 32 Fällen für Apache-Daten und in 6 von 20 Fällen für Audit. Weiters können Konfigurationen als Dictionaries dargestellt und mittels Jaccard-Index auf Ähnlichkeit verglichen werden. Es zeigt sich, dass die Konfigurationen der Experten denen der CE signifikant unähnlich sind, während die des CE eine bemerkenswerte Ähnlichkeit über verschiedene Datensätze hinweg aufweisen. Dies spricht für eine effektive Übertragbarkeit der Konfigurationen auf verschiedene Datensätze desselben Typs. Die CE stellt einen signifikanten Fortschritt in AE dar, da es den Bedarf an Fachwissen und manueller Konfiguration reduziert und somit AE zugänglicher und effizienter macht.

Cyber threats are continuously evolving, with new attack techniques developing rapidly. Anomaly detection (AD) in system log data is thereby an increasingly important task, as it is able to detect attacks of previously known but also unknown kind. The configuration of AD algorithms heavily depends on the data and includes complex feature selection and the definition of parameters such as thresholds or window sizes. This process is consequently not straightforward and often necessitates manual intervention by domain experts which restricts accessibility and effectiveness of AD algorithms. This work therefore introduces the Configuration-Engine (CE), a semi-supervised approach to automate the configuration process of AD algorithms. The CE applies a data science approach to identify properties of parts of log lines. Thereby, it uses a parser to recognize meaningful static and variable tokens in the log lines that AD detectors can analyze. The CE categorizes variables based on their characteristics and behavior over time. Based on the requirements of the AD detectors at hand, the CE specifies which log parts a detector should observe and determines the appropriate configuration parameters. This thesis considers a set of 6 different detectors of the AMiner, an advanced AD pipeline encompassing a wide range of AD algorithms. Additionally, the CE contains an optimization approach for further refinement of configurations.The performance was evaluated considering point and collective anomalies occurring in a set of Apache Access and audit datasets. For collective anomalies the CE provided configurations that reached an average precision of over 0.95 for Apache and over 0.9 for audit datasets for 5 out of the 6 detectors, while maintaining a recall of 1.0 during detection. It thereby competes with the performance of handcrafted configurations by 3 different experts that formed the baseline for the evaluation. Additionally, the optimization improved the precision of both CE and expert configurations in 29 out of 32 cases for Apache data and in 6 out of 20 cases for audit. Moreover, the configurations can be represented as dictionaries and thus be compared for similarity using the Jaccard index. The experts’ configurations are thereby significantly dissimilar to the ones of the CE. Meanwhile, the CE's configurations exhibit remarkable similarity to each other across various datasets, suggesting effective portability of CE configurations across different datasets of the same type. The CE represents a significant advancement in AD, reducing the need for domain expertise and manual configuration, making AD more accessible and efficient across different datasets and detection techniques.

License:

In Copyright

Appears in Collections:

Thesis