Wurzenberger, M. (2021). Resource-efficient log analysis to enable online anomaly detection in cyber security [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2021.90967
log data analysis; intrusion detection systems; anomaly detection; clustering; template generation; parser generation; machine learning; character-based log analysis; online data analysis; system behavior analysis
en
Abstract:
The sheer number of different attack vectors and large amount of data produced by computer systems make it impossible to secure network infrastructures using traditional security measures such as anti-viruses, firewalls, and signature-based intrusion detection systems (IDS) that mostly allow detection of known attacks. Additionally, end-to-end encryption, virtualization and containerization make monitoring and analyzing network traffic non-trivial. Therefore, this thesis investigates the potential of anomaly-based intrusion detection that monitors textual log data, such as system logs, audit logs (syscalls), web logs (e.g., access logs), and application logs. The thesis identifies research gaps in state of the art log-based anomaly detection, including missing online analysis features and efficient log line parsing without loss of information, when analyzing un- and semi-structured log data. Furthermore, we propose a novel incremental clustering approach motivated by high-performance bio informatics tools that enables online analysis of large amounts of log lines. Moreover, we introduce a character- based template generator that solves the problem of computing multi-line alignments for arbitrary strings and provides detailed cluster descriptions. This enables the creation of meaningful log line templates that overcome the disadvantages of token-based templates, including handling of similar but not equal strings, and covering large parts of log lines with wildcards. State of the art parsers apply lists of regular expressions or signatures. Hence, they require large amounts of resources to process log lines and consequently remove large parts of log messages during parsing procedure, which leads to loss of information in the anomaly detection process. To overcome this weakness and enable detailed online log parsing requiring just a minimum amount of resources, the thesis proposes a parser generator that creates tree-like parsers, which effectively reduce complexity of parsing without information loss. Finally, we demonstrate the potential of the developed algorithms in three application cases. The first one introduces a time series analysis approach that uses the incremental clustering approach in combination with cluster evolution to detect frequency anomalies. Next, we describe a log-based anomaly detection system that applies the tree-like parser generator to enable online intrusion detection with a minimum amount of resources. Eventually, we propose a novel concept that enables automatic evaluation, comparison, and optimization of IDS and their configurations with respect to a specific network infrastructure.