Landauer, M. (2021). Extraction of cyber threat intelligence from raw log data [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.103764
The omnipresence of digital systems has led to an interconnected economy and society. Unfortunately, the introduction of new technologies in the rapidly expanding global networks has also enabled previously unimaginable threats. Cyber attackers are utilizing advanced tools and techniques to compromise systems and exploit vulnerabilities for the purpose of data exfiltration and destruction. Frequently targeted victims are corporations or organizations that often have no methods in place to detect such targeted attacks in time, resulting in financial and reputational losses. As a consequence, cyber security deploys so-called intrusion detection systems (IDS) to monitor system behavior and disclose suspicious activity. While signature-based IDSs that search for predefined patterns in logs are highly effective, they are unable to detect unknown attacks and rely on manually maintained databases of attack signatures. The main problem with such signatures is that they are often easy to evade and too simple to detect complex attack cases, and that their generation is slow and relies on domain knowledge. Anomaly-based IDSs seem to resolve some of these issues by leveraging machine learning to detect unknown attacks, however, are notorious for high false positive rates and produce anomalies that are difficult to interpret and relate to specific attacks. The idea presented in this dissertation is therefore to combine the advantages of both methods by generating so-called meta-alerts from sequences of anomalies that enable detection of the same or similar attacks on other systems, as achieved by signatures. For this purpose, a new alert aggregation mechanism is proposed that does not rely on any predefined knowledge about the deployed IDSs, observed attacks, or monitored systems. In particular, the method groups anomalies and alerts by their occurrence times and uses similarity metrics to cluster and merge groups into meta-alerts. For evaluation of the approach, anomalies are generated by a publicly availableanomaly-based IDS. As part of this dissertation, this IDS is extended by a concept for analyzing categorical values in log data. Thereby, statistical tests are used to recognize changes in value correlations as anomalies. Evaluating the ability to detect attacks requires labeled log data. The dissertation therefore also proposes a method for automatic testbed deployment. In particular, testbeds are instantiated from abstract templates following principles from model-driven engineering. This enables to generate arbitrary numbers of testbeds with dynamically assigned random values for specific testbed parameters, which introduces variations in the infrastructure, normal system behavior, and attack executions. The resulting log datasets are representative for diverse system environments and thus improve evaluations.