Meghdouri, F. (2023). Machine learning for network traffic analysis : Feature spaces and model optimization [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2023.70923
Machine Learning (ML) has revolutionized the field of network traffic analysis and anomaly detection, providing promising and efficient methods for predicting and defending against cyber threats. Traditional systems frequently rely on manual inspection and rule-based techniques, which have difficulties detecting zero-day attacks and coping with encrypted traffic, among other issues. In contrast, ML enables the creation of dynamic and adaptive systems that continuously adjust and learn in response to changing settings automatically.However, the application of ML in network traffic analysis is still far from expectations raised in the last decades and introduces new challenges, particularly in relation to network traffic data. Two key aspects include: (a) designing highly discriminative traffic representations in the presence of encrypted network traffic, and (b) improving the accuracy of algorithms while maintaining computational feasibility.In this thesis, I address both of these issues. Specifically, I (a) propose new network traffic representations that can effectively handle traffic encryption and (b) enhance prevalent detection and classification architectures by incorporating novel concepts that improve training, performance, as well as time complexity. More specifically, in the first part of the thesis I present three new network traffic representations: the Multi-key feature vector, a cross-layer representation that allows significantly overcoming state of the art attack detection; I-Notice, a novel approach for handling unaggregated packet sequences; and FlowCount, a method for counting flows in IPsec tunnels. I also explore the effects of data scaling, as well as the importance of data transformation for training speed, model stability, and interpretability. In the second part of the thesis, I present three methods for improving ML techniques when applied to traffic classification and anomalydetection, namely: ODM, a sampling algorithm that provides coresets for faster andmore accurate training; FlowGan, a framework for solving data imbalance and controlling network traffic generation; and EagerNet, a strategy for speeding up fully-connected neural network training while using fewer resources.I conduct a thorough performance and explainability analysis whereupon results show that the proposed methods are highly effective in improving analysis accuracy and speed while overcoming limitations set by traffic encryption. Overall, the advances presented in this thesis cover critical aspects necessary to address the requirements of modern network traffic analysis.