Bretterbauer, M. (2024). Comparison of RDF triplestores in a Kubernetes environment [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2024.87104
Semantic networks are used in order to model concepts (e.g. persons) and their relations to each other. These networks are often modelled using the Resource Description Framework (RDF), a W3C standard, resulting in graph structured data. An advantage of using RDF as data model is that reasoning/rule inferencing can be applied in order to infer additional knowledge. On today’s systems the amount of data of a knowledge graph can reach up to a few terabytes. Single machine systems reach their limits on those use cases due to memory limits and performance constraints. Some systems already exist which claim to have solved this issue by employing a triplestore in a distributed environment. However, these systems use different techniques which makes it difficult to decide which system shall be used for which use case. Also some systems are benchmarked using only a fixed setting of CPU cores or number of workers making it difficult to predict how they scale by altering these settings. Furthermore, it is unclear on how far these triplestores support running on the widely used container orchestrating framework Kubernetes and benefit from its elasticity capabilities. In this work, we address the problem of selecting an optimal triplestore for a Kubernetes environment. For this, we define a use case fraud detection for which we will evaluate our candidate systems. We specify nine functional and three performance evaluation criteria in order to be able to evaluate triplestores. By literature search, we identified three promising triplestores for the cloud, which also support reasoning, namely Apache Rya Accumulo, Apache Rya MongoDB and SANSA-Stack and we show how to deploy them in a Kubernetes environment. We analyse these triplestores based on our defined functional and performance evaluation criteria and discuss to which extent they benefit our defined use case. In order to measure their performance, we measure the data loading time, the query response time and the response times for concurrent queries for different data sizes using the LUBM benchmark. Finally we analyse the advantages and drawbacks of each system.We show that Apache Rya MongoDB fulfills the most functional requirements regarding our specified use case. In terms of performance, Apache Rya MongoDB scales well for concurrent access when adding more resources. SANSA-Stack in general scales well with more resources, however it requires a huge amount of memory. Apache Rya Accumulo fails to load bigger datasets in a reasonable time, which is why we did not run every test for this triplestore.
en
Additional information:
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers