Azzam, A. (2023). Querying knowledge graphs at web scale [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2023.117440
While Linked Data (LD) provides standards for publishing (RDF) and (SPARQL) querying Knowledge Graphs (KGs) on the Web, serving, accessing and processing such open, decentralized KGs is often practically impossible, as query timeouts on publicly available SPARQL endpoints show. To this end, Linked Data Fragments (LDF) have introduced a foundational framework that has sparked research exploring a spectrum of potential Web querying interfaces between server-side query processing via SPARQL endpoints and client-side query processing of data dumps. Current proposals in between typically suffer from an imbalanced load on either the client or the server. In this thesis, we present a novel approach to share the load between servers and clients, while significantly reducing data transfer volume, by combining server-side query processing with shipping compressed KG partitions. Next, we present the first work that combines both client-side and server-side query optimization techniques in a truly dynamic fashion by employing a cost model that dynamically delegates the load between servers and clients by combining client-side processing of shipped partitions with efficient server- side processing of star-shaped sub-queries, based on current server workload and client capabilities. Thereafter, we investigate alternative interfaces able to ship partitions of KGs from the server to the client, aiming to reduce server-resource consumption. To this end, we align formal definitions and notations of the original LDF framework to uniformly present partition-based LDF approaches. Our thesis is a step forward towards a better- balanced share of the query processing load between clients and servers by shipping graph partitions driven by the structure of RDF graphs to group entities described with the same sets of properties and classes. Throughout the thesis, we empirically evaluate our approach against real-world and synthetic RDF KGs on both pre-existing benchmarks for highly concurrent query execution as well as a novel query workload benchmark inspired by query logs of existing SPARQL endpoints. Our experiments show that our proposed work significantly outperforms state-of-the-art solutions in terms of average total query execution time per client, while at the same time decreasing network traffic and increasing server-side availability and outperforms state-of-the-art solutions and increasing server-side availability towards more cost-effective and balanced hosting of open and decentralized KGs.