<div class="csl-bib-body">
<div class="csl-entry">Kitzberger, G. (2026). <i>Optimizing Distributed LLM Inference for Heterogeneous Workers through Dynamic Graph Partitioning</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.138984</div>
</div>
-
dc.identifier.uri
https://doi.org/10.34726/hss.2026.138984
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/227225
-
dc.description
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft
-
dc.description
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers
-
dc.description.abstract
High parameter counts of frontier large language models (LLMs) restrict inference to well-resourced institutions with server-grade hardware. Distributed inference using heterogeneous consumer devices offers a path forward; however, existing systems require manual configuration and setup, limiting accessibility to expert users.We propose a dynamic worker-aware graph partitioning algorithm for ONNX-based LLMs that reduces model partitioning to a variant of the Ordered Partition Problem, solvable in O(n^2m) for n layers and m workers. The algorithm jointly optimizes worker memory, execution speed, network conditions, and cached model weights to minimize end-to-end inference latency. We integrate this algorithm into a distributed inference server implemented in over 5,500 lines of Rust. The server dynamically repartitions the model at runtime in response to workers joining or leaving the system. Using the browser as a distribution mechanism enables zero-setup participation, eliminating the need for manual worker configuration. Empirical evaluation shows that our cost model achieves a mean absolute percentage error (MAPE) of 8.4% overall and 4.4% for large models. Dynamic partitioning consistently outperforms static equal-layer splitting, and an ablation study confirms that each worker metric contributes meaningfully to assignment quality. We demonstrate the system recovering from unexpected worker disconnects and distributing a model totaling 60 GB of weights across heterogeneous devices.
en
dc.language
English
-
dc.language.iso
en
-
dc.rights.uri
http://rightsstatements.org/vocab/InC/1.0/
-
dc.subject
Distributed LLM Inference
en
dc.subject
LLM Partitioning
en
dc.subject
Dynamic Graph Partitioning
en
dc.subject
Dynamic Programming
en
dc.subject
Web-Based LLM Inference
en
dc.subject
ONNX
en
dc.title
Optimizing Distributed LLM Inference for Heterogeneous Workers through Dynamic Graph Partitioning
en
dc.type
Thesis
en
dc.type
Hochschulschrift
de
dc.rights.license
In Copyright
en
dc.rights.license
Urheberrechtsschutz
de
dc.identifier.doi
10.34726/hss.2026.138984
-
dc.contributor.affiliation
TU Wien, Österreich
-
dc.rights.holder
Gabriel Kitzberger
-
dc.publisher.place
Wien
-
tuw.version
vor
-
tuw.thesisinformation
Technische Universität Wien
-
dc.contributor.assistant
Furutanpey, Alireza
-
tuw.publication.orgunit
E194 - Institut für Information Systems Engineering