Optimizing Distributed LLM Inference for Heterogeneous Workers through Dynamic Graph Partitioning

Kitzberger, Gabriel

doi:10.34726/hss.2026.138984

Record link:

https://doi.org/10.34726/hss.2026.138984
http://hdl.handle.net/20.500.12708/227225

Title:

Optimizing Distributed LLM Inference for Heterogeneous Workers through Dynamic Graph Partitioning

Citation:

Kitzberger, G. (2026). Optimizing Distributed LLM Inference for Heterogeneous Workers through Dynamic Graph Partitioning [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.138984

reposiTUm DOI:

10.34726/hss.2026.138984

CatalogPlus:

AC17823787

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Kitzberger, Gabriel

Advisor:

Dustdar, Schahram

Co-advisor:

Furutanpey, Alireza

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2026

Number of Pages:

Keywords:

Distributed LLM Inference; LLM Partitioning; Dynamic Graph Partitioning; Dynamic Programming; Web-Based LLM Inference; ONNX

Abstract:

High parameter counts of frontier large language models (LLMs) restrict inference to well-resourced institutions with server-grade hardware. Distributed inference using heterogeneous consumer devices offers a path forward; however, existing systems require manual configuration and setup, limiting accessibility to expert users.We propose a dynamic worker-aware graph partitioning algorithm for ONNX-based LLMs that reduces model partitioning to a variant of the Ordered Partition Problem, solvable in O(n^2m) for n layers and m workers. The algorithm jointly optimizes worker memory, execution speed, network conditions, and cached model weights to minimize end-to-end inference latency. We integrate this algorithm into a distributed inference server implemented in over 5,500 lines of Rust. The server dynamically repartitions the model at runtime in response to workers joining or leaving the system. Using the browser as a distribution mechanism enables zero-setup participation, eliminating the need for manual worker configuration. Empirical evaluation shows that our cost model achieves a mean absolute percentage error (MAPE) of 8.4% overall and 4.4% for large models. Dynamic partitioning consistently outperforms static equal-layer splitting, and an ablation study confirms that each worker metric contributes meaningfully to assignment quality. We demonstrate the system recovering from unexpected worker disconnects and distributing a model totaling 60 GB of weights across heterogeneous devices.

Additional information:

Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers

License:

In Copyright

Appears in Collections:

Thesis