High parameter counts of frontier large language models (LLMs) restrict inference to well-resourced institutions with server-grade hardware. Distributed inference using heterogeneous consumer devices offers a path forward; however, existing systems require manual configuration and setup, limiting accessibility to expert users.We propose a dynamic worker-aware graph partitioning algorithm for ONNX-based LLMs that reduces model partitioning to a variant of the Ordered Partition Problem, solvable in O(n^2m) for n layers and m workers. The algorithm jointly optimizes worker memory, execution speed, network conditions, and cached model weights to minimize end-to-end inference latency. We integrate this algorithm into a distributed inference server implemented in over 5,500 lines of Rust. The server dynamically repartitions the model at runtime in response to workers joining or leaving the system. Using the browser as a distribution mechanism enables zero-setup participation, eliminating the need for manual worker configuration. Empirical evaluation shows that our cost model achieves a mean absolute percentage error (MAPE) of 8.4% overall and 4.4% for large models. Dynamic partitioning consistently outperforms static equal-layer splitting, and an ablation study confirms that each worker metric contributes meaningfully to assignment quality. We demonstrate the system recovering from unexpected worker disconnects and distributing a model totaling 60 GB of weights across heterogeneous devices.
en
Additional information:
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers