Kitzberger, G. (2026). An Architecture for Web-Based Distributed LLM Inference [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.138985
LLMs reach many billions of parameters, requiring hundreds of gigabytes of memory to run inference for the largest frontier models. This restricts access to large organizations using server-grade hardware. Distributed LLM inference enables multiple devices to serve models that would be too large for just one device. However, existing solutions remain restrictive. In particular, common inference servers, such as vLLM, assume homogeneous hardware, whereas frameworks like Petals necessitate manual configuration and setup.To this end, we propose an architecture for web-based distributed LLM inference. Using the browser as a distribution tool, we enable zero-cost deployment; participants connect to a website, and a Rust-based orchestrator automatically assigns model partitions. The system uses ONNX Runtime Web for browser-based inference, WebGPU for acceleration, and Protocol Buffers over WebSockets for efficient communication. To accommodate hardware heterogeneity, we implement a dynamic partitioning algorithm that minimizes end-to-end latency based on worker capabilities.Our experiments show that the orchestration server introduces negligible overhead, accounting for less than 0.15% of the inference time. In high-bandwidth environments, networking time remains below 25% for up to 10 workers, suggesting that execution is primarily compute-bound. Integrating token decoding into the ONNX graph reduces data transfer by 75 MB per request, while prefix caching improves Time to First Token (TTFT) by up to 30%. Although throughput is lower than local inference engines such as Ollama, we empirically demonstrate that web-based distributed inference is a viable, zero-setup solution for accessing large-scale models.
en
Additional information:
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft