An Architecture for Web-Based Distributed LLM Inference

Kitzberger, Gabriel

doi:10.34726/hss.2026.138985

Record link:

https://doi.org/10.34726/hss.2026.138985
http://hdl.handle.net/20.500.12708/227204

Title:

An Architecture for Web-Based Distributed LLM Inference

Citation:

Kitzberger, G. (2026). An Architecture for Web-Based Distributed LLM Inference [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.138985

reposiTUm DOI:

10.34726/hss.2026.138985

CatalogPlus:

AC17823374

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Kitzberger, Gabriel

Advisor:

Dustdar, Schahram

Co-advisor:

Furutanpey, Alireza

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2026

Number of Pages:

Keywords:

Distributed LLM Inference; Web-Based LLM Inference; LLM Inference Server; Zero-Setup Distributed Inference; ONNX Graph Partitioning; Dynamic Model Partitioning; Distributed Inference Architecture

Abstract:

LLMs reach many billions of parameters, requiring hundreds of gigabytes of memory to run inference for the largest frontier models. This restricts access to large organizations using server-grade hardware. Distributed LLM inference enables multiple devices to serve models that would be too large for just one device. However, existing solutions remain restrictive. In particular, common inference servers, such as vLLM, assume homogeneous hardware, whereas frameworks like Petals necessitate manual configuration and setup.To this end, we propose an architecture for web-based distributed LLM inference. Using the browser as a distribution tool, we enable zero-cost deployment; participants connect to a website, and a Rust-based orchestrator automatically assigns model partitions. The system uses ONNX Runtime Web for browser-based inference, WebGPU for acceleration, and Protocol Buffers over WebSockets for efficient communication. To accommodate hardware heterogeneity, we implement a dynamic partitioning algorithm that minimizes end-to-end latency based on worker capabilities.Our experiments show that the orchestration server introduces negligible overhead, accounting for less than 0.15% of the inference time. In high-bandwidth environments, networking time remains below 25% for up to 10 workers, suggesting that execution is primarily compute-bound. Integrating token decoding into the ONNX graph reduces data transfer by 75 MB per request, while prefix caching improves Time to First Token (TTFT) by up to 30%. Although throughput is lower than local inference engines such as Ollama, we empirically demonstrate that web-based distributed inference is a viable, zero-setup solution for accessing large-scale models.

Additional information:

Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft

License:

In Copyright

Appears in Collections:

Thesis