Distributed Training of Deep Learning Models in the Edge-Cloud-Space Continuum

Maresch, Maximilian

doi:10.34726/hss.2026.136960

Record link:

https://doi.org/10.34726/hss.2026.136960
http://hdl.handle.net/20.500.12708/228892

Title:

Distributed Training of Deep Learning Models in the Edge-Cloud-Space Continuum

Citation:

Maresch, M. (2026). Distributed Training of Deep Learning Models in the Edge-Cloud-Space Continuum [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.136960

reposiTUm DOI:

10.34726/hss.2026.136960

CatalogPlus:

AC17902037

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Maresch, Maximilian

Advisor:

Nastic, Stefan

Co-advisor:

Stanisic, Andrija

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2026

Number of Pages:

Keywords:

AI Training; Distributed Training; Egde Cloud Space Continuum; Compound AI; Distributed AI; Deep Learning Models; Data Parallelism; Hybrid Pipeline Parallelism

Abstract:

Edge Geräte werden immer leistungsstärker und weiter verbreitet. Sie bieten signifikante Rechenkapazität und Zugang zu Rohdaten mit niedriger Latenz. Im Kontext des Edge- Cloud-Space Kontinuums treten sie in neuartigen Formen hervor, wie als Low-Earth-Orbit Satelliten. Existierende Lösungen für verteiltes Training in Edge Computing Umgebungen verwenden Hybrid Pipeline Parallelism. Jedoch schaffen sie es nicht, dass Edge-Cloud-Space Kontinuum mit entsprechenden Mobilitätsbedenken in Betracht zu ziehen, aufgrund der hohen Kosten für Profilerstellung und Planung. Dies hindert die Anwendbarkeit in Szenarien mit Trainingsmitgliedern, welche sich regelmäßig ändern. In dieser Arbeit erforschen wir die Fähigkeit, Edge, Cloud und Space Ressourcen für verteiltes Training zu nutzen. Ausgewählte Herausforderungen mit Ressourcen, welche sich schnell bewegen, und mit unbalancierten Speicherlimitierungen werden bewältigt. Die Motivation hierfür liegt in der vermehrten Verfügbarkeit solcher hoch-mobiler Edge Ressourcen. Wir stellen LoftNN vor, ein Framework für verteiltes Training in dem Edge-Cloud-Space Kontinuum. LoftNN nutzt Hybrid Pipeline Parallelism um Edge-Cloud-Space Szenarien zu unterstützen. Wir stellen einen neuen Planungsalgorithmus vor, welcher die Kosten für die Planung über 97 Mal reduziert und es erlaubt zu einer größeren Anzahl an Geräten zu skalieren, während er kompetitive Leistung erhält. Weiters befassen wir uns mit Speicherlimitierungen der Edge und Space Geräte und wir stellen einen neuen Algorithmus vor, welcher Budgets für Activation Checkpointing ermittelt. Dies ermöglicht es, 70% größere Modelle auf derselben Infrastruktur zu trainieren. Letztlich stellt LoftNN eine Plattform für Experimente zur Verfügung. Mehrere Typen von Parallelisierung werden angeboten. LoftNN kann einfach in existierende Skripte für Training integriert werden, im Gegensatz zu bestehenden Lösungen.

Edge devices are becoming more capable and widespread than ever, offering significant amounts of compute capacity and access to raw data with low latency. In the context of the edge-cloud-space continuum, edge devices are emerging in novel forms, such as low-earth orbit satellites. Existing solutions for distributed training in edge computing environments employ hybrid pipeline parallelism. However, they fail to consider the edge-cloud-space continuum and its mobility concerns due to high profiling and planning overhead, hindering applicability in scenarios involving frequently changing training participants. In this work, we investigate the ability to leverage edge, cloud, and space resources for distributed training and tackle selected challenges with fast moving nodes and imbalanced memory limits, motivated by the increasing availability of such highly- mobile edge resources. We introduce LoftNN, a framework for distributed training in the edge-cloud-space continuum. LoftNN leverages hybrid pipeline parallelism to support edge-cloud-space settings. We propose a novel planning algorithm, drastically reducing the planning overhead by over 97 times and allowing scaling to a larger number of devices whilst retaining competitive performance. Furthermore, to tackle the memory limitations of edge and space devices, we propose a novel algorithm to determine the activation checkpointing budgets, enabling 70% larger models to be trained on the same infrastructure. Lastly, LoftNN offers an experimentation platform that covers multiple types of parallelism and can easily be integrated into existing training scripts, unlike existing solutions.

Additional information:

Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft

License:

In Copyright

Appears in Collections:

Thesis