Evaluating the impact of faults on the Quality of Service of edge-cloud applications

Sepin, Stefan

doi:10.34726/hss.2026.133116

DC Field

Value

Language

dc.contributor.advisor

Dustdar, Schahram

dc.contributor.author

Sepin, Stefan

dc.date.accessioned

2026-05-07T08:14:56Z

dc.date.issued

2026

dc.date.submitted

2026-03

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Sepin, S. (2026). <i>Evaluating the impact of faults on the Quality of Service of edge-cloud applications</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.133116</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2026.133116

dc.identifier.uri

http://hdl.handle.net/20.500.12708/227989

dc.description

Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft

dc.description.abstract

Mit dem Aufkommen des Internet of Things (IoT) werden die Grenzen des traditionellen Cloud Computings deutlich. Sensoren am Rande des Netzwerks (Edge) können Teil von Applikationen mit hohen Quality of Service (QoS)-Anforderungen sein, wie beispielsweise niedrige Latenz oder hoher Durchsatz. Diese Anforderungen können durch Cloud Computing allein nicht erfüllt werden, bedingt durch dessen zentralisierte Natur und die physische Distanz zwischen Edge Devices und Rechenzentren. Daher wurde das Konzept des Edge Computings eingeführt, bei dem Teile der Berechnungen direkt auf Edge Devices ausgeführt werden, um die Latenz zu minimieren. Allerdings stehen Applikationen, die am Edge ausgeführt werden, im Vergleich zu reinen Cloud-Applikationen vor zusätzlichen Herausforderungen. Unzuverlässige, drahtlose Netzwerke können zu Fehlern wie Paketverlust, begrenzter Bandbreite oder sogar Netzwerkpartitionierung führen. Ressourcenbeschränkte Edge Devices können unter hoher Arbeitslast an ihre Grenzen stoßen, und Ausfälle oder Hardwaredefekte können zu Applikationsabstürzen führen. Die Auswirkungen solcher Fehler auf Edge-Applikationen wurden bisher noch nicht umfassend untersucht, und es mangelt an Daten, die den Zusammenhang zwischen Fehlern und QoS-Metriken veranschaulichen. In dieser Arbeit präsentieren wir einen Ansatz zur Evaluierung der Auswirkungen von Fehlern auf die QoS von Edge-Cloud-Applikationen mittels Fault-Injection-Experimenten. Zwei Applikationen aus dem Bereich der Edge AI Inference werden auf einem Kubernetes (k8s) Cluster deployt, der auf einem physischen Testbed bestehend aus mehreren Edge Devices betrieben wird. QoS-Metriken werden durch das Monitoring-System Prometheus erfasst, während Fehler mithilfe von Chaos Mesh injiziert werden. Die gesammelten experimentellen Daten werden zur Erstellung eines Datensatzes verwendet, der das Verhalten der Applikationen während verschiedener Fehlerszenarien veranschaulicht. Dieser Datensatz wird anschließend sowohl mit statistischen als auch mit Explainable Artifical Intelligence (XAI) Methoden analysiert, um die Muster zu identifizieren, die Fehler in den Metrikdaten verursachen. Zusätzlich trainieren wir ein Machine Learning (ML) Modell zur Fehlerklassifizierung auf dem Datensatz, um zu untersuchen, ob diese Muster zur Erkennung von Faults genutzt werden können. Die Ergebnisse zeigen, dass CPU-Stress, Network Packet Corruption und Bandbreitenbegrenzung zu einer Service Degradation in Form von längeren Antwortzeiten und reduziertem Durchsatz führen. Fehler wie Netzwerkpartition und Container-Ausfall verursachen einen vollständigen Service Failure, während Packet Duplication, Reordering und RAM-Stress keine merkbaren Auswirkungen auf die QoS der Applikationen haben. Unsere XAI-Analyse zeigt, dass zeitbasierte Metriken und Ressourcenauslastungsmetriken für die Erkennung von Fehlern besonders wichtig sind. Der Fehlerklassifikator erreicht eine Genauigkeit von 83%, was darauf hindeutet, dass die in den Monitoring-Daten gefundenen Muster ausreichend charakteristisch sind, um die meisten Fehler zu erkennen. Fehler, die ähnliche Muster in den Metrikdaten aufweisen, können vom Klassifikator nicht unterschieden werden. Unser Ansatz ermöglicht es Entwicklern von Edge-Applikationen zu identifizieren, welche Fehler die größten Auswirkungen auf ihre Apps haben. Dadurch können sie ihre Bemühungen gezielt auf Resilienzmechanismen gegen diese Fehlerszenarien konzentrieren.

dc.description.abstract

With the emergence of the Internet of Things (IoT), the limitations of traditional cloud computing become apparent. Sensors located at the edge of the network may be part of applications with high Quality of Service (QoS) requirements, such as low latency or high throughput. These requirements cannot be satisfied by cloud computing alone, due to its centralized nature and the physical distance between edge devices and datacenters. Therefore, the concept of edge computing was introduced, where parts of the computation are performed directly on edge devices, minimizing latency. However, applications running at the edge face additional challenges compared to cloud-only applications. Unreliable, wireless networks may lead to faults such as packet loss, limited bandwidth or even network partition. Resource-constrained edge devices may struggle under high workloads and outages or hardware failures may result in application crashes. The effects of such faults on edge applications have not been studied extensively yet and there is a lack of data that illustrates the relation between faults and QoS metrics. In this thesis, we present an approach for evaluating the impact of faults on the QoS of edge-cloud applications by performing fault injection experiments. Two applications from the domain of edge AI inference are deployed on a Kubernetes (k8s) cluster running on a physical testbed consisting of multiple edge devices. QoS metrics are collected by the monitoring system Prometheus, while faults are injected through Chaos Mesh. The captured experimental data is used to create a dataset, which illustrates the applications’ behavior during various fault scenarios. This dataset is then analyzed using both statistical and Explainable Artifical Intelligence (XAI) methods to identify the patterns that faults cause in metric data. Additionally, we train a fault classification Machine Learning (ML) model on the dataset to investigate whether these patterns can be used to detect faults. The results show that CPU stress, network packet corruption and bandwidth limitation lead to service degradation in the form of longer response times and reduced throughput. Faults like network partition and container outage cause complete service failure, whereas packet duplication, reordering and RAM stress have no noticeable effects on the applications’ QoS. Our XAI analysis shows that time-based and resource utilization metrics are especially important for detecting faults. The fault classifier achieves an accuracy of 83%, indicating that the patterns found in the monitoring data are sufficiently distinctive for detecting most faults. Faults that exhibit similar patterns in the metric data cannot be distinguished by the classifier. Our approach enables edge application developers to identify which faults have the highest impact on their apps, allowing them to focus their efforts on resilience mechanisms against these fault scenarios.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Edge computing

dc.subject

Fault injection

dc.subject

Chaos Engineering

dc.subject

Quality of Service

dc.subject

Dependability

dc.subject

Resilience

dc.title

Evaluating the impact of faults on the Quality of Service of edge-cloud applications

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2026.133116

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Stefan Sepin

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

dc.contributor.assistant

Raith, Philipp Alexander

tuw.publication.orgunit

E194 - Institut für Information Systems Engineering

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC17856765

dc.description.numberOfPages

115

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.assistant.staffStatus

staff

tuw.advisor.orcid

0000-0001-6872-8821

tuw.assistant.orcid

0000-0003-3293-9437

item.mimetype

application/pdf

item.languageiso639-1

item.fulltext

with Fulltext

item.grantfulltext

open

item.openaccessfulltext

Open Access

item.openairetype

master thesis

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.cerifentitytype

Publications

crisitem.author.dept

E194 - Institut für Information Systems Engineering

crisitem.author.parentorg

E180 - Fakultät für Informatik

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(6.08 MB)

In Copyright

Show simple item record

Page view(s)

checked on May 7, 2026

Download(s)

checked on May 7, 2026

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM