Resource conflict handling in computing continuum using deep reinforcement learning

Popescu-Vifor, Vlad

doi:10.34726/hss.2025.127561

Record link:

https://doi.org/10.34726/hss.2025.127561
http://hdl.handle.net/20.500.12708/220245

Title:

Resource conflict handling in computing continuum using deep reinforcement learning

Citation:

Popescu-Vifor, V. (2025). Resource conflict handling in computing continuum using deep reinforcement learning [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.127561

reposiTUm DOI:

10.34726/hss.2025.127561

CatalogPlus:

AC17677776

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Popescu-Vifor, Vlad

Advisor:

Dustdar, Schahram

Co-advisor:

Murturi, Ilir

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2025

Number of Pages:

Keywords:

Ressourcenorientierte Orchestrierung; Leistungsbewusstes Ressourcenmanagement; Konfliktbehandlung in Ressourcenmanagement; Ressourcenmanagement-Framework; Simulation von Konflikten; Deep Reinforcement Learning (DRL); Verteilte Systeme; Distributed Systems Group (DSG); Edge Computing; Cloud Computing; Kubernetes; Autonome Agenten; Neuronale Netze; Cluster-Infrastruktur

Resource-Oriented Orchestration; Performance-Aware Resource Management; Conflict Handling; Resource Management Framework; Simulation of Conflicts; Deep Reinforcement Learning (DRL); Distributed Systems; Distributed Systems Group (DSG); Edge Computing; Cloud Computing; Kubernetes; Autonomous Agents; Neural Networks; Cluster Infrastructure

Abstract:

This research takes an in-depth look at conflict-handling mechanisms in resource-oriented orchestration within computing continuum environments and proposes a solution to conflict-handling between resource management agents. These agents are tasked with handling resource requests and allocating resources to various deployments to achieve optimal performance. However, in case of contradicting requests, the agents enter a state of conflict, leading to an endless loop of changes in resource allocations and to a worse performance of the corresponding services. This thesis aims to mitigate such conflicts with the help of a deep reinforcement learning (DRL) model. By integrating neural network models and adaptive reinforcement learning techniques, the proposed solution seeks to enhance the efficiency and stability of resource management in the cloud in the event of conflicts in resource allocation. The main contribution of this research is a framework that serves as a blueprint to create a simulation environment that allows triggering resource conflicts on services running in the computing continuum's criteria. Included in this framework, aside from the suggested resource management system, is also a DRL system that communicates with the resource management system to suggest the best-fitting optimization steps for services that may otherwise enter a conflicting state or remain running in a state that would waste more resources than required. The reinforcement learning aspect of this system relies on the communication between the components in the proposed framework. Because this system is performance and latency-aware, a watcher component is able to send feedback to the DRL model, in case the performance of the services degrades upon applying the neural network's decision. With this functionality, the DRL model can improve itself based on previous experiences.Multiple component tests and benchmarks done on the implemented framework show that it can successfully detect conflicts between contradicting agents and apply resource quotas to whole nodes, mostly without triggering performance issues. In case performance degradation occurs, the DRL system corrects itself accordingly to treat services with different thresholds and resource requirements accordingly.The results of the research show that it is possible to create a framework to generate conflicts between resource management agents in a cloud-based system and have them solved by the DRL system. This implies being able to reduce the allocation of the resources of a node by a specified quota in a performance-aware and self-repairing resource management environment.

Diese Arbeit untersucht eingehend Mechanismen zur Konfliktbewältigung in der ressourcenorientierten Orchestrierung innerhalb von Computing-Continuum-Umgebungen und schlägt eine Lösung für den Umgang mit Konflikten zwischen Ressourcenverwaltungsagenten vor. Diese Agenten sind dafür verantwortlich, Ressourcenanfragen zu bearbeiten und Ressourcen verschiedenen Deployments zuzuweisen, um eine optimale Leistung zu erzielen. Bei widersprüchlichen Anfragen geraten die Agenten jedoch in einen Konfliktzustand, was zu einer Endlosschleife von Änderungen in der Ressourcenverteilung und zu einer Verschlechterung der Leistung der betroffenen Dienste führen kann. Ziel dieser Arbeit ist es, solche Konflikte mithilfe eines Deep-Reinforcement-Learning (DRL) Modells zu entschärfen. Durch die Integration von neuronalen Netzwerken und adaptiven Reinforcement-Learning-Techniken soll die Effizienz und Stabilität des Ressourcenmanagements in der Cloud bei Ressourcenkonflikten verbessert werden.Der Hauptbeitrag dieser Arbeit ist ein Framework, das als Blaupause für die Erstellung einer Simulationsumgebung dient, in der Ressourcenkonflikte unter Einhaltung der Kriterien des Computing Continuums gezielt ausgelöst werden können. Neben dem vorgeschlagenen Ressourcenmanagementsystem enthält dieses Framework auch ein DRL-System, das mit dem Ressourcenmanagementsystem kommuniziert, um die am besten geeigneten Optimierungsschritte für Dienste vorzuschlagen, die sich andernfalls in einem Konfliktzustand befinden würden. Der Reinforcement-Learning-Aspekt dieses Systems basiert auf der Kommunikation zwischen den Komponenten des vorgeschlagenen Frameworks. Da das System leistungs- und latenzbewusst ist, kann eine Watcher-Komponente dem DRL-Modell Rückmeldung geben, falls sich die Leistung der Dienste nach Anwendung der Entscheidung des neuronalen Netzwerks verschlechtert. Mit dieser Funktionalität kann das DRL-Modell sich auf Basis früherer Erfahrungen verbessern. Mehrere Komponententests und Benchmarks des implementierten Frameworks zeigen, dass es erfolgreich Konflikte zwischen widersprüchlichen Agenten erkennen und Ressourcenquoten auf gesamte Nodes anwenden kann – meist ohne dabei Leistungseinbußen zu verursachen. Im Fall von Performance-Degradierung korrigiert sich das DRL-System entsprechend, um Dienste mit unterschiedlichen Schwellenwerten und Ressourcenanforderungen differenziert zu behandeln. Die Ergebnisse der Arbeit zeigen, dass es möglich ist, ein Framework zu entwickeln, das Konflikte zwischen Ressourcenverwaltungsagenten in einer cloudbasierten Umgebung erzeugt und diese mithilfe eines DRL-Systems löst. Dies impliziert, dass sich Ressourcen auf einem Node in einer leistungsbewussten und selbstkorrigierenden Ressourcenverwaltungsumgebung um eine festgelegte Quote reduzieren lassen.

License:

In Copyright

Appears in Collections:

Thesis