D* Lite algorithm vs Dyna Q+ Algorithm for navigating agents in a railway network

Rothschedl, Dominik

doi:10.34726/hss.2022.87562

Datensatz Zitierlink:

https://doi.org/10.34726/hss.2022.87562
http://hdl.handle.net/20.500.12708/19822

Titel:

D* Lite algorithm vs Dyna Q+ Algorithm for navigating agents in a railway network

Zitat:

Rothschedl, D. (2022). D* Lite algorithm vs Dyna Q+ Algorithm for navigating agents in a railway network [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.87562

reposiTUm-DOI:

10.34726/hss.2022.87562

CatalogPlus:

AC16486971

Publikationstyp:

Hochschulschrift - Diplomarbeit

Sprache:

Englisch

Autor_innen:

Rothschedl, Dominik

Betreuer_in:

Breitenecker, Felix

Mitbetreuer_innen:

Bicher, Martin

Organisationseinheit:

E101 - Institut für Analysis und Scientific Computing

Datum (veröffentlicht):

2022

Umfang:

Keywords:

Neural Networks; Agenten-basierte Modellbildung; Heuristische Pfadsuche; Reinforcement Learning; Q-Learning; Deep Reinforcement Learning

Neural Networks; Agent Based Model; Heuristic path finding; Reinforcement Learning; Q-Learning; Deep Reinforcement Learning

Abstract:

Diese Arbeit untersucht Pfadfindungsalgorithmen, um den zeiteffizientesten Weg für Agenten in einem dynamischen Modell, das auf einem realen Eisenbahnnetz basiert, zu finden. Aufgrund von zeitabhängigen Eigenschaften, wie ausgelastete Bahnhöfe und somit der daraus resultierenden Verspätung, sind herkömmliche Pfadfindungsalgorithmen wie A* nicht zufriedenstellend. Im Zuge dieser Fragestellung analysieren wir das Verhalten von zwei spezifischen Algorithmen: dem heuristischen D*-Lite Algorithmus und der Reinforcement Learning Methode Dyna Q+. Der D*-Lite Algorithmus ist eine inkrementelle heuristische Suchmethode, die Informationen wiederverwendet, um in unbekanntem Terrain zielgerichtet zu navigieren. Dieser Algorithmus kann in seiner ursprünglichen Form blockierte Knoten mittels Überprüfung der Umgebung in jedem Schritt erkennen. Durch eine Anpassung bezieht er in unserem Fall auch in Zukunft mögliche Entwicklungen der Auslastung eines Knoten mit ein. Außerdem wird das Konzept des Reinforcement Learning vorgestellt, welches zu der Anwendung des sogenannten Dyna Q+ Algorithmus führt. Dieser lernt durch einen Markov Decision Process (MDP) die optimale Strategie, um sein Ziel zu erreichen. Er ermittelt für jeden Zustand die beste Aktion, genauer gesagt jene welche die größte Belohnung bringt, mittels einer Kombination aus bisher Gelerntem und Erkundung. Einer der größten Schwächen von Dyna Q+ ist die Speicherung der Werte für jede mögliche Kombination eines Zustands und einer Aktion. Um dies zu umgehen, wird die Deep Variante von dieser Reinforcement Learning Methode eingeführt, welche einen Türöffner zur Theorie der Neuronalen Netze darstellt. Hierbei müssen nur die Modellarchitektur und die Parameter gespeichert werden. Um die Unterschiede der Algorithmen aufzuzeigen, wird einerseits darauf geachtet, ob Pfade überhaupt gefunden werden können und andererseits die qualitativen Unterschiede aufgezeigt. Hierfür betrachten wir anhand zweier Aspekte ob globale oder lokale optimale Lösungen berechnet werden, nämlich die Lösung für den Agenten selbst und wie sich diese auf das Zugnetzwerk auswirkt. Beim ersten Aspekt werden wir die Länge des Pfades und die benötigte Dauer bis zum Erreichen des Ziels betrachten. Außerdem wird ein Augenmerk auf die Aktionen gelegt, die der Agent aufgrund von zeitlich blockierten Stationen nicht durchführen kann. Der zweite Aspekt, der die Unterschiede der Algorithmen aufzeigen soll, ist die Auswirkung des Agenten, der den berechneten Pfad folgt, auf das Schienennetz. Hier wird in Anzahl der blockierten Züge und Minuten der Verzögerung im gesamten System unterteilt. Zusammenfassend beleuchtet diese Arbeit nicht nur die konkrete Aufgabenstellung, sondern versucht auch den allgemeinen Trend zur Nutzung künstlicher Intelligenzen zu begründen und sowohl die Vor- als auch Nachteile dieser Technologie in der theoretischen und praktischen Anwendung aufzuzeigen.

This thesis studies path planning algorithms to find the most time efficient way for agents in a dynamic model based on a real-life railway network. Because of the dynamic nature of blocked nodes during various time steps while operating on an historical timetable, classical path finding algorithms such as A* are not sophisticated enough to allow agents to find a viable path through the system. Therefore, we analyze the behaviour of two specific algorithms: the heuristical D*-Lite algorithm and the Reinforcement Learning method Dyna Q+. The D*-Lite algorithm is an incremental heuristic search method which reuses information from previous searches to navigate goal-directed in unknown terrain. It will be adapted to the specific setting with reusing blocked nodes. Furthermore, the concept of Reinforcement Learning is introduced, which leads to Dyna Q+. This one will find the optimal policy to achieve the goal by a Markov Decision Process (MDP). More precisely, it approximates the best action for each state in the environment through the right balance of exploitation and exploration, and draws its conclusions from the resulting outcomes. With respect to already taken actions in a state, Dyna Q+ will tackle the arising issues of the dynamically changing environment. To circumvent one of the biggest weaknesses of Dyna Q+, where the value of each state-action combination has to be stored in a matrix, Deep variants of Reinforcement Learning methods will be presented. Those are door openers to Neural Network theory, where only the model architecture and parameters need to be stored.In order to show the differences between the algorithms, we consider on the one hand whether paths can be found at all and on the other hand the qualitative differences. For this purpose, we consider two aspects whether global or local optimal solutions are computed, namely the solution for the agent itself and how it affects the railway network. In the first aspect, we will look at the length of the path and the duration needed to reach the destination. We will also pay attention to the actions that the agent cannot perform due to time-blocked stations. The second aspect to show the differences of the algorithms is the impact of the agent following the calculated path on the railway network. Here it is divided into number of blocked trains and minutes of delay in the whole system. This work does not only illuminate the concrete task but also tries to justify the general trend of using artificial intelligence and to shows both the advantages and disadvantages of this technology in the theoretical and practical application.

Weitere Information:

Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers

Lizenz:

Urheberrechtsschutz

Enthalten in den Sammlungen:

Thesis