Towards Improving Monte Carlo Tree Search for Games with Imperfect Information DIPLOMA THESIS submitted in partial fulfillment of the requirements for the degree of Diplom-Ingenieur in Logic and Computation by Lukas Grassauer, BSc. Registration Number 01526001 to the Faculty of Informatics at the TU Wien Advisor: Ao.Univ.Prof. Ing. Mag.rer.soc.oec. Dr.rer.soc.oec. Horst Eidenberger Vienna, February 23, 2024 Lukas Grassauer Horst Eidenberger Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Erklärung zur Verfassung der Arbeit Lukas Grassauer, BSc. Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. Wien, 23. Februar 2024 Lukas Grassauer iii Acknowledgements First and foremost, I want to thank my thesis advisor, Professor Horst Eidenberger, for the invaluable guidance, patience, and for the ongoing encouragement. His advice was pivotal in determining the course and execution of this work. Furthermore, I’d like to express my appreciation to my parents, Dr. Andreas Grassauer, and Dr. Daniela Dörfler, for their continued support, and for believing in me. My gratitude goes out to all those who listened to my ideas. You shaped this thesis, and your support has been invaluable. v Kurzfassung Die Monte-Carlo-Baumsuche (mcts, engl. Monte Carlo tree search) ist ein weit verbrei- teter Suchalgorithmus für Spiele. Bei Spielen mit perfekter Information wurden bereits beeindruckende Ergebnisse erzielt, und der Algorithmus hat sich als bevorzugte Methode im anspruchsvolleren Bereich des allgemeinen Spielens herauskristallisiert. Beim allgemei- nen Spielen müssen Agenten Spiele meistern, ohne deren Regeln vorher zu kennen. Viele Verbesserungen und Optimierungen für mcts sind jedoch bei Spielen mit imperfekter (oder versteckter) Information oder beim allgemeinen Spielen nicht verfügbar. Daher waren Neuentwicklungen für mcts mit dem Fokus auf diese Beschränkungen notwen- dig. Um die Spielstärke zu verbessern, schlug der Autor dieser Arbeit den neuartigen Ansatz vor, die Expansionstiefe des Suchbaums zu begrenzen. Der Ansatz erforderte die Identifizierung einer grundlegenden mcts-Modifikation und ein Mittel zur Konfiguration der maximalen Expansionstiefe. Anschließend wurde der Ansatz in einem Experiment quantitativ bewertet. Dazu musste eine Engine implementiert werden, die in der La- ge ist, allgemeine Spielpartien zu orchestrieren, Spiele mit imperfekter Information zu interpretieren und eine Reihe von Agenten zu entwickeln. In dieser Arbeit werden die wichtigsten Details der Engine- und Agentenimplementierungen vorgestellt. Sie präsen- tiert eine Analyse geeigneter mcts-Modifikationen. Außerdem wird die Verwendung von Merkmalsextraktion als Mittel zur Konfiguration der maximalen Expansionstiefe unter- sucht. Abschließend wird über die Ergebnisse zweier Turniere berichtet, die durchgeführt wurden, um die Veränderung der Spielstärke bei Begrenzung der Expansionstiefe zu be- werten. Für die betrachteten Fälle war die Multi-Beobachter-Informationsset mcts (mo-ismcts, engl. multi observer-information set mcts) die bevorzugte Modifikation. Die Merk- malsextraktion war hilfreich, aber nicht ausreichend, um den Agenten mit begrenzter Expansionstiefe zu konfigurieren. Bei diesem Agenten war im Vergleich zum Basis- agenten ein deutlicher Unterschied in der Spielstärke. Überraschenderweise erreichte er in einer Variante von Corridor eine um 6, 67% höhere Gewinnrate (10 mehr, von 150 Spielen) im Vergleich zum Basisagenten. Infolgedessen sollte mo-ismcts mit einer begrenzten Expansionstiefe für Spiele mit imperfekter Information in Betracht gezogen werden. Es bleibt unklar, ob der Vorteil nur für ausgewählte Spiele gilt oder ob er auf eine breitere Klasse von Spielen anwendbar ist. Zukünftige Arbeiten werden sich auf die Klärung dieser Frage und die Weiterentwicklung der Engine konzentrieren. vii Abstract Monte Carlo tree search (mcts) is a commonly used search algorithm for games. Im- pressive results have been achieved in perfect information games, and the algorithm emerged as the preferred method within the more challenging field of general game play- ing, where agents are required to play games without knowing their rules beforehand. However, many improvements and optimizations for mcts are not available in imperfect (or hidden) information games or in general game playing. Therefore, new developments for mcts with a focus on these constraints were necessary. To improve playing strength, the author of this thesis proposed the novel approach of limiting the search tree’s expan- sion depth. The approach required to identify a baseline mcts modification and a means to configure the maximum expansion depth. Subsequently, the approach was quantita- tively evaluated with an experiment. This required implementing an engine capable of orchestrating general game playing matches, interpreting imperfect information games, and a set of agents. This thesis highlights the key details of the engine’s and agent’s implementations. It presents analysis of suitable mcts modifications. Additionally, it examines the use of feature extraction as a means of configuring the maximum expan- sion depth. Finally, the thesis reports on the results of two tournaments, conducted to evaluate the change in playing strength when limiting the expansion depth. For the considered cases, multi observer-information set mcts (mo-ismcts) was the preferred modification. Feature extraction was helpful, but not sufficient for configuring the agent using limited expansion depth. This agent had a distinct playing strength compared to the baseline agent. Surprisingly, in the game of the Corridor family, it achieved a 6.67% higher win rate (10 more, out of 150 matches) compared to the baseline. As a consequence, mo-ismcts with a limited expansion depth should be considered for imperfect information games. It remains unclear whether the advantage is only applicable to the chosen games, or if it applies to a wider class of games. Future efforts will be focused on addressing this question and further developing the engine. ix Contents Kurzfassung vii Abstract ix Contents xi 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background and Literature Review 7 2.1 General Game Playing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Games as Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3 Methods 31 3.1 Scientific Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Comparison of Existing Modifications . . . . . . . . . . . . . . . . . . 33 3.3 Feature Extraction for Setting the UCT-Border . . . . . . . . . . . . . 34 3.4 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Implementation 39 4.1 Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Tree Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5 Results 53 5.1 Effectiveness of Existing Monte Carlo Tree Search Modifications . . . 53 5.2 Characteristic Features of Games . . . . . . . . . . . . . . . . . . . . . 59 5.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6 Discussion 69 xi 6.1 Monte Carlo Tree Search Extensions . . . . . . . . . . . . . . . . . . . 69 6.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.3 Limited Expansion Depth Agent . . . . . . . . . . . . . . . . . . . . . 71 7 Conclusion 75 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 List of Figures 79 List of Tables 81 List of Code Listings 83 Acronyms 85 Bibliography 87 CHAPTER 1 Introduction Games encapsulate complex decision-making processes. They are finite systems with well-defined metrics that have a long tradition in benchmarking agents. The advance- ments in the broader category of artificial intelligence (AI) resonate in the field of agent game playing, which have produced impressive results. Through the improvements made in hardware and development of new methods, games which deemed to be impossible to tackle, became accessible. Despite this, the class of games with imperfect (viz. hidden) information in the context of general game playing (GGP) remain insufficiently explored. This thesis aims to investigate the use of Monte Carlo tree search (MCTS) for playing games of this class. Thereby, providing insights in a novel modification of limiting the expansion depth, and its viability for imperfect information games. The ensuing sec- tions will outline the motivation for the topic, define the problem statement, detail the objectives of this research and present the structure of this thesis. 1.1 Motivation There is a fascination in letting agents play games that reaches back to the dawn of the computer itself. Alan Turing conceptualized a Chess engine before programming in a modern sense was possible (see [29]) and a few decades later the topic reached popular culture, after IBM’s Deep-Blue beat the then-reigning Chess World Champion Garry Kasparov. Recently, interest peaked again after Lee Sedol lost to DeepMind’s AlphaGo in the ancient game of Go, an achievement long thought to be impossible. While playing a specific game is certainly impressive, the field of GGP forces agents to be more adaptive, as they have to be able to play any game. The leading method for this task is MCTS, which has received significant attention after the result of the match between Lee Sedol and AlphaGo. Games serve as the ideal proving ground for algorithms. They are approachable for a broad audience, yet rich complexity arises from them. They allow systematic exploration 1 1. Introduction and evaluation of search methods, pattern recognition and strategy learning. They have a clear metric of success— win, tie or loss, which is often lacking in other fields. There is a tight conceptual link between game playing and AI-systems. Such systems act upon input received from their environment. Games provide low-stakes playground and benchmark environments. They challenge agents in planning, foresight and decision- making, features commonly associated with intelligence. GGP confronts the agent with unknown environments. It requires agents to play games without human intervention, and without knowing the rules beforehand. Real-world problems require agents to adapt to the rules of the situation, while staying performant enough to be able to deal with new constraints. Specialization has to be done ad-hoc or not at all. A fascinating class of games are those with imperfect information. These include ele- ments of hidden information. Think of examples like Poker, where the opponent’s hands are not known to a player, but also games like Kriegspiel (sometimes called Phantom Chess), where the moves are not communicated to the opponent. In contrast, are perfect information games, for example Go, where the whole state is known, and each move is clearly visible for all players. Imperfect information games are analogous to problems where not every detail is avail- able to the agent. Think of the field of medical diagnosis, where critical data about the patient might be missing. Another example would be financial forecasting, where it is sometimes in the best interest for some agents to hide their moves to gain an advantage. Or consider the discipline of autonomous driving, in which many uncertainties arise naturally— other vehicles and the weather conditions obscure the view of the agent. The synthesis of playing imperfect information games in a GGP context is a true challenge for agents. It pushes the methods available today to their limits and serves as a proper benchmark for performance. It allows measuring if an algorithm is ready for the messy incomplete data encountered in the real world and can handle unseen as well as unknown rules. In the long history of search methods, MCTS has emerged as the most promising method for both GGP and imperfect information games. It uses a sampling method to replace a heuristic function, which is often required for other methods to perform well enough. Additionally, because of the sampling nature of the algorithm, dealing with hidden in- formation is conceptually only a short leap. As the hardware improved, results grew became more notable and new avenues and methods opened up, such as machine learning (ML). As a result, agents using MCTS with neural nets guiding the search, quickly became the state-of-the-art for solving specific games, where a heuristic developed with expert-knowledge used to be the best available method. Many suggested methods remain unexplored and to be evaluated to this day. Subsequently, the results MCTS produced were impressive. AlphaGo beat Lee Sedol 4:1 using MCTS with two deep neural networks (see Silver et al. [25]). The networks were 2 1.2. Problem Statement trained by reinforcement learning using self-play. The approach was further demon- strated to be viable for the games Chess and Shogi in Silver et al. [24]. In principle, the technique could be used for any game, given that the rules are known beforehand and sufficient resources are available. Note that these conditions are not met in a GGP context. 1.2 Problem Statement In the discipline of GGP, games with imperfect information present a unique challenge. The combinatorial explosion due to the limited view and the partially observable moves, necessitates algorithms capable of effectively handling uncertainty, while staying appli- cable for all and any game. MCTS is a powerful tool for perfect information contexts, and impressive results have been produced using it as the primary method for game playing. Without the correct modifications, however, it is not effective when applied to games with imperfect information. Furthermore, many optimizations and approaches available normally, are no longer available in a GGP context. The lack of knowledge in imperfect information games introduces a new quality of com- plexity. Think of the game Battleship. Battleship is a two-player game where each player has an arrangement of ships on a grid that are hidden from the opponent. In turn, each player shoots the enemies’ grid and the other player announces if it was a hit or a miss. Consider a variant of this game where only one of the player’s ships were hidden. Certainly, this player would have a significant advantage, as the other player would be forced to guess, while they themselves could pick off the opponent’s ships without ever missing. The player with this advantage does not need to consider the uncertainty. Agents are no longer solely responsible for devising their strategies but are also required to consider the available data, the potential knowledge of their adversaries, and the potential adaptation of new information upon its availability. Prior work has focused on improving MCTS for perfect information contexts, and the remaining research is divided between GGP and imperfect information games. There is a significant gap in performance between games with imperfect information and perfect information scenarios, where impressive results have been achieved. This is especially apparent in the discipline of GGP, where game-specific knowledge cannot be used to improve playing strength. However, many of the optimizations carry over and might be used in the joint context of GGP and imperfect information games. It remains to be revised and tested which of these optimizations are effective for the joint case. When playing games with imperfect information, there is a significant increase in com- plexity, because of the uncertainty about the game’s state. Predictions made about the state of the game decrease in value as the distance from the current state increases. The conditionals of the uncertainty compound. As the expansion depth of MCTS is un- bounded, the procedure fails to deal with this efficiently. This is addressed as by a novel modification, proposed by the author. This modification sets a limit to the maximum expansion depth available to MCTS. 3 1. Introduction A fundamental limitation of playing games without knowing the rules beforehand is that aids like handcrafted heuristics are not available, or guiding the search with neural networks trained by reinforcement learning is infeasible. Nonetheless, the literature suggests different approaches for generating such improvements just-in-time, including statically analyzing the ruleset and dynamically sampling the games. Both methods allow extracting characteristic features of the game and can subsequently be used for ad- hoc heuristics. This approach will also be refered to as feature extraction. Previous work has not yet produced enough data to test if this approach is also viable for informing the value of the limit of the expansion depth. By contributing quantitative and qualitative evaluations, the author intents to firm the use of these approaches. As GGP and MCTS have been a topic of discussion for a relatively long time, many implementations exist. There is no agreed standard, and the tooling surrounding is often outdated. Furthermore, the focus of these tools is more often than not on perfect information games. Thus, there is no apt implementation of an engine capable of GGP and imperfect information games. One of the contributions of this thesis is such an artifact. 1.3 Research Object The contribution of this thesis is fourfold. Firstly, it surveys the available modifications for MCTS, while it provides reference implementations for the most important of those. Secondly, it studies the viability of limiting the expansion depth by experiment. In this experiment, GGP-agents played against each other using a custom implementation of an engine ready for imperfect information games. The data resulting from the experiment serves as the third contribution, while the engine used is the fourth and final contribution of the thesis. The available literature presents many modifications of MCTS for games with imperfect information. Some of those modifications work better than others, and many share their fundamentals. The author of this thesis aimed to examine the large body of work and use qualitative analysis on two concrete examples to answer the question of which of the modifications for MCTS are effective for playing imperfect information games in a GGP context. As MCTS is a general method, it can be configured to suite a specific problem. Similarly, the proposed modification of limiting the maximum expansion depth, can be adapted by changing where this cap actually is. During the quantitative evaluation of the validity of this novel approach, the author used static and dynamic feature extraction on two example games to inform the value of the limit. From this, a suggestion for a more general method for arbitrary games was extracted. This answered the question of which features can be extracted from descriptions of imperfect information games to construct a heuristic to set the maximum expansion depth for the MCTS algorithm. The metric of success – frequency of wins – is clearly defined in GGP, thus quantitative 4 1.4. Outline evaluation is relatively accessible. Often, previous work uses running tournaments of games as experiments to indicate how well a method does. To evaluate how a MCTS- agent with limited expansion depth compares in benchmarks, the author conducted such an experiment as a measure of quantitative evaluation. The results show that in special cases, the approach appears to be viable. Besides the data produced during the benchmark of the modification, the engine used for the experiment serves as a workbench for research in playing games. It is writ- ten in Python using contemporary software engineering principles. Together with the implementation of agents with effective modifications of MCTS for playing games with imperfect information and a GGP-orchestrator, it is a comprehensive work environment for future experiments in this field. 1.4 Outline This thesis is divided into seven chapters. Chapter 1 motivates the broader topic, states the fields’ most recent problems, lists the research hypotheses and finally outlines the thesis. Chapter 2 provides background and reviews the available literature. In chapter 3 a discussion of scientific method is followed by a detailed report of the experiment setup. An overview of the engine’s and agent’s implementation is provided by chapter 4. The results of the research can be found in chapter 5, and they are discussed in chapter 6. Finally, a summary, outlook for future work, and most important contributions are presented in chapter 7. Chapter 2 will lay out the necessary theoretical background while summarizing the prior art. Section 2.1 introduces the elementary principles and the protocol of GGP. Fur- thermore, it highlights the difference between perfect and imperfect information games from the context of GGP, and finally introducing the game description language (GDL). In section 2.2 the necessary definitions and concepts for viewing perfect and imperfect information games as trees, are introduced. These preliminaries are vital for section 2.3, which reviews MCTS and the most important modifications for playing imperfect infor- mation games found in literature. The scientific methods and experiment setup used in this thesis are discussed in chapter 3. At first, the overall scientific approach is reviewed in section 3.1, and afterward the details for the qualitative and quantitative evaluations are laid out. Section 3.2 covers the comparison of the existing MCTS modifications, while section 3.3 discusses the qualitative approach for extracting features, that are used to set the maximum expansion depth. Finally, section 3.4 goes into detail for how the quantitative evaluation was planned and conducted. The engine used for the experiment is the primary focus of chapter 4. This chapter touches upon the most interesting details, namely the overall architecture of the engine, the interpreter, and the match orchestration in section 4.1, while section 4.2 dives deep 5 1. Introduction into the tree data structure and section 4.3 outlines implementation details for the agents themselves. Chapter 5 reports on the results of the research. The qualitative evaluation of the existing modifications for MCTS can be found in section 5.1, while the results for the static feature extraction may be read in section 5.2. The data produced by the experiment is in section 5.3. A discussion of the results may be read in chapter 6. The chapter is sectioned according to the research questions. Finally, chapter 7 summarizes the thesis, concludes the findings, gives perspectives for future work, and reviews the most important contributions. 6 CHAPTER 2 Background and Literature Review This chapter introduces the required background for tracing the experiment setup (see section 3.4), to understand the implementation (see chapter 4), and reviews the prior-art. Section 2.1 gives an introduction to the field of general game playing (GGP). Section 2.2 presents the background and reviews the literature for viewing games as trees in the context of GGP. The search method Monte Carlo tree search (MCTS) is introduced in section 2.3. 2.1 General Game Playing General game playing (GGP) is the challenge of playing games only by their definition, and no prior knowledge of them. It is a generalization of the field of autonomous com- puter playing. Through the generalization, the implementation of search algorithm is less prone to the bias of domain-knowledge. GGP encompasses both perfect and imper- fect information games. This section of the thesis introduces the general concepts (see subsection 2.1.1), the commonly used protocol (see subsection 2.1.2), GGP in the con- text of perfect information (see subsection 2.1.3), as well as imperfect information (see subsection 2.1.4), and a standardized language for defining games (see subsection 2.1.5). 2.1.1 Elementary Principles The terminology around games is highly overloaded. To avoid confusion, this subsection discusses the most important terms and introduces the elementary principles underlying GGP. The terms are game, action, actor, role, move, turn, ply, history, and match. The word game is a synonym for its rules. The rules define a deterministic and finite state machine. A state machine has an internal state that is changed by actions. A state 7 2. Background and Literature Review History Ply TurnState Move Actor Match Game ActionRole instance of participates in consists of many assumes one does one consists of one consists of one consists of one consists of one consists of many Figure 2.1: An abstract entity-relationship diagram of the term used for games. is defined by a set of atoms, which are either true or false. Actions are performed by the actors. A role is a distinct entity in a game, assumed by an actor. There can be any non-zero, finite number of roles. In the example of Chess, there are two roles, one role controlling the white pieces, and the other role controlling the black pieces. Poker may have three or more roles, a dealer and at least two players. A move is a pair consisting of a role and an action. Given a state, a move can be legal or illegal. Not every move is legal in every state, and not all roles can do the same actions (and by extension the same moves). The state contains the information which role has to do an action. As a convention, this paper uses, a role is in control instead of a role has to do an action. Turns are groups of moves done at the same time. A turn consists of a move from every role that is in control. Depending on the roles that are in control, a turn can consist of a single move or multiple moves. A ply is a state-turn pair. Plies are counted and ordered. The first ply contains the 8 2.1. General Game Playing initial state and the first turn. The second ply consists of the first turn applied to the initial state, and so forth. A series of plies is called a history (see also [20]). This thesis discerns between open and closed histories. The last state of an open history is not terminal, while a closed history ends in a terminal state. A match describes an instance of a game. Therefore, the sentence “Player 𝐴 and Player𝐵 play a match of Chess”, precisely means that two players 𝐴 and 𝐵 use the rules of the game Chess to play against each other once. This definition reliefs the word game of the same interpretation. Each match has a history. The history serves as the record. 2.1.2 Protocol To conduct GGP matches Genesereth, Love, and Pell [10], suggest a protocol. The proto- col tasks a gamemaster with the mediation of the match. Furthermore, it conceptually splits a match into four distinct phases and defines how the phases change. The gamemaster is the matches’ administrator and ensures that no actor has an advan- tage over other actors by assuming this role. The gamemaster has to keep the canonical copy of the state (separate to the copies of the actors) and progress it once all actors made their moves. Furthermore, it keeps the actors informed about the game. An actor cannot manipulate the state to its advantage, as the gamemaster only allows legal moves. A match can be in one of four phases, the start-, play-, completed-, and aborted-phase (see fig. 2.2). During the start-phase, the actors are informed about the game they are playing. The actors submit their moves during the play phase, which leads to another iteration of the play-phase or is followed by either transitioning to completed or aborted. A match is completed if a terminal state is reached, should the match end in any other way, it is aborted. Before the match starts, the gamemaster sends to the actors, the rules, the mapped roles and how much time the actor has to answer in the start phase and how much time per ply the actor has. Afterward, the actors acknowledge and the main game loop starts. The rules are usually defined via the game description language (GDL) (see subsection 2.1.5). The gamemaster initializes the state with the definition given by the rules. To ensure fair conduct, the gamemaster keeps two clocks per actor, the start clock and the game clock. The start clock counts down as soon as the gamemaster sends the message to the agents. It is stopped once the gamemaster receives the acknowledgment. The game clocks count down during the plies of the main game loop if the player is in control. If a clock reaches zero, the match is aborted, and the associated actor is disqualified. Games may be classified into two types, simultaneous- and non-simultaneous moving games. Simultaneous moving games include those where more than one player at a time is in control. On the contrary, non-simultaneous moving games only have exactly one 9 2. Background and Literature Review Start Play Completed Aborted a ¬a ¬t t dnf dsq Figure 2.2: The phases of a match. The filled black state is the entry point, the state with double borders is the final phase and the red state indicates an error state. The transation𝑎, requires all actors acknowledge in time, 𝑡 is applicable if the state is terminal, dnf means that some actor failed to submit move in time, while if some actor submitted an illegal move dsq is applicable. player in control. In general, every non-simultaneous moving game can be expressed as a simultaneous moving game by modeling the inaction of the opponents with a null or no-operation move. The inverse direction is only possible under certain conditions (for more information see subsection 2.1.3 and subsection 2.1.4). During the play loop, each actor whose role is in control, must submit a move. If the move is deemed illegal, the match is aborted. Otherwise, the gamemaster combines all moves to a turn and applies it to the current state, advancing to the next state. Until a terminal state is reached, the play loop repeats. Each role has a goal value associated with it, for every state in the game. The goal value is either an integer or a null value. The goal values of terminal states have to be integers. They are synonymous with the utility of the roles. The actor with the highest utility wins the match. The match switches to the completed phase if it reaches a terminal state. The gamemaster uses the rules and the terminal state to calculate the utility for each role. Finally, it submits the utility to the actors. Afterward, the match has ended. 10 2.1. General Game Playing 2.1.3 Perfect Information Games Perfect information games describe a class of games without hidden information. There- fore, every entity involved knows everything that can be known. Each match starts with a defined initial state. The gamemaster informs every actor about the moves of the other roles. As the rules are deterministic, the actors may synchronize their state by applying the turns submitted by the gamemaster. Popular games with perfect information include Tic-tac-toe, Chess and Go. The formal specification of an 𝑛-player perfect information game, requires the following definitions (adapted from [10]): 𝑆 set of all game states𝑠𝑖 ∈ 𝑆 initial state𝑆𝑡 ⊆ 𝑆 terminal states𝑅 = {𝑟1, … , 𝑟𝑛} set of the 𝑛 roles𝐴 set of all actions𝑇 = {{(𝑟, 𝑎) | 𝑟 ∈ 𝑅′, 𝑎 ∈ 𝐴} | 𝑅′ ∈ 2𝑅, 𝑅′ ≠ ∅} set of all turns (2.1)𝑀 = 𝑅 ⨯ 𝐴 set of all movesℓ ⊆ 𝑆 ⨯ 𝑀 legal relation𝑛 ⊆ 𝑆 ⨯ 𝑇 ⨯ 𝑆 next state relation𝑔 ⊆ 𝑆 ⨯ 𝑅 ⨯ (ℤ ∪ {null}) goal relation Elements of chance, such as dice throws or card draws, are modeled with an additional role random. This role selects its move randomly and has the lowest possible goal value. Hidden information cannot be modeled, this requires an extension (see subsection 2.1.4). Any non-simultaneous game can be translated to a simultaneous one. Without a formal proof: Add a null or no-operation move to 𝐴 (the set of all actions). Let this move be called null. The legal relation ℓ is extended by pairs of all roles and null for every state where the given role is not in control. Furthermore, the turns of the next state relation𝑛, require null from every role that is not in control. Generally, simultaneous perfect information games cannot be converted to non- simultaneous games. Simultaneous turns would need to be broken up into sequential turns. Thus, one player would have the advantage of seeing the opponent’s moves before having to make their move. Another example of a simple perfect information game is Nim. Nim was named and extensively studied by Bouton [3]. It is often used, as it is simple to understand and is rich in tools for analysis. Nim is a basic alternating two-player game. At the start, there are a defined number of straws and the goal is to take the last straw. A player may take 11 2. Background and Literature Review one, two, or three straws from the heap, thereby reducing the number of straws. There are many versions known, with different sized heaps and additional rules. To illustrate, consider a variant of Nim with a single heap of five straws. Let this variant be called 5-Nim. Definition 2.2 characterizes the game, adhering to definition 2.1. The number of straws is represented as n or n depending on the number of straws 𝑛 and bold if first is in control or italic if second is in control. The actions are 𝟙, 𝟚, 𝟛 taking1, 2 or 3 straws respectively. 𝑆 = {0, 1, 2, 3, 5} ∪{0, … , 4}𝑠𝑖 = 5𝑆𝑡 = {0, 0}𝑅 = {first, second}𝐴 = {𝟙, 𝟚, 𝟛}ℓ = (𝑆 ∖ {0, 0}) ⨯ (𝑅 ⨯ {𝟙}) ∪(𝑆 ∖ {0, 1, 0, 1}) ⨯ (𝑅 ⨯ {𝟚}) ∪(𝑆 ∖ {0, 1, 2, 0, 1, 2}) ⨯ (𝑅 ⨯ {𝟛})𝑛 = {(5, {(first, 𝟙)}), 4), (5, {(first, 𝟚)}), 3), (5, {(first, 𝟛)}), 2),(4, {(first, 𝟙)}), 3), (4, {(first, 𝟚)}), 2), (4, {(first, 𝟛)}), 1), (2.2)(3, {(first, 𝟙)}), 2), (3, {(first, 𝟚)}), 1), (3, {(first, 𝟛)}), 0),(2, {(first, 𝟙)}), 1), (2, {(first, 𝟚)}), 0),(1, {(first, 𝟙)}), 0),(4, {(second, 𝟙)}), 3), (4, {(second, 𝟚)}), 2), (4, {(second, 𝟛)}), 1),(3, {(second, 𝟙)}), 2), (3, {(second, 𝟚)}), 1), (3, {(second, 𝟛)}), 0),(2, {(second, 𝟙)}), 1), (2, {(second, 𝟚)}), 0),(1, {(second, 𝟙)}), 0), }𝑔 = {(0, first, 1), (0, second, −1),(0, first, −1), (0, second, 1)} ∪(𝑆 ∖ 𝑆𝑡) ⨯ 𝑅 ⨯ 0 To reason about matches, the author of this thesis adapts a notation for histories pro- posed by Schiffel and Thielscher [20]. Histories are prefixed with a unique identifier, features states, which are separated by arrows. These arrows represent the turns, la- beled with moves. For example, consider a history ℎ starting from the state □ and the roles 𝑟1 and 𝑟2 doing the action 𝑎 and 𝑏 respectively resulting in the state ○. From this state 𝑟1 does 12 2.1. General Game Playing the action 𝑎 again and this leads to a terminal state △ with the goal values 1 for 𝑟1 and0 for 𝑟2. Therefore, ℎ ∶ □ 𝑟1∶𝑎,𝑟2∶𝑏−−−−−→ ○ 𝑟1∶𝑎−−→ △ | 𝑟1 ∶ 1, 𝑟2 ∶ 0 would compactly represent this history. As a more concrete example, consider a match of 5-Nim. In this match, first takes one straw from the heap and second answers by taking another straw from the heap. Finally, first wins the game by taking the final three straws. Compactly:ℎ𝟙,𝟙,𝟛 ∶ 5 first∶𝟙−−−→ 4 second∶𝟙−−−−−→ 3 first∶𝟛−−−→ 0 | first ∶ 1, second ∶ −1 2.1.4 Imperfect Information Games Opposed to perfect information games, are imperfect information games, which describe a class of games with hidden information. Therefore, at least one actor, does not know everything that can be known. The gamemaster is a clairvoyant observer. The informa- tion visible to the actors, is defined by the rules. Moves of their opponents are no longer submitted to the actors. Instead, only the visible parts of the state are shared. An actor may reconstruct the possible states by applying all legal moves and subsequently com- paring which states are compatible with the visible information. Popular games with hidden information include Battleship and Poker. Formally, a view is a partial state. A partial state only defines an assignment for visible atoms, all non-visible atoms may be true or false. The union of all partial states may be total or partial in respect to the actual state. The formal specification of imperfect information games expands the definition 2.1. The expansion includes the representation of views. It includes a new relation 𝑣. The relation maps a partial state to each role and state.𝑣 ⊆ 𝑅 ⨯ 𝑆 ⨯ 𝑉 view relation (2.3)𝑉 is the set of all partial states 𝑉 = {2𝑠 | 𝑠 ∈ 𝑆}. It is the power set of every state in 𝑆. Imperfect information games introduce a shift of focus from the state to a set of pos- sible states. In general, the number of possible states is exponential. Therefore, the computational effort is increased. Games with imperfect information can be freely translated between simultaneous- to non-simultaneous-moving formulations. Without a formal proof, the non-simultaneous moving formulation can be translated akin to perfect information games, where the non-action is explicitly replaced by a null or no-operation move. Simultaneous moving formulations may break up turns with multiple substates, where the effect of each move is not visible to the involved actors. 13 2. Background and Literature Review To illustrate the formal definition of imperfect information games, consider the game Phantom-Nim, a variant of Nim where the participants cannot directly see which size the heap has. Taking more straws than available, leads to a loss. The changes required for 5-Phantom-Nim, a variant with five initial straws, are listed in definition 2.4. When a term is omitted, definition 2.2 still applies. 𝑉 = {h, h}𝑣 = (𝑅 ⨯ {0, 1, 2, 3, 5} ⨯ {h}) ∪(𝑅 ⨯ {0, … , 4} ⨯ {h})ℓ = 𝑅 ⨯ 𝐴 (2.4)𝑛 = 𝑛′ ∪ {(2, {(first, 𝟛)}), 0),(1, {(first, 𝟛)}), 0), (1, {(first, 𝟚)}), 0),(2, {(second, 𝟛)}), 0),(1, {(second, 𝟛)}), 0), (1, {(second, 𝟚)}), 0)} Where 𝑛′ is the next state relation of definition 2.2. A state where first is in control has the view h and a state where second is in control has the view h. To illustrate, consider a match of Phantom-5-Nim. A clairvoyant observer sees that at the first ply first takes one straw, afterward second also only takes one straw, which first answers with again taking one straw. Finally, second wins the game by taking the last straws. Therefore, the canonical history for the match is:ℎ𝟙,𝟙,𝟙,𝟚 ∶ 5 first∶𝟙−−−→ 4 second∶𝟙−−−−−→ 3 first∶𝟙−−−→ 2 second∶𝟚−−−−−→ 0 | first ∶ −1, second ∶ 1 However, after the first ply, the possible histories for second are: ℎ𝟙 ∶ 5 first∶𝟙−−−→ 4ℎ𝟚 ∶ 5 first∶𝟚−−−→ 3ℎ𝟛 ∶ 5 first∶𝟛−−−→ 2 After second made their move, one of three histories may be possible for first. ℎ𝟙,𝟙 ∶ 5 first∶𝟙−−−→ 4 second∶𝟙−−−−−→ 3 14 2.1. General Game Playing ℎ𝟙,𝟚 ∶ 5 first∶𝟙−−−→ 4 second∶𝟚−−−−−→ 2ℎ𝟙,𝟛 ∶ 5 first∶𝟙−−−→ 4 second∶𝟛−−−−−→ 1 On the third ply, only three histories are possible for second. ℎ𝟙,𝟙,𝟙 ∶ 5 first∶𝟙−−−→ 4 second∶𝟙−−−−−→ 3 first∶𝟙−−−→ 2ℎ𝟙,𝟙,𝟚 ∶ 5 first∶𝟙−−−→ 4 second∶𝟙−−−−−→ 3 first∶𝟚−−−→ 1ℎ𝟚,𝟙,𝟙 ∶ 5 first∶𝟚−−−→ 3 second∶𝟙−−−−−→ 2 first∶𝟙−−−→ 1 From the information second received, they can infer that first did not take three straws from the heap; Otherwise, the match would already be over. The match ends after second takes the last straws. From the perspective of first, two closed histories are possible, ℎ𝟙,𝟙,𝟙,𝟚 ∶ 5 first∶𝟙−−−→ 4 second∶𝟙−−−−−→ 3 first∶𝟙−−−→ 2 second∶𝟚−−−−−→ 0 | first ∶ −1, second ∶ 1ℎ𝟙,𝟚,𝟙,𝟙 ∶ 5 first∶𝟙−−−→ 4 second∶𝟚−−−−−→ 2 first∶𝟙−−−→ 1 second∶𝟙−−−−−→ 0 | first ∶ −1, second ∶ 1 where first cannot know which of those two is the canonical history. 2.1.5 Game Description Language Game description language (GDL) or sometimes game definition language is a formal language to describe a deterministic finite state machine. It is used to define a game more consisely (compare to definition 2.1 and 2.3). It was introduced by Genesereth, Love, and Pell [10] and subsequently extended by Thielscher [27] to allow for games with imperfect information. This subsection adapts the definitions of the cited sources. It reviews the syntax, and gives possible definitions of 5-Nim and Phantom-5-Nim. It discusses how it is evaluated, and what the advantages of formulating games in GDL are. The language is logic-based, making its syntax similar to first-order logic. A definition of a game consists of facts and rules. The facts build the base. They are relations that hold statically. A rule consists of a head, which is also a relation and a body. The head holds if the body— a conjunctive clause of relations, holds. Rules can depend on facts, or on other rules. The language is a subset of Datalog, which itself is a subset of Prolog [2, 10, 14]. The syntax is commonly in infix (head :- body(1), body(2).) or prefix 15 2. Background and Literature Review ((=> head ((body 1) (body 2)))) notation. This thesis uses the infix notation. Relations are denoted in the name/arity notation, where name is the name of the relation and arity the number of arguments of that relation. Some relations are reserved to describe the components of the game. The relation role/1 defines which entities can do actions. The state has three keywords associated with it, init/1, true/1 and next/1. The initial state is described by init/1, while the current state is described by true/1. Valid GDL-definitions require next/1 to only be used in the head of rules. It indicates which atoms become true in the next state. Frequently, these rules depend on true/1 and does/2 relations. The moves of the roles during a ply are considered facts and are added as does/2 relations. The resulting next/1 relations are the next state’s true/1 relations. The special true(control/1) relation indicates which players are in control. Accordingly, next(control/1) describes who is in control in the next ply, and init(control/1) describes which players are in control at the initial state. The three relations legal/2, terminal/0 and goal/2, are also reserved, however, unre- lated to the next state. A move is legal, precisely if the first argument of the relation legal/2 contains the role, and the second the action. The atom terminal/0 is only true if the state is terminal. For these states, goal/2 indicates the utility of a role, where the first argument is the role, and the second an integer value (by convention in the range of 0 − 100). In comparison to programming languages, GDL is low-level in scope. The scope explicitly excludes arithmetic operations. Should arithmetic operations be needed, they have to be defined via the rules. Mittelmann and Perrussel [18] proposed an extension called GdlZ, that expands GDL with integer arithmetic. The only higher level relation available is distinct/2. The relation is true if the two arguments do not unify (viz. are different). Intentionally, GDL is a compact representation of definition 2.1. The notation is still transparent while allowing evaluation by an interpreter. Moreover, it is succinct, as all states do not need to be enumerated. To illustrate, consider the game of 5-Nim (see definition 2.2). Listing 2.1 defines 5-Nim in GDL. The roles are defined in line 1. The initial state is defined by lines 3 and 4. The actions are enumerated in line 26 and the legal actions in lines 38 to 41. The state transition is handled in lines 28 to 36, where 28 to 31 defines that the moves are alternating and 33 to 36 the effects of taking from the heap. An empty heap leads to a terminal state, which is defined on line 43. The goal values for non-terminal states are defined in 45 to 47. The losing conditions are listed in 49 to 51, while the winning conditions are stated on lines 53-57. Lines 6 to 24 define the arithmetic needed for taking from the heap. The style of programming in GDL is as opinionated as any other programming language. For example, in the lines 46, 50 and 54, it is possible to substitute true(heap(0)) with terminal. The author of this thesis argues that it should not be, as the goal value does not actually depend on the completion of the game but the size of the heap. Another 16 2.1. General Game Playing 1 role(first). role(second). 2 3 init(control(first)). 4 init(heap(5)). 5 6 succ(0,1). succ(1,2). succ(2,3). 7 succ(3,4). succ(4,5). 8 9 num(N) :- succ(N,N'). 10 num(N') :- succ(N,N'). 11 12 plus(S,0,S) :- num(A). 13 plus(A,1,S) :- succ(A,S). 14 plus(A,B',S') :- 15 succ(B,B'), 16 succ(S,S'), 17 plus(A,B,S). 18 plus(B,A,S) :- plus(A,B,S). 19 20 le(A,A) :- num(A). 21 le(A,B) :- lt(A,B). 22 23 lt(A,B) :- succ(A,B). 24 lt(A,C) :- succ(B,C), lt(A,B). 25 26 action(1). action(2). action(3). 27 28 next(control(R)) :- 29 true(control(R')), 30 role(R), role(R'), 31 distinct(R,R'). 32 33 next(heap(N)) :- 34 does(R,A), 35 true(heap(N')), 36 plus(N,A,N'). 37 38 legal(R,A) :- 39 true(heap(N)), 40 role(R), action(A), 41 le(A,N). 42 43 terminal :- true(heap(0)). 44 45 goal(R,0) :- 46 not true(heap(0)), 47 role(R). 48 49 goal(R,-1) :- 50 true(heap(0)), 51 true(control(R)). 52 53 goal(R,1) :- 54 true(heap(0)), 55 true(control(R')), 56 role(R), role(R'), 57 distinct(R,R'). Listing 2.1: A possible definition of 5-Nim using GDL. example is the commonly seen redundant pattern to define that the legality of an action is conditional on the role being in control. The gamemaster only allows roles in control to submit a move, which makes the inclusion of the clause redunant. The relation of GDL-definitions and logic programs leads to parallels in how they are evaluated. Two distinct approaches are possible, forward evaluation and evaluation via backtracking. Forward evaluation, or sometimes Herbrand interpretation, gathers a set of true relations, the facts. The set of true relations is expanded by all rules that are true based on the current true relations. This is done iteratively until a fixpoint is reached. Backtracking-based evaluation, starts from a relation, the query. All rules compatible with the query are checked. If a rule’s head unifies with a query, it is compatible. A relation is supported if there is a compatible rule and the body of this rule is supported. The backtracking-evaluation picks an untried arbitrary rule, with a compatible head, and adds it to the query. If everything is properly supported, the initial query is true. 17 2. Background and Literature Review Should a contradiction occur, the query backtracks until the contradiction resolves. If a query cannot be supported, it is false. Both forward and backtracking-based evaluation have relevant use cases. The perfor- mance is highly dependent on the game. Backtracking is usually better if only a few relations need to be checked, while it suffers compared to forward evaluation in the con- trary case. Both can be abstracted and used interchangeably, depending on the usecase. In fact, successful agents decide which technique they use, based on the task [2, 22, 28]. Defining games in a formal language, allows them to be executed, reasoned about, and compared to each other. Executing (or playing) the game does no longer need a sepa- rate implementation, a GDL-interpreter can use the game’s definition to do that. This generality comes at the cost of computational overhead. Another advantage of this rep- resentation of games is the formal reasoning that is possible. The definition can be checked if it satisfies certain requirements. For example, playability, which is the notion if there is a state where a role cannot move. Another example would be winability, it requires the existence of a series of moves that results in the winning condition of a role [11]. Similarly, the differences between games can be highlighted by the differences in their definitions. If a game is more difficult to play, its definition might include more relations and rules than a simpler example, or more legal moves are possible in a given state, which increases the size of the search space. The formal description of GDL does not allow for games with elements of chance, such as dice throws or hidden information akin to the hand of other players in a card game. To address this shortcoming, Thielscher [27] proposed game description language with imperfect information (GDL-II) (sometimes incomplete instead of imperfect). Since then, Schiffel and Thielscher [20] expanded the GGP-protocol to accommodate the changes. GDL-II is evaluated just the same as standard GDL. GDL-II has only one new relation, sees/2 and a special role called random. A role perceives a relation, if the relation is the second argument of sees/2 and the role is the first argument. This relation is the equivalent to the view relation 𝑣 of definition 2.3. To incorporate elements of chance, the role random chooses arbitrarily from the possible moves. Listing 2.2 defines the additional (or changed) rules for Phantom-5-Nim. The definitions for the two games share most of their rules. Lines 4 to 16 define that if a role takes more than is available on the heap, they are still in control in the next round, otherwise the control alternates. Lines 20 to 23 specifies that taking more than is available from the heap leads to an empty heap. Lines 25 and 26 define which roles see who is in control (if they are in control). Finally, line 28 expands the legal actions to all actions. 18 2.2. Games as Trees 1 % Lines 1-26 from 5-Nim 2 3 next(control(R)) :- 4 does(R',A), 5 true(control(R')), 6 true(heap(N)), 7 role(R), role(R'), 8 distinct(R,R') 9 le(A,N). 10 11 next(control(R)) :- 12 does(R,A), 13 true(control(R)), 14 true(heap(N)), 15 lt(N,A). 16 17 % Lines 33-36 from 5-Nim 18 next(heap(0)) :- 19 does(R,A), 20 true(heap(N)), 21 lt(N,A). 22 23 sees(Everyone,control(R)) :- 24 role(R), role(Everyone). 25 26 legal(R,A) :- role(R), action(A). 27 28 % Lines 43-57 from 5-Nim Listing 2.2: A possible definition of Phantom-5-Nim using GDL. 2.2 Games as Trees Games can be viewed as trees. This representation lends itself to using tree search methods. Tree search methods are studied in many fields of computer science. This section discusses trees in the context of perfect information games (see subsection 2.2.1) imperfect information games (see subsection 2.2.2) and a common pitfall of tree search methods called the horizon effect (see subsection 2.2.3). 2.2.1 Perfect Information Games The following paragraphs review how trees can be used to define games and how they are used to find the best move. As complete search is infeasible for large enough games, this subsection also discusses remedies for this limitation, and quantifies a good agent. The nodes of a perfect information game’s tree, hold the state and the edges are labeled with the turns. The tree is rooted in the initial state. From this root node, each possible move leads to a child node holding the resulting state. The leafs of the game tree hold the terminal states. This encodes definition 2.1. For an example of a game tree for the game 5-Nim, see fig. 2.3. The tree may be used by an agent to calculate the best move. The utility of a move is equivalent to the role’s utility of the resulting state. Usually, only terminal states have a defined utility. For non-terminal states, the utility is equivalent to the value of the best move. Crucially, the value of a move depends on which player is in control. Furthermore, if a tree is too large, enumerating all non-terminal states might be infeasible. In these cases, heuristics can be used to approximate the utility. 19 2. Background and Literature Review 5 4 3 2 𝟙 𝟚 𝟛 3 2 1 𝟙 𝟚 𝟛 2 1 0 𝟙 𝟚 𝟛 1 0 𝟙 𝟚 2 1 0 𝟙 𝟚 𝟛 1 0 𝟙 𝟚 0 𝟙 1 0 𝟙 𝟚 0 𝟙 0 𝟙 1 0 𝟙 𝟚 0 𝟙 0 𝟙 0 𝟙 0 𝟙 Figure 2.3: The game tree of 5-Nim. The round and blue nodes denote states where first is in control and rectangular and red nodes denote states where second is in control. The top-most node is the initial state and the states with a double border are terminal states. The edges represent the moves. Fat, blue and dashed edges are actions done by first. Thin and red edges are actions done by second. The label of the edge represents the arithmetic operation on the heap. In a match of a game with perfect information, every player views the same game tree. Each player may use this tree to find the best move. However, for most games, searching for the optimal move becomes akin to a search for the needle in the haystack, as game trees are prohibitively large for non-trivial games. Each state has at least one node, and each possible move results in a branch. There are too many nodes and branches to consider for a complete search. To remedy this limitation, agents build their own tree, called a search tree. Starting from the root node, the tree is built by expanding nodes. The expansion is done by calculating every possible follow-up state from the combinations of legal moves. The states correspond to the nodes and the moves correspond to the edges. Step-by-step, a subtree of the game tree is built. A good agent does not only consider more nodes (viz. handles a larger search-tree), but is also cautious about what parts of the tree are considered. Traditionally, this is done via heuristics, handcrafted by human-expert knowledge. Recently, these were replaced by better performing neural networks. They are trained by reinforcement learning, where the data is generated via self-play [26]. After a ply, the agents never need to ascend the game tree again. The tree is rerooted after each ply to the current node. As the current node holds all the available information, nothing is lost by pruning the tree above the current ply level. 20 2.2. Games as Trees 2.2.2 Imperfect Information Games The trees for perfect and imperfect information games fundamentally differ. The hidden information introduces the knowledge of a state. Encoding knowledge into the tree complicates the data structure. This subsection introduces the difference between perfect and imperfect information game trees, using the running example of Nim. Furthermore, it discusses the theoretical underpinnings for these differences and how this affects agents playing against each other. For games with imperfect information, agents can no longer accurately determine in which branch of the game tree they are. They cannot observe their opponent’s moves, and they do not view the full state, therefore, multiple nodes (and by extension states) need to be considered. Their states are indistinguishable for this player. See this visual- ized for the game Phantom-5-Nim in fig. 2.4. A set of states that is indistinguishable by a player is called an information set. This concept forms the basis of the search tree for games with imperfect information. De- pending on the assumed state, a move of a player might have different effects. Therefore, a single move might point to different information sets. To differentiate, they are only ever meaningful together with the assumed state. In imperfect information games, multiple different histories are possible. By contrast, in perfect information games, there is only one history. The current state can be recon- structed by applying the moves in order. In imperfect information games, moves are only partially observable, and only a partial state is visible. Therefore, multiple states may be consistent with the available information. Histories implicitly describe information sets [23]. An information set is the set of all states of all indistinguishable histories. Histories are important for navigating the search tree. They describe which branches need to be taken to reach the current node of the tree. Searching games with hidden information requires a new kind of data structure for search trees. The search tree’s nodes no longer hold one single state, but a set of indistinguishable possible states— the information set. The edges are labeled with the assumed state and the moves, as the effect of a turn can differ depending on the actual state. In the case some players have different views, their respective search tree differs, contrary to perfect information games, where full knowledge is assumed. In perfect information games, a node holds the actual state. While, in imperfect information games, the actual state is some state of the intersection of each player’s information set. 2.2.3 Horizon Effect in Games Tree search algorithms are susceptible to problems, that surface as anomalies in playing strength. One such difficulty is known as the horizon effect. This subsection reviews 21 2. Background and Literature Review 5 4 3 2 𝟙 𝟚 𝟛 3 2 1 𝟙 𝟚 𝟛 2 1 0 𝟙 𝟚 𝟛 1 0 0 𝟙 𝟚 𝟛 0 0 0 𝟙 𝟚 𝟛 0 0 0 𝟙 𝟚 𝟛 1 0 0 𝟙 𝟚 𝟛 0 0 0 𝟙 𝟚 𝟛 0 0 0 𝟙 𝟚 𝟛 2 1 0 𝟙 𝟚 𝟛 1 0 0 𝟙 𝟚 𝟛 0 0 0 𝟙 𝟚 𝟛 0 0 0 𝟙 𝟚 𝟛 1 0 0 𝟙 𝟚 𝟛 0 0 0 𝟙 𝟚 𝟛 Figure 2.4: The game tree of Phantom-5-Nim. The round and blue nodes denote states where first is in control and rectangular and red nodes denote states where second is in control. The top-most node is the initial state and the states with a double border are terminal states. The edges represent the moves. Fat, blue and dashed edges are actions done by first. Thin and red edges are actions done by second. The label of the edge represents the arithmetic operation on the heap. 22 2.3. Monte Carlo Tree Search what the horizon effect is, where the term was coined, and discusses why it is an issue to keep in mind for games with imperfect information. The search tree is only a part of the full game tree. At best, it is still sufficient for finding the optimal move, in the worst case, it produces inaccurate results. These inaccurate results are caused by the missing information the search is blind to. In some games, these blind spots can be mitigated by strong heuristics. Still, however accurate the heuristic is, it is an approximation of the actual utility. Naturally, an agent using a search tree (that is partial in respect to the full game tree) has a horizon. Whatever lies beyond this horizon— a good move, a trap laid by the opponent, might ultimately be opaque for the agent. This phenomenon was first named by Berliner [1], and is called the horizon effect. Berliner outlines a negative and a positive horizon effect. Even though the naming suggests that sometimes the effect is beneficial for playing strength, it usually is not. The negative horizon effect broadly describes cases where a bad move initially appears deceptively strong. The move has high heuristic value but due to some kind of downside that typically only appears after some plies, it is of low utility to the player. In contrast, the positive horizon effect includes moves that have an upside, that after searching deeper, reveals an alternative with even greater upside. While the initial definition of the horizon effect was meant for Chess programs, it de- scribes a phenomenon found in many fields of artificial intelligence. If the search depth is limited, unpredictable discrepancies between the actual utility and a heuristic can occur. The author of this thesis argues that games with imperfect information and high branch- ing factors have a greater tendency to suffer from the horizon effect. While it has not been a major focus of previous work, the literature generally agrees [5]. One possible ex- planation is that the heuristic has even less information to base their evaluation on, and anything beyond the horizon may be hidden among an exponential number of possible states. 2.3 Monte Carlo Tree Search The Monte Carlo tree search (MCTS)-algorithm is a procedure to efficiently generate a search tree from the rules of a game. It is used for games, and other fields such as planning, Chemical synthesis and Security [26]. Due to properties of the algorithm, it is particularly interesting for GGP. Namely, it does not need a heuristic and only needs the rules of a game to be applicable. In the following subsections, the author of this thesis reviews the core algorithm and four extensions for games with imperfect information. Subsection 2.3.1 gives the principles of operation for all MCTS variants. De- terminizing MCTS is the first discussed simple extension for imperfect information games (see subsection 2.3.2). Cowling, Powley, and Whitehouse [8] introduce three variants of information set Monte Carlo tree search (ISMCTS). The first of those three is reviewed in subsection 2.3.3, the second in subsection 2.3.4, and the last in subsection 2.3.5. 23 2. Background and Literature Review 2.3.1 Principle of Operation While there are many resources [4, 26] covering the details of MCTS and its modifications, the implementation of the core principles differ from author to author. This subsection discusses the most important aspects for implementing MCTS variants and how to con- figure it for different kinds of games. It spans the four characteristic phases, and goes into detail for UCT— the primary algorithm for the selection phase. Lastly, it discusses the susceptibility towards the horizon effect. MCTS is divided into the four phases, selection, expansion, simulation and backpropa- gation. During selection, the procedure walks down the tree until a leaf node is met. The children of this node are linked in the expansion phase, turning the leaf into a branch. The simulation phase evaluates the node, usually by simulating random plays starting from the current state. By propagating upwards until the root is reached, the backpropagation phase updates the tree, thus completing one step of the procedure. The goal of the selection phase is to determine, which of the available leaf nodes is the most worthwhile to expand. A leaf node is considered worthwhile to expand, either if the state of the node has a high valuation or if there is yet high uncertainty about the state. How these are quantified is dependent on configuration. Selecting nodes with high valuation is called exploitation, while choosing nodes with high uncertainty is considered exploration. The procedure starts at the root node and observes all children. If the current node is a leaf node, the selection phase ends. Otherwise, each child is assigned a weight. Considering these weights, a selector decides on a node. Then the procedure iteratively repeats until a leaf node is reached. Depending on the game, the selector and the assignment of the weights may be config- ured, which in turn configures the algorithm’s preference for exploitation or exploration. Simple selectors include minimum or maximum selectors that always picks the node with the lowest or highest weight, or random selectors where nodes with higher weight have a higher probability of selection [7]. A common way to assign weights is the UCT algorithm introduced by Kocsis and Szepesvári [13]. The UCT algorithm quantifies the degree of exploration and exploitation, by assigning a value 𝑢 to each node that depends on the valuation and the number of times it has been valuated in comparison to its siblings and selecting the node with the highest score [13, 26]. The value for 𝑢 is given by evaluating the function ucb ucb(𝑛) = 𝑄(𝑛) + 𝐶 ⋅ √ ln 𝑁(𝑛pred)𝑁(𝑛) where 𝑄(𝑛) is the valuation of the node 𝑛, 𝐶 is the exploration weight— usually decided empirically, 𝑁(𝑛) is the number of iterations done in 𝑛 and 𝑛pred denotes the parent of node 𝑛 (see also [26, 6, eq. 2]). 24 2.3. Monte Carlo Tree Search In the expansion phase, the procedure creates new nodes from the selected leaf node’s state and links them as children. Thereby, transforming the leaf node to a branch node. Due to the selection phase, the expansion of the tree is asymmetric. If the weights and selector of the algorithm prefer exploration, the resulting tree is more shallow and symmetric, where a high preference for exploitation produces deep and asymmetric trees. During the simulation phase, the algorithm repeatedly applies random turns to the state of a node until a terminal state is reached. The terminal state’s utility is then used as a valuation for the initially selected node. While a single simulation might not be meaningful, the algorithm repeatedly simulates, thus increasing confidence. Using simulations as valuations for the nodes divorces the search algorithm from the need for a heuristic. Heuristics usually estimate the utility for a given state, while simulations quantify the possibilities of a win. If a state has a high valuation, there exist enough sequences of moves that result in a win, which indicates a high utility. In the backpropagation phase, the procedure updates the tree with the information gained during the simulation phase. It starts at the node chosen during the selection phase, and uses the link with the parent to walk up the tree. Each node is updated until the current root is reached. At this point, one iteration of the procedure is completed. While MCTS is not immune to the horizon effect (see Section 2.2.3), it is far less sus- ceptible to it. The horizon effect is more likely to occur if the heuristic valuation and the actual utility of a node differ greatly [1]. As MCTS uses simulations, the heuristic is gradually improved, mitigating the horizon effect. 2.3.2 Determinizing Monte Carlo Tree Search On its own, MCTS cannot handle imperfect information games. It assumes the search to be on a tree with a state per node and a move per edge. The agent, however, has to handle a multi-edge search tree with multiple possible states in each node and a state- turn pair as edge label in imperfect information games. A simple extension to handle this shortcoming is determinizing MCTS. This subsection conceptually reviews the algorithm, why it ultimately is not suited for imperfect information games in the context of GGP, and why it is still a considerable alternative for certain games. Determinizing MCTS dissects an imperfect information game tree into several perfect information game trees. Starting from the root node, each possible state is assigned as separate tree. Before an iteration, one of those trees is selected. The results of each tree are then aggregated. The move with the most visits over all trees is chosen as the selected move. As alluringly simple this procedure is, it suffers from a grave weakness, referred to in the literature as strategy fusion [8]. Strategy fusion describes the phenomenon that averaging over all possibilities leads to weak playing performance. The weakness surfaces, when the optimal strategy depends on the fact that the hidden information cannot be known. 25 2. Background and Literature Review To illustrate, consider the game Mini-Single-Call. Mini-Single-Call is a single-player game with randomness and hidden information. First, the player has to decide whether they want to place a bet or fold. If they fold, the utility is +5. If the player bets, they are dealt a card with one of two equally likely suits, red or black. Without looking at the card, the caller must decide if they call red or black. Both moves end the game. If the player calls the same suite as the dealt card, they receive a utility of +10, if they guess wrongly −10. See fig. 2.5 as an illustration for a game tree. init bet folded +5 dealt ⋄ dealt ♣ dealt ⋄ called ⋄ +10 dealt ⋄ called ♣ −10 dealt ♣ called ⋄ −10 dealt ♣ called ♣ +10 bet fold deal ⋄ deal ♣ call ⋄ call ♣ call ⋄ call ♣ Figure 2.5: The game tree of Mini-Single-Call. The best move is clearly not placing the bet at the beginning, as this results in a sure+5 utility. Placing a bet on the other hand forces the outcome to be either +10 half of the time and −10 the other half of the time. The expected utility is 0. Nonetheless, determinizing MCTS would select to bet, due to strategy fusion. Strategy fusion occurs because the two determinized trees have the optimal strategy of betting and then selecting the correct call. When combining the strategies, the branch of betting seems to be promising a utility of +10. However, as the agent cannot see what card has been dealt, it is unable to call the correct suit with certainty; therefore the strategy is suboptimal. Experiments indicate that determinizing MCTS is still competitive in games where strat- egy fusion is not a major factor [8]. Determining if a game susceptible to strategy fusion 26 2.3. Monte Carlo Tree Search is difficult to gauge a-priori. There are methods of indirectly measuring it [16], but this is impractical for non-trivial games. A major benefit of determinizing MCTS is the relatively low runtime cost. In comparison to more sophisticated algorithms (see the subsequent sections) Cowling, Powley, and Whitehouse [8] report up to more than double the number of iterations per time frame for certain games. If strategy fusion is not a factor, this may result in better playing performances if the calculation time is fixed. 2.3.3 Single Observer-Information Set Monte Carlo Tree Search Single observer-information set Monte Carlo tree search (SO-ISMCTS) is part of the family of algorithms referred to as ISMCTS [8]. These algorithms abstract the ideas of MCTS to build information set trees, instead of game trees. This subsection reviews the changes required to the core MCTS algorithm, namely how the four phases are affected. It then reviews the advantages and drawbacks, on the topics of playing strength and runtime cost. Instead of splitting the information set’s states into the roots of multiple trees, SO-ISMCTS directly samples the possible states. Starting at the root, the procedure selects one pos- sible state and only considers edges of the multi-edge tree, where the label is compatible with the selected state. A node no longer only has a single state which can be used to valuate it. There are multiple ways of dealing with this. Cowling, Powley, and Whitehouse [8] suggest deter- minizing the tree at the root note before each iteration. In this way, the four phases of MCTS do not need to be altered conceptually. The agent in control defeminizes the tree by selecting one of the possible states of the root node. From the root, the edges labeled with pairs of states and moves lead to the children. Every edge, where the label does not contain this determinization, is incompatible. Those children with a compatible in-going edge are considered. The selection process then continues as in the perfect information MCTS. The selected node is then expanded. In perfect information MCTS this is straight-forward, as each node only contained a single possible state, the edges corresponded to the turns and each follow-up state was assigned its child node. The expansion in the new search tree for imperfect information games is different. As the information set may contain a prohibitively large number of possible states, it is not tractable to enumerate them all and consider each state. Instead, only the determinized state is used. When expanding a node via a determinized state, existing children are either populated or new nodes are created. A node may be a leaf node in one determinization, but not in another. The determinization approach ensures that after selecting the appropriate leaf node, expansion, simulation and backpropagation are done as in perfect information MCTS. While SO-ISMCTS improves the drawback of not being directly applicable to imperfect information games, it still suffers from strategy fusion. Fundamentally, SO-ISMCTS does 27 2. Background and Literature Review not differ greatly from determinizing MCTS. The difference is the accumulation of data. Determinizing MCTS first splits the information sets into perfect information trees and later recombines the resulting search trees, while SO-ISMCTS directly operates on the information set tree. They both produce similar results, however SO-ISMCTS is more efficient in allocating iterations. In the experiments done by Cowling, Powley, and Whitehouse [8], SO-ISMCTS is com- parable to determinized MCTS. In the same number of iterations, it is slightly more competitive. Its obvious downside is the weak opponent model, which is why it is weaker than multiple observer-information set Monte Carlo tree search (MO-ISMCTS) (see subsection 2.3.5). 2.3.4 Single Observer-Information Set Monte Carlo Tree Search with Partially Observable Moves Single observer-information set Monte Carlo tree search with partially observable moves (SO-ISMCTS+POM) is a evolution of the ISMCTS variants introduced by Cowling, Powley, and Whitehouse [8]. This subsection discusses the differences of SO-ISMCTS+POM and SO-ISMCTS. Furthermore, it lists the advantages and disadvantages in comparison to other reviewed MCTS variants, including the performance considerations. The continuation SO-ISMCTS+POM alleviates some shortcomings of its predecessors. It no longer differentiates between moves that result in outcomes with the same view. Thereby, removing the assumption that the opponents moves are fully observable. It no longer suffers from strategy fusion, while significantly weakening the model of the opponents. To accurately model partially observable moves, the search tree is extended by distin- guishing between two kinds of nodes, visible and hidden nodes. Visible nodes represent plies where the observer is in control, and is updated with the current view of the game state. Because the observer can distinguish its moves, the edges can be differentiated. On the other hand, are hidden nodes, where it is opaque to the observer which moves were made. The player in control observes the current partial game state and determinizes from the possible states. Each incompatible edge remains unconsidered. One of the children is selected. Should this child be a visible node, the selection procedure continues in the same way. If this child is a hidden node, the selection procedure randomly selects a compatible edge. Expansion, simulation and backpropagation are done as in SO-ISMCTS and perfect in- formation MCTS. Expanding the in the final step selected leaf node is done lazily by the determinization. Simulation is carried out by random playouts. Backpropagation aggregates the already present data and amends it with the new result of the simulation. The advantages of SO-ISMCTS+POM over SO-ISMCTS are minimal. Their playing strength is similar, where SO-ISMCTS+POM has a tendency to be slightly less competitive at the 28 2.3. Monte Carlo Tree Search same number of iterations [8]. In games where there is limited interaction between opponents, the playing strength should be closer to MO-ISMCTS (see subsection 2.3.5). 2.3.5 Multiple Observer-Information Set Monte Carlo Tree Search The last of three ISMCTS algorithms introduced by Cowling, Powley, and Whitehouse [8] is multiple observer-information set Monte Carlo tree search (MO-ISMCTS). MO-ISMCTS solves both the strategy fusion problem of determinizing MCTS and SO-ISMCTS, as well as the weak opponent model of SO-ISMCTS+POM. This subsection highlights the differences to the other variants of ISMCTS and reviews the required extensions to the four phases of MCTS. Lastly, it discusses the performance in comparison to other three reviewed MCTS variants for imperfect information. MO-ISMCTS uses an independent search tree, for each observer in the game. Wherever SO-ISMCTS+POM would assume a random move in the selection phase, MO-ISMCTS in- stead consults the respective tree, informing the decision. As a consequence, the models of the opponents are significantly stronger, at the cost of higher complexity. The phases of selection, expansion and backpropagation need to be carried out con- currently on all trees, while simulation can be used for all trees. Where the previous iterations and perfect information MCTS implicitly chose moves by successive selection of favorable states, MO-ISMCTS constructs turns from the role’s moves, ply by ply. Ex- pansion is still done on a specific determinization. Each tree grows lazily. The utility is backpropagated in each tree depending on the role. In the selection phase, at the start of the iteration, a possible state is determinized by selecting one from the information set of the current root. This determinization may be in many indistinguishable histories and seperate branches for each tree, therefore all non-main trees need to be rerooted. Rerooting requires backtracking up the tree until the last known state is reached and generating a development that is compatible with this last known state and the determinization. After all trees are rooted in the common determinization, the algorithm queries which roles are in control. Using UCT the most eligible child node for each tree is determined. Potentially, many edges point to that child. By arbitrarily choosing an edge, the pro- cedure selects a move for that role. All those moves are collected and combined into a turn. The turn is used to progress the determinized state. In each tree, the edge that is compatible with the respective move and the progressed state is used to select the next node. After iterating enough times until a leaf node in respect to the determinized state in the main tree was found, the algorithm transitions into the expansion phase. In the expansion phase, for every follow-up state of the determinized state is either added to the information set of an already existing node, or it is the basis of a new child node. This needs to be done for every tree. 29 2. Background and Literature Review Simulation via random playouts can be done just as in perfect information MCTS. After it reaches the terminal state, the utility for every role is extracted (instead of only the actor’s role). The utility of each role is backpropagated in each tree. This ensures that the moves in the selection phase mimic those of strategic opponents. In the experiments done by Cowling, Powley, and Whitehouse [8], MO-ISMCTS is the best option in the family of ISMCTS. While its runtime performance is comparatively poor, the stronger opponent model and the ability to search the tree deeply without strategy fusion offsets its limitations. In some games, the margin is bigger than others, depending on branching factor, susceptibility to strategy fusion and player interaction. 30 CHAPTER 3 Methods Autonomous play of imperfect information games in the context of general game playing (GGP) remains insufficiently explored. A promising family of methods are the Monte Carlo tree search (MCTS)-algorithms. One objective of this body of research was to survey the literature on MCTS, and gather the most important variants for imperfect information games, and GGP. Additionally, this thesis describes a new method in this family, based on the found variants. When applying MCTS, there are multiple configuration options available, to accommodate to different use cases. Using methods found in the literature, the author of this thesis suggests ways of configuring the proposed method, similar to the known options. To evaluate the method, this thesis describes a tournament, and reports on a quantitative experiment designed according to this tournament. A dependency of the tournament is an engine to run games in, and agents to play those games. The author of this thesis designed, and wrote an engine, and agents informed by similar programs described by the literature. Section 3.1 reviews the overall scientific approach, of these objectives. The methodology of the comparison of existing MCTS modifications is discussed in section 3.2. In sec- tion 3.3 the focus lies on the path towards configuring the proposed method. Section 3.4 offers the details of the design of the experiment. Finally, chapter 4 is devoted to the implementation of the engine, and the agents. 3.1 Scientific Approach The thesis had three research questions. First: Which of the modifications for MCTS are effective for playing imperfect information games in a GGP context? Second: Which features can be extracted from descriptions of imperfect information games to construct a heuristic to set the maximum expansion depth for the MCTS algorithm? Third: How does a MCTS-agent with limited expansion depth compare in benchmarks? 31 3. Methods The first research question, led to a literature research, and to a comparative and qualita- tive evaluation of two concrete games. The literature research started with the extensive survey papers by Browne et al. [4] and Świechowski et al. [26]. Both papers aimed to gather and review all available methods, modifications, and applications. From there, the most relevant publications on the topics of imperfect information games and GGP became the subject of the literature review. After selecting two games for the experiment (see section 3.4), the author of this thesis used those games for applying the methods described in the publications, and comparing them qualitatively. The survey of the chosen literature focused on the approach that allowed to play imper- fect information games, and if there are any specific optimizations that ruled the usage in a GGP context. More specifically for playing imperfect information games, check how the method mitigates strategy fusion (see subsection 2.3.2), if there have been experi- ments comparing to other methods, and if there is code available. Some methods are not usable in a GGP context. These were identified by investigating if the variant relies on a game specific optimization, and machine learning techniques. The most promising methods were qualitatively compared using the two example games Phantom_Connect(4,4,4) and Dark_Split_Corridor(3,4). The two games were also used for the experiment. The author of this thesis chose them for the qualitative anal- ysis, because of two reasons. Firstly, they were small enough for manual analysis, but still rich enough in complexity such that they were not trivial. Secondly, after having implementing them, he was intimately familiar with the games, and thus being able to quickly falsify hypotheses by running examples. The methodology of the qualitative analysis itself, was to use a characteristic state of each game, then analyze it manually, and finally, infer how the MCTS variants would evaluate the game state. The variants considered were determinizing MCTS (see subsec- tion 2.3.2), single observer-information set Monte Carlo tree search (SO-ISMCTS) (see subsection 2.3.3), single observer-information set Monte Carlo tree search with par- tially observable moves (SO-ISMCTS+POM) (see subsection 2.3.4), and multiple observer- information set Monte Carlo tree search (MO-ISMCTS) (see subsection 2.3.5). Focusing on specific games loses generality of analysis, but greatly simplifies the scope, while be- ing relatively insightful. Furthermore, it may serve as a stepping stone for more general analysis in future work. When configuring the devised method of limiting the expansion depth for MO-ISMCTS, the choice for the concrete depth is subject to a range of parameters. The author of this thesis proposes a set of parameters and methods, based on the approaches by Kuhlmann and Stone [15], Schiffel and Thielscher [21], Clune [6], and Mańdziuk and Świechowski [17]. The experiment was closely modelled after the quantitative evaluation done by Cowl- ing, Powley, and Whitehouse [8]. It is a double-round-robin tournament including self-play with three contestants, a clairvoyant MCTS-agent, an agent using unlimited depth-MO-ISMCTS, and a limited depth-MO-ISMCTS agent. Each contestant played as 32 3.2. Comparison of Existing Modifications the first-moving player and the second-moving player, against each other contestant, including themselves. The design process of the engine followed an iterative prototype model. Starting from the elementary data structures, the author aimed to identify the minimal required changes, such that the features required to conduct the experiment. Before implementing, adding unit test cases, whenever possible. This is also known as test-driven development. A goal during the implementation was to reuse as much of existing libraries and projects as possible to minimize how much code the author had to write, and thus start the experiment as early as possible. However, due to the scope of the project, the code base was still sizeable. Naturally, while implementing one part of the engine, insights accrued during this process, revealed design mistakes, and better alternatives for implementation details. Thus, a balance was to be kept, between iteratively improving, and striving to finish. 3.2 Comparison of Existing Modifications Four variants of MCTS (see section 2.3) are subject to comparison. These are the sug- gested primary methods for playing imperfect information games [26]. Naturally, their relative playing strengths differs from game to game. To compare the playing strength, the author of this thesis used a method of qualitative evaluation, adapted from previous work [8]. The survey paper by Świechowski et al. [26] identified improvements for MCTS used for playing imperfect information games. The primary suggested methods were determiniz- ing MCTS, and three variants of information set Monte Carlo tree search (ISMCTS). All of them were extensively covered in the publication by Cowling, Powley, and White- house [8]. Furthermore, the paper features several experiments directly comparing the methods. The authors use a small example game to showcase how the different the MCTS variants work. In the showcase, they chose a particular state, and visualized the search tree for every algorithm. As a consequence, Cowling, Powley, and Whitehouse [8] highlight the differences between the search trees, and emphasize the differences in expected outcome. For this thesis, the author expanded upon the idea of directly comparing the algorithms with specific examples. Specifically, by considering more complex games. The intended outcome was a richer analysis of the differences, losing simplicity. However, the goal was to compare, and not instruct on how the algorithms work. The chosen games were Phantom_Connect(4,4,4), and Dark_Split_Corridor(3,4). These were the games used for the experiment. Thus, technical feasibility was the main consideration. Nonetheless, among the choices of games, the author selected the games for additional reasons. For a detailed explanation of the games see subsection subsection 5.1.1 and subsection 5.1.2. 33 3. Methods By choosing Phantom_Connect(4,4,4), the results of the experiment could be com- pared directly to those by Cowling, Powley, and Whitehouse [8]. The game is a hidden information version of Connect that is well understood, while not being trivial. The author picked Dark_Split_Corridor(3,4) because it features a comparatively wide decision tree, with a complex state, while still being technically feasible. Furthermore, the moves are partially visible, leading to indirect player interaction, and incentivizing knowledge management. These properties indicated an interesting game to analyze the algorithms on. The choices for characteristic game states, follow from the histories produced by the experiment. The matches for Phantom_Connect(4,4,4) ran chronologically before those of Dark_Split_Corridor(3,4). Unfortunately, 53 of the 450 matches did not record the history in an easily extractable format. Thus, only 397 histories were available for Phantom_Connect(4,4,4) for analysis. For Dark_Split_Corridor(3,4) this mistake was rectified, and the analysis included all 450 histories. The states were produced by methodical selection of actions, informed by the number of occurrences in all histories. The number of occurrences is aggregated further upon the outcome of the match. From this follows for each action a count for all histories for each ply and for each outcome (first player wins, first player loses, or a tie). The selection of the moves corresponds to (one of) the most often used move, that ultimately leads to the most wins for the role in control. This process of selection results in a history. The history is cut off one ply before the shortest recorded history. The shortest recorded history for Phantom_Connect(4,4,4) was 7 plies long. The shortest recorded history for Dark_Split_Corridor(3,4) was 5 plies long. The last state of the generated history is the characteristic state for that game. 3.3 Feature Extraction for Setting the UCT-Border The proposed modification of the MCTS is characterized by limiting the maximum ex- pansion depth. The maximum expansion depth is also called the UCT-border. While there was no formal analysis on the effect of where exactly the UCT-border is, there is prior work on the feature extraction in the context of GGP. These features ideally inform the decision for setting the maximum expansion depth. Furthermore, after suffi- cient quantitative evidence, this may also pave the way towards a formal investigation in future work. The literature includes two approaches, static analysis (sometimes also offline analysis), and dynamic sampling (sometime also online analysis). The former is discussed in Schif- fel and Thielscher [21], and Kuhlmann and Stone [15]. The latter is featured in the publications by Clune [6], and Mańdziuk and Świechowski [17]. Static analysis concentrates on the static information of a game, which may be extracted via the description. These features are characteristic for a game, and indicative of 34 3.4. Experiment Setup properties relevant to search. For example, a game with a narrow tree, may only have a small set of actions. As another example, serves a game with a high number of atoms. This feature may indicate a complicated state. A more thorough analysis yields more expressive results. When resources are limited, such as a competitive environment, the expense may not be justifiable. Dynamic sampling – another approach – includes sampling states, and doing statistical analysis upon the recorded states. This is less general, but nonetheless useful in most cases. For example, counting the number of possible states, given a role’s perspective. This indicates how much information is actually hidden from this role. For the value of the maximum expansion depth, two static-, and two dynamic features were considered. The static features were the size of the state space, and the size of the action space. The size of the state space may be given an upper bound by counting the atoms defined by the game description language (GDL) description. The state space is less or equal to the powerset of atoms. The tightness of the upper-bound varies, and can be improved by incorporating rules. The size of the action space is the number of actions defined by the GDL description. The dynamic features were the mean arity (viz. number of branches of a node) of the game tree, and the average distance to the terminal nodes. To determine the state- and the action space’s sizes, ground the logic program described by the GDL description. The set of relations that are arguments of init/1, true/1, and next/1, form the state space. The set of relations that are the second argument of legal/2 constitute the action space. The mean arity of the game tree is approximated by generating random histories, and counting the number of possible turns at each ply. Starting from the initial state, ap- ply random turns to the current state until reaching a terminal state. During each step, record previously unseen non-terminal states into a set. For each state in the set, calculate the number of unique turns. Similarly, the average depth is calculated by sampling closed histories, and counting their length. Starting from the initial state until reaching a terminal state corresponds to a path from the root- to a terminal node in the game tree. Thus, the number of plies is equivalent to the depth. 3.4 Experiment Setup The conducted experiment was a tournament, playing two games, with the implemented agents as contestants. The tournament was double-round-robin with self-play. Self- play includes games where two instances of the same agent play against each other. The selected games were Phantom_Connect(4,4,4), and Dark_Split_Corridor(3,4). Both games are alternating two player zero-sum games with hidden information. Each agent played fifty games as first moving-, and fifty games as second moving role, against each contestant. The number of MCTS iterations were fixed, with an unlimited game clock time, to remove any influence of implementation details. The contestants were a 35 3. Methods clairvoyant MCTS agent, a MO-ISMCTS agent without a limited expansion depth, and an agent using MO-ISMCTS with a fixed expansion depth. The design of the tournament was informed by Cowling, Powley, and Whitehouse [8]. The mode of double-round-robin with self-play, ensures an equal number of matches for each agent, and for each role. By balancing the number of matches, statistical com- parison is more representative. Alternatives to double-round-robin tournaments include elimination tournaments. Elimination tournaments are well suited for finding the best agent among the contestants, using a smaller number of games. However, the smaller number of games makes it more difficult to analyze why the agent is better. The author of this thesis chose Phantom_Connect(4,4,4) as the first game because it was also used as a game in the tournament conducted by Cowling, Powley, and White- house [8]. The conclusions of the paper, served as a point of reference, and verification. Additionally, the small state- and action space, classify it is a computationally tractable game. The second game Dark_Split_Corridor(3,4), is a computationally more intense game. Furthermore, there seems to be less available literature on it. In contrast to the first game, it does not feature ties, has a more complicated game state, and features a more immediate player interaction. The number of matches played was fifty per game, per role, and per agent. In total, for each game this equates to 450 matches, and 250 matches per agent (not double counting the games they played against themselves). The number of games was limited by the high computational load, and consequentially the required time to run the experiment. The number of iterations was fixed, while there was no time limit per ply. This enabled the possibility to run the experiment distributely on multiple machines, while keeping the results representative. For both games, the number of MCTS iterations, were chosen as high as possible, while keeping the total runtime feasible. With Phantom_Connect(4,4,4) the agents applied7500 iterations of MCTS per ply. Dark_Split_Corridor(3,4) due to its more complex state, and the resulting longer individual iterations, was reduced to 2500 iterations. The implemented agents were a clairvoyant (viz. cheating) MCTS agent, an unlimited-, and limited MO-ISMCTS agent. The clairvoyant agent plays with the full game state in view, instead of a partial view. Therefore, there is no information hidden to it. Obviously, this leads to a significant advantage. The clairvoyant agent serves as an upper bound of playing strength. Furthermore, an agent winning, tieing, or losing more often against the clairvoyant agent is an indication of an advantage or disadvantage in playing strength. The unlimited agent implemented MO-ISMCTS, and would be directly compared to the limited counterpart. The limited agent had a set expansion depth of 4 for Phantom_Connect(4,4,4), and 5 for Dark_Split_Corridor(3,4). The expansion depth depended on the static analysis, dynamic sampling, and most im- portantly on the histories of the matches conducted by the unlimited agent. These 36 3.4. Experiment Setup matches were conducted chronologically before the limited MO-ISMCTS agent. The his- tories showed that the values chosen resulted in lower maximum expansion depth, while not being too restricted. 37 CHAPTER 4 Implementation The implementation consists of three modules, the engine, the tree (or node) data struc- ture, and the agents. The engine is the foundational layer of the implementation. It covers the elementary data structures, the game description language (GDL)-interpreter and the match orchestration. The tree data structure is a representation of the dynamic search tree discussed in section 2.2. The implementation of the three available agents, apply Monte Carlo tree search (MCTS) onto the tree data structure. The programming language of choice was Python with type-hints. The targeted versions included 3.8 up to 3.12. The choice of Python was based on the readability, usage, and because it’s interpreted rather than compiled, ease of debugging, and introspection. Fur- thermore, by using libraries for solving logic programs, writing parsers, and orchestrating multithreading, the scope of the project was more manageable. 4.1 Engine The engine covers three parts, the elementary data structures, the interpreter, and the match orchestration. The elementary data structures are modeled after the definition of GDL. The interpreter uses the elementary data structures to receive and answer requests. The match orchestration module uses the interpreter to conduct matches. This section briefly summarizes the most important details for the implementation of the elementary data structures, the interpreter, and the match orchestration. 4.1.1 Elementary Data Structures The elementary data structures are organized into a hierarchy of classes (see fig. 4.1), which closely follow the definitions of GDL. Symbol forms the bottom of the hierarchy. It is a disjoint union of Primitive, and Relation. One level up is Sentence. A sentence is either a fact or a rule. It consists of a head and a body. The head is a relation. The 39 4. Implementation body is a sequence of Literal objects. A literal is a relation and a sign. A sequence of sentences forms a Ruleset. The design and architecture was informed by the Python API of the answer set programming (ASP)-solver clingo [12]. Primitive Variable + name str Number + number int String + string str Relation + name str | None + arguments Subrelation[] Symbol Subrelation + symbol Symbol Sentence + head Relation + body Literal[] Literal + atom Relation + sign Sign Sign + value NOSIGN: 0; NEGATIVE: 1 Ruleset + sentences Sentence[] Figure 4.1: The class hierarchy of the elementary data structures. The Primitive class is the abstract base class, of Variable, Number and String. A variable is only a name that may be used instead of any other symbol. A number is a zero-ary relation, and an integer as name. Finally, a string is also a zero-ary relation, with any characters as name. Strings are denoted inside quotes to distinguish between relations and variables. These primitives are the bedrock of the hierarchy. They do not have arguments, and therefore no children. The Relation class consists of a name, that is either a str or None and a sequence of arguments, with the type Subrelation. A relation with no name is a tuple. A subrelation is a wrapper for either a relation or a primitive. The wrapper class allows a 40 4.1. Engine type-safe total ordering between all symbols, and to distinguish a top-level relation and a relation as argument. Sentences are constructed with the Sentence class, consisting of a head with type Re- lation and a body, a sequence of elements with type Literal. If the body is empty, the sentence is considered to be a fact. If the body is non-empty, the sentence is a rule. Literals are a Relations together with a Sign, which is either positive (i.e., NOSIGN) or negative (i.e., NEGATIVE). Finally, an ordered collection of sentences forms a ruleset which is instantiated with the Ruleset class. In the implementation, Ruleset builds the rule-graph, and from that, extracts the strongly connected components. 4.1.2 Interpreter The interpreter uses the definition of a game to fulfill requests. The definition of a game is given in GDL (see subsection 2.1.5). GDL is a dialect of Datalog, and therefore a stratified logic program. The interpreter transforms the logic program, depending on the request, and delegates it to a solver. Then the interpreter processes the solution, extracts and returns the answer, and thus carries out the request. The architecture of the interpreter is inspired by Kowalski and Szykuła [14]. The inter- preter uses the ASP-solver clingo [12], more specifically its Python-API, to solve logic programs. This approach was first described by Thielscher [28] and later by Möller et al. [19]. The interpreter has to fulfill static, dynamic, and temporal requests. Static requests are neither dependent on the state, nor sensitive to actions. These include what roles exist in the game and what the initial state is. Dynamic requests either depend on the state, and for one method the actions of the actors. The dynamic requests include, what the next state is, which role sees what, what the legal moves are, what the goal values for each role are, whether the state is terminal. Finally, there are two temporal requests, enumerating all possible histories, and listing all possible states. See fig. 4.2 for the methods of an interpreter. The interpreter only considers relevant rules when fulfilling requests, by slicing the origi- nal GDL definition. The relevant rules are grouped in the strongly connected components of the rule graph. The rule graph is a directed graph, where nodes correspond to the signature of a relation, and an edge represents the inclusion in the body of a rule. Some rules depend on the distinct/2 relation, which holds if its arguments do not unify. Literals containing the relation are substituted with the ASP binary comparison (!=)/2 (viz. does-not-equal operator). A backtracking-solver would need to delegate to the (/=)/2 (viz. does-not-unify operator). The method for the role request is called get_roles. It answers which roles exist in a game. The method extracts the rules where the heads are in the same strongly connected 41 4. Implementation Interpreter + get_roles() Role[] + get_init_state() State + get_next_state(current: State | View, turn: Mapping[Role, Move]) State + get_sees(current: State | View) Mapping[Role, View] + get_legal_moves(current: State | View, role: Role) Mapping[Role, Set[Move]] + get_goals(current: State | View) Mapping[Role, int | None] + is_terminal(current: State | View) bool + get_developments(record: Mapping[int, State]) Iterator[State[]] + get_possible_states(record: Mapping[int, State | View], ply: int) Iterator[State] Role: Subrelation State: Set[Subrelation] View Move: Subrelation Figure 4.2: The class diagram of the interpreter. component as the role/1 relation. The solution to the corresponding logic program contains role/1 relations. Every argument of such a relation is a role. The solvers answer may be cached. The request for the initial state, is the interpreter’s get_init_state method. The solver receives all rules which are relevant in respect to the init/1 relation. In the solution, all arguments of the init/1 relations form the true atoms of the initial state. The solvers answer may be cached. Dynamic requests include an encoding of the state into the logic program. The state is a set of atoms, representing the atoms that are true in this state. All other atoms are assumed to be false. The encoding of a state is a series of facts. These facts match the signature true/1. The method get_next_state, represents the request for the next state. It takes two inputs, the current state, and the turn. The solver receives all rules which have the relation next/1 as head, and their relevant rules. Additionally, the current state is encoded, and the turn is encoded with does/2 relations. A turn consists of moves, which are pairs of roles and actions. The role is the first argument of the does/2 relation, while the action is the second. The solution to the logic program contains the next state encoded with next/1 relations. All other relations in the solution may be 42 4.1. Engine ignored. Given a state, the method get_sees returns the views for each role in the game. The derived logic program includes all rules with the sees/2 relation as head, their relevant rules, and the encoded state. The solution consists of sees/2 relations where the first argument corresponds to the roles, and the second to the atom that is visible to the role. All other relations in the solution are irrelevant. Similarly, the method get_legal_moves represents the request for the legal moves for each role in the game, given a particular state. By slicing the GDL definition, such that, only the relevant rules for all rules with legal/2 are included, and the state is encoded, the interpreter derives the appropriate logic program. The solution of the logic program encodes the legal actions of each role in the given state via the legal/2 relation. To determine if a state is terminal, the interpreter offers the is_terminal method. The method takes the state as input. The logic program derived from the GDL definition includes all relevant rules to the terminal/0 relation, as well as the encoded state. If the solution to the program includes terminal/0 the state is terminal. Temporal requests model the match as a series of states, as a logic program. This logic program is called the temporal extension [28]. The temporal extension of a GDL definition swaps out the init/1, true/1, and next/1 relations for a holds_at/2 relation. The holds_at/2 relation’s first argument is the atom, while the second argument is the ply number. It also transforms relations that are relevant in a temporal context by adding the _at suffix, and adding the ply number as another argument. The get_developments method is a temporal request, and can enumerate all possible histories. It takes any number of states paired with their ply number, and returns all histories that are compatible with the states. The interpreter transforms the original GDL definition into the temporal version, and encodes the states at their respective ply with holds_at/2 relations. The resulting logic program has multiple solution. Each solution represents a history. The state at a given ply of a history is encoded with holds_at/2. Similarly, the get_possible_states method may be used to enumerate all possible states given a collection of views and their ply numbers. Optionally, the method also takes states with ply numbers. The resulting temporal logic program consists of holds_at/2 and sees_at/3 rules. Each solution encodes a possible history. The interpreter extracts the possible solutions by filtering the holds_at/2 relations by ply number. 4.1.3 Match orchestration Match orchestration is divided into two modules, the actors, and the general game playing (GGP)-gamemaster (see subsection 2.1.2 and [10]). The actors are objects defined by the Actor class. They serve as the interface between the agents and the gamemaster. 43 4. Implementation The gamemaster is represented by the Match class. It centrally mediates the actors, and manages the match’s history. The Actor class serves as the bridge between the gamemaster, and the agent. As a bridge, there is nearly no business logic, except enforcing disqualification when the game clocks run out. The gamemaster sends the signals discussed in subsection 2.1.2 via the methods send_start, send_play, send_stop, and send_abort to notify the agent of transitions in the match’s phase. The return values of those methods are the respective answers. The Match class is the representation of the gamemaster. It has a copy of the ruleset, an interpreter, a mapping of roles to their actor, the start- and game clocks, and the history. Once the match is over, the gamemaster also holds a mapping of roles to utilities. 4.2 Tree Data Structure All implemented agents (see section 4.3) use a specialized tree data structure as a search tree. The tree data structure is a root node that links to its children. Each node may link to its parent (except the root node), and to its children (except the terminal nodes). The implementation consists of two types of nodes, perfect information nodes and infor- mation set nodes. There are no subtypes to perfect information nodes. There exist two subtypes of information set nodes, hidden information nodes, and visible information nodes. Hidden information nodes are for plies where the agent is not in control, while at visible information nodes the agent makes a move. 4.2.1 Perfect Information Nodes Perfect information nodes (see subsection 2.2.1) are implemented as a class. The class’ name is PerfectInformationNode and it specifies a mutable stateful object with five attributes, and four methods. The attributes are named parent, children, state, turn, and valuation. The methods are called expand, evaluate, develop and trim. As the name suggests, the parent attribute is a link to the parent of a node. If the parent is None the node is the (current) root node. The distance from the root node, also implicitly defines the depth of a node. Generally, the depth of a node is defined to be parent.depth+1 where the depth of the root is 0. In the implementation of search trees, the edges correspond to the mutable mapping children. The mapping’s keys are turns, while its values are the child nodes. The keys represent the edge’s labels. In perfect information nodes, the mapping is assumed to be bijective. Therefore, each turn maps to exactly one node, and there is an edge for every child node. Each node represents the state of a game, stored in the attribute state. The root node holds the initial state. The state of a node cannot change— it is immutable. 44 4.2. Tree Data Structure After a ply, an agent may record the turn in the attribute turn. While not strictly necessary, it is useful for pruning branches that are no longer relevant. Namely, after turn is set, all entries in the mapping children with a different key are obsolete. The attribute valuation stores the current valuation. It is mutable, as the MCTS- algorithm updates it during its iterations. The method expand creates the child nodes, links them, and stores the links into chil- dren. It takes one argument, an interpreter. Using the interpreter, it enumerates each follow-up state, and the corresponding turn. It creates a PerfectInformationNode with the followup state as state, for each follow-up state. These nodes are mapped, with the turns as keys. The expand operation is idempotent. With the evaluate method, the node is evaluated, thereby setting or updating the valuation attribute. The method takes three arguments, an interpreter, an evaluator, and a valuation factory. The evaluator uses the interpreter to produce a utility value from the state of the node. If the node already has an valuation that is not None, the propagate method aggregates the utility to the existing valuation. In the other case, the valuation factory produces a valuation and stores it in the valuation attribute. During the progression of the game, the agents need to keep the tree up-to-date. The update process is fairly simple in perfect information games. As the moves are observable, updating the tree consists of setting the turn value for each node and walking the edges with the correct turns. Nodes are updated with the develop method. Calling develop returns the node on a depth that is equal to the ply number, and for perfect information games, a state that equals the view— the current state. The method takes three arguments, an interpreter, the ply number, and the view. With the interpreter’s get_developments method, the method iterates through all possible histories. For perfect information games, there is (usually) only one possible history. With the turns generated by the history, the method walks down the tree until reaching the correct node. Finally, trim deletes all links to obsolete children. Without the links, the garbage collector is free to collect the children. 4.2.2 Hidden Information Nodes Hidden Information Nodes are implemented via the HiddenInformationNode, which is informed by the concepts discussed in subsection 2.2.2. It is used for plies where the role of the actor is not in control. The class defines stateful and mutable objects. It has six attributes role, possible_states fully_enumerated, fully_expanded, parent, and children. It has three methods: branch, evaluate, and develop. The attributes of HiddenInformationNode encode the current knowledge of the match. In role is the role from which perspective the tree is. It is the same value for all nodes of the tree. The information set of a node is stored in possible_states. It is a set 45 4. Implementation of states that share the same view. The attribute fully_enumerated is true if the information set is complete. The attribute fully_expanded is true if, for every possible state, all possible turns are linked in children. Both parent, and children fulfill the same function as in the implementation for perfect information nodes. The mapping children represents the edges to its children, similar to PerfectInforma- tionNode, however, the mapping is no longer guaranteed to be bijective. Furthermore, a hidden information node may only have up to two children, a visible child or another hidden child. As the moves are unobservable, all keys in the mapping point to one of those two children. The key of an entry consists of the assumed state, paired with the turn. To add edges, the method branch takes two arguments, an interpreter and a state of possible_states. Using the interpreter, the method expands the children map with an edge per turn and the follow-up state, only for the state in the arguments. If the role in role is in control in the follow-up state, the visible node is the target. In the other case, the hidden child is the target. As there are only two children for hidden information nodes, the develop method for in- formation set nodes walks down the hidden information children until the depth matches the last ply. Then, it chooses the visible child node. Should a link not be branched yet, the method uses arbitrary states from possible_states to reach its target. 4.2.3 Visible Information Nodes The implementation of the second type of information set node, the class VisibleIn- formationNode, is also informed by the concepts discussed in subsection 2.2.2. The class describes a stateful mutable object. Compared to HiddenInformationNode it has two additional attributes view, and move. Additionally, the children attribute differs. The class has one additional method, trim. The attribute view is an optional view during the ply. The attribute is set to None, until the agent receives the view from the gamemaster. Once it is set, all incompatible states in possible_states are obsolete. The trim method will later remove these states. The attribute move is an optional move. It is set after the agent has committed to a move. All edges that are incompatible become obsolete, once it is set. The mapping children points from pairs of states (the assumed state), and moves to children. It is neither guaranteed to be injective, nor surjective. Therefore, two edges may point to the same node, and not all follow-up states are covered. Other than HiddenInformationNode, an instance of the VisibleInformationNode class may have more than two children. Visible children are grouped by the view of the follow-up states. Therefore, two follow-up states with the same view, are part of the same child’s possible_state attribute. Hidden nodes are not grouped. Therefore, each move leads to a different hidden node. Thus, if two edges point to the same hidden child, their states differ, but not their move. 46 4.3. Agents As with PerfectInformationNode, trim deletes all links to obsolete children. Without the links, the garbage collector is free to collect the children. 4.3 Agents The agents are autonomous players, using MCTS to play games. The agents use the data structures and algorithms from the previous sections to search for the best move. The author implemented three types of agents: an agent using MCTS for games with perfect information, called MCTSAgent; an agent using single observer-information set Monte Carlo tree search with partially observable moves (SO-ISMCTS+POM), called SO- ISMCTSAgent; and an agent using multiple observer-information set Monte Carlo tree search (MO-ISMCTS) called MOISMCTSAgent. This section discusses the architecture used to implement the agents, as well as each agent separately. The design and implementa- tion of the agents is informed by Cowling, Powley, and Whitehouse [8]. 4.3.1 Architecture The architecture of the agent’s implementation is based on the composition over in- heritance design principle. The design principle recommends to use composition over inheritance, if it can be used instead of inheritance. Consequently, the implementation is separated into multiple classes that could be aggregated. This decision led to a more modular code base, and was easier to understand for the author. For an overview, consult the class diagram in fig. 4.3. The entry point for an agent is the calculate_move method. The method receives the numbered ply, the current view and information about the game clock. As the name suggests, it calculates the next move. How it calculates the move is based on the search algorithm. The search algorithm for the implemented agents was an adapted form of MCTS. Any of the three MCTS-agents is also a tree-agent. Tree agents characteristically contain two minor- and two major operations. The minor operations are descend and get_- key_to_evaluation. The major operations are update and search. The method descend receives a key as input, and is called at the very end of the cal- culate_move. The key is the label of an edge, prunes all incompatible edges, and if possible, descends the (main) search tree. The key corresponds to the best move. The get_key_to_evaluation operation builds a map that points each edge to an eval- uation. Depending on the agent and type of game, an edge is not necessarily correlated to only one move. Thus, to evaluate the moves, a calling method has to further process this mapping. With the update method, the agent ensures that the tree(s) are up-to-date before they start the search. The update method receives as input the current ply number and a 47 4. Implementation Agent + prepare_match(role: Role, ruleset: Ruleset, ...) None + calculate_move(ply: int, total_time_ns: int, view: View) Move + conclude_match(view: View) None + abort_match() None TreeAgent[K, E] + descend(key: K) None + get_key_to_evaluation() Mapping[K, E] + update(ply: int, view: View, total_time_ns: int) None + search(search_time_ns: int) None MonteCarloTreeSearchAgent[K, Tuple[int, float]] + step() None AbstractMCTSAgent[K] SingleObserverMonteCarloTreeSearchAgent[K] + tree Node[float, K] | None + selector Selector[float, K] | None + evaluator Evaluator[float] | None MultiObserverMonteCarloTreeSearchAgent[K] + trees Mapping[Role, ImperfectInformationNode[float]] | None + roles Set[Role] | None + selectors Mapping[Role, Selector[float, K]] | None + evaluators Mapping[Evaluator[float]] | None AbstractSOMCTSAgent[K] MCTSAgent[Turn]SOISMCTSAgent[Move | Turn] MOISMCTSAgent[Move | Turn] Figure 4.3: The class diagram for the agents. Protocols (viz. interfaces) are in a light shade of gray, and with dashed borders. Abstract classes have a darker shade of gray. Concrete implementation classes are in the darkest shade of gray. view. The view in perfect information games is the full state. In imperfect information games, it is only the actual view of the agent. Finally, in search the agent uses the MCTS-algorithm to progress the search tree. As input, the method only takes the remaining available time. It repeats iterations of MCTS until the available time expires. The MCTS-agent’s calculate_move is shared among all implemented agents. First, cal- culate_move calls update to get the tree(s) to the latest version. Then, it invokes search. Afterward, it uses get_key_to_evaluation to map the edges labels to their evaluation. The mapping is again processed by grouping, and aggregating by move. The resulting mapping points from a move to an evaluation. From this mapping it is possi- ble to pick a move, by picking the move with the maximum evaluation. Finally, it calls descend to trim the tree(s), of all branches that are incompatible with that move. Both the MCTSAgent-class and SOISMCTSAgent-class inherit from the abstract class Ab- 48 4.3. Agents stractSOMCTSAgent. An object of type AbstractSOMCTSAgent is defined by a tree, a selector, and an evaluator. The tree is the search tree (see section 4.2) which MCTS is applied to. The selector picks a child node from the available children of a node. In this implementation, it uses the UCT algorithm for selection. The evaluator returns the utility of a state. In this implementation, it uses random playouts to evaluate a given state. The MOISMCTSAgent is characterized by the attributes trees, selectors and evalua- tors. These three attributes are maps, that point from the role to their respective data structure. Trees are search trees (see section 4.2). As the UCT value depends on the valuation of a node, which in turn depends on the role, each selector uses UCT to pick a child node from a node for their associated tree. Finally, the evaluators do random playouts from the perspective of the role, to evaluate a given state. 4.3.2 Single Observer-Monte Carlo Tree Search The abstract base-class AbstractSOMCTSAgent is the foundation of both the MCTSAgent (for perfect information games) and SOISMCTSAgent (for imperfect information games). The base-class applies the four phases of MCTS in step. The selector handles the select phase, the tree itself handles the expand phase, the agent determines the valuation of a node using the evaluator, and, finally, the tree handles the select phase. MCTSAgent as- sumes the tree is a perfect information tree (see subsection 4.2.1), while SOISMCTSAgent assumes an information set tree (see subsection 4.2.2 and 4.2.3). The method update of AbstractSOMCTSAgent ensures that the tree is up-to-date. It uses the tree’s develop method to walk to the node that has a depth that corresponds to ply number. For SOISMCTSAgent, the method uses some time to sample from the possible states, and add them to the node. This allows more possible states to be available for the selection phase. During step, AbstractSOMCTSAgent iterates MCTS-algorithm once. The MCTS- algorithm (see section 2.3) is divided into the four phases select, expand, simulate and backpropagate. With the agent’s selector, the step routine iteratively picks child nodes until it reaches an unexpanded node, which concludes the select phase. The expand phase uses the state of the node to add children nodes and link them. The same state is used for the simulation phase. The step routine uses the agent’s evaluator to estimate a utility for the state. In the backpropagate phase, starting from the expanded node, the routine updates each node’s evaluation with the utility, and walks up to the parent. The phase ends after reaching the current root node. The MCTSAgent overrides descend and get_key_to_evaluation. The method descend receives a key. The key is the edge’s label. The label for perfect information trees are turns. Thus, the agent can already walk the tree on non-simultaneous plies. The operation get_key_to_evaluation maps each possible turn to the evaluation of a node. The turn is associated with the node 𝑛, if from the current node 𝑐, there is an edge with the turn as a label on it, that leads to the node 𝑛. 49 4. Implementation The implementation for step of SOISMCTSAgent slightly differs from MCTSAgent’s. Be- fore the select phase starts, it picks one of the available possible states. Based on this state, the selector picks a child. From this child, the method selects another possible state, and so forth. The method, upon reaching an unexpanded node, expands the tree based on the last determinized possible state. From there, the method uses the same implementation as its super class, for simulation and backpropagation. After a SOISMCTSAgent picked a move (during calculate_move), it calls the descend method. The method can no longer walk down the tree, as it does not know the ac- tual state. Therefore, it can only prune incompatible branches. Incompatible branches include those which do not have the picked move on the label of the edges. 4.3.3 Multiple Observer-Monte Carlo Tree Search The MOISMCTSAgent uses MO-ISMCTS to find the best move. As the name suggests, it uses multiple trees, one for each role in the game, to accomplish this goal (see also subsection 2.3.5). Every role’s tree is inherently linked to their perspective. The tree that matches the role of the actor’s is synonymously called the main tree. The MOISMCTSAgent describes a stateful object with three maps, each pointing the roles in the game to a tree, a selector, and an evaluator. The tree consists of hidden and visible information nodes. Modelling the available information, and how it is updated, necessitates this distinction. At hidden information nodes, the agent did (or rather will) not receive new information. Visible information nodes support incorporating new information. The former are used for plies where the agent is not in control, while the latter are used if the agent has to move. Thus, if a role is not in control, the corresponding node is a hidden node, and it is a visible node if the role is in control. As a consequence, the trees of each role are shaped differently and contain different information. When incorporating new information with update, it only updates the main tree. During the update, the tree’s develop method ensures the current node is at the same depth as the ply’s number. Furthermore, it guarantees the current node’s possible states are compatible with the view, discarding any that are incompatible. In the second phase of update, the method spends some time sampling the space of compatible states and adding them to the possible states of the node. As with SOISMCTSAgent, this enables more possible states to be picked during the selection phase. Conceptually, step describes the four phases of MCTS, more specifically MO-ISMCTS. It starts by picking one of the possible states, called the determinization. The deter- minization may only originate from the main tree. Afterward, the interpreter generates a matching history for the determinization. The history enables synchronization of the trees to compatible branches. From there, the method applies the role-specific selector to its matching tree for all roles that are in control. As there is a determinization, and each role is in control, there is a unique edge in each tree. By extension, there exists a unique move for each role. These moves are combined into a turn. The turn is applied to the state. The state together with the moves determine the edges of the trees. This 50 4.3. Agents is repeated, until the method reaches an unexpanded node, concluding the select phase. During the expand phase, the procedure expands the main tree’s current node, assuming the determinization. As a consequence, only edges where the label contains the deter- minization are added. Each selector simulates a separate playout which is propagated back to their respective tree. Each call of step during calculate_move, is significantly pricier, in comparison to MCTSAgent or SOISMCTSAgent. Therefore, MOISMCTSAgent takes significantly more time to reach the same number of iterations. However, the procedure overall takes fewer iterations to provide good moves. Nonetheless, the implemented agents all feature a common base of implementation. The common base of implementation, was useful in debugging and for comparison during the experiment. The results of the experiment are featured in the next chapter, chapter 5. 51 CHAPTER 5 Results The results include insights due to the analysis and the outcome of the experiment. First, is a qualitative evaluation of the effectiveness of the four studied Monte Carlo tree search (MCTS) modifications. Second, follows a review of characteristic features found for the games Phantom_Connect(4,4,4) and Dark_Split_Corridor(3,4). Third and final, reports on the results of the experiment run on the same games. 5.1 Effectiveness of Existing Monte Carlo Tree Search Modifications To evaluate the effectiveness of determinizing MCTS and of information set Monte Carlo tree search (ISMCTS), this section contains the case studies for Phantom_Connect(4,4,4) and Dark_Split_Corridor(3,4). First, the case studies outline the most important rules and properties of the games. Then, given a characteristic game state, the different algorithms are considered. For each algorithm, the author of this thesis discusses the process of calculating the next move, and highlights the differences. Commonly, deter- minizing MCTS, single observer-information set Monte Carlo tree search (SO-ISMCTS) and single observer-information set Monte Carlo tree search with partially observable moves (SO-ISMCTS+POM) end up reaching similar conclusions. However, the convergence is mostly incidental, and due to the structure of the games. 5.1.1 Case Study Phantom_Connect(4,4,4) The game Phantom_Connect(4,4,4) is an alternating two-player zero-sum game with hidden information. It is an instance of the 𝑛, 𝑚, 𝑘-connect family. Those games feature an 𝑛 × 𝑚 board. Each player marks cells with their initial (usually x and o), where the first player to mark 𝑘 in a row, column, or diagonal wins the game. The game ends in a draw if the win-condition is not yet met, but all cells are already marked. Arguably 53 5. Results the most well known of this family is Tic-tac-toe (Connect(3,3,3)). Some Connect games feature modifications to their rules, like gravity and phantom. In gravity, a player only chooses the column, and the mark lands on the lowest not-already marked row. A famous game in this category is known as 4-Connect (Gravity_Connect(7,6,4)). The phantom modifier, hides the status of a cell for all players, until they try to mark it. Once they mark a cell, they either know that it was not occupied before, resulting in their mark being present, or revealing the other player’s mark. Should a player attempt to mark an already occupied cell, they may move again. Thus, the variant is no longer alternating. The notation for the actions is (𝐶, 𝑅), and the moves are denoted with Role ∶ (𝐶, 𝑅). The column of the marked cell is denoted with 𝐶 and the row is denoted with 𝑅. The roles are x (the first-moving player) and o the second-moving player. The characteristic game state for the case study is in its seventh ply, after the sixth turn has been played. Eq. 5.1 lists its history. See fig. 5.1 for a visual representation of the characteristic state. 𝑠0 x∶(1,4)−−−−→ 𝑠1 o∶(4,4)−−−−→ 𝑠2 x∶(4,1)−−−−→ 𝑠3 o∶(1,4)−−−−→ 𝑠4 o∶(3,3)−−−−→ 𝑠5 x∶(3,2)−−−−→ 𝑠6 (5.1) 1 2 3 4 1 2 3 4 next: o x x o o x x o o o x x Figure 5.1: The characteristic game state 𝑠6 for the case study of Phantom_Con- nect(4,4,4). A cell is either blank, or features up to three marks. The marking player’s symbol is seen in the center of the cell. The marks below show which player knows the mark of a cell. 54 5.1. Effectiveness of Existing Monte Carlo Tree Search Modifications The state 𝑠6 favors x, as o is no longer permitted to make mistakes, to avoid a potential loss. Acceptable actions include (2, 3), (3, 2), and (4, 1). Both (3, 2), and (4, 1) allow immediate subsequent moves. Therefore, gaining more information in the process. Thus, they are both optimal moves. From the perspective of o, 78 unique states are possible, where 5 (one of which is 𝑠6) of these feature situations, where a wrong move from o might result in a loss. The board has 16 cells, 3 cells are known to o, and 5 cells are marked in total, thus 78 = (132 ) states are possible. A player cannot distinguish the order in which the opponent marked the cells, thus 158 = 13!(13−2)! histories lead to the 78 states. Determinizing MCTS has at least 78 determinized trees, one for each unique state. Until the algorithm runs out of time or iterations, it chooses one of the trees and applies MCTS. From each root node, of every tree, the procedure counts the number of iterations of the child-nodes. The number of iterations is aggregated by the move that lead to the child. The algorithm returns the move with the highest number of iterations. Given that in only 5 of the 78 determinized trees— less than 7%, the optimal moves converge to the acceptable moves, determinized MCTS would probably pick (1, 1) or (2, 2). The overwhelming number of determinizations are positions where x cannot secure a win in the next ply. In those positions, (1, 1) or (2, 2) would indeed be beneficial, as o would be able to secure a win with one more move. Determinized MCTS fails to weight the threat of losing accordingly. SO-ISMCTS only features one tree, holding 78 unique possible states. Each history leads to a separate branch of the tree. The procedure cannot distinguish the correct branch and thus has to either create a new tree, and discard the previous iterations, or iterate from the last certain ancestor of the current node. At the start of an iteration, the procedure assumes a possible state. Next, the algorithm applies the four phases of MCTS, until the break condition. Until this point, SO-ISMCTS and determinizing MCTS are similar. However, as there is only one tree, SO-ISMCTS does not need to combine the data into a single tree. This allows the procedure to potentially reuse the previous iterations. Nonetheless, despite the gained efficiency, SO-ISMCTS likely performs poorly, as it still suffers from strategy fusion. Only 5 out of the 78 possible states are similar enough to the actual state, such that their optimal move coincides with the acceptable moves. In most of the possible states (1, 1) and (2, 2) is far more likely to lead to a win, than defending on (2, 3), (3, 2), or (4, 1). Even if the probability is low, SO-ISMCTS fails to consider the importance of not losing. The improved SO-ISMCTS+POM also uses only one tree, but addresses a fundamental weakness over the previous attempts, by removing the assumption that the opponent’s moves can be observed. Without the assumption, the algorithm differentiates between hidden and visible nodes. Every non-terminal hidden node only has one or two chil- dren (see subsection 2.3.4). Ultimately, SO-ISMCTS+POM never needs to throw away the tree. Starting from 78 possible states, the procedure determinizes a possible state. On 55 5. Results visible nodes, the usual MCTS selection step ensues. At hidden nodes, the algorithm determinizes again. Even though, the impending loss is given algorithmically adequate weight, by aggregating all possible states in hidden nodes, the algorithm still likely fails to address the threat because the opponent’s model is reduced to random moves. Given the overwhelming number of bad moves x could do i.e., not picking (2, 3) to win, it still fails to acknowledge the threat. However, this is not due to strategy fusion, but due to the high number of available options for x. Finally, multiple observer-information set Monte Carlo tree search (MO-ISMCTS) im- proves the opponent’s model, by once again using multiple trees. Each role (i.e., x and o), has their dedicated tree. Each tree has visible and hidden nodes, and thus does not assume observable moves. Instead of sampling a random state at hidden nodes, the selection continues at the other role’s tree. The benefit is two-fold. Not only is the opponent’s model more strategic, but also the iterations are efficiently allocated and can be reused for later plies. The improved MO-ISMCTS will rank the actions (2, 3), (3, 2), and (4, 1) high. Any other actions either lead to losses because x will win the diagonal, or x to block o’s attempt to get the other diagonal. The analysis shows that MO-ISMCTS has a decisive advantage in a characteristic state for this game. Additionally, Cowling, Powley, and Whitehouse [8] empirically came to a similar result. Their explanation is that Phantom_Connect(4,4,4) triggers strategy fusion, which leads to overly optimistic and pessimistic assumptions. The optimistic assumption is that the agent can respond to the opponent’s actions perfectly. The pessimistic assumption is that the opponent can also perfectly respond to the agent’s move. All four methods stand to benefit from weighting the possible states. Cowling, Powley, and Whitehouse [8] agree, and together with opponent strategy modelling, suggest such an improvement in their outlook. 5.1.2 Case Study Dark_Split_Corridor(3,4) The game Dark_Split_Corridor(3,4) is an alternating two-player zero-sum game with hidden information. It is an instance of the 𝑛, 𝑚-corridor family. An 𝑛, 𝑚-split corridor game is played on two 𝑛 × 𝑚 boards, where each player controls the sole pawn of that board. The pawn starts in the column with the index ⌊(𝑛 + 1)/2⌋ of the first row. The pawn is allowed to move to any orthogonally adjacent cell. The player, whose pawn reaches the last row first, wins. Besides moving the pawn, a player may choose to construct a barrier between two orthogonally adjacent cells on the opponent’s board. A pawn cannot cross a barrier. The player may only construct a barrier, if it is still possible to reach the last row. 56 5.1. Effectiveness of Existing Monte Carlo Tree Search Modifications The dark variant, hides the barriers from the players’ views, while the pawns’ positions, are still visible to both players. Should a player attempt to move a pawn such that it crosses a barrier, the pawn stays put, and the barrier is revealed. Players see revealed barriers. There are two classes of actions in dark split corridor, moving and blocking. All moving- actions are fully observable to the opponent. Actions that block, cannot be differentiated from one another. As a corollary, both families of actions are distinguishable. Therefore, a player can judge if the other player moved or blocked. The notation for the actions are south, west, north, east, denoting the movement, and(𝐶1, 𝑅1), (𝐶2, 𝑅2) denoting the blocks. The coordinates (𝐶1, 𝑅1) and (𝐶2, 𝑅2) ought to be orthogonally adjacent, and as a convention (𝐶1, 𝑅1) is the cell that is more northwest than (𝐶2, 𝑅2). The columns are denoted as 𝑎, 𝑏, and 𝑐. Rows are enumerated 1 to 4. The characteristic game state for the case study is in its fourth ply, after the third turn has been played. The history is listed in eq. 5.2. Fig. 5.2 offers a visual representation of the characteristic state. 𝑠0 left∶south−−−−−−→ 𝑠1 right∶south−−−−−−→ 𝑠2 left∶south−−−−−−→ 𝑠3 right∶(𝑏,3),(𝑏,4)−−−−−−−−−→ 𝑠4 (5.2) a b c a b c 1 2 3 4 P p Figure 5.2: The characteristic game state 𝑠4 for the case study of Dark_Split_Corri- dor(3,4). left is in control, indicated by the double border. The pawns are visible to both players. The barrier is still hidden to left indicated by the gray color. 57 5. Results The state 𝑠4 favors left over right. The second-moving role right is forced to respond, at least until left makes a mistake, then left falls back to the more reactive role. In an advantageous position, victory is assured, should the disadvantaged role not block. From the available actions, both moving and blocking are admissible. Given the power of clairvoyance, moving west, or east is desirable. Realistically, left cannot be sure which of the available crossings is blocked. Either right bluffed and set the barrier somewhere unexpected, or it set the barrier at the expected crossing (which is the case in 𝑠4). Thus, there is a slight chance that calling the bluff and moving south wins the game. However, in 𝑠4 the move is suboptimal. While it does not lose the game, it forces left to be the reactive player. Given the uncertainty, blocking actions could be considered. If right did not bluff, then moving west or east can be defended with the appropriate (𝑎, 3), (𝑎, 4) or (𝑐, 3), (𝑐, 4) answer. From there, left needs at least three plies for a victory, while right can win in two. Eventually, left has to block some crossing as well to increase the distance for right. If right bluffed, then moving increases the number of plies until victory by one in the worst case, and right can increase this distance by blocking again. Therefore, left will need to block to still be able to win. The available blocking actions can be categorized into non-bluffs and bluffs. The non- bluffs are (𝑏, 2), (𝑏, 3)— blocking the immediate forward movement, and (𝑏, 3), (𝑏, 4)— blocking the straightforward path. Therefore, all other options could be included. How- ever, blocks concerning the first row are riskier, as they may only be useful later in the game. Thus, as bluffs (𝑎, 2), (𝑎, 3); (𝑎, 2), (𝑏, 2); (𝑎, 3), (𝑎, 4); (𝑎, 3), (𝑏, 3); (𝑏, 2), (𝑐, 2);(𝑏, 3), (𝑐, 3); (𝑐, 2), (𝑐, 3); and (𝑐, 3), (𝑐, 4) are acceptable moves. As the moves in plies 1, 2 and 3 are fully observable, there are only 15 possible moves, and by extension only 15 possible histories, one for each potentially blocked crossing. Due to the symmetry of the game, 12 states have a mirrored equivalence. An agent using determinizing MCTS considers the 15 possible states as 15 separate trees. In only one of those south is not the optimal move. Therefore, it is likely that after aggregating all the iterations from the trees, the procedure picks south. Unfortunately, the one tree, where the chosen move is suboptimal, corresponds to the real state. While, the move is suboptimal, not all is lost. Furthermore, south reduces the number of possible states to 1, which significantly improves the performance of determinizing MCTS for subsequent turns. SO-ISMCTS also tracks 15 possible states, managed in a single tree. Due to the assump- tion, that all moves are fully observable, the procedure is either forced to rebuild the tree from scratch, or start from above the current root. At this point, the algorithm randomly determinizes from the possible states. Like with determinizing MCTS only one of the possible states has south as suboptimal move. The possibility that right blocked, is judged equal to all other states. As south leads to wins most of the time, it is the preferred move. 58 5.2. Characteristic Features of Games In the case of SO-ISMCTS+POM, the same 15 possible states also remain in a single tree. The model never assumes that a move is visible, allowing to reuse the previous iterations, from before plies. The algorithm assumes right chose their moves randomly. Therefore, the algorithm likely will come to the same conclusion as its predecessors. Most of the moves right could have made, have south as the optimal answer. This limitation is only partially overcome with MO-ISMCTS. Due to the procedure deter- minizing at the root node, all states 15 states are equally likely. Thus, south stays the most promising action. However, unlike the previous methods, MO-ISMCTS spend more iterations in this part of the tree. Given the analysis, all methods select the same move south. This move is judged to be suboptimal, when clairvoyant. However, each method takes different paths to reach the same conclusion because the actual state is outnumbered by alternatives. Due to the equal weighting, they hide the more relevant game states. Frank and Basin [9] call this phenomenon non-locality (also described in [8]). For this game, all methods would benefit from a more sophisticated weighting of the possible states. Especially SO-ISMCTS+POM and MO-ISMCTS stand to gain the most, as they could effectively use the win-ratio of the determinized playouts as weight (or loss-ratio should the opponent be in control). 5.2 Characteristic Features of Games The UCT-border is constrained by two boundaries, zero and the lowest depth of the game. Zero is the minimum, as it is not possible to expand to a depth less than zero. It is equally impossible to expand beyond terminal nodes. Usually, during a match, an agent using MCTS will only encounter terminal depth, when the match has progressed and is close to the end. In the plies before that, the UCT-border is constrained by the limited resources available to the algorithm. If a proposed limit for the UCT-border is too low, the effect will be noticeable but likely prohibitively negative. If the limit is set too high, the tree might not ever be constrained. Finding the correct value is especially difficult in the context of general game playing (GGP), where the players cannot be fine- tuned to the games. The only available information remains the description of the game (static analysis), and sampled states (dynamic sampling). Four characteristic features of games are the cardinality of the state space, the cardinality of the action space, the average arity of the game tree, and the average depth of the terminal nodes. All four features are closely related to the size of the game tree and by extension the search tree. 5.2.1 Cardinality of State Space The state space is the set of all possible spaces. The cardinality of the state space serves as a lower bound for the total number of nodes in the game tree. In the best case, each state 59 5. Results is mapped to exactly one node. For complex games, it is often impractical to calculate the exact number of states. In service to brevity, the analysis for both Phantom_- Connect(4,4,4), and Dark_Split_Corridor(3,4) only attempts to find approximate upper bounds. A state of Phantom_Connect(4,4,4) may have a cardinality up to 66. Enumerated by2 control/1 atoms, 32 cell/3 atoms, and 32 revealed/3 atoms. Each of the 66 atoms that may be included or excluded in a state. Therefore, a first upper bound for the size of the state space is 266. A tighter upper bound for the cardinality of the state space is 2 ⋅ 516 ≈ 3.05 × 1011— significantly less than the naive 266 ≈ 7.37 ⋅ 1019. Of the 66 atoms, 2 atoms describe which role is in control. The rules dictate that exactly one role is in control. Therefore, there are only 2 classes of states. The board has 16 cells. For each cell, there are two atoms. One atom for each role that could potentially mark the cell. There can either be no mark or one mark of a role (but not of both). Therefore, a cell may have 3 different states. Finally, a marked cell can be revealed to the opponent, or remain hidden, two states per marked cell. Unmarked cells cannot be revealed to either player. Thus, a cell can be in 5 different states. First, not marked, second marked by x and not revealed, third marked by x and revealed, fourth marked by o and not revealed, and fifth marked by o and revealed. In total, resulting in 516 board configurations. The game Dark_Split_Corridor(3,4) has a state consisting of up to 86 atoms. The sum consists of 2 control/1 atoms, 24 at/2 atoms, 30 border/2 atoms, and 30 revealed/2 atoms. A naive upper bound for the cardinality of the state space is 286. This upper bound can be reduced to at least 2 ⋅ (6 ⋅ 314 + 3 ⋅ 313 + 3 ⋅ 312)2 ≈ 2.46 × 1015. The board of the game is split into two separate sub-boards, one sub-board for each player. On a board, there is exactly one pawn, therefore a sub-board can be in 12 different states. The number of barriers is harder to count exactly, as it is dependent on the position of the pawn. A crossing can be in three states, the barrier not constructed, the barrier constructed and not revealed, and the barrier constructed and revealed. Per sub-board there are 15 crossings. On the last two rows, there has to be at least one crossing where no barrier is constructed. On the second row, there have to be at least two empty crossings. Finally, on the first row, at least three crossings remain empty. In total, each sub-board has 6 ⋅ 314 + 3 ⋅ 313 + 3 ⋅ 312 unique states. 5.2.2 Cardinality of Action Space The action space is the set of all actions. The cardinality of the action space is an upper bound for the arity of a single node, and by corollary a bound for the average arity of the tree. The action space of Phantom_Connect(4,4,4) is 16. All 16 have the signature cell/2. One argument for each cell. Each cell may be marked by either player. 60 5.2. Characteristic Features of Games The action space of Dark_Split_Corridor(3,4) is 19. A pawn may move in four directions, and there are 15 crossings that may be constructed. 5.2.3 Average Arity The arity of a state is the number of unique turns. The number of turns depends on the number of actions. Furthermore, a turn consists only of legal actions from roles in control. In non-simultaneous moving games, the number of turns is the number of legal actions. Both Phantom_Connect(4,4,4) and Dark_Split_Corridor(3,4) are non-simultaneous moving games. For Phantom_Connect(4,4,4) the average arity is approximately 8.37. The average arity was estimated experimentally. For the estimation 486, 000 histories were generated. During generation 8, 002, 496 unique states were recorded. The estimation for the average arity is the mean number of legal actions. The average arity of Dark_Split_Corridor(3,4) is approximately 6.61. Similarly to Phantom_Connect(4,4,4), the average arity was estimated by experiment. The estima- tion consisted of 354, 000 histories, resulting in 13, 342, 062 unique states. The average arity was estimated with the mean number of legal actions of these states. While choosing an action for Phantom_Connect(4,4,4) removes the option for later plies, the same is not true for Dark_Split_Corridor(3,4). Namely, moving the pawn is (typically) reversible. Once all barriers are constructed, the legal actions reduce to a minimum of 1 (moving the pawn in one direction) up to a maximum of 4 (moving the pawn in all four directions). 5.2.4 Average Depth of Terminal Nodes The depth of a node is defined by the distance to the root node. The leaf nodes of a game tree correspond to the terminal states. The distance between the root and a leaf node is the length of the associated history. Therefore, the average length of histories corresponds to the average depth of the terminal nodes. The average depth of terminal nodes for Phantom_Connect(4,4,4) is approximately21.86. The estimate is the arithmetic mean length of 619 801 histories. The theoretical minimum depth is 7. A game with this depth features x marking a line, o not prohibiting the line, and o not marking any of the line’s cells. The theoretical maximum depth is32, a draw where every cell is attempted to be marked twice, revealing the whole board for both roles. The terminal nodes of the game Dark_Split_Corridor(3,4) have an average depth of43.25. The estimation considered 127 601 histories. The theoretical minimum is 5. This is a game where left moves down the board in every ply, and right does not block left in doing so. The theoretical maximum is not finite, as both left and right could just repeatedly move back and forth, never reaching the last row. 61 5. Results Phantom_Connect(4,4,4) Dark_Split_Corridor(3,4)|𝑆| ≤ 3.05 × 1011 ≤ 2.46 × 1015|𝐴| 16 19?̄? ≈ 8.37 ≈ 6.61̄𝑑 ≈ 21.86 ≈ 43.25𝑙 4 5 Table 5.1: The extracted numeric features compared for the games Phantom_Con- nect(4,4,4) and Dark_Split_Corridor(3,4). |𝑆| is the cardinality of the state space.|𝐴| is the cardinality of the action space. ?̄? is the average arity. ̄𝑑 is the average depth of terminal nodes. 𝑙 is the chosen maximum depth for the experiment (see section 5.3) 5.2.5 Choice of Maximum Depth The feature evaluation of Phantom_Connect(4,4,4) shows a significantly smaller state space, and a game tree with slightly more branches, but not as deep, in comparison to Dark_Split_Corridor(3,4) (see table 5.1). The direct comparison, influenced the decision for the limited agent’s maximum expansion depth in the experiment (see sec- tion 5.3). A tree with a smaller average arity can usually be searched deeper in the same amount of MCTS iterations. The matchup in the experiment of the unlimited agent against itself confirmed that. Specifically, the unlimited agent searched deeper for Dark_Split_- Corridor(3,4) than for Phantom_Connect(4,4,4). A match’s outcome typically is not decided in the very last move. In most games, after a key ply, the match’s outcome cannot be changed anymore. The nature of the game dictates when that ply is, and how long the game continues after that. For example, in Chess, winning with only a bishop and a knight takes up to 20 plies (i.e., 40 moves), often times much longer. However, the outcome is inevitable (given no mistakes occur). Matches of Phantom_Connect(4,4,4) are typically shorter than Dark_Split_Corri- dor(3,4). This indicates that the tail of moves, where the outcome is inevitable, is shorter than with Dark_Split_Corridor(3,4). If victory is guaranteed, the match usually ends soon in Phantom_Connect(4,4,4). By contrast, a match of Dark_Split_- Corridor(3,4) can already be decided, but 9 plies are still left to be played (given no player makes mistakes). Nonetheless, the length of a typical game, together with the average arity (bound by the cardinality of the action space), is a benchmark for how deep an agent should calculate. Given the loose connection between size of the state space and the number of the tree’s nodes, this metric is less meaningful. The number of iterations the MCTS procedure makes, indirectly limits the depth. To- gether, with the other metrics, the choice for the maximum expansion depth may be 62 5.3. Experiment Results done comparatively. This allows to estimate the influence of a changed maximum ex- pansion depth. For example, to increase the maximum expansion depth, without also increasing the number of iterations, will likely lead to deeper trees, that are more similar to a tree searched by unconstrained MCTS. Similarly, decreasing the expansion depth in comparison to another game, where the average arity was higher, will lead to more balanced tree, with high average expansion depth. 5.3 Experiment Results The experiment consisted of two double-round robin tournaments with self-play, where three MCTS-agents played 450 matches in total. The number of iterations was fixed, while there was no time-limit. The played games were Phantom_Connect(4,4,4) and Dark_- Split_Corridor(3,4). All percentages in this section are rounded to two significant digits. 5.3.1 Results of the Phantom Connect Tournament The Phantom_Connect(4,4,4) tournament had 450 matches in total, where each of the three agents played 250 matches. Each agent has 300 results. This is due to the double-counting of the games with self-play. The three agents were a clairvoyant agent using MCTS (agent 𝐶), an unlimited MO-ISMCTS agent (agent 𝑈), and an agent using MO-ISMCTS with a maximum expansion depth of 4 (agent 𝐿(4)). All agents did 7500 iterations per ply. The clairvoyant agent capitalizes upon its advantage, and did not lose a single game. The unlimited agent is a significantly better player for the role x compared to the limited agent. For the role o their playing strengths are closer, with a slight advantage towards the unlimited agent. Table 5.2 shows the result matrix in absolute numbers. Fig. 5.3 displays the win-, tie-, and loss-rate for both roles. Furthermore, the figure displays the total number of points, if a win is worth one point, a tie equal to a half point, and loss is worth no points. The clairvoyant agent played 250 matches, where in 50 of those matches it played against itself. It was the strongest player. In total, it won 82 (27.33%), lost none, and drew 218 (72.67%) of them. As the player moving first, agent 𝐶 won 59 (39.33%) of the matches, lost none, and tied 91 (60.67%). As the player moving second, it won23 (15.33%), lost none, and drew 127 (84.67%) of the matches. In a direct comparison, both the unlimited, and the limited agent fared similarly against the clairvoyant agent. Against 𝑈 , it won 42 (42%), and tied 58 (58%) matches. With 𝐿(4), it won 40 (40%), and tied 60 (60%) matches. Of the 250 matches the unlimited agent played, 200 were against other players, and 50 against itself. 𝑈 ’s playing strength was in between 𝐶 and 𝐿(4). Winning 71 (23.67%), losing 96 (32.00%), and drawing 133 (44.33%) of those 300 results. As the first-moving player, agent 𝑈 managed to win 62 (41.33%) matches, while losing 18 (12%), and tieing70 (46.67%). Going second, resulted in 9 (6%) wins, 78 (52%) losses, and 63 (42%) 63 5. Results x-role𝐶 𝑈 𝐿(4) x o ½ x o ½ x o ½ o- ro le 𝐶 0 0 50 0 12 38 0 11 39𝑈 30 0 20 26 3 21 22 6 22𝐿(4) 29 0 21 36 3 11 26 3 21 Table 5.2: The result matrix of the Phantom_Connect(4,4,4) tournament. The cells describe the total number of distinct outcomes. The horizontal axis denotes the agent of the x-role (the player moving first). The vertical axis denotes the agent of the o-role (the player moving second). The subcolumns of the cell’s matrix describe the outcome. They are written as x (the player moving first won), o (the player moving second won), and ½ (the match ended in a draw). 0 20 40 60 80 100 120 140 C U L(4) 59 62 48 number of wins as x (absolute) 0 20 40 60 80 100 120 140 C U L(4) 23 9 6 number of wins as o (absolute) 0 20 40 60 80 100 120 140 C U L(4) 91 70 82 number of ties as x (absolute) 0 20 40 60 80 100 120 140 C U L(4) 127 63 53 number of ties as o (absolute) 0 20 40 60 80 100 120 140 C U L(4) 0 18 20 number of losses as x (absolute) 0 20 40 60 80 100 120 140 C U L(4) 0 78 91 number of losses as o (absolute) 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 C U L(4) 191 137.5 121.5 total points (win = 1, tie = ½, loss = 0) Figure 5.3: The results for the Phantom_Connect(4,4,4) tournament. 64 5.3. Experiment Results draws. The direct matchup 𝑈 versus 𝐿(4) favors the former, winning 42 (42%), losing25 (25%), and tieing 33 (33%), out of 100 matches. The limited agent participated in 250 matches, giving 300 results, due to the 50 matches against itself. It was the weakest player. Of the 300 results, 54 (18%) are wins, 111 (37%) are losses, and 135 (45%) are draws. As the player moving first, 𝐿(4) managed to accrue48 (32%) wins, 20 (13.33%) loses, and 82 (54.67%) ties. The results as second-moving player were 6 (4%) wins, 91 (60.67%) losses, and 53 (35.33%) ties. The clairvoyant agent seems to converge to optimal play, as in the self-matchup all played matches resulted in ties. The game Connect(4,4,4) is a known draw [30]. Two clairvoyant agents playing Phantom_Connect(4,4,4) is isomorphic to two agents playing Connect(4,4,4). At suboptimal play, spurious non-draws would occur. Furthermore, both 𝑈 , and 𝐿(4) have similar results against 𝐶, even though their playing strengths differ in the direct comparison. This suggests, that 𝐶 is as strong as possible. Otherwise, it would fare distinctly against each of its opponents. Notably, 𝑈 has a higher win-rate than 𝐶 as x. 𝑈 could capitalize 𝐿(4)’s weakness better than 𝐶. One possible explanation is the wrongly made pessimistic assumption. The clairvoyant agent assumes that any move can be countered, while the unlimited agent models the lack of knowledge the limited agent has. The direct comparison of the unlimited agent 𝑈 and the limited agent 𝐿(4) shows that the former is stronger as the x-role. When 𝑈 assumed the x-role, it won 36 (72%), lost3 (6%), and drew 11 (22%) out of 50 matches. Against the same opponent, 𝐿(4) achieved only 26 (52%) wins, the same number of losses— 3 (6%), and 21 (42%) ties. The playing strength of 𝑈 and 𝐿(4) for the o role is relatively similar, with an advantage for the unlimited agent. 𝑈 managed to win twice as often, totaling 6 (12%) matches, losing only 22 (44%) matches, and tieing 22 (44%) matches, in comparison to the mir- rored case. While playing against itself, the outcome is identical to the 𝐿(4) versus𝐿(4) matchup. A possible explanation for the identical outcome, is the relatively weak playing performance of 𝐿(4) as x. The experiments done by Cowling, Powley, and Whitehouse [8] bear similar results. However, the paper omitted exact numbers of the experiment, [8, fig. 9] shows that the cheating UCT (equivalent to clairvoyant MCTS) is equally matched against itself, and winning most of the games against MO-ISMCTS (equivalent to unlimited MO-ISMCTS). Furthermore, when weighting ties as half-wins, the win percentage suggests about 60%, which is slightly more than indicated in [8, fig. 10]. The win percentage of 𝑈 with about45% is also slightly lower than the indicated value in the same figure. In comparison to the experiment done in this thesis, the number of games was significantly higher, with a total number of matches at 37 880. 65 5. Results 5.3.2 Results of the Dark Split Corridor Tournament The Dark_Split_Corridor(3,4) tournament featured 450 matches. As with the Phan- tom_Connect(4,4,4) tournament, each agent played 250 matches, which leads to 300 results, due to double counting the self-play matches. The contestants were the clair- voyant MCTS agent (denoted as 𝐶), an unlimited agent using MO-ISMCTS (abbreviated with 𝑈), and a limited expansion depth MO-ISMCTS (denoted as 𝐿(5)). The maximum expansion depth of 𝐿(5) was 5. The agents did 2500 iterations per ply. The clairvoy- ant agent was the strongest player, as expected. However, the advantage was not as clear-cut as with Phantom_Connect(4,4,4). Surprisingly, the limited agent won the direct comparison against the unlimited agent, assuming the left-role (the first-moving role). Furthermore, 𝐿(5) was able to accrue more wins moving first, compared to 𝑈 , when playing against 𝐶. Playing as the right-role, the advantage turns the other way around, and 𝑈 beat 𝐿(5). The result matrix is shown in table 5.3. The win and loss-rate for both roles is visualized in fig. 5.4, as well as the total number of points, if a win is worth one point, and loss is worth no points. The clairvoyant agent was featured in 250 matches, resulting in 300 results, due to the double counting in self-play matches. Of the three, it was the strongest agent. The total number of wins was 219 (73%), while it lost 81 (27%) of the times. Moving first, it won134 (89.33%) matches, while losing 16 (10.67%) out of 150 matches. Of the three agents, it was the strongest as left. Moving second, it won 85 (56.67%) and lost 65 (43.33%) matches, making it the only agent with a positive win rate for this role. As expected, this agent was the strongest when assuming the right-role. Going first, in the direct matchups, the most formidable opponent was itself, with 40 (80%) wins and 10 (20%) losses. Both against 𝑈 and 𝐿(5) it achieved identical results, winning 47 (94%) games and losing only 3 (6%). The unlimited agent contested in 250 matches, totaling 300 results. It won 114 (38%) matches and lost 186 (62%). As the first-moving player, it was the sole agent that lost more often than not. With 73 (48.67%) wins and 77 (51.33%) losses, it was the weakest first-moving player. Going second, 𝑈 manages to clutch 41 (27.33%) of the wins, while losing 109 (72.67%). This slots it as the second best of the agents, considering only the right-role. In direct matchups, when moving first, 𝐶 was the strongest opponent, with a win-lose ratio of 10 ∶ 40 (20%). Against itself, it won 29 (58%) and lost 21 (42%) matches. Finally, against 𝐿(5) the tally was 34 (68%) wins and 16 (32%) losses. Going second, it was beat substantially by 𝐶, with only 3 (6%) wins and 47 (94%) losses for 𝑈 . In the matchup against itself, it won 21 (42%) and lost 29 (58%) matches. Surprisingly,𝐿(5) was a stronger opponent, with 17 (34%) wins and 33 (66%) losses out of 50 matches for 𝑈 . The limited agent also contested in 250 matches, totaling 300 results. Marginally, 𝐿(5)’s win rate was the second best, with 117 (39%) wins versus 183 (61%) losses. When assuming the left role, 83 (55.33%) of the played matches were wins, and 67 (44.67) losses, making it the second-best agent when moving first. As the second-moving player, 66 5.3. Experiment Results left-role𝐶 𝑈 𝐿(5) left right left right left right ri gh t- ro le 𝐶 40 10 10 40 15 35𝑈 47 3 29 21 33 17𝐿(5) 47 3 34 16 35 15 Table 5.3: The result matrix of the Dark_Split_Corridor(3,4) tournament. The cells describe the total number of distinct outcomes. The horizontal axis denotes the agent of the left-role (the player moving first). The vertical axis denotes the agent of the right-role (the player moving second). The subcolumns of the cell’s matrix describe the outcome. They are written as left (the player moving first won), and right (the player moving second won), 0 20 40 60 80 100 120 140 C U L(5) 134 73 83 number of wins as left (absolute) 0 20 40 60 80 100 120 140 C U L(5) 85 41 34 number of wins as right (absolute) 0 20 40 60 80 100 120 140 C U L(5) 16 77 67 number of losses as left (absolute) 0 20 40 60 80 100 120 140 C U L(5) 65 109 116 number of losses as right (absolute) 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 C U L(5) 219 114 117 total points (win = 1, loss = 0) Figure 5.4: The results for the Dark_Split_Corridor(3,4) tournament. 67 5. Results it loses to both 𝐶 and 𝑈 with a win-loss ratio of 34 ∶ 116 (22.67%). The direct comparison shows it to be an apt agent for the left role, winning 15 (30%) and losing 35 (70%) matches against 𝐶. Against 𝑈 it won 33 (66%) times and lost 17 (34%) matches. Playing against itself yielded similar results; 𝐿(5) won 35 (70%) matches and lost 15 (30%). The data suggests, that in contrast to Phantom_Connect(4,4,4), the clairvoyant agent did not yet converge towards optimal play. Currently, the game of Dark_Split_Cor- ridor(3,4) lacks formal analysis. However, assuming the first-moving player has the advantage, at optimal play, 𝐶 should have a perfect win rate. Especially in the case against 𝑈 and 𝐿(5), where 𝐶 solely has perfect information. With the parameters set, 𝐿(5) is surprisingly a better first-moving player than 𝑈 . Not only does it win the direct matchup, but also win more often against the superior clairvoy- ant agent. While it was able to eke out an advantage as first-moving, it is approximately by the same margin weaker in the second-moving case. The results do not provide an obvious explanation for why 𝐿(5) is better when moving first. A discussion featuring an interpretation and the implications continues in the next chapter, specifically section 6.3. 68 CHAPTER 6 Discussion The contributions of this thesis cover three topics, that coincide with the three research questions. First is a comparison of the extensions for Monte Carlo tree search (MCTS) for games with imperfect information in the context of general game playing (GGP) available in the literature. Second is a suggestion of what features of a game may be useful for determining the depth of a limited expansion depth MCTS algorithm, and how those features may be quantified. Third and final is an evaluation of a limited expansion depth algorithm using an experiment. 6.1 Monte Carlo Tree Search Extensions Of the considered MCTS modifications for imperfect information games in the context of GGP, multiple observer-information set Monte Carlo tree search (MO-ISMCTS) was the most apt extension. MO-ISMCTS is a part of the information set Monte Carlo tree search (ISMCTS) family. It mitigates the issues of strategy fusion and non-locality. Furthermore, both weighting the possible states, and more sophisticated modelling of the opponent, are powerful suggestions for improving the method further. The results show, in line with Cowling, Powley, and Whitehouse [8], MO-ISMCTS is an effective extension for playing imperfect information games in the context of GGP. MO-ISMCTS is a designated algorithm for games with hidden information. Furthermore, it does not require game-specific heuristics to approximate the next best move. Yet, it is not perfect. The case study of Dark_Split_Corridor(3,4) showed, that without further modifications, the best move does not always coincide with the likely chosen move. Furthermore, while the opponent model of MO-ISMCTS is certainly stronger than that of single observer-information set Monte Carlo tree search with partially observable moves (SO-ISMCTS+POM), it is an area of improvement. Nevertheless, MO-ISMCTS suffers less from strategy fusion and non-locality, while keeping a strong enough opponent model. 69 6. Discussion When building an agent capable of GGP with imperfect information, MO-ISMCTS should be considered as the primary search algorithm. Previous research focused on either GGP (with perfect information) or games with hidden information. Work that assumes both is sparse. These results demonstrate that MO-ISMCTS is the most effective expansion of MCTS for a wide class of applications. Due to the evaluation on specific examples, the comparison of the existing MCTS methods usable for imperfect information games in the context of GGP is lacking generalizabil- ity. The chosen games are both relatively simple two-player games. The differences in the algorithms are possibly not as pronounced as with more complex examples. Cer- tainly, the differences would be greater in games with more than two roles (especially for MO-ISMCTS). However, the chosen games are important representatives. Furthermore, their relative simplicity allows for a clear analysis. The list of considered MCTS methods was not exhaustive. There are alternative ap- proaches available. Most can be categorized by determinization and information set approaches. Many resort to multiple paradigms—hybrid approaches. The goal was to represent their base variants by focusing on determinizing MCTS, and the three instances of ISMCTS. Finally, the aspect of computational cost viability was left out of scope entirely. Cowling, Powley, and Whitehouse [8] report that MO-ISMCTS is computationally pricier, while producing better results. Analysis covering such a scope would have hindered the brevity of this thesis. Overall, the issue can and should be considered separately, and does not at all affect the playing strength. While MO-ISMCTS is indeed a good base-expansion of the MCTS algorithm fit for games with imperfect information in the context of GGP, it is still not ideal. Avenues for future research include the improvement of MO-ISMCTS with better opponent modelling, and more sophisticated probabilistic analysis in respect to the possible states. Moreover, future research may aim to further develop the other available approaches. Furthermore, in future research, these methods should be additionally validated and their difference quantified. As a first suggestion, the author of this thesis recommends tournaments. Tournaments lend themselves to direct comparison, even though they are computationally intensive endeavors. They are easy to replicate, and emphasize differences in playing strength. Quantifying playing strength is difficult qualitatively. 6.2 Feature Extraction The features obtained by dynamic sampling, and static analysis, as applied in the prior art, characterize a game. This characterization lends itself to heuristically determining where the UCT-border at a given number of MCTS iterations might be. The discussed features were the cardinality of the state space, the cardinality of the action space, estimate the average arity, and estimate the average depth of terminal nodes. 70 6.3. Limited Expansion Depth Agent The underlying assumption for extracting features directly from game descriptions was that these are characteristic enough to determine a value for the maximum expansion depth for a limited MCTS agent. This assumption was confirmed by applying two meth- ods from the prior art, static analysis and dynamic sampling. Static analysis was useful in providing lower and upper bounds for features related to the game tree’s size. Their result was less meaningful, but computationally cheap. Dynamic sampling is an easy to apply method, but computationally intensive. Furthermore, while the values are more directly applicable, they might be sensitive to biases in the sampling method, and may require a substantial sample size to be meaningful. Static analysis and dynamic sampling are methods previously used by Kuhlmann and Stone [15], Schiffel and Thielscher [21], Clune [6] and Mańdziuk and Świechowski [17], to generate ad-hoc heuristics for playing games in a GGP context. The qualitative evaluation of using these methods to set the maximum expansion depth, built on the existing use- case and extended the idea. The methods can be used for configuring search methods, especially in the context of GGP. The configuration may span not only the maximum expansion depth in the case of the limited UCT-border agent, but also the exploration parameter in a general MCTS agent. Beyond the scope of this thesis was to quantify how good the features extracted from the game were at heuristically suggesting the value for the limit of the expansion depth for the limited agent. Given that the ultimiate decision for the limit in the experiment was not only based on the extracted features, there was no direct quantitative evaluation of the method. Nonetheless, as the scope was restricted to the initial viability of the method, this limitation was not confounding. Feature extraction via static analysis and dynamic sampling seems promising. Further research is needed to establish the viability of the approach in an ad-hoc scenario. An ad-hoc scenario might be a tournament with an unlimited agent and a limited expansion depth agent that can adjust the UCT-border according to the game and the phase of the game. Another area of interest is to explore other features extractable by static analysis or dynamic sampling. A promising feature, not yet considered, is the number of possible states. The number of possible states is an important factor for ISMCTS. More possible states likely require more iterations to accurately estimate the expected utility of a node. In addition, more qualitative analysis on the feature extraction is necessary. Feature extraction was proposed as a mechanism for generating heuristics for GGP. Thus, it is mainly discussed in a quantitative setting and sufficient theoretical evaluation is still lacking. 6.3 Limited Expansion Depth Agent The experimental data shows that limiting the maximum expansion depth for games with imperfect information, in the context of GGP, can have a negative and surprisingly 71 6. Discussion also a positive effect. Specifically, in the example of Phantom_Connect(4,4,4) the effect was decisively negative, whereas with Dark_Split_Corridor(3,4) there was a positive effect when moving first. The positive effect was not present when assuming the second- moving role. The outcome of the Phantom_Connect(4,4,4) tournament contradicts the hypothesis that limiting the expansion depth has a positive effect on playing strength. The agent with a limited UCT-border was severely disadvantaged. This was measurable in the tally of the direct matchup, as well as the aggregated points. The Dark_Split_Corridor(3,4)-tournament’s results show that in certain conditions, it can be beneficial to reduce the maximum expansion depth of an agent using MO-ISMCTS. Surprisingly, the positive effect was only present when the agent was assuming the first- moving role. The results indicate that as the second-moving role, the limited expansion depth had a negative effect. The results of the experiment do not fit with the theory that a deeper search tree is strictly better. One of the reasons why MCTS is lauded, is that it refrains from exploring parts of the tree that are less promising. This allows to produce deeper trees, while remaining accurate. Previous research has focused on increasing the depth MCTS can foray into. The results of the experiment demonstrate that in certain conditions this might be counterproductive. There are multiple possible explanations for why the limited agent surpassed the uncon- strained agent in Dark_Split_Corridor(3,4). One such explanation would be, that due to the relatively low number of MCTS iterations, the search tree of the unconstrained agent was too optimistic, placing too much weight on parts of the search tree which are still highly uncertain due to the hidden information. Due to the fixed UCT-border, the uncertainty was better accounted for, as each expansion step compounds the unpre- dictability of the future (what move will my opponent make) with the ambiguity of the hidden game state. Perhaps, each game has an associated characteristic curve, parameterized by the ex- ploitation parameter and the number of MCTS iterations. The curve indicates in how many cases the algorithm converged to the actual best move, given a maximum ex- pansion depth. Three possible families of curves might exist. The simplest case is a monotonically increasing curve; The farther the algorithm can look ahead, the better the playing strength. In fig. 6.1 an example of such a curve is labeled Monotonically Increasing. The global optimum for the expansion depth is the maximum height of the game tree. The second case describes all curves with peaks and troughs, until there is again a global optimum at the maximum height of the game tree. An example curve is labeled Non-Monotonically Increasing in fig. 6.1. The third case would be all curves where the global optimum does not reside with the maximum height of the game tree. See fig. 6.1, for an example; The curve is labeled Early Peak. Possibly, Dark_Split_Corridor(3,4) with the chosen iterations and exploration pa- rameter has a characteristic curve belonging to the second or third family. More data 72 6.3. Limited Expansion Depth Agent 0 2 4 6 8 10 12 14 16 · · · D 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Expansion depth Pl ay in g st re ng th Monotonically Increasing Non-Monotonically Increasing Early Peak Figure 6.1: The hypothesized three families of game characteristic curves. A curve is parameterized by the exploration parameter, and the number of MCTS iterations. The𝑥-axis quantifies the expansion depth. Left is zero, on the very right is the depth of the game tree 𝐷. The 𝑦-axis quantifies an abstract notion of playing strength—the probability that the chosen move coincides with the best move. is required to confirm the hypothesis of game characteristic curves is applicable. Fur- thermore, if this hypothesis is not rejected, additional evidence if Dark_Split_Cor- ridor(3,4) indeed has a game characteristic curve of the family Non-Monotonically Increasing or Early Peak. One of the tournaments’ limiting factor was the number of matches per matchup. With50 matches per matchup, is certainly on the lower end for permitting a correlation be- tween playing strength and performance in a tournament. This was mitigated by includ- ing the clairvoyant agent as an overwhelmingly stronger opponent. In both tournaments, the non-clairvoyant agent which achieved overall better results also fared better against the clairvoyant agent. This permits drawing the conclusions with a lower number of matches. The restricted number of iterations for the Dark_Split_Corridor(3,4) limits the gen- eralizability of the results. Without more data it is impossible to confirm if the limited 73 6. Discussion expansion depth agent has an advantage because of the game, the number of iterations, or other factors such as a better rate of convergence at the chosen exploitation parame- ter. However, in tournaments and real-world scenarios, the number of iterations is most likely suboptimal. In these situations, it may then be beneficial to limit the expansion depth. Future studies should not only include the tournaments with different games and more matches per matchup. Ideally, these tournaments would vary the exploration parameter among agents, as well as their number of iterations and the maximum expansion depth. Therefore, allowing to quantify the impact of these parameters. 74 CHAPTER 7 Conclusion This research aimed to examine Monte Carlo tree search (MCTS) as a search method for games with imperfect information in the context of general game playing (GGP). The thesis pursued three goals. First was to identify effective modifications for MCTS for GGP in hidden information scenarios. Second, was to suggest a set of features that could inform the maximum depth of a limited expansion depth MCTS variant, as well as examining a method for evaluating these features. Third and final, was to conduct a quantitative evaluation of an agent using said MCTS variant. The quantitative evaluation required the implementation of an engine, capable of orchestrating GGP matches and in- terpreting the standard description language, game description language with imperfect information (GDL-II). 7.1 Summary Drawing upon a large body of work in a literature review, and a qualitative analysis using concrete examples, this thesis provided the relevant background for imperfect information games and GGP. Furthermore, it compared the variants determinization MCTS, and information set Monte Carlo tree search (ISMCTS). As MCTS is a commonly used method for GGP, the goal was to find and verify the most promising variant of MCTS for imperfect information games in the context of GGP. In accordance with the hypothesis, the results indicated that determinizing MCTS is not an ideal extension, while from the family of ISMCTS, multiple observer-information set Monte Carlo tree search (MO-ISMCTS) was the most promising method for imperfect information games in the context of GGP. By reusing a method found in literature, this thesis provided a concrete way of determin- ing a set of features, useful for characterizing games, without prior knowledge about the game. The characterization of the game hinted at the size of the game tree. The size of the game tree subsequently influenced the value for the maximum depth of the limited 75 7. Conclusion expansion depth MO-ISMCTS algorithm used by an agent in the experiment. Priorly, the method was used for generating heuristics for search methods, without any game-specific knowledge, which was why it was also applicable for this use case. The limited expansion depth MO-ISMCTS variant was shown to have an advantage only when moving first in a game with a more complex state, but fewer possible moves. The agent was evaluated against an overwhelmingly stronger clairvoyant MCTS agent, and a baseline unconstrained expansion depth MO-ISMCTS agent using two tournaments. In the tournament for the game Phantom_Connect(4,4,4)—a variant of the game Tic- tac-toe on a 4 × 4-board, where the players do not see the marks of their opponents, the limited expansion depth agent was disadvantaged. In the other tournament for the game Dark_Split_Corridor(3,4), where both players race towards a finish line, and may be obstructed by invisible barriers placed by the opponent, the limited expansion depth agent achieved an about 6.67% (10 out of 150 matches) higher win rate as the first-moving player. In an effort to support GDL-II and orchestrate matches of GGP-agents, one of the contri- butions of this thesis is an integrated game engine capable of these features. The engine supports loading GDL-II definitions, and provides an interpreter interface for agents. In- cluded in the engine are agents using MCTS. The agents use the interpreter to manage their search trees, and calculate the next best move. This thesis reviews the implemen- tation of the engine, to make it more accessible to learn from. Ultimately, the goal was to shape future revisions and iterations. 7.2 Future Work While this thesis provides an introduction into the usage of MCTS for imperfect infor- mation games in the context of GGP, there remain many insufficiently explored topics. These topics include the combination of modifications, and application of extension- specific optimizations. Specifically for MO-ISMCTS, its authors hypothesize a benefit when applying opponent modelling, and probability weighting of the possible states (see [8]). Experiments are needed to quantify the effects of these suggested improvements. After the promising initial evaluation of static analysis and dynamic sampling for feature extraction, further research is necessary. Future studies could address other features (e.g. the number of possible states). Furthermore, to determine the viability of the approach, a more in-depth analysis of the computational cost remains open. Likewise, finding specific counter examples where the characterization via feature extraction is misleading, may help to improve and refine the method. This thesis showed that the limited expansion depth MCTS variant can play better in specific games. This ultimately leads to the question for which class(es) of games this is also true. Furthermore, future research could consider why there is such a big difference between moving first and moving second. Moreover, it remains to be validated whether 76 7.3. Contributions there is an ideal limit given a range of parameters, such as characterizing information about the game, the number of MCTS iterations, and data about the opponent. The engine serves as an iterative prototype influenced by previous similar efforts. While the performance was acceptable, it is still far from optimal. The literature provides a plethora of unexplored optimizations for evaluating GDL, and improving the computa- tional speed of the tree data structures. Future versions may improve the throughput of the engine to greatly increase the number of matches that are economical to run. 7.3 Contributions The engine and its Python interface, implemented as part of the experiment, pave the way towards an improved ecosystem in the field of GGP. Similar predecessors use Java as an implementation language, making it difficult to embed with commonly used artificial intelligence (AI) packages. Most of those packages are meant to be used via Python, one of the most widely used languages in AI and machine learning (ML). Therefore, it was chosen as the implementation language of the engine, facilitating its usage in the context of AI and ML. This thesis offers a novel approach of limiting the expansion depth of MO-ISMCTS. More- over, the data provides initial evidence that the approach is indeed viable. An experiment revealed an unexpected conditional advantage in a specific game. This challenges the often times implicit made assumption, that MCTS always benefits from a deeper search tree. The approach together with the engine form the main contributions to the research topic of GGP, and the broader field of AI. 77 List of Figures 2.1 An abstract entity-relationship diagram of the term used for games. . . . 8 2.2 The phases of a match. The filled black state is the entry point, the state with double borders is the final phase and the red state indicates an error state. The transation 𝑎, requires all actors acknowledge in time, 𝑡 is applicable if the state is terminal, dnf means that some actor failed to submit move in time, while if some actor submitted an illegal move dsq is applicable. . . . 10 2.3 The game tree of 5-Nim. The round and blue nodes denote states where first is in control and rectangular and red nodes denote states where second is in control. The top-most node is the initial state and the states with a double border are terminal states. The edges represent the moves. Fat, blue and dashed edges are actions done by first. Thin and red edges are actions done by second. The label of the edge represents the arithmetic operation on the heap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 The game tree of Phantom-5-Nim. The round and blue nodes denote states where first is in control and rectangular and red nodes denote states where second is in control. The top-most node is the initial state and the states with a double border are terminal states. The edges represent the moves. Fat, blue and dashed edges are actions done by first. Thin and red edges are actions done by second. The label of the edge represents the arithmetic operation on the heap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 The game tree of Mini-Single-Call. . . . . . . . . . . . . . . . . . . . . . . 26 4.1 The class hierarchy of the elementary data structures. . . . . . . . . . . . 40 4.2 The class diagram of the interpreter. . . . . . . . . . . . . . . . . . . . . . 42 4.3 The class diagram for the agents. Protocols (viz. interfaces) are in a light shade of gray, and with dashed borders. Abstract classes have a darker shade of gray. Concrete implementation classes are in the darkest shade of gray. 48 5.1 The characteristic game state 𝑠6 for the case study of Phantom_Con- nect(4,4,4). A cell is either blank, or features up to three marks. The marking player’s symbol is seen in the center of the cell. The marks below show which player knows the mark of a cell. . . . . . . . . . . . . . . . . . 54 79 5.2 The characteristic game state 𝑠4 for the case study of Dark_Split_Corri- dor(3,4). left is in control, indicated by the double border. The pawns are visible to both players. The barrier is still hidden to left indicated by the gray color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3 The results for the Phantom_Connect(4,4,4) tournament. . . . . . . . . 64 5.4 The results for the Dark_Split_Corridor(3,4) tournament. . . . . . . . 67 6.1 The hypothesized three families of game characteristic curves. A curve is parameterized by the exploration parameter, and the number of MCTS iter- ations. The 𝑥-axis quantifies the expansion depth. Left is zero, on the very right is the depth of the game tree 𝐷. The 𝑦-axis quantifies an abstract notion of playing strength—the probability that the chosen move coincides with the best move. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 80 List of Tables 5.1 The extracted numeric features compared for the games Phantom_Con- nect(4,4,4) and Dark_Split_Corridor(3,4). |𝑆| is the cardinality of the state space. |𝐴| is the cardinality of the action space. ?̄? is the average arity.̄𝑑 is the average depth of terminal nodes. 𝑙 is the chosen maximum depth for the experiment (see section 5.3) . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 The result matrix of the Phantom_Connect(4,4,4) tournament. The cells describe the total number of distinct outcomes. The horizontal axis denotes the agent of the x-role (the player moving first). The vertical axis denotes the agent of the o-role (the player moving second). The subcolumns of the cell’s matrix describe the outcome. They are written as x (the player moving first won), o (the player moving second won), and ½ (the match ended in a draw). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 The result matrix of the Dark_Split_Corridor(3,4) tournament. The cells describe the total number of distinct outcomes. The horizontal axis denotes the agent of the left-role (the player moving first). The vertical axis denotes the agent of the right-role (the player moving second). The subcolumns of the cell’s matrix describe the outcome. They are written as left (the player moving first won), and right (the player moving second won), . . . . . . 67 81 List of Code Listings 2.1 A possible definition of 5-Nim using game description language (GDL). . . 17 2.2 A possible definition of Phantom-5-Nim using GDL. . . . . . . . . . . . . . 19 83 Acronyms GGP general game playing MCTS Monte Carlo tree search AI artificial intelligence ML machine learning GDL game description language GDL-II game description language with imperfect information ISMCTS information set Monte Carlo tree search SO-ISMCTS single observer-information set Monte Carlo tree search SO-ISMCTS+POM single observer-information set Monte Carlo tree search with par- tially observable moves MO-ISMCTS multiple observer-information set Monte Carlo tree search ASP answer set programming 85 Bibliography [1] Hans Berliner. “Some Necessary Conditions for a Master Chess Program”. In: Pro- ceedings of the 3rd International Joint Conference on Artificial Intelligence. Inter- national Joint Conference on Artificial Intelligence. Vol. 3. Aug. 20, 1973, pp. 77– 85. [2] Y. Bjornsson and H. Finnsson. “CadiaPlayer: A Simulation-Based General Game Player”. In: IEEE Transactions on Computational Intelligence and AI in Games 1.1 (Mar. 2009), pp. 4–15. issn: 1943-068X, 1943-0698. doi: 10 .1109/TCIAIG. 2009.2018702. url: http://ieeexplore.ieee.org/document/4804731/ (visited on 01/10/2023). [3] Charles L. Bouton. “Nim, A Game with a Complete Mathematical Theory”. In: The Annals of Mathematics 3.1/4 (1901), p. 35. issn: 0003486X. doi: 10.2307/1967631. JSTOR: 1967631. url: https://www.jstor.org/stable/1967631?origin=crossref (visited on 12/07/2023). [4] Cameron B. Browne et al. “A Survey of Monte Carlo Tree Search Methods”. In: IEEE Transactions on Computational Intelligence and AI in Games 4.1 (Mar. 2012), pp. 1–43. issn: 1943-068X, 1943-0698. doi: 10.1109/TCIAIG.2012.2186810. url: http://ieeexplore.ieee.org/document/6145622/ (visited on 01/11/2023). [5] Paolo Ciancarini and Andrea Gasparro. “Priority Level Planning in Kriegspiel”. In: Entertainment Computing - ICEC 2012. Ed. by Marc Herrlich, Rainer Malaka, and Maic Masuch. Vol. 7522. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 333–340. isbn: 978-3-642-33541-9 978-3-642-33542-6. doi: 10.1007/978-3-642- 33542- 6_29. url: http://link.springer.com/10.1007/978- 3- 642- 33542- 6_29 (visited on 11/30/2023). [6] James Clune. “Heuristic Evaluation Functions for General Game Playing”. In: AAAI. AAAI. Vol. 7. 2007. [7] Rémi Coulom. “Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search”. In: Computers and Games. Ed. by H. Jaap Van Den Herik, Paolo Cian- carini, and H. H. L. M. Donkers. Red. by David Hutchison et al. Vol. 4630. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 72–83. isbn: 978-3-540-75537-1 978-3-540-75538-8. doi: 10.1007/978-3-540-75538-8_7. url: http://link.springer. com/10.1007/978-3-540-75538-8_7 (visited on 11/20/2023). 87 [8] P. I. Cowling, E. J. Powley, and D. Whitehouse. “Information Set Monte Carlo Tree Search”. In: IEEE Transactions on Computational Intelligence and AI in Games 4.2 (June 2012), pp. 120–143. issn: 1943-068X, 1943-0698. doi: 10.1109/TCIAIG. 2012.2200894. url: http://ieeexplore.ieee.org/document/6203567/ (visited on 12/21/2022). [9] Ian Frank and David Basin. “Search in Games with Incomplete Information: A Case Study Using Bridge Card Play”. In: Artificial Intelligence 100.1-2 (Apr. 1998), pp. 87–123. issn: 00043702. doi: 10.1016/S0004-3702(97)00082-9. url: https:// linkinghub.elsevier.com/retrieve/pii/S0004370297000829 (visited on 12/09/2022). [10] Michael Genesereth, Nathaniel Love, and Barney Pell. “General Game Playing: Overview of the AAAI Competition”. In: AI Magazine 26.2 (2005), pp. 62–72. doi: 10.1609/aimag.v26i2.1813. [11] Giuseppe De Giacomo, Yves Lesperance, and Adrian R Pearce. “Situation Calculus Game Structures and GDL”. In: (2016), p. 9. [12] Roland Kaminski et al. How to Build Your Own ASP-based System?! Nov. 5, 2021. arXiv: 2008.06692 [cs]. url: http://arxiv.org/abs/2008.06692 (visited on 09/29/2022). preprint. [13] Levente Kocsis and Csaba Szepesvári. “Bandit Based Monte-Carlo Planning”. In: Machine Learning: ECML 2006. Ed. by Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou. Red. by David Hutchison et al. Vol. 4212. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 282–293. isbn: 978-3-540-45375-8 978-3-540- 46056-5. doi: 10.1007/11871842_29. url: http://link.springer.com/10.1007/ 11871842_29 (visited on 11/16/2023). [14] Jakub Kowalski and Marek Szykuła. “Game Description Language Compiler Con- struction”. In: AI 2013: Advances in Artificial Intelligence. Ed. by Stephen Crane- field and Abhaya Nayak. Red. by David Hutchison et al. Vol. 8272. Cham: Springer International Publishing, 2013, pp. 234–245. isbn: 978-3-319-03679-3 978-3-319- 03680-9. doi: 10.1007/978-3-319-03680-9_26. url: http://link.springer.com/10. 1007/978-3-319-03680-9_26 (visited on 01/02/2023). [15] Gregory Kuhlmann and Peter Stone. “Automatic Heuristic Construction in a Com- plete General Game Player”. In: AAAI. Vol. 6. 2006, pp. 1457–62. [16] Jeffrey Long et al. “Understanding the Success of Perfect Information Monte Carlo Sampling in Game Tree Search”. In: Proceedings of the AAAI Conference on Ar- tificial Intelligence 24.1 (July 3, 2010), pp. 134–140. issn: 2374-3468, 2159-5399. doi: 10.1609/aaai.v24i1.7562. url: https://ojs.aaai.org/index.php/AAAI/article/ view/7562 (visited on 01/02/2023). [17] Jacek Mańdziuk and Maciej Świechowski. “Generic Heuristic Approach to General Game Playing”. In: SOFSEM 2012: Theory and Practice of Computer Science. Ed. by Mária Bieliková et al. Vol. 7147. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 649–660. isbn: 978-3-642-27659-0 978-3-642-27660-6. doi: 10.1007/978- 88 3-642-27660-6_53. url: http://link.springer.com/10.1007/978-3-642-27660-6_53 (visited on 01/10/2023). [18] Munyque Mittelmann and Laurent Perrussel. “Game Description Logic with Inte- gers: A GDL Numerical Extension”. In: Foundations of Information and Knowledge Systems. Ed. by Andreas Herzig and Juha Kontinen. Vol. 12012. Cham: Springer International Publishing, 2020, pp. 191–210. isbn: 978-3-030-39950-4 978-3-030- 39951-1. doi: 10.1007/978-3-030-39951-1_12. url: http://link.springer.com/10. 1007/978-3-030-39951-1_12 (visited on 01/06/2023). [19] Maximilian Möller et al. “Centurio, a General Game Player: Parallel, Java- and ASP-based”. In: KI - Künstliche Intelligenz 25.1 (Mar. 2011), pp. 17–24. issn: 0933- 1875, 1610-1987. doi: 10.1007/s13218-010-0077-4. url: http://link.springer.com/ 10.1007/s13218-010-0077-4 (visited on 11/08/2023). [20] S. Schiffel and M. Thielscher. “Representing and Reasoning About the Rules of General Games With Imperfect Information”. In: Journal of Artificial Intelligence Research 49 (Feb. 14, 2014), pp. 171–206. issn: 1076-9757. doi: 10.1613/jair.4115. url: https : / / www . jair . org / index . php / jair / article / view / 10862 (visited on 01/02/2023). [21] Stephan Schiffel and Michael Thielscher. “Automatic Construction of a Heuristic Search Function for General Game Playing”. In: Department of Computer Science (2006), pp. 16–17. [22] Stephan Schiffel and Michael Thielscher. Fluxplayer: A Successful General Game Player. 2007. url: https://www.aaai.org/Papers/AAAI/2007/AAAI07-189.pdf (visited on 12/29/2022). [23] Michael Schofield and Michael Thielscher. “General Game Playing with Imperfect Information”. In: Journal of Artificial Intelligence Research 66 (Dec. 13, 2019), pp. 901–935. issn: 1076-9757. doi: 10.1613/jair.1.11844. url: https://jair.org/ index.php/jair/article/view/11844 (visited on 12/30/2022). [24] David Silver et al. Mastering Chess and Shogi by Self-Play with a General Rein- forcement Learning Algorithm. Dec. 5, 2017. arXiv: 1712.01815 [cs]. url: http: //arxiv.org/abs/1712.01815 (visited on 11/01/2023). preprint. [25] David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree Search”. In: Nature 529.7587 (Jan. 28, 2016), pp. 484–489. issn: 0028-0836, 1476-4687. doi: 10.1038/nature16961. url: http://www.nature.com/articles/ nature16961 (visited on 01/09/2023). [26] Maciej Świechowski et al. “Monte Carlo Tree Search: A Review of Recent Modifi- cations and Applications”. In: Artificial Intelligence Review (July 19, 2022). issn: 0269-2821, 1573-7462. doi: 10 . 1007 / s10462 - 022 - 10228 - y. url: https : / / link . springer.com/10.1007/s10462-022-10228-y (visited on 12/01/2022). 89 [27] Michael Thielscher. “A General Game Description Language for Incomplete Infor- mation Games”. In: Proceedings of the AAAI Conference on Artificial Intelligence 24.1 (July 4, 2010), pp. 994–999. issn: 2374-3468, 2159-5399. doi: 10.1609/aaai. v24i1.7647. url: https://ojs.aaai.org/index.php/AAAI/article/view/7647 (visited on 12/30/2022). [28] Michael Thielscher. “Answer Set Programming for Single-Player Games in General Game Playing”. In: Logic Programming. Ed. by Patricia M. Hill and David S. Warren. Vol. 5649. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 327– 341. isbn: 978-3-642-02845-8 978-3-642-02846-5. doi: 10.1007/978-3-642-02846- 5_28. url: http://link.springer.com/10.1007/978-3-642-02846-5_28 (visited on 01/09/2023). [29] Alan M. Turing. “Chess”. In: Computer Chess Compendium. Ed. by David Levy. New York, NY: Springer New York, 1988, pp. 14–17. isbn: 978-1-4757-1970-3 978- 1-4757-1968-0. doi: 10.1007/978-1-4757-1968-0_2. url: http://link.springer.com/ 10.1007/978-1-4757-1968-0_2 (visited on 10/30/2023). [30] J.W.H.M. Uiterwijk and H.J. Van Den Herik. “The Advantage of the Initiative”. In: Information Sciences 122.1 (Jan. 2000), pp. 43–58. issn: 00200255. doi: 10. 1016/S0020-0255(99)00095-X. url: https://linkinghub.elsevier.com/retrieve/pii/ S002002559900095X (visited on 01/18/2024). 90