Hand Tracking in Colocated Multi-User Virtual Reality DISSERTATION submitted in partial fulfillment of the requirements for the degree of Doktor der Technischen Wissenschaften by Dennis Reimer, M.Sc. Registration Number 11728991 to the Faculty of Informatics at the TU Wien Advisor: Univ.Prof. Dr. Hannes Kaufmann Second advisor: Prof. Dr. Daniel Scherzer The dissertation has been reviewed by: Tiare Feuchtner Eike Langbehn Vienna, 1st July, 2024 Dennis Reimer Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Erklärung zur Verfassung der Arbeit Dennis Reimer, M.Sc. Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. Wien, 1. Juli 2024 Dennis Reimer iii Acknowledgements A dissertation like this is certainly not the result of a single person, which is why there are many people whom I would like to thank and who have made it possible for me to produce this thesis in this form. First of all, I would like to thank my supervisor Hannes Kaufmann, who always supported me with helpful feedback, guidance, valuable advice, and a lot of patience. I would also like to thank my second supervisor Daniel Scherzer, who made the initial start to my PhD possible and also who has always helped me with advice and support. I also thank Iana Podkosova for her help in finding the topic, for the very interesting exchange, and for her help as coauthor on two articles. Of course, I would also like to thank the entire VR Group at Vienna University of Technology, which welcomed me with open arms and offered me a valuable exchange of ideas. Many thanks also go to Tiare Feuchtner and Eike Langbehn, who kindly agreed to review the thesis. This thesis was funded by Ravensburg-Weingarten University, who kindly employed me for the research and writing of my dissertation, and the TU Wien Bibliothek for financial support through its Open Access Funding Program. I would especially like to thank my fiancée Anna-Catrin. Her moral support and the fact that she always had an open ear helped me through the difficult phases, and this work would never have been possible without her. To my daughter Linnea, who makes every day a beautiful one for me and who I hope I can inspire to always be curious and explore the world. Many thanks go to my parents Peter and Renate Reimer, without whom my path to my studies, my enthusiasm for computer science and ultimately this thesis would not have been possible. Additional thanks go to my good friend Thorsten, who was always open to an exchange and helped me out as a test person in every experiment. Finally, thanks go to all the testers who agreed to participate in the user tests that were part of this thesis and thus helped to generate valuable results. v Abstract To enhance immersion in virtual reality applications, this dissertation addresses the challenges and opportunities of implementing natural hand interactions through hardware- based hand tracking, particularly in the context of multi-user colocated VR environments. The research introduces ’EasyHand’, a hand-tracking framework that unifies detection mapping, visualization, interaction, and networking for several tracking systems to facilitate intuitive hand tracking for visualization and interaction while addressing the limitations of tracking range dead spots in colocated scenarios. The first experiment describes a novel approach for creating colocated multi-user VR scenarios for SLAM-tracked (Simultaneous Localization and Mapping) VR headsets without the need for external tracking cameras to continuously track all users. By leveraging the hand recognition capabilities of the VR headset, the system synchronizes the virtual space for colocated users. This method is compared to alternative approaches such as initial positioning and ArUco marker recognition, with a comprehensive evaluation of accuracy, consistency, and simplicity, demonstrating the superior performance of the proposed hand tracking-based calibration method. The dissertation proceeds with an experiment that demonstrates a method for trans- forming hand data detected via an RGB camera and the MediaPipe framework into 3D space. This technique includes user-specific hand length estimation to determine 3D hand positions, enabling the detection of multiple hands simultaneously. This allows a hand detection system to track the hands of other users and thus also support their detection systems in the event that they are not able to see these hands. Comparative analysis with Oculus Quest and Leap Motion, conducted under different conditions (static & dynamic) and distances from the tracking device, confirms the effectiveness of the proposed method, with significantly extended tracking ranges for colocated scenarios. Finally, the dissertation explores methods for assigning tracked hands to colocated virtual users, introducing an algorithm that leverages past assignments to enhance future assignments’ robustness and effectiveness. Multiple assignment algorithms are evaluated, highlighting the precision of the proposed algorithm. Overall, this work provides initial insights into calibrating multi-user environments, compensating for tracking loss and assignment of hands to users for colocated VR scenarios, while maintaining user-specific interactions, representing a substantial advancement in VR technology. vii Contents Abstract vii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Problem Statement and Contribution . . . . . . . . . . . . . . . . . . . 3 1.3 Resulting Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Related Work 7 2.1 SLAM Tracking for Virtual Reality Headsets . . . . . . . . . . . . . . 8 2.2 Advancing Hand Detection . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Hand Tracking and Pose Estimation Using RGB-D Cameras . 10 2.2.2 Hand Tracking and Pose Estimation Using RGB Cameras . . . 13 2.2.3 Evaluating Hand Tracking Solutions . . . . . . . . . . . . . . . 15 2.3 Advancing Multi-user VR Systems . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Exploring Colocated VR . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Hand Interactions in Multi-User Systems . . . . . . . . . . . . 21 2.4 Fostering Hand-Body Association in Multi-user Scenarios . . . . . . . 23 3 EasyHand - A Modular Hand Interaction and Visualization Frame- work for Single and Colocated VR Scenarios 25 3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.1 Overall System Design . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.2 Implementation Environment . . . . . . . . . . . . . . . . . . . 28 3.2 Hand Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Low-Level Tracking Modules . . . . . . . . . . . . . . . . . . . 29 3.2.2 Unified Joint Mapping . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.3 Gesture Recognition - Design and Implementation . . . . . . . 32 3.3 Visualization and Interaction . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Skeletal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.3 Direct 3D Object Manipulation in Unity3D . . . . . . . . . . . 35 3.3.4 EasyHandRig Template Implementation . . . . . . . . . . . . . 35 ix 3.4 Multi-User Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Distribution and Extensions . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Current State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4 Using Tracked Hands to Create Colocated VR 43 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Calibration Methods for Creating Colocated VR . . . . . . . . . . . . 45 4.2.1 Fixed-Point Calibration . . . . . . . . . . . . . . . . . . . . . . 45 4.2.2 Marker-Based Calibration . . . . . . . . . . . . . . . . . . . . . 46 4.2.3 Hand Tracking-Based Calibration . . . . . . . . . . . . . . . . . 47 4.3 Usability Evaluation Experiment . . . . . . . . . . . . . . . . . . . . . 49 4.3.1 Experimental Design and Evaluation . . . . . . . . . . . . . . . 49 4.3.2 Pilot Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.3 Setup and Procedure . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.1 Calibration Error Outcome . . . . . . . . . . . . . . . . . . . . 54 4.4.2 Consistency and Potential for Improvement . . . . . . . . . . . 57 4.4.3 Ease of Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.5 Exploring Colocation in a Different VR Scenarios . . . . . . . . 59 4.4.6 Applicability and Future of Hand Tracking-Based Calibration . 60 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5 Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage 63 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2 3D World Position Estimation of a Camera Tracked Hand . . . . . . . 65 5.2.1 2.5D Joint Detection with MediaPipe . . . . . . . . . . . . . . 66 5.2.2 Estimating Real Size of Users’ Hands . . . . . . . . . . . . . . 67 5.2.3 Estimating Hand Depth . . . . . . . . . . . . . . . . . . . . . . 69 5.3 Accuracy Evaluation Experiment . . . . . . . . . . . . . . . . . . . . . 69 5.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.2 Experiment 1: Comparison to Integrated Tracking Solutions . . 70 Static Error Evaluation . . . . . . . . . . . . . . . . . . . . . . 71 Dynamic Error Evaluation . . . . . . . . . . . . . . . . . . . . . 73 Lost and Acquired Tracking Distances . . . . . . . . . . . . . . 76 5.3.3 Experiment 2: Accuracy of MediaPipe-Based Hand Tracking for Multiple Users . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 Demonstration in a Colocated Setup . . . . . . . . . . . . . . . . . . . 81 5.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6 Assignment of Tracked Hands in Colocated VR 87 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2 Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2.1 Empirical Estimation Methods . . . . . . . . . . . . . . . . . . 90 6.2.2 Hand Assignment with Machine Learning Agents . . . . . . . . 92 6.2.3 Dynamic Method Selection . . . . . . . . . . . . . . . . . . . . 94 6.2.4 Assignment History . . . . . . . . . . . . . . . . . . . . . . . . 95 6.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.3.1 Experimental Design and Evaluation . . . . . . . . . . . . . . . 96 6.3.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4.1 History Algorithm Effectiveness . . . . . . . . . . . . . . . . . . 100 6.4.2 Assignment Accuracy . . . . . . . . . . . . . . . . . . . . . . . 101 6.4.3 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . 103 6.4.4 Usability in Realtime Applications . . . . . . . . . . . . . . . . 105 6.5 Conclusion and Future Outlook . . . . . . . . . . . . . . . . . . . . . . 107 7 Conclusion 109 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.2 Open questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Bibliography 113 List of Figures 127 List of Tables 133 CHAPTER 1 Introduction "Curiosity is the most powerful driving force in the universe because it can overcome the two greatest braking forces in the universe: Reason and Fear." — Walter Moers - The City Of Dreaming Books Hand tracking has emerged as a popular feature in Virtual Reality (VR) systems, offering users a multitude of interaction possibilities while granting maximum hand freedom without the need for handheld controllers. Controllers are the modern standard method for interaction, supporting direct manipulation of virtual content via a series of buttons and gestures through stable tracking of hand position as well as multimodal feedback (vibration and sound, passive haptic feedback from buttons, triggers, etc.). However, since the use of controllers obscures the user’s own hands and makes free gesticulation and the use of props impossible, hand tracking and hand interactions offer significantly more scope and more natural interaction possibilities. While the majority of applications and use cases for virtual reality in general and hand interactions in particular have historically focused on single-user experiences, there is a growing trend towards multi-user approaches. Specifically, colocated multi-user environments, where individuals share both virtual and physical spaces, hold the promise of immersive virtual activities. As there is a push in colocated VR due to affordability, technical feasibility and user acceptance, new challenges and opportunities for hand tracking also arise: hands of several users must be tracked accurately and assigned to the correct user. Also, tracking redundancy when multiple systems recognize the same hand, or when one system recog- nizes multiple hands, can be used to improve tracking stability and range, or even to synchronize the coordinate systems of the head-mounted displays (HMDs). Solving this challenge and utilizing the mentioned possibilities is the goal that this work tries to solve This dissertation shows solutions that are designed to enable colocated VR environments for SLAM-tracked virtual reality headsets equipped with hand-tracking technology. 1 1. Introduction Furthermore, it explores the usability of this approach in comparison to more complex alternatives. Additionally, this work explores the limitations encountered when using hand tracking in such scenarios and propose strategies to overcome these limitations. These enable the integration of multiple hand-tracking systems that can assist each other in cases where hands become obscured. These solutions are combined and integrated in a comprehensive system that provides an accessible framework for developers who want to create a unified application independently from the tracking system and are able to create colocated virtual environments solely reliant on hand-tracking technology and do not require additional hardware such as an external camera rig. Additionally, it leverages the transformation of two-dimensional tracked hands into the three-dimensional world through pose estimation and robust user assignment. Furthermore, this work evaluates the usability of these systems and explore potential enhancements that can improve usability and enable more effective interactions within colocated multi-user VR environments, ultimately supporting a more natural VR experience and eliminating the need for physical controllers. 1.1 Motivation Since the release of Oculus Rift and HTC Vive, virtual reality (VR) has experienced significant growth within the commercial sector [34]. This expansion is evident both in personal gaming, with Sony’s PSVR and the VR stores offered by HTC Vive and Oculus, as well as in the industrial sector, where VR is extensively used for training (e.g. Strivr1) and marketing purposes (e.g. IKEA Virtual Showroom2). VR has evolved to offer the most immersive digital experiences available and has become increasingly user-friendly and easy to set up, featuring fewer cables and cameras [2][34]. With the advent of Simultaneous Localization and Mapping (SLAM) technology, VR systems can now function without outside-in tracking technology, enhancing their portability. The trend to minimize hardware dependency continues with the emerging technology in optical hand-tracking. This allows users to interact within the virtual world solely using their hands, eliminating the need for additional controllers or gloves. Therefore the virtual representation of the hand could better match the real hand and the interaction with bare hands in the virtual environment may have a positive influence on the sense or presence. In recent years, a growing desire among people to share digital experiences was observed, as evidenced by the increasing usage of social media platforms [6]. This desire for shared experiences extends to VR, particularly in colocated environments. VR arenas and similar spaces (e.g. YULLBE3) take advantage of this technology to create collaborative and immersive virtual reality experiences. 1Strivr: https://www.strivr.com/ (Accessed: 2024-06-21) 2IKEA VR Showroom: https://present.digital/ikea/ (Accessed: 2024-06-21) 3YULLBE: https://yullbe.com/ (Accessed: 2024-06-21) 2 1.2. Problem Statement and Contribution Many of these colocated experiences currently rely on an array of additional hardware, including body-worn trackers, external camera setups, and cable-bound head-mounted displays attached to user-worn backpacks. The optimization of the setup effort and space limitation of such scenarios would benefit from the development of SLAM-tracked headsets capable of creating colocated environments without the need for extra hardware or cameras. Furthermore, the integration of hand-tracking capabilities could enhance the sense of presence and collaboration, especially in cases where a single user’s hand tracking may fail due to obscured hands, where limited tracking volume and handling of physical objects may cause the hands to not be properly visible to the tracking sensor. In colocated setups, multiple hand-tracking systems could assist each other, enlarging the tracking range in which a hand can be detected. This would result in a scenario where users only need to wear a lightweight HMD to fully immerse themselves in a shared VR experience, making it independent of the surrounding hardware setup and having the devices support each other in their tracking. Ideally, this is then not limited to a specific HMD and thus allows interoperability, which to our current knowledge is not yet possible with current headsets (e.g. Meta Quest or Apple Vision Pro). 1.2 Problem Statement and Contribution In this section we will briefly summarize what the goals of this work are, what problems we encounter and what this work offers as contributions to achieve these goals. Overall, we want to advance what has been achieved so far in the area of multi-user to improve hand-based interactions within a physical and virtual shared space in a virtual reality (VR) application (colocated VR). Such a shared area offers the advantage that users often are within sight of other users. This means that their hand recognition systems can also see the hands of the other users. Using this fact to enable the systems to support each other in tracking is one of the overarching goals of this work. This makes it possible for the systems to recognize hands and make them virtually usable in situations in which hands may be hidden or not visible to their own system (for example, when they are held behind the back). When developing the methods, it was also important for us to limit ourselves to hardware that is easily available to a potential end user without incurring high additional costs. This results in the use of widely available SLAM tracked headsets (such as the Meta Quest) and the use of RGB cameras in our work, as this allows the use of wide-spread webcams without the need for expensive hardware such as depth cameras. This should increase the range of users who can easily use the solution. Of course, these limitations also entail problems which, to our knowledge, have not yet been solved. These problems and our contribution to solving them are discussed below. Numerous hardware options and implementations are available to enable hand tracking in virtual reality, such as the Leap Motion Sensor, the HTC hand tracking solution or Megatrack [43][14][37]. However, each of these implementations originates from different manufacturers and comes with its own software development kit (SDK). Consequently, 3 1. Introduction each implementation possesses its unique detection events, gesture recognition, finger joint index mapping, and other specific features. This diversity makes it challenging to develop hand-tracking applications that work seamlessly across multiple VR systems. Developers are often forced to choose one particular system, such as Leap Motion, Meta Quest, or Vive, to focus on their development efforts. Thus, this thesis introduces a comprehensive hand-tracking framework that consolidates multiple hand- tracking APIs, simplifying the development process for applications across various systems. Furthermore, it facilitates colocation capabilities for all supported systems with the capability to send networking messages between all connected users (chapter 3). All the implementations presented in this thesis are conducted within this unified framework. Many colocation implementations require additional sensors or cameras to track all users in a shared environment [122][15]. This introduces a more complex setup, which requires extra effort and expenses. Such a setup runs counter to the purpose of SLAM-tracked headsets, which are designed to be independent of external cameras for positional tracking. To address these issues, this thesis proposes a method to enable colocation for SLAM-tracked headsets using only tracked hands (chapter 4), thus synchronizing the self-contained coordinate systems of SLAM headsets and eliminating the need for external tracking systems. To avoid ambiguity in tracking hands of multiple people, commercial hand tracking systems are typically limited to tracking only two hands; left and right [37]. However, in scenarios where tracking more hands is required, as when multiple users are physically colocated, alternative tracking methods must be considered. Ideally, additional costly hardware should be avoided (such as RGB-D sensors), restricting the choice to RGB- based tracking methods, such as the MediaPipe framework [64]. The challenge lies in the absence of 3D pose capabilities in such systems due to missing depth sensors. In response, this thesis proposes a method to transform a two-dimensional tracked hand with relative depth data of finger joints into a three-dimensional repre- sentation. This work compares this approach with state-of-the-art systems and demonstrate its high accuracy, enabling interactions and significantly extending the tracking range, rendering it suitable for assistance in colocated multi-user scenarios (chapter 5). In colocated setups, where the limitation of only two concurrent hands is not given, it is possible to track more than two hands or multiple left or right hands simultaneously. In such cases, a system must assign each tracked hand to a virtual user. While straightforward in cases where users are far apart, this assignment becomes non-trivial in close proximity (within an arms length). This thesis presents a solution for accurately assigning tracked hands to virtual users, achieving a remarkable 99% accuracy, whether in close or far proximity. Importantly, this method relies solely on positional data from hands and users gained from the HMD and does not require additional hardware (chapter 6). In summary, we aim to offer the following contributions: 4 1.3. Resulting Publications • I: A framework that offers versatility for use in both single- and multi-user scenarios, with the ability to seamlessly switch between tracking systems. • II: A method to calibrate colocated virtual environments for SLAM-tracked headsets using only tracked hands. • III: A method to position a two-dimensionally tracked hand in a three-dimensional space. • IV: Enabling the simultaneous tracking and virtual representation of more than two hands. • V: A method to assign virtual hands to the corresponding user in colocated multi- user environments. We aim to have a system that offers the creation of colocated VR environments, utilizing RGB hand tracking to monitor all visible hands in the view frustum and assign them to the respective users, even when their hands are obscured to their own tracking systems, potentially facilitating tracking and interaction with the virtual environment even when hands are obscured. 1.3 Resulting Publications This PhD dissertation is the result of the following peer-reviewed publications, which are incorporated as published, with minor changes to conform to the overall thesis: 1. Dennis Reimer, Iana Podkosova, Daniel Scherzer and Hannes Kaufmann. "Colo- cation for SLAM-Tracked VR Headsets with Hand Tracking". In the journal of Advances in Seated Virtual Reality, Computers 2021, 10(5), 58, 2021. 2. Dennis Reimer, Iana Podkosova, Daniel Scherzer and Hannes Kaufmann. "Evalua- tion and improvement of HMD-based and RGB-based hand tracking solutions in VR". In the journal of Beyond Touch: Free Hand Interaction in Virtual Environ- ments, Frontiers in Virtual Reality, 4:1169313, 2023. 3. Dennis Reimer, Daniel Scherzer and Hannes Kaufmann. "Ownership Estimation for Tracked Hands in a Colocated VR Environment". In the proceedings of the 33rd International Conference on Artificial Reality and Telexistence & Eurographics Symposium on Virtual Environments (ICAT-EGVE), Dublin, Ireland, pp. 105-114, 2023. As the first author of the publications listed in this thesis, the main work was also done by me. This included the design, implementation, execution of the experiments, analysis, and writing of the publications. Other authors were involved in a supporting role and 5 1. Introduction provided advice during the conception, writing, and analysis. This also applies to the ’EasyHand’ framework presented, where the design and implementation of which was carried out by me, but for which I received supportive feedback during the design process. Due to the advisory contribution of the additional authors and supervisors, I use the term ’we’ instead of ’I’ in this dissertation. The contents of the first paper are described in detail in chapter 4, the second paper is described in chapter chapter 5 and the third paper is described in chapter 6. 1.4 Thesis Structure The remaining thesis is structured as follows: First, chapter 2 presents the background and related work in the areas that are relevant to this dissertation. This includes an overview of work in SLAM tracking for HMDs, detecting hands with RGB and RGB-D cameras as well as multi-user virtual reality. Also, research work done on multi-user interaction and hand-body association is given. Then, chapter 3 presents ’EasyHand’, the underlying system that is used in all presented experiments to combine different hand tracking systems and render the results, as well as to create colocated scenarios and assign tracked hands to users. Next, chapter 4 describes the first experiment to create colocated scenarios for SLAM- tracked headsets with hand tracking and compares it with other colocation methods. The following chapter 5 presents a method for hand size estimation of the user and the usage of this information to do a 3D pose estimation for tracked hands with an RGB camera. Tracking accuracy and tracking range are compared with state-of-the-art tracking systems for VR. For the final experiment, chapter 6 describes methods to assign tracked hands to virtual hands in a colocated multi-user VR setup as well as the experiment to determine the method with the highest accuracy. Finally, the dissertation is concluded in chapter 7 and an outlook for future development and improvements is given. 6 CHAPTER 2 Related Work This chapter provides an overview of the relevant previous research conducted in the central areas of this thesis. Research topics include SLAM tracking, multi-user VR systems, hand detection in VR environments, and hand-body association. Before we go deeper into previous work on the specific research topics relating to this paper, we want to position our work in the reality-virtuality continuum of Milgram et al. [75]. They present a scale that can be used to classify whether an application takes place in complete virtual reality (VR), the virtual augments the real (augmented reality - AR), the real augments the virtual (augmented virtuality) or whether it takes place completely in reality. They define everything that mixes the real and the virtual as mixed reality (MR). Since our work deals with the area of colocated virtual reality and the recognition of real hands and their virtual representation (AR), we would see the overall context of this work in the mixed reality area, even if the main use case of this work takes place in a completely virtual space (VR). Since we create colocated VR scenarios for SLAM-tracked VR headsets, we begin by introducing existing SLAM-tracked headsets, elucidating their operational principles, and addressing the research related to their limitations. Next, we examine hand detection in both VR and non-VR contexts. We explore research involving RGB depth (RGB-D) cameras and non-depth (RGB) cameras, highlighting their respective limitations and advantages. Furthermore, we review studies on 3D pose estimation employing these camera types. Given that our work involves evaluating the accuracy of our system, we also present existing research on the accuracy assessment of hand detection systems in static and dynamic interaction scenarios. Then, we delve into the domain of multi-user VR systems, categorizing existing approaches and show work done in interactions within such systems. Additionally, we discuss research on colocation, with focus on SLAM headsets, as well as experiments done using external camera systems. 7 2. Related Work Lastly, we analyze the existing research on hand-body association, their limitations, and present the reasons why the existing work in this area is of limited use for our experimental purposes. 2.1 SLAM Tracking for Virtual Reality Headsets SLAM, an acronym for ’Simultaneous Localization and Mapping’, refers to a set of algorithms designed to detect the surroundings using one or more cameras or sensors, build a map of the tracked environment, and determine the position of an object, typically a robot, vehicle, or VR headset, within this environment [42]. This technology enables the tracking of objects within a confined space without the need for an external camera system, such as the HTC VIVE with its external Lighthouse tracking system, which is used to track objects in a physical space [81], or the Oculus Rift headset, which uses low-cost external micro-electromechanical systems sensors and a predictive strategy to minimize lag [57]. An example of SLAM tracking is illustrated in Figure 2.1, where a camera tracks environmental features, creating a point cloud, while the user is simultaneously positioned within the mapped environment.1 Figure 2.1: An example of SLAM tracking where features of the environment are tracked in a point-cloud an the user is positioned in the mapped environment. SLAM technology finds extensive applications in mobile robots, improving the navigation of autonomous devices like robot vacuum cleaners [78][52]. It extends to a wide range of domains, including robotics [53][46], autonomous driving [11], and virtual reality headsets [9][72]. In our context, we focus on Visual SLAM, which employs images from cameras, 1https://medium.com/@mccartneykyle12/whats-the-deal-with-s-l-a-m-and-track ing-technology-d1ef6ede9bbb (Accessed: 2023-10-09) 8 2.2. Advancing Hand Detection such as RGB and RGB-D cameras, to achieve SLAM, as opposed to LiDAR SLAM, which relies on LiDAR sensors. Barros et al. conducted a survey of various Visual SLAM algorithms, providing an overview for developers to explore existing solutions and challenges in Visual SLAM research [65]. Similarly, Saputra et al. conducted a survey specifically on SLAM in dynamic environments, addressing three key problems: robustness, dynamic object segmentation and tracking, and joint motion segmentation and reconstruction [97]. An algorithm specifically designed for indoor environments and mobile RGB-D cameras was presented by Brunetto et al. [8]. They proposed a real-time 3D reconstructed environment that, while usable, operates too slowly (10Hz) for real-time applications in VR. In the context of virtual reality headsets, it is essential to have a fast and mobile SLAM technique. Additionally, since common commercial VR headsets are typically equipped with RGB cameras without the ability to track depth input, alternative algorithms are required for Visual SLAM with RGB cameras. Williams et al. presented a mapping generation system for SLAM tracking with VR headsets, offering keypoint recognition for monocular SLAM within a 33ms frame time, improving the approach of Brunetto et al. and facilitating real-time VR scenarios [124]. Bustos et al. dived into the bundle adjustment technique commonly used in SLAM systems and its applications in VR tracking for 6DOF tracking and environment mapping, proposing further improvements to accelerate calculations and enhance the overall complexity of the SLAM system [9]. In our work, we use the Meta Quest and Meta Quest 2 headsets, that are employing Visual SLAM tracking with four cameras on the headset for inside-out tracking [72]. Building upon existing SLAM research, they leverage their ’Oculus Insight’ AI technology to enhance SLAM, controller recognition, and 6DOF tracking for their headset [39]. This involves using motion capture data and device simulation in predefined scenarios as training data to improve tracking accuracy, along with presenting a computer vision architecture to optimize map generation and updating based on changes in the user environment. The result is a submillimeter tracking accuracy for the Meta Quest 2, even surpassing the accuracy of the HTC Vive Trackers 2.0, that use outside-in tracking technology [40]. 2.2 Advancing Hand Detection In addition to the conventional controllers that became standard with commercial VR headsets, hand tracking presents a natural alternative to translate users’ hand movements into the virtual space, enabling intuitive interaction. For example, a comparative user study by Voigt-Antons et al. demonstrated that tracked hand interactions in a virtual environment enhance the sense of presence and offer improved usability for activities such as grasping and virtual typing [116]. Manresa-Yee et al. introduced a real-time algorithm to track and recognize hand gestures 9 2. Related Work and emphasize the usability for interaction with video games. Their approach involved identifying various hand characteristics, extracting these features, and using a finite-state classifier to determine hand configurations [67]. Feuchtner explored the phenomenon of body ownership illusion in virtual and augmented reality interactions and its potential to transcend the limitations of our physical bodies, leading to more effective and engaging interactions [21]. Khundam et al. argued that hand tracking holds great promise in medical applications due to its naturalness in real-world scenarios and usability in contrast to conventional controllers, highlighting the ongoing importance of research in this area [48]. Hand recognition techniques can be broadly categorized into two areas: recognition using gloves worn by the user [133][110][82], and visual recognition employing cameras and sensors [43][41][37][130]. In 1986, the company ’VPL Research’ developed the first commercial glove for hand tracking, which used glass fibers to detect finger curvature [133], as illustrated in Figure 2.2. Subsequent commercial data gloves, including the CyberGlove [47], emerged in the wake of this development. Temoche et al. introduced a low-cost glove designed for VR and commercial use [110], and other researchers sought to enhance glove tracking [50]. Brendan O’Flynn et al. presented a case of using data gloves for arthritis rehabilitation, introducing a smart glove equipped with sensors, processors, and wireless technology to quantitatively assess range of motion with the goal of improving the rehabilitation process, where recalibration is no longer required for each user [82]. Xu et al. proposed a collision detection algorithm that relies on predicting the movement direction of a virtual hand and that is effective in conjunction with a data glove, particularly in a mine training simulation system. They also suggested a dynamic algorithm to resolve collision detection issues between two solid simulation models [126]. Modern gloves frequently incorporate haptic feedback to enhance immersion [111][89]. Notable examples include the Senseglove Nova 2), Haptx glove 3), and TESLAGLOVE 4). However, these devices, while enhancing presence, can be costly for end-users and require the user to wear a fabric device consistently. Alternatively, there are tracking systems that optically track the user’s hands using cameras and sensors. One such system is the LeapMotion controller, which relies on infrared cameras and emitters to capture hand movements [3]. The subsequent sections will delve into systems and research that employ RGB-D or simple RGB cameras for hand tracking. 2.2.1 Hand Tracking and Pose Estimation Using RGB-D Cameras A large selection of different types of cameras is available, which can also be used for hand detection. One approach to hand tracking involves the use of RGB-D cameras. These cameras typically employ additional sensors, such as infrared sensors, to determine 2Senseglove Nova: https://www.senseglove.com/product/nova/ (Accessed: 2023-10-17) 3Haptx: https://haptx.com/ (Accessed: 2023-10-17) 4TESLAGLOVE: https://teslasuit.io/products/teslaglove/ (Accessed: 2023-10-17) 10 2.2. Advancing Hand Detection Figure 2.2: One of the first data gloves, developed by ’VPL Research’ for interaction in virtual environments [133]. the depth of visually tracked objects in the image. Many different techniques exist5. Structured Light Cameras that use light projection and infrared sensors to perform pattern recognition for depth calculation can be very precise, but also require a long computing time. Time-of-flight cameras, on the other hand, calculate the time of flight of a light pulse from their infrared emitter for depth calculation. This means that depth data can be displayed in real time, but there may be a loss of precision. Additionally, there are the Stereo Vision Cameras, which calculate depth information by triangulating the images from two camera lenses. These are easy to replicate in terms of hardware design and are inexpensive, but the range can be limited and require long computing times. Prominent examples of such cameras include the Intel RealSense Depth Camera D435 (Stereo Vision Camera), which is equipped with two stereo RGB cameras and an infrared sensor for depth detection of up to 10 meters 6), or the ASUS Xtion PRO LIVE camera (Structured Light Camera), which features an RGB camera, two infrared detectors, an accelerometer, and a microphone 7). Both cameras are depicted in Figure 2.3. RGB-D cameras find utility in various applications, such as 3D reconstruction using depth and 5Depth Camera Principles: https://wiki.dfrobot.com/brief_analysis_of_camera_prin ciples (Accessed: 2023-07-01) 6Intel RealSense: https://www.intelrealsense.com/depth-camera-d435/ (Accessed: 2023-10-24) 7ASUS Xtion PRO LIVE: http://xtionprolive.com/asus-xtion-pro-live (Accessed: 2023-10-24) 11 2. Related Work color images [114] or in real-time hand tracking, an area of particular interest for our use case. Figure 2.3: Two RGB-D cameras are depicted. Left: Intel RealSense Depth Camera D435; Right: ASUS Xtion PRO LIVE In a prior work by Frati et al., a commercial camera, the Microsoft Kinect, was utilized in combination with a wearable haptic device for hand tracking and rendering, allowing for hand interactions within a virtual reality environment [24]. While our system does not depend on additional hardware worn on the hand, this work highlights the necessity and potential of depth cameras in providing interactive hands within virtual scenes. Huang et al. conducted a survey and performance analysis of hand shape and pose estimation techniques using RGB-D cameras [41]. They identified several state-of-the- art methods achieving low estimation errors (<10mm), enabling practical interactive applications. Despite their effectiveness and efficiency, certain challenges, such as occlusion and self-similarity, require further investigation. The authors emphasize the need for additional research to address issues in 3D pose estimation. Researchers Sharp et al. [103] and Malik et al. [66] presented implementations for hand tracking and pose estimation with depth cameras. Both approaches use a single depth camera to estimate the pose of the tracked hand. Sharp et al.’s approach combined a multilayered discriminative reinitialization strategy for per-frame pose estimation with a model fitting process based on stochastic optimization of an objective function. Their evaluation yielded a tracking range of 0.5 to 4 meters and a precision of less than 3 centimeters [103]. Malik et al. improved upon this by generating a 3D mesh representation of a hand from a single depth image. They used a synthetic data set with accurate joint annotations, achieving a joint location error of less than 1.5 centimeters, segmentation masks, and mesh files derived from depth maps for neural network recognition. The processing time from the depth image to the mesh was 3.7 ms [66]. However, it is important to note that depth cameras are typically more expensive than RGB cameras (e.g. simple webcams). Consequently, these approaches are not always intended for widespread commercial use. On the contrary, most computer users possess webcams that lack depth estimation sensors, necessitating alternative methods to estimate the pose of tracked hands. The lower costs as well as the better availability is the reason why this work focuses on the use of RGB cameras. The following section will delve into 12 2.2. Advancing Hand Detection works related to hand tracking and depth estimation with RGB cameras. Our proposed solution to this challenge is also presented in section 5.2.3. 2.2.2 Hand Tracking and Pose Estimation Using RGB Cameras Images retrieved from RGB cameras consist of only three color channels - red, green, and blue. With this information alone, tracking hands and estimating their virtual 3D posi- tions becomes more challenging in the absence of depth information. Consequently, much research has turned to machine learning-based classification techniques, utilizing informa- tion from the image detection process to derive 3D hand and joint poses [10][107][119][62]. Our solution to this challenge, explained in section 5.2.3, also incorporates additional information, specifically the real size of the user’s hand, to transform tracked hands into the 3D world. Before transitioning the tracked hand into three-dimensional space, an analysis of the 2D image to detect the hand in it must occur. For instance, Zariffa et al. achieved hand segmentation in egocentric video using pixel-wise skin classifiers, followed by shape-based post-processing techniques. This allowed hand tracking in a 2D space using a single camera [129]. Hammer et al. presented their own algorithm, also relying on contour detection. Due to their requirement for use in interactive environments, their detection process operates in real time [36]. Both systems identify the contour of the hand without the need for markers worn on the hand, but do not include finger joint detection. The MediaPipe framework, as introduced by Lugaresi et al., offers perception pipelines for various tasks, including object detection, utilizing machine learning with RGB cameras [64]. Zhang et al. showed a hand tracking implementation with MediaPipe extending the base hand detection with a 2.5D joint recognition for more than two simultaneously visible hands with good performance and precision [130]. Due to its open accessibility and strong performance, especially on mobile devices with processing times ranging from 1.1 to 7.5 ms on iPhone 11, depending on the model used, we selected the MediaPipe framework as the foundation for our RGB-based hand tracking method. However, like many similar methods, Zhang et al.’s approach provides 3D finger joint positions relative to the origin in the middle of the hand (see Figure 2.4 [130]). The distance between the hand and the tracking camera remains unknown. However, this information is essential to enable reliable interactions in three-dimensional space. Estimating the 3D pose of the hand is a widely researched area. For instance, Sun et al. proposed the first real-time one-stage method for pose estimation from a single RGB image, predicting 2.5D hand joint coordinates while locating two hand regions. Their experiments on public datasets demonstrated competitive results with state-of-the-art methods [107]. Zimmermann et al. published a method for 3D hand pose estimation based on RGB input, tested on a synthetic dataset for neural network hand recognition, with performance comparable to existing depth-based approaches [134]. Additionally, Lin et al. utilized a neural network-based pipeline for hand tracking and created a new 3D dataset to train the algorithm for two simultaneously tracked hands, reporting a mean 13 2. Related Work Figure 2.4: Tracked hands using the MediaPipe framework, which is used in our im- plementation. Left: Tracked finger joints with relative depth to the wrist. Right: Simultaneously tracked multiple hands. End Point Error of 12.47mm, but only within arm’s length range and for a maximum of two simultaneously visible hands [62]. Han et al. used four monochromatic cameras to track a maximum of two hands in 3D space for real-time virtual reality applications [37]. Their solution was subsequently implemented in Meta Quest headsets for hand tracking. Panteleris et al. achieved 3D pose estimation for hands tracked in 2D by using OpenPose for 2D hand recognition and non-linear least-squares minimization to fit a 3D hand model to the estimated 2D joint positions, thus recovering the 3D hand pose [87]. Che et al. presented a detection-guided method capable of recovering 3D hand posture with a color camera, by lifting the 3D pose from the estimated 2D joints through a model-fitting approach [10]. Wang et al. developed a method to track 3D real-time interactions using a monocular RGB camera. They employed a multi-task convolutional neural network (CNN) that regresses multiple complementary pieces of information, including segmentation, dense matchings to a 3D hand model, and 2D keypoint positions [119]. This approach also proposed intra-hand relative depth and inter-hand distance maps. While several methods exist to track hands with an RGB camera and lift detected hands into 3D space, none of them fully suited our use case. They either are limited to tracking two simultaneous hands, operate solely in the 2D or 2.5D space, or do not track beyond the user’s arm’s reach. However, for our planned use within colocated scenarios it is necessary for the hands of several users to be recognized at greater distances and correctly positioned in three-dimensional space to enable interactions within the virtual environment. This led us to propose our solution for estimating the 3D pose of tracked hands. As explained earlier, we selected the MediaPipe hand tracking implementation 14 2.2. Advancing Hand Detection by Zhang et al. as our base tracking system due to its open accessibility and good performance for our 3D pose estimation, as discussed in section 5.2.3. 2.2.3 Evaluating Hand Tracking Solutions Several reports and evaluations have examined the precision and usability of common hand tracking systems, which is highly relevant to our work as we plan to compare these systems, namely the Meta Quest and Leap Motion Controller, with our implemented depth solution in section 5.3. With this we want to show that the proposed solution is suitable for use in colocated VR scenarios and has advantages over existing commercial solutions. Let’s first look at metrics that can be evaluated in hand recognition systems. The first metric to be mentioned is accuracy, which indicates how closely the tracked hand positions and rotations match the real positions and rotations. To determine this, a ground-truth comparison can be used, for example by an external tracking system, as used by Abdlkarim et al. or by us in this work [1]. The latency can also be used as a metric that measures the delay between a user’s physical hand movement and the corresponding response. in the latency evaluation of modern HMDs, for example, Gruen et al. use external cameras that observe virtual and real inputs and measure the differences with a clock with submillisecond accuracy [33]. Tracking range is another important metric that we will discuss in chapter 5. This can also be measured, as in our work, by an external ground-truth tracking system, for example. Other metrics are robustness, to evaluate how robust the system behaves in difficult situations (e.g. covered hands or changing lighting conditions), usability, such as ease of calibration or user comfort, which is usually evaluated with user tests[20], or performance, in which the computational effort is measured [130]. For example, Schneider et al. reported the accuracy of finger tracking for touch-based tasks like pointing or drawing in the virtual environment, finding that Meta Quest and Leap Motion outperformed HTC Vive hand tracking in terms of spatial accuracy. Users generally preferred the Leap Motion sensor [98]. Their previous study also demonstrated superior tracking accuracy for Leap Motion compared to HTC Vive, with a Z error of approximately 2.6 cm for interactions with horizontally aligned surfaces (and 1 cm for vertically aligned surfaces) in walk-up-and-use scenarios [99]. Vysocky et al. conducted an accuracy study on Leap Motion, reporting an error of up to 1 cm for measurements taken at a distance of 20 cm [117], while Bachmann et al. performed an initial evaluation of the general usage and interactability of Leap Motion as a contact-free pointing device [3]. Even earlier, Weichert et al. reported that the Leap Motion controller could deliver near- and submillimeter precision when detecting hands, allowing developers to accurately visualize the pose of a user’s hand inside a virtual environment [121]. Additionally, Mizera et al. compared the visual tracking accuracy of the Leap Motion sensor with the accuracy of two data gloves [76]. They found that the Leap Motion was less precise when measuring finger bending but very precise in estimating fingertip positions, making it the best device for fine manipulation tasks involving the thumb and opposing fingers. 15 2. Related Work A recent study of Matulic et al. propose a mirror-based solution that reflects the front camera of a smartphone to estimate the 3D position of the fingertip with the help of a deep neural network. They report a mean precision of 6mm and how their system could be adapted to VR systems in the future [69]. In the context of strategies to measure tracking accuracy in VR hand tracking systems, Abdlkarim et al. introduced a framework and demonstrated its application in the Meta Quest 2 [1]. Their setup involved a height-adjustable table that could be lowered to a height difference of 82 cm. The Oculus was attached to the top to track the user’s hand during the experiment. Ground truth data was collected from infrared markers placed on the table and tracked by an optical camera system. The users were instructed to point at different markers during the experiment to collect data. The authors reported an average error of 1.1 cm in the position of fingertips for the Meta Quest 2. Ferstl et al. introduced and evaluated strategies to mitigate the impact of hand tracking loss, highlighting the potential influence of tracking loss on the user experience [20]. As a result, our evaluation design includes tracking loss distances for all hand tracking systems evaluated in section 5.3.2. However, for VR hand tracking systems, standardized evaluation techniques for dynamic hand movements are lacking. Such techniques are commonly used in fields like machine tools, iGPS, and laser trackers. For example, Wang et al. tracked the trajectories of an industrial robot using a standardized strategy [120], and Ding et al. measured dynamic tracking error for five-axis CNC machining [17]. Due to the absence of standardized evaluation techniques for dynamic hand movements in VR, we have developed our own method, tailored to our experiment, to determine tracking accuracy and error curves for dynamic hand movements in section 5.3.2. 2.3 Advancing Multi-user VR Systems By having social aspects such as communication, cooperation and collaboration, multi-user applications introduce a realm of possibilities that single-user applications cannot offer. Even in non-VR contexts, it has been demonstrated that multi-user virtual environments can facilitate enhanced learning experiences compared to traditional direct instruction methods [112]. When it comes to implementing multi-user experiences in the domain of virtual reality, developers have various options to consider. Podkosova has introduced a taxonomy that categorizes different types of such systems, which is depicted in Figure 2.5. In contrast to single-user VR experiences, where each user typically occupies their own physical and virtual space, multi-user VR can be characterized by whether users share the physical space, the virtual space, or both. One common scenario is where users share the virtual space while maintaining their distinct physical spaces. This configuration is prevalent in multi-player VR games 8, virtual meeting environments that emphasize 8Tetra Studios: https://www.tetrastudios.com.au/ (Accessed: 2023-10-10) 16 2.3. Advancing Multi-user VR Systems Figure 2.5: Taxonomy of multi-user VR created by Podkosova [92]. social interactions 9, and casual VR chat platforms 10. The highly anticipated Metaverse is also expected to incorporate such multi-user VR access [115]. For example, Shi et al. used such an environment to demonstrate that a shared virtual environment can enhance communication efficiency in facility management [104]. You also don’t run the risk of colliding with other players or having a mismatch between the physical and virtual presence of another user (e.g. hearing him behind you while standing in front of you). When using physical devices, however, a mismatch between physical space and physical device must be aligned, as proposed by Fink et al. with their Re-locations method [22]. Conversely, the colocated non-shared VR approach involves users occupying the same physical space while experiencing their unique virtual spaces. For example, the commercial VR experience ’YULLBE GO’ offers free-roaming colocated non-shared VR experiences, where up to ten users share a physical space of approximately 80 square meters, each 9Spatial: https://www.spatial.chat/ (Accessed: 2023-10-10) 10VRChat: https://hello.vrchat.com/ (Accessed: 2023-10-10) 17 2. Related Work engaging with their own VR application asynchronously 11. These experiences typically include collision avoidance mechanisms, with visual cues within the application signaling proximity to other users. However, effectively managing collisions while maintaining a high level of presence and immersion remains an ongoing challenge. Various approaches have been explored in the creation of multi-user VR applications. As early as 1996, Benford et al. investigated shared spaces and provided an example of an internet foyer to showcase the possibilities offered by multi-user VR [5]. Just two years later, Frécon et al. introduced ’DIVE’, a network architecture for creating distributed virtual environments. Although it was not developed explicitly for VR, it presents techniques that are applicable in this context [28]. Research in the area of multi-user VR has continued to evolve, often intersecting with other research domains. For example, Langbehn et al. playfully explored redirected walking in a colocated virtual reality scenario, where multiple users share the physical and virtual space. They introduce a technique that enables users to explore virtual spaces five times larger than the shared physical space [55]. Redirected walking in separate physical spaces and shared virtual spaces have also been studied by Xu et al., who developed methods to reduce the frequency of resets by predicting potential reset points based on a user’s current reachable area [125]. The last approach in the spectrum of multi-user virtual reality is colocated shared VR, where all users occupy both the physical and virtual space collectively. This configuration introduces unique challenges in terms of implementation, as it involves the simultaneous management of physical and virtual experiences. These challenges will be examined in greater detail in the subsequent section. 2.3.1 Exploring Colocated VR As described in the previous section, colocated shared VR refers to multi-user applications where multiple users share both the virtual space and the physical space. An example on colocated users can be found in Figure 2.6. Recent studies have shown that this type of multi-user experience can create a strong social atmosphere with a positive impact on social closeness between users [108]. Such a colocated scenario can be either symmetric, where all users use the same interface [15][91], or asymmetric, where users use different interfaces to interact with the shared environment [86]. For instance, Drey et al. conducted a comparative study of symmetric and asymmetric methods in the context of pair-learning and found that both approaches yielded equivalent results in terms of learning success [19]. In our work, we focus on symmetric approaches to give the users of the application a similar experience. To reliably synchronize both realities, precise localization of all users in the physical space is essential. One common approach involves continuously synchronizing all users with the help of external camera systems. These camera systems are frequently used in 11YULLBE Go: https://yullbe.com/yullbe-go/ (Accessed: 2023-10-10) 18 2.3. Advancing Multi-user VR Systems Figure 2.6: Two colocated users in both the physical and virtual space, with hand tracking enabled. The point of view (POV) is from the left user within the virtual scene. VR arena experiences and are offered by companies such as Vicon 12 and OptiTrack 13. However, these systems can be expensive and have limitations in terms of spatial size and the number of users they can accommodate. As an alternative, Weissker et al. proposed a low-cost colocation system using the Lighthouse tracking system of the HTC Vive [122]. They use a single tracking system to synchronize the tracking data of multiple users to a common coordinate system, enabling colocation in a more cost-effective manner. In contrast, SLAM-tracked VR headsets, such as the Meta Quest, use inside-out tracking without the need for external cameras. These headsets create their own internal environ- mental mapping of the physical space. Therefore, solutions are required to synchronize the physical positions of all users within their own environmental data. Researchers have already developed solutions to address this challenge for virtual (VR) and augmented (AR) reality headsets. McGill et al. investigated practical and cost-effective ways to implement colocated scenarios for SLAM-tracked headsets. They categorized these solutions into two main types: ’Aligning to a single known point’ and ’Aligning to two known points’ [70]. The single-point calibration method is straightforward. It involves having a single point in the real world and a corresponding reference point in the virtual world. When the location of the user’s headset relative to the real-world point is known, the user’s virtual pose can be set accordingly. An example of this setup can be seen in the CAVE experience by Layng et al., where each user is assigned a seat with a predetermined position that is mirrored in the virtual world [58]. Herscher et al. employed a similar approach in their CAVRN system, where users are positioned in real-world seats corresponding to virtual orientations and positions [38]. However, these systems are limited as they only recognize 12Vicon: https://www.vicon.com/hardware/cameras/ (Accessed: 2023-10-11) 13OptiTrack: https://optitrack.com/applications/virtual-reality/ (Accessed: 2023- 10-11) 19 2. Related Work users based on their seating positions, and further tracking and movement within the virtual world are not supported after relocalization. ’TritonVR’ 14, a multiplayer shooting game, offers a different approach by allowing users to stand in the same physical location, followed by recalibrating their positions in the virtual environment to match their real- world poses. The precision of this method depends on how accurately users stand on the predefined positions. On the other hand, the two-point calibration method uses two points or anchors in the real world to better approximate the users’ poses and reduce drifting in larger rooms [70]. While this calibration method requires more calibration points and takes longer than single-point calibration, it can potentially provide more precise results by compensating for tracking errors. In our research, we implement both one-point calibration and two- point calibration in a standard room-sized environment (4x4 meters) and compare their results in section 4.4. Colocation can also be achieved by tracking markers attached to headsets. DeFanti proposed a solution in which multiple users within a colocated space track each other using cameras on their HMDs to track ArUco markers attached to each user. These data are then shared among users to recalculate their relative positions in the virtual world [15]. However, this method depends on accurate and consistent recognition of each other’s markers. We investigate a modified version of marker-based colocation in our experiments in section 4.2.2. A drawback of the previously mentioned methods is their reliance on either additional AR-tracking cameras or the precise positioning of users in the real world. As more SLAM-tracked VR headsets, which also offer integrated hand tracking, become available (e.g., Meta Quest, or VIVE Focus15), we propose a solution in section 4.2.3 to use this feature for colocation. This approach ensures precision independent of the user and relies solely on the tracking system. Additionally, it eliminates the need for additional hardware, using only the SLAM-tracked headsets. We can also differentiate between continuous calibration and one-time calibration. Con- tinuous calibration, as the name suggests, updates each user’s location every frame in the tracking system’s coordinate system when external cameras are used. An example of this approach is the calibration using the Lighthouse tracking system of the HTC Vive by Weissker et al. [122]. Waller et al. used an infrared optical outside-in tracking solution with eight cameras and an HMD attached to a rendering computer worn by a user for their ’HIVE’ system [118]. DeFanti used cameras attached to the users’ HMDs to continuously track other users [15]. Furthermore, Podkosova et al. created ImmersiveDeck, a system that combines inside-out head tracking with motion capture, enabling multiple users to move freely in a large area (i.e. 200m2) [91]. In the field of robotics, a shared spatial map can be used for collaborative applications to maximize efficiency in environment 14Triton VR: https://www.tetrastudios.com.au/tritonvr (Accessed: 2023-10-12) 15VIVE Focus: https://enterprise.vive.com/de/product/vive-focus/ (Accessed: 2023-10-12) 20 2.3. Advancing Multi-user VR Systems exploration [23]. Unlike continuous calibration, one-time calibration aligns all devices only once. The accuracy of colocated systems after calibration depends on the tracking accuracy of the device. If the error becomes too significant, a recalibration is necessary. For instance, McGill et al. used one-time calibrations for their experiments, whether for a one-point or two-point calibration [70]. In seated experiences, such as CAVE or CAVRN, users are placed in real-world seats that correspond to virtual orientations and positions, and continuous calibration is not used [58][38]. Given our desire to minimize additional hardware and efforts, our approach relies on one-time calibration, leveraging the low drift of our SLAM devices after calibration. However, we ensure that recalibration remains a viable option with minimal effort. 2.3.2 Hand Interactions in Multi-User Systems Interaction stands as a very important concept within the domain of virtual environments, such as in virtual reality, augmented reality or even for tabletop. It can be manifested through various means, including gestures and direct manipulation. There is existing work comparing hand interactions to conventional controller input. Khundam et al. compared interaction time and usability between hand and controller interactions in intubation training, with the result that there are no significant differences [49]. Zhao et al. also found comparable user preference and performance between hand and controller interactions for virtual locomotion [131]. However, Schäfer et al. found significantly better performance and accuracy for interactions in which objects are picked up and put down between the use of hands and controllers [101]. This is also inline with other studies by Masurovsky et al. and Hameed et al. that show the preferential usability and accuracy of controller-based interactions compared to hand interactions [68][35]. However, they also emphasize that growing acceptance and familiarization with hand interaction systems, as well as the continuous improvement and better availability of hand recognition systems and their naturalness, offer the potential to overtake conventional controllers in terms of user experience in the future [68]. Extending the interaction concept into shared virtual worlds introduces both challenges and opportunities. For instance, as early as 2007, Streuber et al. investigated the impact of multi-user environments on joint actions and social behavior. In their study, they designed a test in which a colocated user pair had to navigate a stretcher through an obstacle course. Their findings suggested that humans can quickly adapt to the lack of haptic and tactile feedback when fully immersed in the virtual environment [106]. Further research has focused on designing frameworks and strategies tailored to multi-user contexts. For example, Gong et al. conducted a case study to develop interaction design strategies for multi-user virtual reality systems in manufacturing settings, underscoring the importance of robust interaction systems in such scenarios [31]. Langbehn et al. cautioned against making substantial user manipulations, like altering a user’s voice pitch, as it may distract users from interactions in the virtual environment [54]. 21 2. Related Work In addition to multi-user VR applications, Zaman et al. aimed to create a platform that enables collaborative tasks regardless of the type of head-mounted display, thus enabling collaboration between multiple VR and AR devices [128]. Olin et al. explored cross-device interaction in a virtual environment, allowing handheld device users to engage with VR headset users. Their research provided a framework and design suggestions for such scenarios, demonstrating immersive integration of non-VR users into virtual interactions [83]. Within the scope of multi-user AR applications, Jansen et al. introduced ShARE, a head-worn AR device equipped with a projector to enable simultaneous interactions with non-AR users. Through marker detection, outside users can interact with the projected image, fostering multi-user interactions across three different applications: collaborative games, competitive games, and external visualizations, resulting in a proof-of-concept device for multi-device multi-user AR interactions in a shared virtual world [44]. However, in the context of multi-user VR applications, one significant aspect of interaction involves the hands themselves, which is particularly relevant to the current work aimed at improving hand recognition and interaction in colocated scenarios. Pretto et al. conducted research on hand interaction through gestures using Google Cardboard as an HMD and Leap Motion for hand tracking. They applied this technology in a forensics use case, integrating real-world objects into a VR environment [93]. Beyond the VR domain, multi-user hand interactions have been explored in the media domain using multimedia tabletops. Del Bimbo et al., for instance, created a computer vision-based system and algorithms that uses hand recognition to enable multi-user interactions at a display table, even with distinguishing hands of two users that were overlapped in the 2D area [16]. Dohse et al. developed a tabletop that utilized a combination of computer vision for hand recognition and touch detection to distinguish hands in close proximity. Multiple users interacting with the tabletop can be seen in Figure 2.7. This work emphasized the importance of distinguishing and assigning hands to different users to enable interactions in multi-user and close proximity scenarios [18]. We address this issue in the context of colocated multi-user VR in chapter 6. In the realm of colocated VR scenarios, an area of focus in this work, researchers have investigated various interaction possibilities and their impact on the user experience. Salzmann et al. examined collaborative assembly tasks using only hands in the automotive industry, demonstrating their suitability for two simultaneous users [96]. Their findings revealed higher precision in task execution and user preference for these tasks. Li et al. conducted user testing to evaluate the influence of interactions using hand-held controllers in multi-user co-location scenarios for cultural heritage. Their conclusion highlighted the positive effects of social influence on performance expectancy and effort expectancy [61]. All of these systems emphasize the importance of interactions in shared virtual experiences and the challenges they involve, such as obscured hands and hand-body ambiguity. Consequently, the system presented in this thesis represents a significant step towards providing reliable hand interactions in a colocated multi-user application. 22 2.4. Fostering Hand-Body Association in Multi-user Scenarios Figure 2.7: Three users interacting with the multi-touch tabletop by Dohse et al. On top is a camera attached that is used for hand tracking [18]. The image is taken the authors’ work. 2.4 Fostering Hand-Body Association in Multi-user Scenarios Assigning tracked hands to specific users in a colocated shared virtual environment is crucial for creating a seamless and immersive experience. In 2D space, various research studies have explored hand ownership in egocentric views to assign detected hands to users. For example, Frati et al. utilized the capabilities of the Kinect in conjunction with wearable haptic devices for hand tracking (as stated in section 2.2.1), making it easier to assign these hands to users due to the additional full-body tracking provided by the Kinect [24]. However, most tracking systems are limited to tracking only the hands of users, and the user’s full body may not always be in the tracking device’s field of view, which requires alternative solutions for hand assignment. Narasimhaswamy et al. employed neural networks to associate hands with bodies by utilizing image detection of bodies and hands. They used overlapping tracking bounding boxes from image detection, as well as the locations of heads and hands, to perform hand assignments [80]. This approach demonstrates the potential for improved hand tracking and hand contact estimation with hand-body association. However, it relies on having a user’s body within the camera’s field of view for bounding box creation, as shown in Figure 2.8, an image taken from the work of Narasimhaswamy et al. [80]. Tsutsui et al. also used visual cues provided by image tracking, such as color, depth, skin texture, and shape, for hand identification in egocentric views [113]. However, their approach focused solely on identifying a user’s own hands, without considering other users and their hands 23 2. Related Work in the environment. Furthermore, their approach demonstrated a verification error rate of 36%, making it impractical for real-world scenarios and not applicable to our scenario. Figure 2.8: Hand-Body association by Narasimhaswamy et al. involves the creation of 2D bounding boxes to assign hands to users in the detected image [80]. Lee et al. explored the distinction of multiple hands in an egocentric view by utilizing the location of the hand in the two-dimensional recognition frame image. However, this method was not designed for three-dimensional tracked hands [59]. Lin et al. proposed a method for detecting hand raising in classrooms using image detection on a single image and a selection algorithm, achieving a mean Average Precision of 90% [63]. Building on this method, Zhou et al. extended it with an improved detection algorithm, a pose algorithm, and a matching technique for the raised hand and the hand raiser by utilizing the location of the hand and various keypoints of the hand-raisers body (face and limbs) with an average recognition precision of 83% [132]. However, all the mentioned research is either confined to hand assignment in the 2D space [59][63][132], or relies on additional tracking information such as skin color or 2D tracking boundaries [80][113]. These approaches are not directly applicable to our use case, which involves assigning hands in a colocated virtual environment in 3D space, with limited information available by the employed tracking systems, specifically, the location of the hands and user heads. Therefore, our methods presented in chapter 6 build on existing 2D domain methods and extend them with our ideas tailored to the 3D domain. 24 CHAPTER 3 EasyHand - A Modular Hand Interaction and Visualization Framework for Single and Colocated VR Scenarios This chapter introduces ’EasyHand’, a Unity3D framework developed to harmonize the hand tracking capabilities of various hand tracking systems. ’EasyHand’ serves as the foundational system for visualizing, positioning, and interacting with tracked hands in all the experiments presented in this thesis. To maintain a concise and easily comprehensible structure within the chapter, the sections will typically commence with an outline of the concept and design of each component. If required, we will then proceed with an explanation of its implementation. We begin by providing an overview of the ’EasyHand’ system, outlining its primary objectives and the requirements that guided its design and development. We will also elaborate on the hardware and software compatibility of this framework. We demonstrate how the unification of different base tracking systems within the framework for universal usage in the development process is done. This unification includes the integration of hand detection and gesture recognition events. After that, it is explained, how visualization for skeletal and mesh rendering as well as interaction is done based on the unified data. Then, we address the conceptualization and implementation of the network layer, a mandatory component for creating colocated scenarios. This will be followed by an explo- ration of our efforts to enhance the framework’s capabilities through the implementation of plugins. Finally, we conclude this chapter with an overview of the current state of the ’EasyHand’ framework and discuss potential future development. 25 3. EasyHand - A Modular Hand Interaction and Visualization Framework for Single and Colocated VR Scenarios 3.1 System Overview The framework was initially designed with specific capabilities in mind. It not only serves as a hand-interaction layer, but also encompasses hand tracking capabilities. These capabilities encompass networking features for multi-user applications, including colocation capabilities, as well as the ability to assign tracked hands to virtual users. To our knowledge, the ’EasyHand’ framework stands as the sole system currently available capable of creating colocated multi-user applications utilizing hand interactions. Moreover, it is compatible with various existing base hand tracking systems. 3.1.1 Overall System Design Initially, the system was designed to unify various hand tracking systems, such as Meta Quest, LeapMotion, or HTC Vive, providing developers with an easily accessible framework to develop applications across multiple platforms. Each of these systems is equipped with its own unique API and functionalities for visualization, gesture recognition, and interaction (see Figure 3.1). The term ’gesture’ used in this paper refers to a fixed hand posture, such as all fingers being bent (’Fist’ gesture) or the index and middle fingers being stretched (’Peace’ gesture). Our ’EasyHand’ framework serves as an intermediary layer between the application and these base APIs, allowing developers to employ a unified system for visualization, interaction with the virtual environment and gesture recognition across all supported APIs, as illustrated in Figure 3.2. Furthermore, the system is designed to facilitate the seamless addition of new hand tracking systems to expand its compatibility. Figure 3.1: Overview of hand tracking API integration of different systems in one application. Each system comes with distinct visualization, interaction and gestures. To fit our needs we defined the following requirements for the system: • The system should support several base systems. As the framework serves as an intermediary layer for developers, it is designed to 26 3.1. System Overview Figure 3.2: ’EasyHand’ acting as a layer between several hand tracking APIs and the application, unifying visualization (V), interaction (I) and gesture recognition (G). support several popular state-of-the-art hand tracking systems, including those used in the Meta Quest and Vive HMDs, as well as the Ultraleap system. • The system should implement visualization, interaction, and simple gestures. All data are unified. Developers will utilize this system to craft interactive applications. Consequently, the framework should support a unified visualization for both, with simple lines and a visually more realistic mesh rendering. Furthermore, interactions should remain consistent across various base systems, with a similar set of recognized gestures (e.g. pinch) being supported. The system should also provide events for the detection and loss of the hand to adequately handle the visibility of the virtual hand. Additionally, all tracked data should be accessible through a unified data structure. • The system should be easily expandable. As developers may have diverse requirements, they should have the flexibility to easily integrate hand tracking systems that were not initially included. This can be achieved by adding a unification class as an intermediate layer between the tracking system and the ’EasyHand’ framework. • The system should be usable in real time. Provided that the basic recognition systems allow it, a framerate of at least 60 frames per second (FPS), which is necessary in VR for a smooth run, should be achieved. 27 3. EasyHand - A Modular Hand Interaction and Visualization Framework for Single and Colocated VR Scenarios With the refinement of experiments and the addition of desired colocation capabilities, the following requirements were subsequently defined: • The system should support networking. To create multi-user scenarios with hand tracking capabilities, the system must be capable of efficiently synchronizing all tracked hand data over a low-latency network layer. • The system should be able to create colocated scenarios. The ’EasyHand’ framework, utilizing integrated hand tracking solutions, should enable the creation of colocated multi-user scenarios, including the handling of calibration and synchronization of the coordinate systems. • The system should provide hand tracking for more than two hands. To support other tracking systems inside colocated multi-user scenarios, the system should be able to track more than two hands, respectively the hands of colocated users. To achieve this the framework should incorporate a fitting RGB camera hand tracking algorithm and seamlessly integrate it into the system. This also enhances accessibility and hand tracking is not restricted to VR environments but can also find usage in desktop applications. • The system should be able to assign hands to virtual colocated users. To be able to support each other’s hand tracking by combining tracking cameras and ensure accurate interactions, the system should possess the capability to consistently assign tracked hands to virtual users within a colocated multi-user scenario. 3.1.2 Implementation Environment Many different engines are available to develop applications for XR. Among them are the Unreal Engine1, Unity3D2, CryEngine3 or OpenXR4. To support a wide range of hand tracking systems, it is necessary to choose a well-supported engine. A 2019 TIGA study in the UK indicates that a majority (72%) of developers surveyed use the Unity3D engine [105]. Due to the wide adoption and strong developer support for this platform, as well as the fact that we don’t have to worry about rendering and physics, Unity3D was chosen as the development platform. For the basic hand detection of the different systems, the corresponding Unity3D plugins were included. This allows the ’EasyHand’ framework to be made available to a wide range of developers. 1Unreal Engine: https://www.unrealengine.com/en-US/xr (Accessed: 2023-10-06) 2Unity3D Engine: https://www.unity.com/solutions/vr(Accessed:2023-10-06) 3CryEngine: https://www.cryengine.com/ (Accessed: 2023-10-06) 4Khronos OpenXR: https://www.khronos.org/openxr/ (Accessed: 2023-10-06) 28 3.2. Hand Detection 3.2 Hand Detection The most critical component of the system is hand detection itself. To provide a visual representation of the process, we start with integrating base tracking systems. This involves unifying the incoming base data into a standardized dataset with an index mapping of all detected finger joints that can be used independently of the tracking system in use. This unification of the base system is also where developers can add new tracking systems to the ’EasyHand’ framework, as depicted in Figure 3.13. The developer only has to provide the index mapping (as explained in section 3.2.2) and an implementation of the ’ITracker’ interface class, where the tracking systems low-level hand detection is mapped to the unified data set of ’EasyHand’. From this unified dataset, we can proceed to use low-level events or, in cases where they do not already exist, create custom events to trigger actions when hands are detected or lost by the tracking device. Additionally, we employ a similar approach to handle detected hand gestures. 3.2.1 Low-Level Tracking Modules As low-level tracking modules we chose systems that are widely used for optical hand tracking in XR scenarios. This includes the following systems (MediaPipe for RGB tracking is explained in chapter 5): Figure 3.3: Hemispherical area of the interaction zone of the LeapMotion Sensor. LeapMotion The LeapMotion hand tracking from Ultraleap is a tracking device consisting of two monochromatic IR cameras and three infrared LEDs. It can detect hands in a hemispherical area with an interaction zone of up to 60 cm and a field of view (FOV) of 120° x 150° [43] (see Figure 3.3)5. It is either used on a tabletop, on top of a 5LeapMotion Hemisphere: https://cms.ultraleap.com/app/uploads/2020/09/ultralea p-hand-tracking-interaction-zone-side.jpg (Accessed: 2024-01-07) 29 3. EasyHand - A Modular Hand Interaction and Visualization Framework for Single and Colocated VR Scenarios screen or it can be attached to a VR HMD (see Figure 3.4). The underlying Ultraleap API extracts 27 distinct joints per hand with positional and rotational data (including elbow recognition which is irrelevant for our framework). The detected joints can be seen in Figure 3.4. Figure 3.4: Left: The LeapMotion controller attached to a VR HMD. Right: Detected joints of the Ultraleap API Meta Quest The Meta Quest6 (2019) / Meta Quest 2 (2020) is a state-of-the-art virtual reality head-mounted display (HMD) that utilizes SLAM (Simultaneous Localization and Mapping) technology for 6 Degrees of Freedom (6 DoF) tracking. The device is equipped with four monochronous cameras, which are used to track the user’s hands within a 3D space, a concept presented with the commercial solution MEgATrack proposed by Han et al.[37]. This hand tracking functionality is achieved through the application of deep neural networks, enabling the estimation of the hands’ location and the detection of 21 individual hand joints. See Figure 3.5 for a visual representation of the device and how tracked hands appear in a greyscale passthrough view.7 Figure 3.5: Left: The Meta Quest 2 HMD. Right: Detected hands and joints with the Meta Quest 2. 6Originally Oculus Quest before Meta rebranding in 2021 7Quest Hand Greyscale: https://www.uploadvr.com/content/images/2021/04/QuestHan dTrackingGreyscale.png (Accessed: 2023-10-06) 30 3.2. Hand Detection Vive Hand Tracking The Vive Hand Tracking SDK provides hand tracking capabilities for supported headsets within the HTC Vive series, including the Valve Index. This SDK offers 2D and 3D tracking for a total of 21 finger joints and operates at a smooth frame rate of 60 FPS for minimal GPU requirements for VR with a NVIDIA GTX1060/AMD RX480 GP [14]. Additionally, the Vive Hand Tracking SDK includes support for base gesture recognition, which includes a set of six gestures, all of which are integrated into the ’EasyHand’ framework. While the number of finger joints is the same as with the Meta Quest hand recognition system, the indexing differs. An overview of the detected joints and available gestures can be found in Figure 3.6. Figure 3.6: Left: The detected joints of the Vive Hand Tracking. Right: Recognized Gestures of the Vive Hand Tracking. 3.2.2 Unified Joint Mapping As depicted in Figures 3.4, 3.5, and 3.6, each of the base systems is capable of detecting finger joints and assigning a specific integer index to each joint. However, it’s important to note that this indexing is not standardized across these systems. The ability to discern which part of the finger is currently being tracked is crucial for accurate hand visualization and gesture recognition. To address this challenge, we introduce our own index mapping system (where we were inspired by LeapMotion and Quest) within the ’EasyHand’ framework. This mapping defines a name for each part of the finger and assigns a corresponding index to each part. This name definition can be seen in Figure 3.7. During an initialization step, we establish a correlation between the index assigned by the base tracking system and the corresponding ’EasyHand’ joint/bone index. For instance, let’s consider the ’EasyHand’ joint named ’Wrist’. Initially, Vive and Quest assign this joint the index ’0’, while LeapMotion assigns it an index ’20’. Through the aforementioned mapping process, we ensure that we consistently obtain the correct location and rotation data when referencing the bone identified as ’Wrist.’ 31 3. EasyHand - A Modular Hand Interaction and Visualization Framework for Single and Colocated VR Scenarios Figure 3.7: Joint index mapping and description of an ’EasyHand’ hand skeleton. With this mapping in place, we can easily maintain unified data structures for tracked hands. As one of the initial steps in the ’EasyHand’ update cycle, we map the detected base joints to this unified data structure. This allows us to subsequently build upon these data, enabling us to perform all logic related to gestures, visualization, and interaction without relying heavily on the specifics of the underlying base tracking system. The current implementation is limited to the joints used in Figure 3.7. If a base system has more, the most suitable ones are selected and the others are discarded. The case that the base system does not have these joints is not yet covered, but offers the potential to calculate them in a future extension by interpolation, for example. However, this case has not occurred for the base systems used in this work. In this context, the only remaining information we need to obtain from the base system, if available, is about detection events and gesture recognition. 3.2.3 Gesture Recognition - Design and Implementation Recognizing various hand gestures is a broad topic for scientific research in its own right. Therefore, our primary focus does not encompass the recognition of complex gestures but centers on the identification of simple gestures to facilitate basic gesture integration for developers. These simple gestures comprise six distinct hand gestures, as previously detailed in the HTC Vive hand tracking plugin overview presented in Figure 3.6. To determine whether a specific gesture has been executed, we initially verify if the underlying recognition system successfully identifies any of the predefined gestures autonomously. In cases where this recognition does not occur, we resort to our custom implementation for gesture recognition. Since all of the mentioned gestures hinge on the extension or bending of fingers, we evaluate the angles at each finger joint. Each finger consists of four joints, forming three vectors. As illustrated in Figure 3.7 for the index finger, we can establish vectors as follows: v1 = 6̄5, v2 = 7̄6 & v3 = 8̄7. We then calculate the resulting angle (α) between v1 & v2, as well as between v2 & v3, add them up, and 32 3.3. Visualization and Interaction Figure 3.8: The ’Peace’ gesture recognized by the system. Left: Labels indicate which fingers are bent and which are not. Here the system recognizes the gesture when all fingers are bent except the index and middle Finger. Right: Angle calculations between the joints are used to determine which fingers are bent and which are not. subsequently apply predefined threshold values to discern whether the finger is extended (α < 45◦) or bent (α > 100◦). An example calculation for the ’Peace’ gesture is shown in Figure 3.8, where all fingers need to be bent except the index and middle fingers. 3.3 Visualization and Interaction After detecting the hands and assigning IDs to the detected joints, users require the ability to incorporate these hands into the virtual world. To accomplish this, it is essential to provide users with a means to visualize these hands and imbue them with physical capabilities, e.g. colliders, enabling interaction with the virtual environment. Special events that originate from the base tracking system when hands are newly detected or detection is lost are used to trigger visualization. For visualization, we have implemented two widely recognized approaches (such as those used by Metzner et al. [74]): one employs an abstract representation of the detected joints and connecting bones, while the other utilizes a more realistic mesh representation of the hand. As Ricca et al. suggested that the visual representation of the hand does not always impact tool-based motor tasks training in immersive VR simulators, these visualizations promise good and interactively usable representations of the hand[95]. 33 3. EasyHand - A Modular Hand Interaction and Visualization Framework for Single and Colocated VR Scenarios Figure 3.9: Skeletal rendering of detected hands. 3.3.1 Skeletal One of the fundamental representations of the hand is the skeletal visualization, comprising spheres positioned at the detected joint locations and connecting lines to illustrate the bones. This approach enables users to identify each individual finger joint, as illustrated in Figure 3.9. The hand’s world position is determined by the location of the ’Wrist’ joints, and its orientation is derived from the resulting vector connecting the ’Wrist’ and ’Middle1’ joints. 3.3.2 Mesh For a more lifelike representation of the tracked hand, we employ predefined meshes that offer adaptable visual presentations, allowing us to choose between realistic textures, abstract wireframes, and the ability to alter skin color, among other options. To align the hand mesh with the detected hand, each rigged bone in the mesh is mapped to the corresponding joint index, as illustrated in Figure 3.7. The position and rotation of each joint are then dynamically adjusted through code to accurately mimic the connections of real hand bones. To ensure that the mesh closely matches the user’s real hand size, we compute an ideal finger length based on the positions of the detected joints. This approach allows us to resize the bones to approximate the user’s specific hand dimensions. The resulting visual representation can be observed in Figure 3.10. We have taken advantage of existing resources, incorporating pre-assembled meshes from the ’OculusIntegration’ Unity package [73], and we have also utilized the mesh calculations already used in the Vive hand tracking system for Unity [13]. 34 3.3. Visualization and Interaction Figure 3.10: Skeletal rendering of detected hands. 3.3.3 Direct 3D Object Manipulation in Unity3D Interacting with the virtual environment constitutes a crucial aspect of any VR application. One mode of interaction involves using predefined gestures, as explained in section 3.2.3. Another approach is direct interaction with and manipulation of the virtual environment, encompassing actions such as pressing buttons, picking up and tossing virtual objects, or selecting items. To enable these interactions, we must facilitate physical interactions with the visualized hand. To achieve this, we used the existing collider component within the Unity game engine. Since we already have the locations of each finger joint, and therefore the length of each bone, we can easily generate capsule colliders with corresponding lengths for each bone. The positions of these colliders are then dynamically adjusted in each physics update loop to align with the user’s hand movements. An alternative approach would involve creating an exact mesh collider to match the hand precisely. While this approach would offer greater precision in interactions, it presents challenges due to the static nature of meshes in the Unity engine. Creating such a mesh every time the hand pose changes would impose an exceedingly high computational load. With our implemented solution, we maintain an interactable hand with good precision while simultaneously managing an acceptable computational load in the physics calculations. 3.3.4 EasyHandRig Template Implementation To seamlessly integrate the concepts presented above into the VR environment, we have devised a template structure known as ’EasyHandRig’. This template simplifies the process for developers, enabling them to effortlessly incorporate functional hand tracking 35 3. EasyHand - A Modular Hand Interaction and Visualization Framework for Single and Colocated VR Scenarios into their projects. The architectural components of this template are illustrated in Figure 3.11. Figure 3.11: Components for the ’EasyHandRig’ template. At the core of this template is the ’EasyHandManager’ class, which serves as a central hub for developers to select their preferred hand tracking system (e.g., Quest or Ultraleap). This class is responsible for initializing the foundational tracking system, selecting the mapping joints as per the developer’s specifications, and initializing the visualization and gesture detection subsystems. The ’HandTrackingProvider’ is responsible for providing the unified joint data for each rendered frame to both the visualizer and gesture recognizer. To render the virtual scene, we use a virtual camera that is equipped with a pose tracker that handles the translation of real-world head-mounted display (HMD) movements into corresponding virtual world transforms. An example of the ’EasyHandRig’ inside the Unity Engine is illustrated in Figure 3.12. 3.4 Multi-User Capabilities To enable colocated VR scenarios, our system is designed to create multi-user VR experiences. To achieve this, we use Photon PUN (Photon Unity Networking)8, a real-time cloud networking solution designed for multiplayer games and applications. Photon PUN operates on a client-host architecture, where clients communicate with a central server, which then relays messages to other connected clients. This approach 8Photon PUN: https://www.photonengine.com/pun(Accessed:2023-10-06) 36 3.4. Multi-User Capabilities Figure 3.12: The ’EasyHandRig’ inside the Unity engine. The ’EasyHandManger’ component is placed on the parent ’EasyHandRig’ game object. Each hand object has its own visualizer and gesture recognizer. The VR camera represents the user’s HMD and is responsible for user positioning and rendering. In this example, Meta Quest hand tracking is used, which is why an OVRManager is created at runtime to obtain the low-level data of the tracking system for unification. eliminates the need for direct communication between connected clients, simplifying the communication process when multiple clients are connected simultaneously. Table 3.1: Overview over all synchronized that is sent to the Photon server and then broadcasted to all connected clients Name Type Size (Bytes) Sync Rate Connect ID Integer 4 Once Rig Transform Vector3 & Quaternion 28 60 Hz Camera Transform Vector3 & Quaternion 28 60 Hz Hand Data Hand 373++ 60 Hz Gestures GestureEventArgs 9++ Once when detected Synchronizing Data Photon automatically configures the cloud server to broadcast received data to all connected clients while also managing matchmaking and handling connection-related events, including connection losses. To minimize latency and optimize data handling, only essential data is sent to the server and subsequently broadcasted to all clients. An overview of the synchronized data is provided in Table 3.1. In cases where ’++’ follows the sent package size, it indicates that the sent data has a minimum size specified but may be larger due to the inclusion of dynamic data, such as strings with variable lengths. Hand De-/Serialization As mentioned in Table 3.1, the synchronization of detected hand data over the network is a needed component of our system. Leveraging our unified data class for recognized hands, we can precisely define the data to be serialized. Given that connected clients require more than just joint positions, we also synchronize 37 3. EasyHand - A Modular Hand Interaction and Visualization Framework for Single and Colocated VR Scenarios additional relevant information. Table 3.2 provides an overview of the elements needed for synchronization. Table 3.2: Synchronized data for a serialized hand Name Type Size (Bytes) Left or Right Hand Boolean 1 ID Length Integer 4 ID String ID Length Hand Transform Vector3 & Quaternion 28 Tracking Method Integer 4 21 Joint Indices Integer 21 x 4 21 Joint Positions Vector3 21 x 12 This results in a data set with a minimum size of 373 bytes, generated from the hand data class during each synchronization cycle. This data set is then transmitted to the clients and parsed back into the appropriate data class. Subsequently, it can be employed for visualization and interactions, as detailed in the previous sections. The same process is applied to recognized gesture data. Alongside the recognized gesture itself, which occupies 4 bytes in size, we transmit additional information, including whether the gesture is performed by the left or right hand (1 byte) and the sender’s ID (4 bytes for the ID length, with a custom size of bytes for the actual ID, since we don’t know the length of the ID String). This information enables every client to definitely associate a gesture with a specific hand of a network user. 3.5 Distribution and Extensions The framework has been developed as an extension for Unity3D, designed to be modular and easily accessible for developers to seamlessly integrate into both new and existing projects. It adheres to the structure of Unity’s UPM (Unity Package Manager) workflow [109] and is hosted on the Git provider ’Bitbucket’9. This enables developers to effortlessly incorporate the package and all its dependencies through Unity’s Package Manager using the provided Git URL. An overview of their integration is depicted in Figure 3.13 at the ’System Integration’ part. To streamline the integration process and minimize overhead, we have externalized base-tracking systems and additional features as separate packages. The core framework comprises the following essential components: • Core Source Code: Containing fundamental code, including data structures, mapping, visualization, gesture recognition, and physics. 9EasyHand Repository: https://bitbucket.org/Densen90/easyhand (Accessed: 2023-10-17) 38 3.6. Current State • MediaPipe Tracking: Handling the logic and integration of the MediaPipe system, enabling hand tracking from webcams (see chapter 5). • Networking: Offering seamless network integration, facilitating multi-user capa- bilities, and serialization / deserialization of hand data. • Hand-Ownership Estimation: Containing classes for estimating hand ownership using straightforward methods. Should developers require additional functionality, custom plugins can be incorporated. A plugin is essentially an archive hosted on a web server containing the necessary logic implementations, external dependencies, packages, and the required assets and resources. The following plugins are available and provided by us: • LeapIntegration: Allow hand tracking with the LeapMotion tracking controller. • QuestIntegration: Allowing hand tracking with the Meta Quest or supported Meta HMDs. • ViveIntegration: Allowing hand tracking with all compatible HMDs using HTC hand tracking technology. • MLHOAgentsIntegration: Enhancing the hand ownership algorithm by integrat- ing Unity3D’s machine learning agents for assigning virtual hands (see section 6.2). • Colocation: An additional UPM package that relies on the ’EasyHand’ framework, facilitating the creation of colocated VR scenarios. This can be achieved using tracked hands, initial HMD placements, or AruCo markers (see section 4.2). Archives for accessing these plugins, as well as the Colocation UPM package, are also hosted on servers of the git provider ’Bitbucket’10. 3.6 Current State The ’EasyHand’ system currently meets all the requirements outlined in section 3.1.1. A high-level overview of the system workflow is provided in Figure 3.13. The components denoted as ’System Integration’ and ’VR Headset Tracking’ are customiz- able and can be chosen to suit the user’s specific hand tracking and VR system preferences. The core logic for a unified Visualization, Interaction, and Gesture Recognition system, as explained in the preceding sections, is designed to seamlessly interface with these interchangeable integrated systems. The resulting unified data can then be synchronized 10EasyHand Plugins Repository: https://bitbucket.org/Densen90/eh_plugins (Accessed: 2023-10-17) 39 3. EasyHand - A Modular Hand Interaction and Visualization Framework for Single and Colocated VR Scenarios Figure 3.13: General overview of the flow of the ’EasyHand’ system. ’System Integration’ shows how base hand tracking systems are integrated. ’Unified Logic’ is the core logic for Visualization, Interactions, and Gesture Recognition of the system. These parts can be synchronized over the network. ’VR Headset Tracking’ is the integration of a VR headset to track the user’s head movement and work in conjunction with tracked hands in the final application. over the network to create a multi-user VR application. When there are multiple users present, the developer has the ability to use the Hand Ownership algorithm outlined in chapter 6 to discern which hand corresponds to each active user. The ’EasyHand’ framework is currently on version 1.1.811, implemented in Unity3D version 2022.3.5f1 and relies on Photon Unity Networking 2.30 for network functionality. At the time of writing this thesis, the framework supports the following hand tracking base systems: • Meta Quest Hand Tracking with ’OculusIntegration’ v.55.0 • LeapMotion Hand Tracking with Ultraleap Plugin v.6.10.0 • Vive Hand Tracking v.1.0.0 • MediaPipe with Hand Tracking Integration v.0.8.10, together with a Unity C# wrapper plugin v.0.10.312 Future work on the system could focus on the automatic integration of current versions of underlying base hand tracking APIs. Currently, the system includes plugin archives 11EasyHand Repository: https://bitbucket.org/Densen90/easyhand (Accessed: 2023-10-17) 12MediaPipe Unity Plugin: https://github.com/homuler/MediaPipeUnityPlugin (Accessed: 2023-10-06) 40 3.6. Current State hosted on a web server, each of which integrates a specific version of the base hand tracking API (e.g. OculusIntegration). Automatically referencing the officially released versions of these APIs would serve to reduce the file size of hosted archives. This approach would make it more convenient for developers to use the system and provide access to the latest features and performance updates of the tracking systems without the necessity of manually updating the plugin archives on the web server. Additionally, the integration of hand tracking capabilities of the OpenXR standard could expand the utility of the ’EasyHand’ system. This enhancement would enable compatibility with the universal XR platform, allowing for the seamless integration of multiple additional hand tracking systems, such as Microsoft Mixed Reality [32]. This expansion would facilitate uniform XR hand tracking within the system, further enhancing the user experience. Furthermore, it would open opportunities to incorporate networking capabilities into OpenXR, as discussed earlier. These advancements would contribute to making the ’EasyHand’ system more widely accessible and user-friendly. 41 CHAPTER 4 Using Tracked Hands to Create Colocated VR The following chapter will present our methods for synchronizing the coordinate systems the virtual worlds of two independent users using SLAM-tracked VR headsets. We will introduce three distinct approaches: one that involves the alignment of the HMDs, another where both systems track an ArUco marker, and a third where the HMDs are aligned by tracking the hands of the same user. Following this, a comparative experimental analysis will be performed to evaluate the calibration error, and we will discuss the optimal approach considering factors such as consistency, usability, ease of setup, and scalability. 4.1 Motivation In a colocated VR setup, poses of all users within the same coordinate frame need to be known. To achieve this, virtual users must be placed in correspondence to each others locations concerning their position and view direction. Figure 4.1 illustrates this. In setups where external cameras are used, this approach is easier, as the external camera system establishes the same shared coordinate system for all users inside the tracking range of the system. Such systems can utilize, for example, the Lighthouse tracking system of the Vive [122] or external camera systems like OptiTrack for large physical environments [29]. However, these systems are intricate to set up, come with high costs and are limited in their mobility. For these reasons, we aim to have no dependency on these systems for the creation of colocation. At the time of writing, head-mounted displays that use built-in visual SLAM techniques for head tracking are gaining popularity in the consumer VR market. These demonstrate the substantial advantage of not requiring an external camera setup for fast and precise 6DOF tracking. They can often allow virtual environments that are larger than those 43 4. Using Tracked Hands to Create Colocated VR Figure 4.1: Sketch of requirements for a colocated scenario. Left: Two users standing in front of each other at a distance d and their view directions v1 and v2. Right: Their virtual representations are aligned to have the fitting distance and view direction of their real-world counterparts. that similarly-priced external tracking camera installations can cover. The Meta Quest is an example of an HMD that uses SLAM and is available to consumers. As a SLAM-tracked device, each Meta Quest creates an individual environmental tracking map. However, for colocated shared environments, synchronizing the virtual space for all users becomes a challenging task, given that the tracking map cannot be read out and copied to other devices. As we aim to avoid reliance on external tracking systems for map synchronization, given the constraints of most devices that can be integrated with the HMD, our focus is on developing methods that synchronize exclusively with the user’s SLAM-tracked HMD — in this case, the Meta Quest headset. We address this challenge by investigating three methods that initially calibrate SLAM- tracked HMDs within the same physical environment to create a shared colocated VR scenario: • Fixed-point calibration: All colocated HMDs are initial placed at predefined positions within the physical environment. • Marker-based calibration: A marker in the physical environment is tracked simultaneously by all client applications running on the user’s HMDs. • Hand tracking-based calibration: The hands of one of the colocated users are used as spatial anchors, simultaneously tracked by all client applications. Although two of the investigated calibration methods - fixed-point calibration and marker- based calibration - have been used in previous research, to our knowledge, the hand tracking-based calibration method is entirely novel. To summarize, we present the following contributions with our experiment: 44 4.2. Calibration Methods for Creating Colocated VR • A new calibration method for shared colocated VR scenarios using SLAM-tracked HMDs with hand tracking. This method employs user hands as spatial calibration anchors, eliminating the need for additional infrastructure and demonstrating superior calibration accuracy. • An experimental comparative evaluation of the accuracy of three colocation cali- bration methods. • Analysis of limitations and future possibilities of the discussed calibration methods. To achieve our overarching goal of creating colocated shared environments with hand tracking, where tracked hands can be obtained by multiple systems and assigned to the correct user, the method presented here for creating colocated shared environments for SLAM-tracked headsets with the help of hand tracking is the first necessary step. 4.2 Calibration Methods for Creating Colocated VR In this section, we present the design and implementation of three different calibration methods that enable shared colocated VR scenarios. All discussed methods are designed to work with SLAM tracked headsets in general, are independent of external tracking systems and tested on the Meta Quest HMD. The details of our implementation of two previously published calibration methods - fixed-point calibration and marker-based calibration - are followed by the description of the design and implementation of our novel hand tracking-based method. 4.2.1 Fixed-Point Calibration This calibration method is the most simple and straightforward of the three presented methods. To prepare the physical environment for calibration, a specific point is predefined and marked (for example, the center of the room). We call this point UR. Then, a point in the virtual world UV is manually set up that should correspond to UR. After calibration, a user at position UR in the physical space should have the position UV in the virtual space. A set of distinct UR and UV positions is determined for each colocated user. Some applications use the same reference points for all users, calibrating user positions one after another [70]. We choose to set up a unique reference point pair for each user, enabling simultaneous calibration. For the calibration, we position the HMD of each user on the floor at their corresponding UR, rotated in the direction that is set with UV . Then we manually start the calibration process to align the virtual users with their reference points. Let Uprev be the virtual starting point of the user and UV be the pose we want the user to be aligned with. We define the position and rotation of the virtual user with{pprev, rprev} ∈ Uprev & {pV , rV } ∈ UV . We then determine the amount we have to rotate the user (α) and the amount we have to move the user (Δp) with the following formula: 45 4. Using Tracked Hands to Create Colocated VR Δp = pV − pprev α = rV ∗ r−1 prev (4.1) The process can also be seen in Figure 4.2, where the location and orientation of the user’s heads are aligned with their physical HMD counterparts. In the setup used in our evaluation experiment described in section 4.3, the reference points of two users were placed one meter apart from each other, with the HMDs rotated to look at each other. However, reference points can be set at arbitrary distances, as long as their relative poses in the physical world correspond to those in the virtual environment. Figure 4.2: Left: The users‘ headsets are positioned on predefined locations in the real world. Right: Virtual users who are repositioned to UV , which is the virtual representation of UR. Distance ΔdU is the same in the real and virtual world. Red arrows represent the view direction of the user. 4.2.2 Marker-Based Calibration Our implementation of the marker-based calibration method uses an ArUco marker placed on the floor in the tracking space as a spatial anchor for the colocated HMDs. ArUco markers are square images featuring a black border and an inner matrix represented by white dots denoting the marker’s ID [84]. These markers are easily created and adequately unique for our use case, as only one marker is required for our setup. We used a ZED mini camera that we attached to Meta Quest to detect the ArUco marker. However, HMDs front-facing cameras could be used as well, as long as their video feed can be made accessible to the developer. Using the OpenCV framework, we calculate the position and rotation of the detected marker in camera space to recalculate the position and rotation of the camera in the coordinate frame associated with the marker. 46 4.2. Calibration Methods for Creating Colocated VR This marker also has a reference representation in the virtual world. This representation is our virtual anchor from which we are relocating our users. Compared to fixed-point calibration, where UV is known as the location we want to be aligned with, in marker-based calibration, we have to calculate this pose after a marker was detected. Since we know the pose of the world space markers (MV ) in the virtual world, we can determine the location of the world space (UV ) to which we want to align our user. The marker recognition software provides us with the marker location in the user’s camera space. This can be inversed to determine the user’s location in the marker space (we define this with UM ). We now are able to determine UV with UV = MV + UM . Since we now know UV , we can relocate the virtual user to this location using Equation 4.1. An illustration of the calibration process can be found in Figure 4.3. Figure 4.3: Left: The users standing in the real world detecting an ArUco marker. Right: Virtual users who are relocated depending on the detected marker. The virtual user is moved by Δp and rotated by α to get to UV , which is the position of the user p⃗m and rotation r⃗m in the marker space. This calibration can be used for each user independently or simultaneously. The process is either triggered by a user interaction (e.g. pressing a button on a controller) or, in our case, from an admin computer. This way, it can be controlled and easily redone during the experiment, as long as the marker is in sight of the AR camera. Figure 4.9 shows this setup in the real world. 4.2.3 Hand Tracking-Based Calibration Our novel calibration method based on hand tracking of colocated users requires a hand tracking-capable device. We worked with Meta Quest in our implementation; however an additional sensor such as a LeapMotion device attached to the HMD could be used. 47 4. Using Tracked Hands to Create Colocated VR To see whether it makes a difference in accuracy or effort, we implemented two recalibration methods with hand tracking. The first one uses one tracked hand (here the right one) to relocate the user according to the tracked pose. The second one relocates the user by tracking both hands and calculating a mean pose to which the user is reoriented. During the calibration process, one user’s hands are held behind their back to prevent the headsets from accidentally tracking these hands. The other user is then holding their right hand (respectively both hands) in front of both headsets, so both track it. If this is the case, the colocation process can be triggered by a user interaction (e.g. pressing a button). The calculations for that are found in Equation 4.2. In our setup, an admin computer is triggering the colocation process. However, in future implementations, this could also be done by one of the users. Then one user sends the position and rotation of their hand anchor to the other user as a reference point. For one hand, this is the tracked hand pose; for both hands, it is a mean pose of both tracked hands. The receiving user then reorients his virtual hand to match the received pose. The whole user is then reoriented by the difference between its own and the received hand pose. Δp = prefHand − ph puser = puser + Δp Δr = rrefHand ∗ r−1 ownHand ruser = ruser ∗ Δr (4.2) In this method Δp is the amount we move the user’s position and Δr the amount we rotate the user to get them to the location UV . Figure 4.4 illustrates the reorientation process. UV is the location we want to set the users to in the virtual world. From now on, the users are colocated. Depending on the HMDs tracking accuracy, this reorientation can be redone every time the headsets’ drift gets too big. However, such an increase in drift did not occur in the experiments carried out in this study. Since it does not require any preparation in the real world, this can be done anytime, anywhere during application. The only requirement is that the reference hand is visible for both users. Compared to the other methods, not all users are repositioned to a new virtual location. Since one user’s hand is sent as a reference to other users, this user does not need to get reoriented because other users are relocated to match the reference position. Variant based on the tracking of two hands For the calibration method where two hands are used, a mean point of both tracked hands is used. This point is the mid point between pl and pr which are the virtual points for the detected left and right hand and calculated as: pM =pr + pl − pr 2 (4.3) 48 4.3. Usability Evaluation Experiment Figure 4.4: Left: The users standing in the real world detecting the same hand. Right: User gets relocated by difference Δp. α is the difference in rotation between the tracked hand and the received reference hand. Rotation is visualized by red arrows. Compared to other methods, only the other user gets relocated. This formula is used to get the mean of the position and the rotation, that are then used as a single reference point when recalibrating the user, as explained in section 4.2.3. 4.3 Usability Evaluation Experiment The goal of our experiment is to evaluate the usability of each of four presented calibration methods in terms of calibration accuracy, difficulty of setup and the need of additional hardware. Our four calibration methods are clearly distinct with respect to the calibration effort and the need for additional hardware and software. While this experiment comprises a first technical evaluation, we plan to conduct a user-centered evaluation with multiple participants as part of our future work1. 4.3.1 Experimental Design and Evaluation The evaluation was performed with two HMDs (Meta Quest) colocated within the same room and calibrated with each of the described methods. To estimate the precision of the calibration, we used ground-truth tracking data obtained with an externally mounted Lighthouse 2.0 tracking system with two sensor stations. A HTC Vive tracker was attached to each HMD (pictured in Figure 4.5), allowing the ground-truth distance between both HMDs dGT (t) in the frame t to be calculated as the distance between two trackers, adjusted by offset between the centre of the tracker and the centre of the HMD. 1Due to the COVID-19 pandemic, an extensive usability study with multiple users was not feasible at this stage. 49 4. Using Tracked Hands to Create Colocated VR The calibrated distance between the HMDs dC(t) in the frame t was calculated as the difference between their positions in the virtual scene. The difference between dGT (t) and dC(t) provides the final distance error in the frame t as described by Equation 4.4. δ(t)=|dGT (t) − dC(t)| (4.4) Figure 4.5: Meta Quest with an attached Vive Tracker and ZED-Mini camera used in the evaluation. It is worth noting that δ(t) contains possible contributions due to imprecision or drift of the in-built SLAM-based tracking of Meta Quest as well as the influence of tracking errors inherent to the Lighthouse tracking system used as a ground-truth reference. The tracking accuracy of a Vive tracker delivered by the Lighthouse system has been shown to be in the millimeter range with high reproducibility of position measurements, making it viable for collecting ground-truth measurements [4][7]. In comparison, the tracking accuracy of Meta Quest was measured at a level below 1 cm under good lighting conditions [88]. These results motivate our use of the Lighthouse system as the source of ground-truth tracking data, as we believe its accuracy is sufficient to allow the comparison of the investigated calibration methods. To minimize the impact of these tracking errors, we calculate δ(t) for a number of frames after each calibration. 4.3.2 Pilot Evaluation We collected pilot evaluation data, performing the calibration five times for each calibra- tion method and calculating the distance error δ(t) in N=1000 consecutive frames after 50 4.3. Usability Evaluation Experiment Figure 4.6: Time distribution of the distance error, shown on the example of a dataset from fixed-point calibration. each calibration. These pilot recordings showed that, for each calibration method, the calculated error of the distance between two HMDs δ(t) is not correlated with time. An example of time distribution of the distance error can be seen in Figure 4.6. This example shows the tracking error over time during the 1000 frames recorded. The variance could be explained by inaccurate positioning in the fixed-point calibration, but this will be discussed again later. Furthermore, for each method, the median distance error was different for different calibration events. This fact was established with a Friedman’s ANOVA repeated-measures non-parametric test for each calibration method, since all error distributions were not normal [25][26][27]. The box-plots of distance error obtained from the pilot recording are presented in Figure 4.7. The box-plots indicate a considerable number of outliers in the distance error distributions. We are inclined to think that these outliers are the result of tracking inaccuracy in individual frames during recording sessions. However, our method does not allow to distinguish between contributions of possible inaccuracy of the SLAM-based tracking of Meta Quest or the Lighthouse system. The pilot analysis shows that neither of the investigated methods provides consistent calibration performance across different calibration attempts. For the fixed-point method, the inconsistency in the calibration results can be readily explained by the inconsistency of precision with which the users place the HMDs on predefined calibration positions. For the remaining methods, calibration success directly depends on the precision of the various tracking data (of marker or users’ hands) in the frame where the calibration took place. This dependency on the precision of the tracking data used in a calibration method could be addressed in future modification to the calibration methods, however, potentially 51 4. Using Tracked Hands to Create Colocated VR at the cost of additional setup effort on the part of the users. We will discuss potential mitigation strategies in section 4.4. To address the impact of the varied imprecision of every individual calibration, as well as to eliminate the impact of outliers present in the measurement of the distance error, we chose to perform each type of calibration multiple times and to record the distance error in a large number of frames after each calibration. Afterwards, we compare median distance errors resulting from each calibration. We further use the term calibration error to refer to the median distance error calculated from the recorded distance data after each calibration. Figure 4.7: Distance error box-plots of pilot recordings. 4.3.3 Setup and Procedure A distributed VR application run in the experiment is developed with Unity 3D (v.2019.4.3), with the networking layer built on the basis of the Photon Unity Plu- gin (PUN). The networking layer insured the synchronization of poses obtained with input tracking data and the simultaneous execution of experimental commands on all machines. Meta Integration asset for Unity 3D was used as an API layer providing tracked head and hand poses of Meta Quest to each Unity3D client application. How- ever, the rendering of tracked user hands was achieved with the help of the ’EasyHand’ framework that was presented in chapter 3, providing a universal layer for collecting and distributing hand tracking data obtained from any input source. In the experiment, both client applications running on Meta Quest were connected to a server, also running an administrative client issuing experimental commands. All poses of VR-immersed users as well as calibration-specific tracking data (of the tracked marker or tracked hands) are 52 4.3. Usability Evaluation Experiment visible on the administrative client. The main command issued by the administrative client triggers the calibration procedure for a selected calibration method. A diagram illustrating the communication flow between the administrative client and Meta Quest clients is presented in Figure 4.8. Figure 4.8: Network communication between admin computer and VR users. The evaluation data was collected with the following experimental procedure: 1. Start the administrative client, which is also the master client (host) in the PUN distribution pipeline, to open the network connection. 2. Connect both HMDs with respective connected Vive trackers. 3. Ensure correct synchronization and assignments of HMDs and Vive trackers in the administration client. 4. Collect data following the procedure detailed in Section section 4.3.1. To enable marker-based calibration methods, we attached a ZED-Mini camera to each HMD as demonstrated in Figure 4.5. Figure 4.9 illustrates our experimental setup on a room-scale. An ArUco marker was positioned on the floor in the middle of the room. Two markers for one headset each were placed on the floor at a distance of one meter for fixed- point calibration. When users connected to the same distributed application, they were able to see each other in the same virtual environment, although their relative positions did not coincide with those in the real room before calibration. After the calibration procedure was triggered from the administrative client, users’ relative positions in the virtual environments were moved to coincide to their relative positions in the physical 53 4. Using Tracked Hands to Create Colocated VR environment and recording of their poses started. Although the evaluation was conducted with two colocated users, the calibration procedure accommodates an arbitrary number of users. Figure 4.9: Exemplary experiment setup for marker-based calibration method. 4.4 Results and Discussion The following section will present the results of the analysis of the aforementioned evaluation. After stating the calibration error of the several methods we will discuss these results and elaborate on further important properties of the evaluated calibration methods for colocated SLAM-tracked HMDs. 4.4.1 Calibration Error Outcome Data used in the evaluation was collected following the approach described in section 4.3.1. For each method, the calibration was performed 25 times, resetting the application each time and attempting to perform the same head movements each time; the error between the ground-truth distance between the HMDs and the distance derived from the calibrated positions was recorded during 1000 frames after each calibration. We then calculated the median error for each dataset of 1000 error values, obtaining four datasets of median errors with 25 entries in each. The box-plots of these four datasets are presented in Figure 4.10. The median errors proved to be normally distributed in the Shapiro-Wilk test [102] after two outliers had been removed (p = 0.055 for the fixed point method, p = .102 for the marker-based method, p = 0.021 for the method based on one hand hand tracking, p = 0.066 for the method based on two hands hand tracking). The removed 54 4.4. Results and Discussion outliers can be seen in Figure 4.10, one in the marked-based method and one in the hand tracking-based method with two hands. We then used one-way ANOVA [85] to compare the distributions means. The analysis was performed with IBM SPSS Statistics. Figure 4.10: Box-plots of median distance error for four calibration types. Figure 4.11: Mean values of calibration error (median distance error) for each calibration method. Levene’s test [60] showed a violation of the homogeneity of variance (p < .001). Therefore, we used Welch’s test [123] for the main analysis and the Games-Powel test for post-hoc comparisons. The median error differed significantly with different calibration methods 55 4. Using Tracked Hands to Create Colocated VR (Welch’s F(3, 46.765) = 40.965, p < .001 ). The resulting plot of mean values is presented in Figure 4.11. Post-hoc analysis revealed that the median error was significantly larger for the fixed-point method than for all other methods. The median error of the hand tracking-based method with two hands was significantly smaller compared to all other methods. Details of post hoc comparisons are summarized in Table 4.1. Mean differences are reported as significant at the 0.05 level. (i) method (j) method mean diff. (i - j) std. error sig. 95% CI lower bound upper bound fixed-point marker 7.60789 2.49586 .02 .9291 14.2867 one hand 8.41017 2.68948 .016 1.2433 15.5770 two hands 18.44063 2.16391 < .001 12.5353 24.3460 marker fixed-point -7.60789 2.49586 .02 -14.2867 -.9291 one hand .80227 2.20473 .983 -5.0771 6.6817 two hands 10.83273 1.51988 < .001 6.7132 14.9523 one hand fixed-point -8.41017 2.68948 .016 -15.5770 -1.2433 marker -.80227 2.20473 .983 -6.6817 5.0771 two hands 10.03046 1.82044 < .001 5.0816 14.9793 two hands fixed-point -18.44063 2.16391 < .001 -24.3460 -12.5353 marker -10.83273 1.51988 < .001 -14.9523 -6.7132 one hand -10.03046 1.82044 < .001 -14.9793 -5.0816 Table 4.1: Results of post-hoc pairwise comparisons with Games-Powel test. The results of our evaluation show that the fixed-point calibration method proved to have the largest median calibration error, whereas the hand tracking-based method provided the most accurate calibration when the variant based on the tracking of two user hands was used. The accuracy results of the marker-based method and hand tracking-based method using one user hand are comparable, with their accuracy being higher than that of the fixed-point method but lower than those of the hand tracking-based method using two hands. The observed higher accuracy in methods using hand detection and ArUco markers, compared to the fixed-point method, can be attributed to the inherent limitations of the fixed-point method. The fixed-point method relies on manual placement of the HMD and lacks the automated detection employed by the other methods. Consequently, the introduction of human error is a significant factor in this approach. In contrast, hand recognition utilizes multiple cameras on the HMD to precisely determine the position and rotation of hands in a three-dimensional space, ensuring a high level of accuracy in placement. Notably, the method involving two hands exhibits the smallest error. This can be attributed to both the high tracking precision afforded by the hand-tracking technology and the use of multiple anchors in space. The presence of two hands allows for error compensation, wherein any potential inaccuracies in the position of one hand can be mitigated by the information from the other hand. This is in line with the findings of McGill et al., who demonstrated that employing two fixed points can yield improved accuracy [70]. 56 4.4. Results and Discussion 4.4.2 Consistency and Potential for Improvement Consistency of the calibration result describes the extent to which each calibration method delivers similar calibration accuracy when the user performs identical actions to calibrate the colocated HMDs. In the pilot evaluation stage, we discovered that the median distance error measured after each calibration proved to be different for all the methods evaluated. Data of 25 calibrations for each discussed method allows us to have a more detailed look at the method’s calibration consistency. The fixed-point calibration method showed the highest variability among the four eval- uated methods, demonstrated by the largest range of median error values and their interquartile range (Figure 4.10). This result is somewhat expected, given that each user needs to manually place their HMD at the marked spot to calibrate for a colocated scenario. It is hardly possible for users to achieve placement accuracy in the sub-cm range. The accuracy and possibly consistency of the fixed-point method could be improved by a two-point calibration procedure suggested by McGill et al. [70]. For the marker-based calibration method, the accuracy of each individual calibration is contingent on the accuracy of marker tracking in the frame where the calibration takes place. Compared to fixed-point and hand tracking-based calibration with one hand, this method shows a smaller interquartile range and span of median errors, indicating a better consistency than these two methods. Since the tracking process uses RGB images, its accuracy can be highly dependent on lighting conditions and varied in unstable lighting. A possible solution to mitigate the shortcomings of marker tracking is to collect marker pose data for several frames and use the averaged pose value in the calibration procedure. However, balanced values of frames need to be found since users would need to be very still during the marker pose collection time. Alternatively, a larger number of markers could be used in the calibration procedure, with the mean camera pose being calculated with the tracking data of all markers (similarly to the multi-marker tracking method used by Podkosova et al. [91]). For calibration based on hand tracking using only one hand, the interquartile range and median error range are comparable to the fixed-point calibration range, showing much larger variability in distance error data compared to the setup when two tracked hands are used in the calibration process. This stark difference in the variability of error between the variants based on tracking of one hand and two hands might be an indication of the advantage of using multiple spatial anchors in the calibration process. The hand tracking-based calibration method demonstrated the strongest consistency when two tracked hands were used in the calibration process. The increased consistency compared to the other tested methods is reflected in the much more compact span of the median error values and their interquartile range (Figure 4.10). According to our evaluation, the greater accuracy of this method combined with the clearly better consistency and ease of execution on the part of the users makes it the best method for calibrating two-user colocated scenarios. 57 4. Using Tracked Hands to Create Colocated VR 4.4.3 Ease of Setup The fixed-point calibration method does not require any additional hardware. It also means that no additional software or plug-ins are required for developers, making this method usable by a wide range of HMDs. However, the execution of the fixed-point calibration requires certain involvement on the part of users as they need to take place (or place their HMDs) at predefined locations as accurately as possible. Moreover, if recalibration is necessary during the application runtime, users would have to remove their HMDs to ensure that their positions in the tracking space are accurate. Marker-based calibration might require additional hardware and software, depending on whether the HMD has integrated cameras that can be accessed to enable marker tracking or whether an external camera needs to be used, as in our evaluation. The use of marker tracking itself requires additional implementation. In addition, a marker must be positioned in a fixed location, which restricts mobility. For users, however, the execution of marker-based calibration does not present any difficulty since users only need to position themselves in a way that allows the calibration marker to be seen in the camera image. When recalibrating, users need to return to the marker, making calibration and recalibration dependent on the location of the real-world marker. The calibration process can be made even easier for users if continuous tracking is used and markers are attached directly to each user, as in the work of DeFanti et al. [15]. The hand tracking-based calibration method has similar hardware requirements as the marker-based calibration. Either an HMD with integrated hand tracking (for example, Meta Quest) or an external sensor (for example, LeapMotion) is needed. For both cases, the developer needs to implement hand tracking detection (i.e., use the tracking systems SDK). Since integrated hand tracking is becoming more ubiquitous, this calibration method can be used on more and more HMDs without requiring additional hardware. For preparation and calibration effort, this method is shown to be the least demanding among the evaluated methods. Neither a physical marker nor a fixed location has to be set in the real world. The only requirement is that the hands of one user must be simultaneously visible to both users. For recalibration, the users do not need to return to a specific spot in the real world. Still, they have to be near each other enough so that the reference hand is visible to both tracking systems. 4.4.4 Scalability In our evaluation, two users were colocated in the same physical environment. However, each of the tested methods is designed to work for an arbitrary amount of users. In the following, we briefly discuss how the applicability of each method extends to larger amounts of colocated users. A future comprehensive user test can further verify this. The fixed-point calibration method can be easily extended to accommodate any amount of users. A corresponding number of marked positions in the physical environment and their counterpart target positions in the virtual environment need to be prepared to 58 4.4. Results and Discussion extend the method. Although such preparations would require additional involvement of an application developer, the calibration difficulty for users will remain unchanged. For the marker-based calibration method, there might be an upper limit on the amount of participating users since HMDs (or cameras attached to them) of all users would need to track the calibration marker simultaneously (in the case where a simultaneous calibration is required). However, this limit would be rather large - it should be possible for up to ten users to stand in a circle so that the calibration marker can be tracked in all client applications. Such a limit on the amount of user HMDs that can be calibrated simultaneously would most probably be larger than the number of users that can be physically colocated in a regularly sized tracking room. For larger tracking spaces that many users share, several markers with known offsets could be arranged to calibrate sub-groups of users. Otherwise, it would also be possible to rely on the low drift of the SLAM-tracked HMDs and calibrate the users one after the other. In this case, the number of users would also be limited by the size of the physical environment. As with the marker-based calibration method, the hand tracking-based method could have an upper limit on the amount of participating users. For a successful calibration, the reference hand needs to be tracked simultaneously in all client applications. The range of hand tracking limits the number of possible user positions from which the reference hand can be tracked. The exact scale of this limitation needs to be examined in future work. For a larger number of users, the hand-tracking-based calibration could be separated into multiple calibration steps. Users’ poses could be calibrated to the same reference hand one after another until all users are correctly colocated in the virtual environment. 4.4.5 Exploring Colocation in a Different VR Scenarios The applicability of all four discussed calibration methods extends to various VR scenarios. The most common use cases relate to room-scale environments, where users stand and navigate the virtual space by walking (see Figure 4.12). Figure 4.12: Two colocated users standing in front of each other with hand tracking enabled. 59 4. Using Tracked Hands to Create Colocated VR In future investigations, it would be worthwhile to delve deeper into room-size constraints and drift issues that may arise during walking, with a focus on determining the optimal frequency for recalibration. But the colocation is not limited to room-scale scenarios. Seated colocated VR experiences enabled by SLAM-tracked HMDs also require calibration and environment synchronization. Colocation in seated VR can be used in a number of scenarios. For example, it can enable a meeting scenario where several participants are sitting at the same table. A calibration method will ensure that the virtual environment and all users are synchronized, allowing them to view and interact with the same 3D model or visualized data. It can also prevent collisions and enable the use of haptic elements or physical props in the environment. Likewise, the discussed calibration methods can be used for seated collaborative VR experiences in CAVE environments [58], where users are placed in physical seats aligned with virtual seats. Currently, alignment is ensured manually by measuring the poses of the physical seats in the CAVE space. This alignment can be automated using a calibration procedure with an AR marker or hand tracking for more flexible scenarios. Figure 4.13 shows two seated users after being colocated interacting with their hands. Figure 4.13: Two seated colocated users using their hands. 4.4.6 Applicability and Future of Hand Tracking-Based Calibration Currently, Meta Quest is not designed to be used in colocated scenarios. This argument can be supported by the absence of access to the internal tracking map, which makes colocation calibration necessary in the first place. Its hand tracking-enabled interaction input is not designed to be used in colocated scenarios either, the possibility of tracking hands of users not wearing the Meta Quest device itself clearly being an artifact. It is this artifact that allowed us to use hand tracking for colocation calibration. 60 4.5. Conclusion It is conceivable that, in the future, neural net training used to enable hand tracking on HMDs with forward-facing cameras would be implemented in a way that prevents the hands of other users from being tracked (for example, by taking arm poses into account during the training stage). In this case, direct use of hand tracking for colocation calibration would not be possible. However, hand tracking capabilities of HMDs equipped with frontal cameras can be extended to track hands of other users deliberately, even if colocated users are relatively far away in the common walkable area. Such extension could be helpful in providing hand pose (and possibly derived full-body pose) estimations in situations where a user does not keep their hands in front of their head (for example, when the user’s arms are kept down, alongside the body). If such augmentations to hand tracking are developed, they can prove beneficial for colocation calibration, potentially proving increased accuracy. Mutual tracking is a promising direction of future work on colocated VR environments overall. 4.5 Conclusion The research presented in this chapter investigated three calibration methods that enable shared colocated VR scenarios for SLAM-tracked HMDs. We implemented and experimentally evaluated fixed-point calibration, marker-based calibration, and our novel calibration method that uses hand tracking data of colocated users as spatial anchors. Our experimental evaluation showed that hand tracking-based calibration using two user hands as anchors achieved the highest consistent accuracy compared to fixed-point and marker-based calibration. Not requiring any internal infrastructure and being easy to execute at any time in a colocated scenario, our hand tracking-based calibration method proved to be very advantageous. With the current trend of hand tracking being adopted by HMD manufacturers, this calibration method provides great potential for a wide range of VR solutions. Since the setup is easy to use for end-users, we hope that this encourages developers to implement colocation into end-user VR applications further. In future applications, it would be interesting to extend the methods to more than two users. A limitation of concurrent users and the impact of interference of all users and VR systems could be examined. A tracking system designed to track users’ hands mutually could improve the user experience in colocated scenarios and be promising research. With this approach, the impact of view direction when tracking hands on calibration results is a promising topic of investigation. After successfully establishing a colocated scenario using the presented method and demonstrating its applicability, the next step involves facilitating interaction among colocated users. To achieve this, it becomes essential to track more than two hands and enable these hands to interact within a three-dimensional virtual environment. This aspect will be explored in detail in the upcoming chapter. 61 CHAPTER 5 Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage In this chapter, we will provide a method to enhance the existing RGB hand tracking capabilities of the MediaPipe framework [64][130] by extending the 2.5-dimensional tracking of keypoints into three-dimensional space. This process involves a three-step approach: first, detecting the hand itself; second, estimating the real-world hand size of the user; and third, utilizing this information to calculate the three-dimensional world space of the hand. Subsequently, in the evaluation section, we will assess the placement accuracy of this algorithm and compare it with state-of-the-art hand tracking systems such as Meta Quest and LeapMotion. This comparative analysis will be conducted in both static and dynamic environments, as well as in real usage scenarios. We will determine and compare the tracking ranges of all systems. Finally, our findings will be summarized in a conclusion where we discuss the applicability and usability of the proposed algorithm. 5.1 Motivation In chapter 4 we presented methods to create colocated VR scenarios with the help of hand tracking. Ensuring consistent tracking of all users’ hands within the shared workspace is crucial for enabling reliable interactions over a wide tracking range. To achieve reliable hand tracking for all colocated users, we propose a method that takes advantage of the tracking system’s capability to track more than two hands and operates within an extended range of distances, and furthermore accurately position the hands in a three-dimensional space, thereby facilitating colocated hand interactions; this way, 63 5. Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage each tracking device which is worn by a user can provide tracking input not only for this user’s virtual hands but also for virtual hands of colocated others. This idea is presented in Figure 5.1 on the right: although the hands of user B are outside the field of view of their hand tracking camera, they are tracked by the camera of user A and can be rendered correctly. In this case, the camera of user A is tracking four hands at the same time. This is not possible with the integrated hand tracking of the Meta Quest, where Han et al. present the recognition algorithm with the fact that they expect a maximum output of two hands, making the recognition of more than two hands impossible [37]. The same applies to LeapMotion or integrated hand detection systems of other HMDs (like the HTC Vive Cosmos). Figure 5.1: A colocated multi-user VR setup. User B’s hands are outside the range of their own hand tracking, but visible to user A. With off-the-shelf solutions, only two hands can be detected at the same time within a short range (left). Our solution allows us to detect and position the hands of other users in 3D space. This way the hands of user B can still be detected (right). Since hand tracking methods integrated into off-the-shelf devices are closed systems, it is currently impossible to adjust them to closely align with the interaction requirements of colocated multi-user VR. For this reason, we turn to methods that use RGB input to detect the user’s hands. Camera-based methods have certain advantages: they can work with any RGB source, not being bound to any specific hardware, and they work at larger distances, the limits of tracking being set only by the resolution of users’ hands in the images. However, most RGB-based solutions offer the capability of detecting the hand pose in 2D image coordinates only, additional calculations being necessary to obtain the full 3D pose. This chapter presents a hand tracking method that is based on the MediaPipe framework [64][130], a cross-platform solution for object recognition (including hand recognition) in 2D images using machine learning. With this framework it is possible to recognize more than two hands at the same time (see Figure 2.4), which qualifies it for use for our mentioned purposes. To calculate the full 3D hand pose based on finger joint coordinates provided by MediaPipe, we have developed an algorithm that uses an estimation of the user’s hand size to obtain its distance from the tracking camera. 64 5.2. 3D World Position Estimation of a Camera Tracked Hand We evaluate the performance of our method in comparison with hand tracking methods provided by Meta Quest and LeapMotion, providing an accuracy assessment for each method in static and dynamic conditions in the range from 0.25m to 3m from the tracking camera. With these results we want to determine whether our proposed method provides comparable or even better tracking accuracy in different tracking ranges. With good tracking accuracy at typical tracking area-scale distances and the ability to track more than two hands, our method would present a step towards enabling reliable natural hand interactions in colocated multi-user VR. 5.2 3D World Position Estimation of a Camera Tracked Hand Our proposed method for enhancing the tracking data of the RGB tracking system consists of three stages: 1. Hand Detection: Hands are detected in the monocular RGB image; 3D positions of finger joints are calculated relative to the center of each hand. This stage is carried out by the hand tracking implementation of Zhang et al. in the MediaPipe framework [130]. 2. Hand Size Estimation: Real-world hand size of the user is estimated according to one of the methods described in section 5.2.2. 3. Depth Estimation: Distance of the hand to the tracking camera is calculated according to the method described in section 5.2.3, using the estimation of the real hand size. This workflow is presented in Figure 5.2, which includes the details of each stage described in the sections below. Our objective is to develop a workflow for calculating the hand sizes of users and utilizing this information to accurately position the tracked virtual hand (using the MediaPipe framework) within a three-dimensional environment, thereby enabling natural hand interactions. To evaluate the effectiveness of our approach, we will compare it with existing off-the-shelf tracking solutions. By leveraging hand size estimation for positioning, we anticipate higher accuracy in hand tracking as the precision of hand size estimation improves. We also expect to achieve accurate hand size estimation by inferring the hand size from the user’s body height (derived from Pheasant [90]). Overall, expect a system capable of facilitating 3D hand interactions with a much larger tracking range, surpassing the capabilities of existing off-the-shelf tracking solutions. This advancement will make our solution highly advantageous for use in colocated VR scenarios. 65 5. Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage Figure 5.2: Step-by-step diagram for adjusting and positioning a detected hand from MediaPipe. After detection, the virtual hand is adjusted to the real-world hand size and then positioned with the help of the intercept theorem. 5.2.1 2.5D Joint Detection with MediaPipe We chose the MediaPipe framework ([64]) as the hand detection step in our workflow (and the hand and finger detection implemented by Zhang et al. [130]) due to its ability to detect more than two hands at the same time. As they report an average precision between 86.22% and 95.7% for palm detection, we can assume a similar detection accuracy in our experiment. With the help of two TensorFlow machine learning models (palm detector and hand landmark model), MediaPipe tracks the finger joints of the hand with high prediction quality. As a result of the recognition, we get the following information from the framework for each detected hand: • Handedness: A label (’left’ or ’right’) and an estimation probability for this handedness. • World Landmarks: 21 landmarks (a landmark corresponds to a finger joint) consisting of x, y, and z coordinates with the origin at the hand’s approximate geometric center. • Normalized Landmarks: 21 landmarks consisting of x, y, and z coordinates in the normalized viewport space of the camera. The landmark definitions can be seen in Figure 5.3 (taken from the official MediaPipe hand tracking website [71]). The marked landmarks are later used for hand length calculation in the virtual space. Together with the remaining landmarks in the coordinate frame of the center of the hand and normalized landmarks, they are used in the calculation of the distance of the hand to the tracking camera. This detection step can be seen as the first step in Figure 5.2. 66 5.2. 3D World Position Estimation of a Camera Tracked Hand Figure 5.3: Landmark indices of the MediaPipe framework. Marked landmarks are used for hand length calculations. 5.2.2 Estimating Real Size of Users’ Hands We use the real-world length of the user’s hand to estimate the distance of the hand to the camera. To obtain the hand’s size, three different methods are used, resulting in three variants of our hand tracking method that were evaluated. In Figure 5.2 these methods are visualized in the second step. 1. We use MediaPipe 3D hand landmarks with the origin in the center of the hand to calculate the distance between the position of the wrist and the tip of the middle finger. This distance represents the length of the hand. We refer to this method of hand size calculation as MediaPipeInternal in the rest of the chapter. 2. We measure the real length of the user’s hand (wrist to the tip of the middle finger) and use the measurement as an input to our program (later referred to as MediaPipeHand). 3. The third method requires more calculations but could provide an easier setup experience for the user. Since most people do not know the length of their hand, we use the body height (measured manually for best accuracy) as an input parameter to infer the length of the hand (later referred to as MediaPipeBody). Figure 5.4: Body Size estimations excerpt from Pheasant [90]. Pheasant conducted an examination of different body part sizes and their frequency in the English population [90]. Figure 5.4 is an excerpt from this book and shows different estimations of the size of body parts (in mm) with three percentiles (including the mean) 67 5. Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage and the standard deviation for a normal distribution. We use these values to create normal distributions and derive body part sizes from another reference body part size. Zafar et al. evaluated the body-hand relations by calculating the body size based on the size of the hand with an accuracy of 2.9 cm [127]. Since their method also requires the age of the user and we want to keep the input data set as small as possible, we calculate the hand size using the tables from [90]. In this way, the actual body part does not have to be physically measured. In our case, we use the body size of the user to get its percentile in the normal distribution which is then used to recalculate the hand length size for this percentile. For this we use the following equations: z = x0 − µ σ p = 1 2  1 + erf  z√ 2  x = µ + (σ ∗ z) (5.1) For our experiment in section 5.3.2, our user had a body size (height) of 1892mm. For this size we look up µ = 1740mm (given at a percentile of 50%) and σ = 70 in Figure 5.4. With Equation 5.1 we calculate a percentile of 0.985. For the hand size we look up µ = 190 mm and σ = 10. With the percentile of 0.985, we can estimate a hand size of 211.71 mm for the given body size. In comparison, we have determined a measured hand size of 213 mm for the user. We calculated the size of the hand based on the body height of all participants in our user test (n = 10, see section 5.3.2 & section 5.3.3) and compared the resulting value with the measured size of the hand. Figure 5.5: Discrepancy between measured hand length and calculated hand length of ten users that participated in the evaluation of section 5.3.2 & section 5.3.3. 68 5.3. Accuracy Evaluation Experiment The box-plot of the difference between the calculated and measured hand lengths can be seen in Figure 5.5. With a mean difference of 0.0787 cm, it can be seen that the hand length can be calculated accurately from the body size, even though differences of up to 1 cm are possible depending on the user. 5.2.3 Estimating Hand Depth With the fitted 3D coordinates of the tracked landmarks we now have a correctly scaled hand with the expected hand length, which only has to be adjusted in the distance to the virtual camera. To achieve this we use the intercept theorem ([100]), which describes rules about the ratio of parallel line segments which are intersected by a line. We calculate the hand length of the normalized 2D landmark positions in the camera’s image space (lsm), which are obtained from MediaPipe. We also transform our fitted 3D coordinates to the viewport space and calculate the hand length of these transformed viewport points (lsr ). With the intercept theorem, we know the following about the ratios: dR dV = lsr lsm (5.2) To get the final depth dR to the camera we solve the equation to dR = lsr lsm ∗ dV = sf ∗ dV where dV in this step is the current distance of the virtual hand to the virtual camera. The step-by-step procedure from detecting the hand to virtual positioning can be seen in Figure 5.2. 5.3 Accuracy Evaluation Experiment We conducted three experiments to evaluate the effectiveness of our hand tracking method. In Experiment 1, we conducted a comprehensive technical assessment to compare the accuracy of our method against two off-the-shelf solutions: LeapMotion and Meta Quest. This evaluation focused on the hand data from a single user. Experiment 2 aimed to validate the performance of our hand tracking method with hand images from multiple users. We analyzed three variants of our method using data input from nine users. In Experiment 3, we conducted a demonstrative test to assess the application of our hand tracking method in an actual colocated scenario, involving a pair of users. The following sections will present each experiment sequentially, along with their corre- sponding results. This organization facilitates a clear grouping of each experiment with its respective results, enhancing readability and understanding of the outcomes. 5.3.1 Setup As hand tracking devices we used a LeapMotion sensor, a Meta Quest 2 HMD with its integrated hand tracking, and a 1080p webcam with a 60° horizontal field-of-view and 69 5. Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage with MediaPipe as the tracking framework. Figure 5.6 shows a sketch of the experimental setup. Figure 5.6: Sketch of the experimental setup. The tracking devices were mounted on a fixture that can move back and forth along a rail. The Vive Lighthouse system was used to calculate real-world distances. The real-world position offset between the Vive tracker and the real-world position of the hand tracking device (as well as the offset between the Vive tracker and the real hand) was measured and taken into account in the virtual world calculations. The rail is 4 m long, which was sufficient for the maximum tracking distance of the evaluated methods. Vive Lighthouse base stations were positioned around the rail system. One Vive tracker was attached to a fixed position on the bracket, and the other to the wrist of the hand to be detected. The offsets to the tracking devices and the center of the hand were measured and added to the respective positions in the evaluation. The VR application to collect evaluation data was developed with Unity3D (v.2021.2.14). The rendering of tracked user hands was achieved with the help of the ’EasyHand’ framework that was presented in chapter 3. For hand tracking, the following versions of the hand tracking API were used: Oculus Integration v.38, Ultraleap Plugin v.5.4.0 and MediaPipeUnityPlugin1 v.0.8.3 with MediaPipe backend v.0.8.9. 5.3.2 Experiment 1: Comparison to Integrated Tracking Solutions In this experiment, we access the performance of our hand tracking method based on RGB input by comparing it to the methods integrated into Meta Quest and LeapMotion. All selected solutions can be combined with HMDs and are therefore, in principle, suitable for use in a colocated VR setup. We evaluate three variants of our RGB-based hand tracking method (as explained in section 5.2.2: MediaPipeInternal, MediaPipeHand and MediaPipeBody. 1Mediapipe Unity Plugin: https://github.com/homuler/MediaPipeUnityPlugin (Accessed: 2023-12-15) 70 5.3. Accuracy Evaluation Experiment Since our MediaPipe-based method is not fine-tuned for a specific interaction range, in contrast to LeapMotion and Meta Quest, we do not expect it to be more accurate than those methods in the close range (within the arm’s length). Nevertheless, we expect to be able to cover a larger tracking range with MediaPipe hand tracking and to achieve higher tracking accuracy at distances beyond arm’s length. We expect our method to provide a usable tracking capability that works with a simple RGB camera, is simple to set up and has the ability to detect more than two hands at a time. We analyze the following metrics: • Static distance error: Error in the distance of the hand from the tracking device, compared to the ground-truth, while the hand is held still in one position. This metric is calculated at different distances from the tracking device. • Dynamic distance error: Error in the distance of the hand from the tracking device, compared to the ground-truth, while the hand is moving away relative to the tracking device. We analyze the correlation between the dynamic error and the distance to the tracking device for all methods. • Tracking lost and tracking acquired distances: Distance at which tracking is lost when the hand is moving away and the distance at which tracking is found when the hand is moving towards the tracking device. The ground-truth distance dr of the hand to the tracking device is measured with an externally mounted Lighthouse 2.0 tracking system, the tracking accuracy of which has been shown to be in the millimeter range with high replicability of position measurements ([4][7]). HTC Vive trackers were attached to the wrist of the hand and the tracking device, allowing the ground-truth distance dr in the frame t to be calculated as the distance between two trackers, adjusted by the offset between the center of the tracker and the center of the hand and between the center of the second tracker and the center of the tracking device. The virtual distance dv from the virtual hand to the virtual camera is calculated from the hand tracking data. The absolute euclidean distance error δ(t) for a frame t can thus be calculated with δ(t) = |dv − dr| (5.3) Static Error Evaluation For the static error measurement, the tracking device and the tracked hand of a user were positioned at fixed distances from each other with d ∈ 25, 50, 75, 100, 150, 200, 250 cm. Smaller intervals of 25 cm were chosen in the range where LeapMotion and Meta Quest could consistently track the hand. For d >= 100, the sampling interval was 71 5. Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage increased to 50 cm since only MediaPipe was able to consistently track the hand for these distances. A maximum length of 250 cm was chosen, as LeapMotion and Meta Quest had a much shorter possible tracking distance in initial tests and we felt that this distance was sufficient to evaluate the MediaPipe system and thus keep the evaluation time tolerably short. At every sampling position, we collected hand tracking data over N=400 consecutive frames. This was done 25 times per position for each tracking method (Quest, LeapMotion, and three variants of the MediaPipe method). Each sample containing tracking data of 400 consecutive frames was collapsed to its median value, to account for possible small movements of the user’s hand. This way, our resulting evaluation sample for each tracking method consists of 25 median static error values per one sampling position. Collections were only carried out if there was consistent tracking during the 400 frames. For Meta Quest and LeapMotion this worked in the range [25 cm;75 cm] and for all variations of the MediaPipe method in the range [25 cm;250 cm]. Results A few outliers were observed, which likely resulted from the temporary tracking loss of the deployed ground truth trackers in the environment. As these outliers were not attributed to the tracking of the hand itself, they were excluded from the analysis to avoid any false influence on the results. Following the removal of outliers, the Shapiro-Wilk normality test was performed on all median error samples [102]. Because not all median error distributions were normal and because sampling position ranges were different for the evaluated methods, we compare the median static error of the evaluated methods separately for each sampling position, using the non-parametric Independent-Samples Median test. We also analyze the impact of the distance to the tracking device on the median static error for each method in the non-parametric Friedman’s 2-way ANOVA test [25][26][27]. The corresponding box-plots are presented in Figure 5.7. Figure 5.7: Box-plots for static data collection for each method and distance. As expected, the tracking error for close distances [25 cm;50 cm] is the lowest for Meta Quest and LeapMotion. Interestingly, MediaPipeHand delivered a lower error of 1.53 cm than Meta Quest with a distance of 75 cm to the camera (p < 0.001), which is still 72 5.3. Accuracy Evaluation Experiment within the user’s arm’s reach. Except for LeapMotion, the other tracking methods show an increasing median error as distances are increased to the camera. This is in line with our results from the dynamic evaluation. For distances greater than 75 cm, only the values of the MediaPipe methods can be compared. However, a significant difference in mean error can be found between all three methods for all distances (p < 0.001). The error for MediapipeInternal is significantly larger at all further sampling positions than for MediaPipeBody and MediaPipeHand. For large distances, the error of MediaPipeInternal increases to values significantly above 10 cm, which can be too inaccurate for precise interactions. The results show that this error can be significantly reduced by using the real hand size (p < 0.001). With a maximum error of 4.47 cm at a distance of 250 cm from the camera, reasonably precise interactions in virtual space at greater distances are also possible with this method. The derivation of the size of the hand by the size of the body also shows a significantly lower error. However, as expected, the accuracy is not quite as good as when the actual hand size is given. With a maximum error of 8.57 cm at a distance of 250 cm from the camera, the error is significantly larger, but still improves the initial recognition of MediaPipe by more than three times (with an initial error of 28.48 cm). A summary of the resulting mean and median values can be found in Table 5.1. 25cm 50cm 75cm 100cm MPI MPB MPH LM Q MPI MPB MPH LM Q MPI MPB MPH LM Q MPI MPB MPH LM Q Mean (cm) 4.14 1.71 1.81 0.58 0.73 6.53 2.01 1.71 0.67 1.49 9.45 2.14 1.42 0.93 2.94 11.41 3.63 1.52 - - Median (cm) 4.2 1.71 1.25 0.6 0.76 6.53 1.99 1.73 0.53 1.71 9.40 2.13 1.24 0.83 2.58 11.49 3.66 1.5 - - 150cm 200cm 250cm MPI MPB MPH LM Q MPI MPB MPH LM Q MPI MPB MPH LM Q Mean (cm) 19.27 5.25 2.41 - - 24.62 6.95 3.85 - - 28.49 8.57 4.47 - - Median (cm) 19.72 4.94 2.46 - - 27.7 6.89 3.64 - - 28.37 8.47 4.71 - - Table 5.1: The resulting mean and median values for all distances in the static error evaluation. Error values are in centimeters. These results show that the accuracy of MediaPipe tracking in 3D space can be significantly improved by inputting the user’s real hand size and allowing 3D interactions for large distances, which is not possible for the LeapMotion sensor or the Meta Quest hand tracking. Dynamic Error Evaluation For the dynamic evaluation, the user held his hand at a fixed position (resting his elbow on a fixture to guarantee consistent height of the hand) while the tracking device moved away from it on a rail system (starting at the distance of 25 cm), at a constant slow speed up to a distance at the edge of the device tracking range. The user had extended all fingers, as this is an easy to hold and repeatable gesture and is usually well recognized by hand recognition systems. While a user walking away from the tracking device would have presented a more ecologically valid hand tracking situation, the rail system was used to ensure repetitiveness and uniformity of data collection for this technical evaluation. 73 5. Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage The movement range was: for the LeapMotion sensor r → [0.25; 0.75], for Meta Quest r → [0.25; 1.75] and for MediaPipe r → [0.25; 2.75]. Interestingly, the Meta Quest achieved a higher tracking range here than in the previous test, which may be due to the fact that the system can recognize an already recognized hand for longer than it takes to recognize a completely new hand. However, the real reason for this remained hidden from us. This procedure was repeated 10 times for each tracking method. Dynamic error data resulting from these recordings were averaged and analyzed according to a procedure described in detail in the following section. Results For each frame, we get the real-world distance dr and the virtual world distance dv of the hand to the camera. Tracking error was calculated with Equation 5.3 and paired with the real-world distance dr. Examples of dynamic error data samples for Meta Quest and MediaPipeHand prepared in this way are illustrated in the scatter plot in Figure 5.8. Figure 5.8: Scatter plots of two example dynamic error distributions for Meta Quest and MediaPipeHand. The x-axis refers to real-world distance of the hand to the tracking device, the y-axis refers to the error difference between real-world and virtual-world distance. Three Regression lines for Meta Quest data are illustrated separately for NearRange, MidRange and FarRange due to the rising gradients in each range. One regression line for MediaPipeHand over all distances is illustrated. The linear equations for the regression lines are in cm cm . Quest median error shows a higher rising error for large distances. The discretized dynamic error distributions are used to perform linear regression, with the gradient of the fitted regression line determining the rate of the error increase with distance from the tracking device. Since the tracking ranges of the evaluated tracking 74 5.3. Accuracy Evaluation Experiment methods are different, we perform the linear regression separately in three distance ranges: NearRange [< 75 cm], MidRange [75 cm;150 cm] and FarRange [>150 cm]. The NearRange distance was chosen based on the results from the previous experiment, as both LeapMotion and Meta Quest delivered reliable results here. We are also within the usual range of one arm’s length. MidRange was also selected as Meta Quest provided additional results here in the current experiment. FarRange was then selected for all results above MidRange, which was only achieved by the MediaPipe methods in the experiment (the Quest did reach the range a little, but already in error values that were not in relation to the other methods). This procedure results in 10-entry distributions of regression coefficients for every tracking method, in distance ranges covered by the method. We can now compare all five tracking methods in NearRange, Meta Quest and three MediaPipe variants in MidRange, and three MediaPipe methods in FarRange (the analysis for the Quest was carried out in this range as well for the sake of completeness). We use one-way ANOVA to compare mean regression coefficient values between the methods within each distance range (data is normally distributed in the Shapiro-Wilk test). The resulting ANOVA plots are shown in Figure 5.9. The scaling of the gradients is error distance given in cm cm , which shows the increase in the tracking error in cm per each cm of distance from the tracking device. Figure 5.9: Mean gradient values for each method and distance range. Gradients are in cm cm . In NearRange, LeapMotion shows the most consistent tracking with a mean gradient of 0.0007. This is a significantly (p < 0.001) smaller error increase rate than for the Meta Quest and the MediaPipe methods. Meta Quest has the steepest error increase rate with distance with a mean gradient of 0.213 and shows significant differences to MediaPipeBody (p = 0.006) and MediaPipeHand (p = 0.002). The mean gradient of 0.098 for MediaPipeHand is the lowest for the three MediaPipe methods, with a mean gradient of 0.117 for MediaPipeBody and 0.162 for MediaPipeInternal. The statistical analysis, along with a substantial effect size of η2 = 0.827 (as measured by η-squared), reveals significant differences among all MediaPipe methods, except for MediaPipeHand and MediaPipeBody. Further information on means and pairwise comparisons can be found in Table 5.2. MidRange shows significant differences (p < 0.001) and an effect size of η2 = 0.992 75 5. Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage for all pairwise comparisons of mean gradients, except (as in NearRange) between MediaPipeHand and MediaPipeBody (p = 0.333). With a gradient of 0.567, the Quest has the largest gradient here, which more than doubled compared to NearRange. Since the gradient of MediaPipeHand with 0.054 and MediaPipeBody with 0.064 has decreased somewhat compared to the NearRange, the tracking error in these areas does not increase as much. In comparison, the gradient of MediaPipeInternal with 0.144 is similarly high as in NearRange. This shows that the external input of hand size (whether measured or by body size) improves the tracking error. A significant difference between hand size by measurement and by body size cannot be found in MidRange. In FarRange, the Meta Quest again shows a higher gradient compared to the closer ranges (with a mean gradient of 0.749). This is again significantly higher than in the MediaPipe methods (p < 0.001). This shows a strongly increasing error for Meta Quest and large distances and that the hand tracking of the Quest does not seem to be aligned for these distances. The MediaPipe methods again show similar mean gradients as in MidRange. With a mean gradient of 0.102, MediaPipeInternal shows the largest change in tracking error among the 3 methods, with a significant difference from MediaPipeHand (p = 0.006; a significant difference from MediaPipeBody could not be shown at p = 0.170). With a mean gradient of 0.059 for MediaPipeBody and 0.047 for MediaPipeHand, these two gradients are close to each other. As there are again no significant differences for FarRange between these two (p = 0.915), it can generally be stated that the increasing error can be significantly improved by entering the size of the hand compared to the size of the internal hand, but no significant differences could be determined between the calculation of the hand size by measurement and the derivation by body size. The one-way ANOVA analysis yielded an effect size of η2 = 0.98, as measured by η-squared. This substantial effect size indicates a strong influence of the methods on the observed differences in the gradients. The findings highlight the significant impact that the choice of method has on the measured outcomes. Figure 5.8 shows the distributed points with its linear regression lines, where one can see how fast the error rises for Meta Quest in comparison to the MediaPipe method with inputting the measured real hand size. NearRange MidRange FarRange MPI MPB MPH LM Q MPI MPB MPH LM Q MPI MPB MPH LM Q Mean (cm) 0.162 0.117 0.0982 0.001 0.213 0.147 0.0644 0.054 - 0.567 0.102 0.059 0.047 - 0.749 Table 5.2: The resulting mean value gradients for different ranges in the dynamic error evaluation. Mean error gradient values are in cm cm . Lost and Acquired Tracking Distances For each tracking method, we recorded the distance at which the tracking device loses the hand while being moved away as well as when the tracking device firstly acquires the hand while being moved towards the device. 76 5.3. Accuracy Evaluation Experiment To determine the lost tracking distance, the user positioned his hand at a distance where it was reliably tracked. After the hand was tracked for at least one second, the device was moved away from the hand and distance at the moment in which the device lost the hand was recorded. To avoid recordings for moments in which the tracking was only briefly disrupted, the hand had to be lost for at least one second. The measurement of the distance at which the tracking device acquires the tracking of the hand follows a similar procedure. The device was set at a distance where the hand is not detected (LeapMotion: 1m; Quest: 1m; MediaPipe: 3.75m). After the program ensured that the hand was not tracked for at least one second, the device was moved toward the hand until the hand was tracked. Again, the hand has to be tracked for at least one second to record the distance. Both procedures were repeated 25 times per method. Results A total of 25 data points were collected for each tracking method and both evaluations. Figure 5.10: Box-plots medians for lost and acquired tracking distance for tracking methods. Labels are median values. The medians for lost tracking distances proved to be normally distributed in the Shapiro-Wilk test ([102]) after three outliers (caused by interferences in the ground truth tracking) had been removed (p = 0.229 for the MediaPipe tracking, p = 0.557 for the LeapMotion, p = 0.293 for the Quest tracking). The resulting box-plots can be seen in Figure 5.10. For the main analysis, Welch’s test ([123]) was employed, yielding a significance level of p < 0.001 and an effect size of η2 = 0.827. Subsequently, the post-hoc analysis was 77 5. Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage conducted using the Games-Howell test [30]. The findings from the post-hoc analysis demonstrated that the mean lost tracking distance for the MediaPipe tracking method (mean distance of 412.48 cm) was significantly greater than that of all other methods (p < 0.001). The mean lost tracking distance of the Meta Quest was significantly larger than with LeapMotion ( p < 0.001), but also significantly lower than with MediaPipe tracking (p < 0.001). The lowest mean distance where the tracking is lost is the lowest for LeapMotion with 95.6 cm. With 237.78 cm the Meta Quest has quite a large tracking range, but also has severe tracking errors at longer distances, as can be seen in the previous sections. The detailed results can be found in Table 5.3. Distributions of recorded distances at which tracking was acquired deviated from normal (in the Shapiro-Wilk normality test) for two out of three tracking methods; therefore, median tracking acquired distance values were compared in the Independent- Samples Median test. The corresponding box-plots are illustrated in Figure 5.10. The pairwise comparisons of the Independent-Samples Median test demonstrate significant differences between the methods, with a notable effect size (as measured by η-squared) of η2 = 0.827. These comparisons reveal that the median distance of 324.94 cm for MediaPipe is significantly greater than that of LeapMotion, which measures 46.05 cm (p < 0.001), as well as Meta Quest, with a median distance of 40.33 cm (p < 0.001). The difference in acquired tracking distance between Meta Quest and LeapMotion is also statistically significant (p = 0.008), although the corresponding median values are much closer (with a difference of 5.72 cm). The fact that MediaPipe consistently acquires hand tracking in a range of approximately 3 meters shows that this method could be used for larger areas while LeapMotion and Meta Quest are limited to near-range distances within arm’s length. EnterHand LeaveHand MediaPipe LeapMotion Quest MediaPipe LeapMotion Quest Mean (cm) 321.51 46.6 40.27 412.48 95.6 237.78 Median (cm) 324.94 46.05 40.33 411.43 95.47 235.37 Table 5.3: The resulting mean and median distances for the different tracking methods when tracking is acquired and lost. Distance values are in centimeters. 5.3.3 Experiment 2: Accuracy of MediaPipe-Based Hand Tracking for Multiple Users The primary objective of this second experiment was to validate the effectiveness of our MediaPipe-based method for various users and to assess the influence of hand size estimation on the accuracy of the 3D positioning error in greater depth. Although it did not involve an extensive user study with joint-error measurements, it served to confirm the usability of our method and highlight other impacts on positioning errors. The user test is intended to validate the applicability of our proposed methods for calculating 78 5.3. Accuracy Evaluation Experiment hand length and positioning the hand in 3D space using various input data (such as body height and hand size). In this experiment, we repeated the procedure for static and dynamic tracking error measurement from Experiment 1 for each user separately, this time focusing only on MediaPipeInternal, MediaPipeBody, and MediaPipeHand methods. For the static error measurement, four hand tracking data recordings (400 frames each) per distance per user were made at distances d ∈ 25, 50, 75, 100, 150, 200, 250 cm between the tracking device and the user’s hand, resulting in four median error values. One recording of dynamic tracking error data per user (for each method) was conducted. The calculations of the distance error in the static and dynamic conditioned were the same as in the previous experiment. The hardware and software setup used for data recording was the same as in Experiment 1. In total, hand tracking data was collected from 9 additional users, 3 female and 6 male, ranging in age from 20 to 56 years. They consisted of students from various fields of study, as well as acquaintances. They took part in the experiment voluntarily and did not receive any compensation. Including measurements of body and hand size and introduction, the procedure took about 45 minutes per user. Results - Static Error Figure 5.11 presents box plots illustrating the average static error of MediaPipeInternal, MediaPipeBody, and MediaPipeHand at different distances. To maintain a consistent analysis metric and account for the relatively low variance observed in the four median error values obtained for each user at each distance, we consolidated these errors by calculating their median value and employed this value for the analysis. As evident from the results, outliers are observed, which may have been caused by temporary tracking losses. Despite their presence, these outliers were retained in the analysis because it could not be guaranteed that their removal would be advantageous for maintaining the integrity of the test, as previously mentioned. With the data deviating from the normal distribution in many cases, we again used the Independent-Samples Median Test to find whether the error depends on the hand tracking method at each distance. The method was statistically significant for the error at the distance of 200 cm (p = 0.016; follow-up pairwise comparisons did not find any statistical significance), with the result at the distance of 250 cm being marginally below statistical significance. From the pattern seen in the box-plots with MediaPipeInternal producing visibly higher errors at larger distances, we believe that more detailed results could be achieved with a larger sample size of users or recorded data. Nevertheless, the available data supports the results of the main analysis. Analyzing the relationship of error to distance from the tracking camera for each method, we found that the error increases with the distance for MediaPipeInternal (p = 0.002 in Independent-Samples Median Test) and MediaPipeBody (p = 0.029). For Medi- aPipeHand, no statistically significant dependency of an error on the distance could be found. 79 5. Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage Figure 5.11: Box-plots of the mean error at every measured distance in the test with multiple users. The results of the previous experiment are confirmed by these results that the tracking error can be improved by entering the real hand size and thus interactions in 3D space can also be made possible over larger distances. Results - Dynamic Error We used the same approach for calculating the gradients of error change in the near, middle and far range on the evaluation data of multiple users. Figure 5.12 shows the gradients for each method in NearRange, MidRange and FarRange. Figure 5.12: Mean gradient values for each method and distance range in Experiment 2. Gradients are in cm cm . We used Mixed ANOVA with the range as a repeated-measures factor (with three levels) and the method as a between-subject factor (also with three levels). In this analysis, only the range was found to be statistically significant (F = 4.683, p = 0.014), with 80 5.4. Demonstration in a Colocated Setup within-subject repeated contrasts showing the increase of gradient from MidRange (mean = 0.035) to FarRange (mean = 0.112), p = 0.002. The result shows us that the increase in error also increases with greater distance for all MediaPipe methods. Figure 5.12 shows mean gradients for each tested range and method. 5.4 Demonstration in a Colocated Setup The goal of this last experiment was to test the usability of our hand tracking method in a real-world colocated VR scenario. To do this, we developed a simple VR application in which two users can see each other’s avatars consisting of the virtual HMD and hands steered by head and hand tracking input. The aim was not to create a detailed qualitative user analysis, but rather a proof-of-concept demonstration. In our setup, each user has an HTC Vive tracker attached to their hand, which provides position and rotation data of the real hand positions in the space of the Lighthouse 2.0 Tracking and serves as ground-truth. Both users have a VR headset on (we used Meta Quest with its own hand tracking feature turned off). User 1 has an RGB camera attached to the HMD (in this scenario a ZED mini camera that only provided the RGB image of one lens to use with MediaPipe). With this camera, all hands were detected and visualized in the shared virtual environment. Since only User 1 had a running hand detection system, it was ensured that only one system tracked all hands and placed them in the virtual space. The users stood facing each other at a distance of about 1.5 meters and held their hands in the tracking area of the RGB camera. As a result, the hands of User 1 were in NearRange and the hands of User 2 were in MidRange during the recording. The camera simultaneously detected and tracked the hands of User 1 and User 2. In the virtual environment, the user’s hands were positioned with the hand tracking input. 5.4.1 Results Figure 5.13 shows a snapshot of the scenario experienced by the users in the experiment. The RGB image of the tracking camera is overlayed with the virtual scene seen in VR to give a better illustration of hand tracking. The plane that can be seen in Figure 5.13 is the floor plane in the virtual environment. Since only User 1 had a hand tracking device attached to his HMD, the view of all virtual hands, his own and those of User 2, is enabled by his hand tracking camera. The same virtual hands were displayed to User 2 and animated by hand tracking data. Hand tracking data was recorded over a period of about 30 seconds, resulting in snapshots of 1242 consecutive frames. The detected hands were assigned in pairs to the corresponding Vive trackers, and the deviation from the position of the ground truth tracker was calculated as a tracking error. The position errors for each user are plotted in Figure 5.14. Peaks and missing values in the plot are moments where corresponding hands were not tracked for a moment. 81 5. Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage Figure 5.13: The point of view of User 1 with the tracking device attached tracking also the hands of User 2. Both users are colocated, the real-world view of the camera can be seen slightly overlayed. Since MediaPipe provides a new set of position data every frame, which is not based on previous results, the actual hand movement in virtual space can look very jumpy. Frame-related short interruptions cause virtual hands to jump around a lot in space. To counteract this, we have applied an exponential low-pass smoothing filter that adjusts hand positions based on previous positioning. Short dropouts are thus counteracted and general movements appear much smoother. The implementation is taken from the DigitalRune library and has a time constant of 0.05 ms as a delay 2. Due to this smoothing filter, inaccurate positioning can occur right before losing or after acquiring tracking. The calculated mean error for the hands of User 1 was 5.1 cm with a standard error of 0.154. The 95% confidence interval provides a lower limit of 4.8 and an upper limit of 5.41 cm. For User 2, we obtain a mean of 8.21 cm with a standard error of 0.17, a lower limit of 7.88, and an upper limit of 8.55 cm for the 95% confidence interval. The higher position error values for User 2 were expected, as User 2 was further away from the camera. The values correspond to the expectations of a more realistic scenario based on the results of the dynamic and static tracking error evaluations. Although we calculated the error of the tracked hands in this scenario, it is important to note that a single user pair is insufficient to provide a comprehensive user study for real-world colocated applications. In addition, the use of direct manual manipulation tasks would be an interesting extension in such a study in order to additionally validate the usability. Nevertheless, we aim to demonstrate that our method is not limited to controlled scenarios with fixed machinery but can also be applied in real colocated setups involving two users. To further evaluate the presented method, it would be valuable to conduct a qualitative user study that measures interaction precision, usability, and user 2DigitalRune: https://digitalrune.github.io/DigitalRune-Documentation/html/81 cd4f27-5ce5-4439-9a6c-121f2942f175.htm(Accessed:2024-02-19) 82 5.5. Discussion Figure 5.14: Error of hand pairs in centimeters during the preliminary user test. The hands of User 1 were in NearRange, while the hands of user 2 were in MidRange. Spikes can occur due to tracking losses and wrongly positioning due to a smoothing filter. experience. Such a study would provide an interesting avenue for future research and a more comprehensive evaluation of our approach. 5.5 Discussion The results of Experiment 1 show that Meta Quest and LeapMotion are more accurate at arm’s length range than RGB input-based tracking with MediaPipe, which was to be expected. The tracking errors for Meta Quest and LeapMotion are in line with previous research results ([1][99]). However, if the user’s hand length is used to improve the MediaPipe-based method (either being estimated based on the height or entered from a real measurement), RGB tracking delivers comparable accuracy with a distance error in the range of 12.4mm to 21.3mm (see the lowest and largest error in Figure 5.7 for MediaPipeHand and MediaPipeBody in distances between 25 cm and 75 cm). This shows that in this distance range the improved RGB hand recognition offers interactions with similar precision as the off-the-shelf solutions. At the best value in the close range of 25 cm, the error of the RGB method with a real measured hand is 3.36 times lower than without external input. In the outer range at a distance of 250 centimeters even by a factor of 6. Due to the improvement through external hand size input, this even allows interactions in ranges of 2.5 meters, which is not possible with Meta Quest or LeapMotion. Assuming a colocated multi-use scenario, the participating users move at different distances from each other. If they carry out 83 5. Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage joint interactions or tasks at the same location, they move in close proximity, which usually exceeds their own hand length. Detection and interaction ranges of 2.5 meters also qualifies this method for use in such scenarios, although these would need to be investigated in more detail together. The results of the acquired and lost tracking distance show that Meta Quest and LeapMotion are in a similar range at which distance a hand is recognized for the first time. This makes sense in the sense that both systems were designed for interactions in the arm’s length. In comparison, MediaPipe has an almost eight times higher distance at which the hand is recognized for the first time, which also allows for a significantly higher range and more possibilities, for example, to recognize the hands of other users that are further away from the own user. After a hand has been detected once, all methods have a greater distance where the tracking of the hand is lost again. It is striking that LeapMotion only allows hand detection in the areas where tracking quality is guaranteed to be within a certain tracking confidence range (which LeapMotion calculates), whereas the Meta Quest goes beyond this range and continues tracking the detected hand up to a distance of 235 centimeters. These differences between acquired and lost tracking are shown by the purple ’overtracking ranges’ in Figure 5.10. This lack of limitation of the Meta Quest for distances beyond arm’s length also leads to very strongly increasing tracking errors in this case. This makes the Quest unusable due to high tracking errors in Mid- and FarRange detection, even though hand tracking would be possible for the system in this range. This is also reflected in the results of dynamic tracking. Within the tracking range, LeapMotion has significantly fewer discrepancies in tracking error than, for example, the Meta Quest. The latter shows significantly increasing tracking errors, especially in tracking outside the NearRange. MediaPipe is much more consistent here, even though the tracking error still increases as the distance increases. However, this slope is nowhere near as steep as in Meta Quest. Even between the different presented depth estimations for MediaPipe, the increasing error is significantly lower in the variants where the user’s hand size was added externally. Thus, the error is less at larger distances. This is also reflected in the results of the static evaluation. This is a further indication that RGB methods (such as MediaPipe) can be improved and are thus suitable for enabling hand tracking and hand interactions even at greater distances. During the demonstration, we observed that the computation time and the occurrences of temporal tracking losses increased as the number of simultaneously detected hands increased, although we did not collect accurate real-time data. This behavior aligns with expectations for an image-tracking system designed for multi-object recognition. While the presence of four hands within the tracking frame did not cause significant issues, the limitations of the current state of the MediaPipe system became more apparent as the number of hands increased. However, it is important to note that these observations did not affect our calculations, and we anticipate that future versions of the framework will 84 5.6. Conclusion address these limitations and enhance stability. As has also been shown, the way the real hand size is determined has a significant impact on tracking accuracy. The more accurately the hand size is determined, the smaller the error. Unfortunately, the 3D coordinates of the finger joints provided by MediaPipe did not accurately match the real hand size and the size of the hand had to be input externally to effectively improve depth calculation. The results of Experiment 2 largely reflect the pattern of the main analysis, although significance was not shown in all comparisons. The error of the 3D hand positioning increases with the distance to the tracking camera. However, the error can be reduced (especially for high distances) by inputting the users’ hand size. We believe that a larger data set and extended user testing (which unfortunately was not possible at this time) would increase the significance and further increase the level of detail in the results. Furthermore, it can be seen that when estimating hand size from body size for 10 users (from Experiment 1 and Experiment 2), the deviation of the calculated hand size to the actual hand length was less than 1 cm for all users. The algorithm can be used as an alternative when the user’s body size is known rather than their hand length. For future applications, it would be interesting to find a method to perform this measurement more universally and without external input. One possibility would be a calibration step at the beginning of the application, where the hand is measured at a certain distance, or where the body size is determined based on the height of the VR headset and then the hand size is derived. For applications where full-body avatars are used, for example, these could also be used to improve the tracking error that still exists. If they are scaled to the user’s size, their maximum arm length can be used as an improvement metric to position the hand more accurately to the avatar. An approach such as ’Virtual Caliper’ by Pujades et al. would be interesting and helpful in this respect, that use a VR controller to take measurements in order to adapt a 3D avatar to the user’s body measurements [94]. The mean hand position errors calculated in Experiment 3 are in line with the results of the controlled static and dynamic tests. With continuously detected hands, consistent positioning can take place. In the test, we noticed that the quality of positioning is also dependent on the quality of the underlying tracking (MediaPipe in this case). A more consistent and better recognition in the future also improves the real error of our algorithm. Coupled with a high-resolution camera, consistent multi-user hand tracking in colocated rooms can be realized. 5.6 Conclusion The study that was presented in this chapter evaluated the tracking error and tracking range of three different hand tracking technologies, one of which works via RGB cameras and is not tied to a specific manufacturer’s hardware. In addition, a method was developed that significantly improves the tracking result of RGB hand tracking by taking hand size into account. 85 5. Evaluate and Improve an RGB-Based Hand Tracking Solution for Colocated VR Usage The evaluation shows that although the hand tracking of Meta Quest and LeapMotion is more accurate in the arm’s length range, they are not designed for hand detection outside this range. Therefore, we achieve comparatively higher precision at larger distances with the RGB method. This can be further increased if additional information about the user’s hand length is used in the calculation of the 3D positions. Direct measurement of the hand length is more precise, but deriving the hand length from the user’s height also produces comparable tracking errors. For more general use, the body height input is probably more intuitive, as many people know their body height rather than their hand length. It might even be possible to automatically determine the body height in a virtual reality application through an initialization phase. This would be a use case for future research. The presented results showed that hand tracking using the (improved) RGB method has significantly higher tracking ranges (with usable distance well above arm’s length range) as well as the ability to track more than just two hands at a time. This can be especially useful for hand recognition in larger tracking areas with multiple users. Thus, we have also shown that by inputting the user’s real hand size, a tracking system based on a single image RGB camera can provide results that provide a tracking error that could allow direct virtual interactions, as well as significantly expand the tracking range for hands. This method could be superior to the presented commercial systems such as LeapMotion or Meta Quest for specific scenarios as colocated multi-user scenarios, where one also wants to track the hands of the other users. Therefore, it would be interesting in future experiments to see the effects of such RGB tracking in multi-user VR scenarios and to find out how it can disrupt or improve hand tracking in such applications. Since our setup and use case focuses on a single camera for tracking, using a camera array to improve tracking of colocated users’ hands would be an interesting extension for the future. One limitation of the measurements in the experiments shown was that only the distance of the palm of the hand was measured. A more detailed analysis of the distance errors of all finger joints would be an interesting extension of these results. With the successful implementation of the presented work, we can now create a colocated VR environment using hand tracking. This system allows us to track multiple hands and accurately position them in three-dimensional space within an acceptable margin of error, enabling seamless interaction for each user. However, the introduction of multi-user hand tracking presents a new challenge – the reliable assignment of recognized hands to virtual users. This is crucial for tailoring specific interactions to individual users. The forthcoming chapter will address this challenge and explore various approaches to solving it. 86 CHAPTER 6 Assignment of Tracked Hands in Colocated VR The upcoming chapter will focus on our approach of assigning virtual hands to colocated virtual users. We will present various methods for calculating confidences that a specific hand belongs to a particular user. These methods are designed to accommodate different scenarios, utilizing either the distance or rotation of the hand to the user, bounding boxes of the user’s reach area, or employing a machine learning approach for assignment confidence calculations. To enhance assignment accuracy and explore initial steps towards performance improve- ments, we introduce algorithms that build upon the base methods. One algorithm dynamically selects the most suitable base method based on user formations, while another leverages historical cached results to predict future hand assignments, addressing temporary weaknesses in the assignment algorithm. To determine the effectiveness of the proposed methods and improvement algorithms, we conducted an extensive evaluation using simulated colocation scenarios. Our analysis encompasses accuracy and performance requirements, providing insights into the usability of these methods in real-time scenarios. The chapter concludes with a recommended method for hand assignment and offers a future outlook on potential enhancements to the results. 6.1 Motivation In a colocated VR scenario where each user utilizes their own hand tracking system, the assignment of tracked virtual hands is straightforward. Given that many tracking systems, such as LeapMotion, Meta Quest, or Vive Focus, are typically limited to tracking two 87 6. Assignment of Tracked Hands in Colocated VR hands at most [37], the assignment is made based on the tracking system that initially detected the hand. Figure 6.1 provides an illustration of this process. Figure 6.1: A colocated scenario involving two users, each equipped with their own hand tracking system. Virtual hands in the right image are assigned to the system that tracked the corresponding real hand in the left image. With the introduction of hand tracking capable of simultaneously tracking more than two hands, as detailed in chapter 5, the assignment process becomes considerably more complex. This challenge is illustrated in Figure 6.2. In scenarios where a tracking system can detect more than two hands at once, it becomes impractical to assign virtual hands directly to the user equipped with the hand tracking system. To ensure consistent three-dimensional interactions, alternative methods must be explored to reliably assign virtual hands to the correct colocated virtual user. Current research addressing this problem focuses either on image tracking [113][80] to extract additional information, such as overlapping tracking boxes or hand color, or is limited to two-dimensional images [59]. To our knowledge, there is currently no solution available that addresses this assignment problem in a three-dimensional virtual environment where only location and orientation information is available. This is especially difficult for hands in close proximity. Consequently, the objective of our work is to assign all virtual hands in 3D space to their respective users solely based on the location of the tracked hands relative to all present virtual users, and therefore enabling targeted interactions within the virtual environment. In this chapter, we present an algorithm that computes the confidence of a virtual hand belonging to a specific user. We also address the challenging task of accurately assigning hands when they overlap in the virtual scene. To achieve this, we demonstrate various methods for computing this confidence and compare their effectiveness. In addition, we introduce a history-based algorithm that utilizes previous assignments and confidences to adjust future assignments (comparable to dead reckoning in multiplayer games [77]). 88 6.2. Estimation Methods Figure 6.2: Two colocated users where only one user is equipped with a hand tracking system that is able to track more than two hands. Virtual hands on the right image cannot be reliably assigned to the correct user. Finally, we evaluate our methods within a colocated virtual reality simulation, considering different user formations involving overlapping hand interactions. Our evaluation focuses on assignment accuracy and the performance of seven method combinations to identify the most reliable and robust approach for assigning hands in all possible colocated virtual reality scenarios. We also assess the effectiveness of the history algorithm in improving assignment accuracy. We aim to offer an algorithm for accurately assigning hands to virtual users without relying on additional image recognition methods, solely using the position and rotation of the hand and the user, which are provided by the used HMD and hand-tracking system. Additionally, we aim to solve the assignment problem that arises when tracked hands in the virtual scene are in close proximity, which is also particularly useful when a user’s hands are obscured (for example by other body parts). 6.2 Estimation Methods Our algorithm aims to enable the most reliable hand assignment to users, taking into account the unique challenges of multi-user colocation scenarios, such as close proximity of the users and overlapping hands. In traditional hand-tracking systems, it is often assumed that only the hands of the primary user will be tracked, as most systems are limited to tracking a maximum of two hands at a time [37][119][107]. However, in colocated scenarios, it is necessary to assign the hands of each user to their respective tracking systems. A naive approach to hand assignment is to assign each virtual hand to the user closest 89 6. Assignment of Tracked Hands in Colocated VR to it. However, this approach can be error-prone in formations where users are close to each other and virtual hands are spatially intermixed. An example of such a scenario would be a surgical or assembly task in which several users perform hand interactions in a confined space. In our algorithm, we consider several algorithms, including distance and rotation, to compute a confidence factor for each virtual hand’s assignment to a user. In addition, we store the previous assignment of hands to users in a history cache to improve the accuracy of future calculations. By considering the history of previous hand assignments, our algorithm can make more reliable predictions about to which user a particular virtual hand belongs to. 6.2.1 Empirical Estimation Methods We developed four approaches to determine the affiliation of a hand to a user, each of which computes a confidence factor p. Distance: We calculate the euclidian distance of the hand to the user’s head and assume that the closer the hand is, the more likely it is to belong to the user. This principle is illustrated in Figure 6.3. If the distance exceeds a threshold value, we no longer assume that the hand belongs to the user. The threshold was chosen so that it is slightly larger than an arm’s length. Figure 6.3: Distance method: The distance between the hand and the user is calculated. The hand is assigned to the user with the shortest distance. To calculate the confidence values between 0 and 1, we use a logistic curve, as shown in Equation 6.1. We chose a distance range of [0.4m;1.3m] that results in a = 10.22 and b = 0.85. Figure 6.4 shows the resulting logistic curve. This method was selected based on its simplicity and anticipated high accuracy in numerous colocated scenarios, where users are sufficiently separated from each other. 90 6.2. Estimation Methods p = 1 − 1 1 + e−a(x−b) (6.1) where: a = the logistic growth rate b = the x value of the functions mid point Figure 6.4: Logistic curve for distance (formula at bottom left)) and rotation (formula at bottom right) calculations. The X-axis represents the input position/rotation, while the Y-axis denotes the resulting confidence. The red lines indicate the selected threshold values. Rotation: We calculate the angle between the forward direction of the hand (calculated via the connection between wrist position and position of the lowest joint of the middle finger) and the direction vector between the head of the user and the hand. We assume that a hand turned away from the user is more likely to belong to the user than a hand turned towards the user (which would require an uncomfortable hand position). This method is further illustrated in Figure 6.5. We calculate p using the logistic curve in Equation 6.1 with values in the angle range [90°;180°], which results in the values a = 0.1 and b = 135. The logistic curve can be seen in Figure 6.4. Prerecorded Area: In a preliminary setup, we asked the user to stretch out their arms and perform defined movements in front, to the side, and behind their body, resulting in a total of 459 points. A convex hull is then created from the points at performance, in which it is checked whether the user’s hand is in this hull or not. Our focus was on capturing the essential movements rather than the precise number of points to ensure comprehensive coverage of all relevant sides of the body. Such a convex hull can be seen in Figure 6.6. The fundamental idea is that although this approach is more complex than using the distance method, it offers a more accurate representation of the radius of 91 6. Assignment of Tracked Hands in Colocated VR Figure 6.5: Rotation method: The angle between the forward vector of the hand and the direction vector from hand to user is calculated. The hand is assigned to the user with the smaller angle. the hand. For instance, it acknowledges the limitation that hands cannot be positioned as far behind the user as they can be in front of the user. If there are several hands in this area at the same time, you can additionally use the ’Distance’ method for better differentiation (as suggested in section 6.2.3). 6.2.2 Hand Assignment with Machine Learning Agents In this approach, we used Unity3D’s machine learning agents [45][12] for hand assignment using reinforcement learning. Unity’s Machine Learning Agents (ML-Agents) is a toolkit that allows developers to integrate machine learning algorithms with Unity’s game devel- opment platform. It enables the creation of intelligent agents within Unity environments, allowing them to learn and adapt to tasks through machine learning techniques. The technology and algorithms behind Unity ML-Agents involve reinforcement learning, a type of machine learning in which an agent learns to make decisions by interacting with an environment [56]1. The implementations of these algorithms are based on PyTorch, an open-source Python library focused on machine learning 2. During training, an agent received input data, including hand and user locations, and received rewards for correct assignments. Hand and head movements were recorded in advance and played back during the learning process to obtain the most realistic learning data possible. To enrich the learning experience, we introduced random user placements within a radius of 2 meters, with locations changing every frame. Additionally, we enforced specific scenarios with overlapping hands when users were close to train the 1Unity ML Agents: https://github.com/Unity-Technologies/ml-agents(Accessed: 2024-01-25) 2PyTorch: https://pytorch.org/(Accessed:2024-01-25) 92 6.2. Estimation Methods Figure 6.6: Convex Hulls generated from user’s VR controller inputs. The red points correspond to recordings from the right hand, while the blue points represent recordings from the left hand. The resulting convex hulls are depicted in green. agent for challenges. For varying user counts, we created dedicated agents (e.g., one for two users, one for three users) to handle increased decision complexity. Training involved 1.5 million steps, and reward progression can be seen in Figure 6.7, showing smoothed reward curves. In our tests, we trained up to five simultaneous users (which was sufficient for our comparative analysis), but we could easily extend to more simultaneous users. This training data served as training input for Unity’s ML system, enabling the creation of an agent capable of runtime hand assignment using the same type of input data (positions, rotations). In the assignment process, each visible hand in the scene is assigned to a specific user using its own agent, depending on the number of concurrent users. Agents were created for 2-5 concurrent users, but more concurrent user agents can be easily added. A decision is made for hand assignment in each rendering frame. We anticipate that this method can leverage multiple inputs during the learning process to autonomously make real-time decisions. Consequently, we expect it to offer broader coverage of various colocated situations. 93 6. Assignment of Tracked Hands in Colocated VR Figure 6.7: Reward development in Unity3D for ML training. The blue line corresponds to training with 2 users, the grey line represents 3 users, the pink line denotes 4 users, and the yellow line represents 5 users. The lines are smoothed for clarity, while the transparent lines in the background display the unsmoothed results. The variance in unsmoothed rewards may arise from randomized input data. 6.2.3 Dynamic Method Selection This method combines the other methods to dynamically determine which method to use in the current situation. The idea is that if all users to be tested are far apart, the distance methods can give a very reliable indication of the hand’s affiliation. However, if at least two users are spatially very close to each other, the assignment is no longer unambiguous, and therefore a more reliable method is used when hands overlap. Figure 6.8 illustrates such a possible dynamic selection. Figure 6.8: An example of a dynamic selection of hand assignment methods for a hand. The two left users and the hand are in close proximity. The complex ’Machine Learning’ algorithm is used here. The user on the right is further away, which is why the simpler ’Distance’ algorithm is used. The idea is that computationally intensive methods (such as ’Machine Learning’) are only used in situations where simpler methods are no longer sufficient, hence the computation time compared to using the complex algorithm alone. The effectiveness of these savings 94 6.2. Estimation Methods is examined in section 6.4.3. In our test implementations, we use the combination of ’Prerecorded Area’ when far away and ’Distance’ when near. We use ’Distance’ method in close proximity, since ’Prerecorded Area’ alone cannot make a distinction for multiple hands in the same area and thus can still expect a comparably low computation time. Additionally we use the combination of ’Distance’ when far away and ’Machine Learning’ when near as we assume that the ’Machine Learning’ method provides better results than the ’Distance’ method in close proximity, but also has a higher computation time. Therefore, the ’Distance’ method is used for the other proximities. This should improve the overall computation time compared to the use of ’Machine Learning’ alone. As threshold distance, we preliminarily tested assignment accuracy and determined a near distance of 0.8m as effective. 6.2.4 Assignment History In a naive mapping of a hand to a user, the probability that a hand belongs to a particular user is computed independently for each hand and user in isolation. However, the previous assignment of a hand may have a significant impact on future assignments. To address this issue, we investigate whether incorporating historical information can improve the assignment of hands. Our thought was that if we assign hands correctly in the past with a high confidence, for example when users are further apart, we can catch the problems in close scenarios when hands overlap. The idea is that if a hand has been confidently assigned to a user before, then it is likely to belong to that user later. This approach is similar to the basic idea of dead reckoning, where for example in multiplayer games previous player network states are used to predict the next states to counteract network latency [77]. In our implementation, we cache the calculated confidences of each hand for each user for each frame. We limit the number of entries to 500 for memory and runtime efficiency, with new entries replacing the oldest ones. From all cached entries, we compute the mean confidence value (ranging from 0 to 1) and use it to calculate the total confidence. However, we only incorporate the history information if the value of the history exceeds a certain threshold. Pilot tests indicate that a confidence threshold of p >= 0.9 (90%) is appropriate. Furthermore, we compute a weighting for the history so that high prior confidences are weighted more heavily in the overall computation, assuming that a frequently occurring high confidence in the past probably means a correct assignment. This should counteract low new confidences up to the point where they also occur frequently. As a result, the history confidence can be multiplied by a weighting factor, depending on the previous confidences. Another approach would be to weight newer confidences higher due to their actuality, which was not used in this work and would be an interesting comparison for future work. To ensure that the maximum display of the raw input confidence is maximally doubled (which we chose to avoid over-weighting of the past confidences) and to ensure that the 95 6. Assignment of Tracked Hands in Colocated VR history algorithm returns to lower confidences after consecutive frames with low input raw confidences occur, a weighting factor of 2 was selected. This expected behavior is specifically anticipated in such a scenario. The confidence p of the history is calculated using the following formula: ph = N n=0 pn N ∗ N n=0 pn ∗ wf Nmax = wf ∗ ( N n=0 pn)2 N ∗ Nmax (6.2) where: N = number of entries in the history Nmax = maximum number of entries in the history, set to 500 in our implementation wf = maximum weight, set to 2 in our implementation Therefore, a reliable previous assignment of a hand can also be critical for future calculations if an assignment is no longer straightforward. For instance, in dynamic scenarios where users start far apart but later converge to interact with the same virtual object, relying solely on distance for hand assignment becomes less effective. In such situations, the use of historical data can help to make more accurate hand assignments. This underscores the value of incorporating historical information in dynamic contexts. 6.3 Experiment To assess the effectiveness of various hand assignment estimation methods and examine the impact of the history algorithm, we developed an evaluation framework that simulates diverse test scenarios within a colocated VR environment. Each scenario comprises specific user formations and whether users move or are positioned stationary. The objective of this evaluation is to provide comparable scenarios for different combinations of assignment methods and to conduct a quantitative analysis of the accuracy and performance of the different methods. Since we want to use hand assignment in a colocated VR environment (which we created in chapter 4) in which a hand recognition system can recognize more than two hands (as established in chapter 5), it is necessary to use it in a real-time environment. Therefore, in this experiment, we also want to find out which methods are best suited in a real-time application regarding the performance and accuracy. 6.3.1 Experimental Design and Evaluation We created a simulated environment that maps a colocated VR application, with 2-5 simulated users as needed. For each user, hand movements were recorded by us beforehand over a period of 1500 consecutive frames, representing an activity such as assembly work (using a screwdriver) or typing on a keyboard. These recordings can then be played back for different test scenarios, making these scenarios similar in their hand movements and thus comparable. Each simulated user had different recordings to cover different 96 6.3. Experiment tracking scenarios. To maintain repeatability and comparability, it was ensured that the recordings in the evaluation are always assigned to the same virtual user and that the virtual hands thus also have the same recorded relative distances and rotations to this user. Each test scenario is defined by the following factors: • Formation: A predefined formation of how the virtual users are positioned with respect to each other in the virtual space and at different distances. • Method: An assignment method, as described in section 6.2. • History: Whether history is used as in section 6.2.4. Specifically, we use the methods described in section 6.2 in isolation (Distance (D), Rotation (R), Prerecorded Area (PA), and Machine Learning (ML)). In addition, we use a combination of Prerecorded Area and Distance (D-PA-C) to see if this combination improves the individual results. The selection of the methods was made in a preliminary evaluation run. Two constellations were used for the dynamic method: Area Prerecorded as a base method with Distance as an additional method when users are close to each other (PA-D-DYN); and Distance as a base method and Machine Learning when users are close to each other (D-ML-DYN). The idea is to use a simple algorithm for straightforward scenarios (i.e. hands are further away from the user) and switch to a more complex, potentially slower one when necessary, such as in close proximity when hands are overlapping. This approach aims to improve the average performance while maintaining good results. Therefore, a total of seven method constellations were evaluated. We use the abbreviations mentioned above for the methods and their combinations in the analysis. ’C’ stands for combination of methods and ’DYN’ for a dynamic selection between two methods. In the dynamic selection, the method in the first place is the one that is used at further distances, and the second is the one for close proximity. In the simulation, we assume that all virtual hands are detected by a hand detection system, but remain unassigned to users. With ground truth information about hand assignment (which we have through the recordings and the controlled simulation setup), we can compare assignment results. We calculate an accuracy ratio, representing the percentage of accurate assignments per frame during the evaluation. A ratio of 1 signifies perfect assignment accuracy, serving as a key metric for method comparison in our study. In addition to assessing assignment accuracy, we evaluate the performance (in ms) of each method to determine its applicability in a real-time scenario and the computational time required. To find the overall performance per frame, we sum up the measured times for all frames and divide by the total frames collected. Anticipating decreased performance with more users, we calculate an adjusted performance per user by dividing the frame’s runtime by the number of concurrent users, as elaborated in section 6.4.3. 97 6. Assignment of Tracked Hands in Colocated VR The evaluation covers a total of 73 formations, comprising 14 formations with 2 users, 16 with 3 users, 20 with 4 users, and 23 with 5 users. We aimed to represent diverse formations where users stood both farther apart and close together, resulting in overlapping hands. Some of those formations can be seen in Figure 6.9. Variation in the number of formations is attributed to the increasing number of concurrent users, which in turn leads to a greater diversity of potential arrangements. The following example can illustrate this again: If we have 2 simultaneous users standing facing each other they can either stand close to each other or far away from each other, resulting in two formations. If we now take four simultaneous users, they can face each other (in a square) too. However, they can all stand close together, only some of them (two stand close together and the other two further apart), or all stand far apart. This creates more than two possible formations. Additionally, we selected formations in which users remained fixed in one place and dynamic formations where users moved along a fixed path, representing stationary and moving users in colocated scenarios. Each of the seven methods mentioned above was applied to each formation for the duration of 1500 frames to determine the percentage of correctly assigned hands. This evaluation was carried out with and without applying the history algorithm, resulting in 14 conditions (7 method constellations with and without applying the history algorithm). In total, we obtained 1022 results for comparative analysis (for performance and accuracy). Figure 6.9: Some exemplary formations of the evaluation with two to five simultaneous users. Formations include near and far proximities to cover diverse and difficult assignment scenarios. 98 6.4. Results and Discussion 6.3.2 Setup The evaluation was performed on a Windows notebook equipped with an Intel Core i7-12700H CPU and an NVIDIA GeForce RTX 3070 Ti Laptop GPU. We utilized Unity3D 2021.3.9 as the software platform, coupled with the ’EasyHand’ framework for Unity that we presented in chapter 3. This framework served as a layer for hand tracking, and was responsible for recording tracked hands in a preliminary setup, which was done by us, playing back these recordings, and rendering the hands during the evaluation. By conducting these evaluations, we gain insight into the accuracy of hand assignment and the computational efficiency of the methods, ensuring their feasibility in real-time applications. 6.4 Results and Discussion We conducted an analysis of correct assignment ratio and computational efficiency to determine the optimal method for a real colocated VR scenario. Firstly, we evaluated the effectiveness of our history algorithm in improving hand assignment results. In addition, we identified the method or combination of methods that yielded the best results. To evaluate the algorithm and methods, we measured the accuracy of hand assignment for each frame, allowing us to calculate an accuracy ratio for each combination of method, formation, and utilization of the history algorithm. This resulted in a total of 1022 values, each representing the percentage of correctly assigned hands throughout the evaluation. The second metric we examined was the performance of the algorithm. As our objective is to use this estimation method in real-time colocated VR applications, performance is a critical factor in determining the method’s feasibility. The performance was measured in milliseconds per frame for each run. Based on our implementation and the complexity of the methods, we formulated the following hypotheses for our expected results: • H1: The history algorithm significantly improves hand assignments for all assign- ment methods. • H2: The machine learning algorithm (ML) is significant more accurate than the distance (D), rotation (R) or Prerecorded Area (PA) algorithm. • H3: The machine learning algorithm (ML) alone is significant slower than all other individual methods (D, R and PA). • H4: A dynamic combination of the machine learning algorithm (ML) with the distance algorithm (D) can yield results with an accuracy comparable to that of machine learning alone, while also achieving a better performance. By analyzing these metrics and testing our hypotheses, we aim to provide valuable insights into the performance and suitability of different methods for real-time colocated VR 99 6. Assignment of Tracked Hands in Colocated VR scenarios. To select the appropriate test for the comparative analysis in the subsequent sections, we performed a Shapiro-Wilk normality test [102] on each analysis. The test results suggested that the examined data did not adhere to a normal distribution (p < 0.001), leading to the use of the analysis methods mentioned in the following sections. We presume that the non-normal distribution might be attributed to the varying complexity of different methods, leading to distinct runtimes and diverse accuracy distributions across different scenarios. 6.4.1 History Algorithm Effectiveness To assess the impact of the history algorithm, we analyzed all available test results, resulting in a sample size of 1022 entries. Half of the entries (n = 511) were executed with the history algorithm enabled, while the other half was executed without it. The objective was to determine whether the use of the history algorithm leads to a significant improvement in the accuracy of the assignment of the hand. The results are visualized in Figure 6.10. Figure 6.10: Accumulated accuracy across all methods with and without the history algorithm. The utilization of the history algorithm leads to an overall higher level of accuracy. Since all methods, and thus also poor assignment accuracy of some methods, were included in the analysis, we do not consider the exact assignment accuracy of the history algorithm here, but only whether there is a significant improvement. A detailed assignment accuracy of the methods is shown in section 6.4.2. Without employing the history algorithm, an average accuracy of 0.658 (σ = 0.26) was observed across all data. However, when the history algorithm was included, the accuracy increased to 0.793 (σ = 0.29). We conducted an Independent-Samples Mann-Whitney U test [79] for comparative analysis. The resulting p-value (p < 0.001; U = 79317), together with a rank-biserial correlation coefficient of r = 0.333 (indicating a positive effect size), demonstrate significant differences with a notable improvement in hand 100 6.4. Results and Discussion assignment accuracy when using the history algorithm. These findings confirm H1, as the utilization of our history algorithm led to significantly enhanced assignment results, showcasing an average improvement of 20% in our evaluations. 6.4.2 Assignment Accuracy In this analysis, the methods listed in section 6.3.1 were compared based on their accuracy. Since the previous section demonstrated the superior performance of the history algorithm, we focused on runs where the history algorithm was applied to achieve the highest possible accuracy in our evaluations. This resulted in a sample size of 511 entries (consequently half as much as for the analysis in section 6.4.1), corresponding to N=73 per method with 7 applied methods. Initially, we calculated the means and standard deviations (σ) of the accuracy for all methods, which can be found in detail in Table 6.1. From the table, it is evident that the methods that include machine learning exhibit the highest accuracy with the lowest standard deviation. In contrast, the rotation method yielded the lowest average accuracy, with approximately half of the assigned hands being incorrect on average. The corresponding bar graph can be seen in Figure 6.11. Table 6.1: Mean and standard deviation of accuracy of the methods Method Mean SD σ Machine Learning (ML) 0.987 0.033 Distance (D) 0.729 0.31 Rotation (R) 0.584 0.334 Prerecorded Area (PA) 0.738 0.265 Prerecorded Area Distance Combination (D-PA-C) 0.782 0.291 Dynamic Prerecorded Area Distance (PA-D-DYN) 0.744 0.30 Dynamic Distance Machine Learning (D-ML-DYN) 0.989 0.033 We conducted an Independent-Samples Kruskal-Wallis test [51] to assess the differences in accuracy among all methods. The analysis yielded a significant result (p<0.001; H=92.9; df=6) and a medium effect size, as measured by η-squared (η2 = 0.172), indicating substantial differences in the distribution of accuracy across the methods. To further investigate these differences, we performed a post-hoc test for pairwise comparisons between the methods. We applied Bonferroni correction (with k = 21) to adjust the resulting p-values. Detailed p-values can be found in Table 6.2. The rotation method exhibited significantly worse results in terms of accuracy compared to all other methods. The machine learning method demonstrated significantly better accuracy compared to the distance and rotation methods. Although significance in 101 6. Assignment of Tracked Hands in Colocated VR Figure 6.11: Mean accuracy results for the methods. The two most accurate methods are highlighted. Table 6.2: Resulting p-values of the pairwise comparison of the methods after applying the Bonferroni correction. Significant results are marked in bold. p ML D R PA D-PA-C PA-D-DYN D .032 R <.001 .013 PA .066 1 .006 D-PA-C .469 1 <.001 1 PA-D-DYN .141 1 .002 1 1 D-ML-DYN .424 <.001 <.001 <.001 <.001 <.001 accuracy was initially observed between the dynamic methods before Bonferroni correction, it was no longer present after adjusting for multiple comparisons. However, it is worth noting that the machine learning method exhibited the least variance and the most consistent accuracy across all runs, as evidenced by its low standard deviation and consistent data distribution. The dynamic combination of distance and machine learning shows an even more favorable outcome, showing significant differences compared to all other methods except for the standalone Machine Learning method. It also displayed high consistency with a low standard deviation (SD = 0.033). With an accuracy of 99%, both the machine learning algorithm and the dynamic combination of machine learning and distance emerged as significantly superior algorithms 102 6.4. Results and Discussion for hand assignments. Therefore, these findings support Hypothesis H2, that the machine learning algorithm yields the best accuracy among the single used algorithms. The remaining methods fell within a similar range of average accuracy, ranging from 72% to 78%, and did not produce significant differences from each other. This suggests that the choice of method in this range should be based on other factors, such as performance. 6.4.3 Performance Results In our final decision regarding the optimal hand assignment algorithm, it is crucial to consider the performance to assess the feasibility of the method in a real-time scenario. Since we have already established in section 6.4.1 that the history algorithm significantly improves hand assignments, we also need to investigate if it introduces a substantial decrease in performance. In addition to accuracy, performance plays a vital role in selecting the best method. To obtain the final performance for the most effective hand assignment method, we calculated and compared the performance for each method when the history algorithm was employed. As the allocation methods are applied to active hands and simulated users in each frame, the computational load increases with more users and hands. To account for this, we used a user-adjusted performance per computation frame for the analysis by dividing the measured frame runtime by the number of users. By analyzing the performance results, we can gain insight into the computational performance of each method and identify any significant differences. These findings will aid in selecting the most suitable algorithm for hand assignment, taking into account both accuracy and performance. History performance To assess the impact of the history algorithm on performance, we evaluated performance across all available runs, resulting in a sample size of 1022, evenly split between runs with and without the history algorithm. When using the history algorithm, we observed an average user-adjusted performance of 0.465 ms (σ = 0.22). In contrast, the average user-adjusted performance without the history algorithm was 0.147 ms (σ = 0.24). It is important to note that these mean values do not solely represent the performance of the algorithm, as they incorporate the values from different assignment methods. However, they indicate that the history algorithm generally results in a poorer performance. A Shapiro-Wilk normality test confirmed that the data did not follow a normal distribution (p<0.001). Therefore, we conducted an independent-samples Mann-Whitney U test and calculated the rank-biserial correlation coefficient to determine the effect size. The test yielded a significant result (p < 0.001; U = 40603), and the coefficient (r=0.689 ) indicated a positive effect size, demonstrating significant differences between using and not using the history algorithm. 103 6. Assignment of Tracked Hands in Colocated VR Table 6.3: Mean performance of methods (in milliseconds) based on number of users, including User-corrected times Method Performance (ms) ± σ 2 User 3 User 4 User 5 User User- corrected ML 1.63 ± .75 2.37 ± .09 3.58 ± .14 4.09 ± .27 .83 ± .17 D .55 ± .03 1.01 ± .04 1.49 ± .08 1.88 ± .15 .35 ± .04 R .55 ± .01 1.03 ± .03 1.60 ± .12 2.03 ± .18 .37 ± .06 PA .54 ± .02 1.01 ± .03 1.47 ± .07 1.88 ± .16 .35 ± .05 D-PA-C .57 ± .05 1.02 ± .04 1.52 ± .10 1.95 ± .16 .36 ± .05 PA-D-DYN .55 ± .03 1.03 ± .03 1.54 ± .07 1.97 ± .18 .36 ± .05 D-ML-DYN 1.08 ± .48 1.88 ± .66 2.81 ± 1.02 3.51 ± .94 .65 ± .23 Method performance Considering the significant improvement in the accuracy achieved by the methods, we proceeded to analyze the performance of each method when combined with the history algorithm. The mean performance values per number of users, along with the user-corrected value, are presented in Table 6.3 and are visually represented in Figure 6.12. Figure 6.12: Mean performance results for the methods at various user counts. One bar illustrates the mean performance adjusted for the number of users. Again, we performed a Shapiro-Wilk test to assess the normality of the data. The 104 6.4. Results and Discussion test yielded a p-value of less than 0.001 for all methods, indicating that the data is not normally distributed. Consequently, we conducted an Independent-Sample Kruskal-Wallis test once again. The analysis revealed significant differences among the user-corrected performances of the methods, with a p-value of less than 0.001 (H = 249.74; df = 6) and a positive effect size (calculated using η-squared) of η2 = 0.484. To determine the specific differences, post-hoc pairwise comparisons were performed with Bonferroni correction (with k = 21). The results indicate significant differences between the Machine Learning (ML) method and all other methods, as well as between the Dynamic Method with Machine Learning and Distance (D-ML-DYN) and all other methods, each with a p-value of less than 0.001. Since both of these methods involve Unity3D’s Machine Learning Agents, it was expected that they would have better performance, as reinforced learning agents require more CPU time. These findings confirm our hypothesis H3. However, it should be noted that there was also a significant difference between the two methods using machine learning (p = 0.047). This indicates that the dynamic combination, with a mean performance of 0.65ms, is significantly faster than the Machine Learning method alone, which has an average performance of 0.83ms, by 27.7%. When considering the results of section 6.4.2, hypothesis H4 can also be confirmed. The dynamic combination of machine learning and distance algorithms exhibits a significantly better performance while achieving the best accuracy among all the applied methods. No significant differences were observed among the remaining methods that do not utilize machine learning agents, suggesting similar performance for these methods. With an average performance of approximately 0.35ms, these methods are 85.7% faster than the dynamic method with machine learning and 137.1% faster than the isolated machine learning method. In the next section, a more comprehensive discussion on the applicability of the methods based on these results, along with the previous findings, is presented. 6.4.4 Usability in Realtime Applications The results demonstrate significant differences in the effectiveness of different algorithms for assigning recognized hands to users in colocated VR scenarios. As expected, the rotation method performs the worst, with approximately half of the virtual hands being incorrectly assigned. This outcome is logical since hands typically have a specific orientation to their owners, but can be freely rotated. The assignment of hands becomes challenging when two users have similar orientations in the scene, resulting in ambiguous assignments. Although a combination of the rotation method with other methods could potentially yield improvements, preliminary unstructured tests did not reveal any visible enhancement over other methods. Furthermore, since the rotation method does not offer significantly better performance, it is not suitable for reliable hand assignment. The distance-based methods, whether utilizing the distance or a prerecorded area, deliver similar results across all examined aspects. With an accuracy of approximately three out 105 6. Assignment of Tracked Hands in Colocated VR of four hands correctly assigned over all group sizes, the performance is better than that of the rotation method, but still insufficient to ensure a reliable assignment. Contrary to our assumptions, the precise definition of the hand’s area of action (prerecorded area) does not yield significantly better results compared to the pure distance method, which essentially represents a radius around the user. Additionally, the combination of these two methods did not confirm the notion that they could mutually support each other. Given similar accuracy and performance, the distance algorithm is preferable to the prerecorded area algorithm in terms of simplicity. Taking into account Figure 6.13, it becomes evident again why the distance methods do not exhibit high overall accuracy. When examining user information where users are far apart, the assignment of hands can be achieved with a high confidence of success. However, if users are standing close to each other and their hands overlap, assigning hands based solely on distance becomes significantly more challenging. Consequently, distance methods are more suitable for scenarios where users in colocated spaces are not in close proximity to each other. Figure 6.13: Effect of user proximity on the accuracy for the distance method. Reduced accuracy and increased variance can be seen when users are near each other. Figure 6.9 illustrates formations with different proximities. When examining the methods involving machine learning, the results demonstrate a significantly improved assignment accuracy. Prior reinforcement learning of an AI agent was crucial in effectively determining the assignment factors. However, the one-time learning effort is no longer necessary in subsequent applications, highlighting its ability to produce good results. With an accuracy of approximately 99%, these machine learning methods exhibit the highest accuracy in hand assignment, regardless of the proximity of users to each other. This distinguishes them clearly from distance methods and scores an even higher accuracy than existing algorithms [63]. Nonetheless, as expected, the machine learning methods also have the lowest performance due to the computational complexity involved, with a user-corrected performance ranging 106 6.5. Conclusion and Future Outlook from 0.65 to 0.83 ms. Performance decreases linearly with the number of users and hands present in the scene. Its applicability is thus contingent on the number of users, as well as the complexity and rendering demands of the virtual VR scene. However, the performance remains sufficiently high to allow usage in real-time applications. The dynamic combination of machine learning and distance algorithms proves to be the optimal approach among the presented methods. The results indicate that the weaknesses of the distance method in assigning hands to users standing close to each other (see Figure 6.13) can be effectively compensated for by the machine learning agent. As a result, a significantly improved performance (of 21%) can be achieved while maintaining a high accuracy. This combination is particularly suitable for complex applications with high computational complexity. When users are consistently in close proximity to each other, the worst-case performance is comparable to that of using machine learning alone. On the basis of these findings, this combination is recommended for real-time applications. Finally, the results highlight the significant improvement in the accuracy of hand as- signments achieved by the history algorithm we developed. As expected, applying the algorithm also leads to a poorer performance, however, remaining at a reasonable level. The 20% improvement in assignment accuracy, coupled with an acceptable decrease in performance, establishes the algorithm as a substantial enhancement to the methods. Consequently, we incorporate it into our methods for optimal results. 6.5 Conclusion and Future Outlook This research aimed to present and compare algorithms for assigning tracked hands to virtual users in a colocated virtual environment, using limited information from the tracking system. Unlike other solutions that rely on image detection information [113][59][63][132], we focused on utilizing the location of the hand in the virtual scene. We introduced and evaluated various methods for determining hand assignments, including a history algorithm that enhances assignment robustness by leveraging past assignment information. Our evaluation revealed that assignment methods employing prelearned AI agents achieved the highest accuracy with the lowest variance. Specifically, the dynamic combination of an AI agent for close user proximity and overlapping hands, combined with a simple distance calculation for greater distances, improved the performance while maintaining a high accuracy of 99%. This dynamic combination effectively addressed the challenge of as- signing overlapping hands. Although other methods exhibited better general performance, their accuracy fell short of being suitable for real colocated VR scenarios, where every fourth hand would then be assigned incorrectly. In scenarios involving hand tracking systems for tracking multiple hands, such as those utilizing camera tracking meshes, our solution provides developers with a high-accuracy and real-time hand assignment tool. The computational load in a colocated VR environment increases with each additional 107 6. Assignment of Tracked Hands in Colocated VR user, which leads us to anticipate that this approach will eventually be constrained by the computational power of the system, ultimately limiting the maximum number of concurrent users before the application becomes impractical. Future applications should consider investigating the limitations concerning the maximum number of users and hands that can be simultaneously present in the virtual scene. Furthermore, it would be worthwhile to explore potential enhancements to the algorithm’s performance. For instance, enhancing the performance of hand assignments can be achieved by optimizing the algorithm’s execution frequency. This can be accomplished by avoiding the execution of the algorithm during each frame. For example, a proximity observer could rerun the algorithm every time the proximity of two users changes. This way the performance could be greatly improved. An avenue to improve the accuracy of assignment is to investigate the integration of image detection cues (as done in other work in the two-dimensional domain [80]), such as hand color, shape, and classification, to increase both accuracy and performance in hand assignment. Such explorations have the potential to significantly advance the effectiveness and efficiency of hand assignment estimation techniques. 108 CHAPTER 7 Conclusion The focus of this dissertation and the research presented herein centered on the utilization of hand tracking technology, particularly in scenarios involving multiple colocated users. We began by introducing ’EasyHand’, our framework for the Unity3D engine, which seamlessly integrates rendering, interaction, and gestures from various hand tracking technologies 1. This framework served as the foundation for subsequent research and was designed with ease of extensibility in mind. Several problems were presented, for which we have introduced solutions, including the creation of colocated virtual reality environments, the simultaneous tracking of multiple hands within such environments to facilitate the sharing of tracked hand data, and the consistent assignment of tracked virtual hands to their respective users, allowing consistent assignments of interactions to the right user. These are now integrated into the current version of the ’EasyHand’ framework, or can be added as a plugin (see section 3.5). The overarching objective is to empower users in colocated multi-user settings to engage in virtual interactions by sharing tracking data, overcoming tracking losses of hands in ones own hand recognition systems. In the following chapter, we will provide a final summary of the research objectives and outcomes, followed by an outlook on potential future extensions and unresolved research questions. 7.1 Summary With this dissertation, we have approached several challenges that occur in hand tracking and colocated VR scenarios, as outlined in chapter 1. In the following section we will refer to the fulfilled contributions that were mentioned in section 1.2. 1EasyHand Repository: https://bitbucket.org/Densen90/easyhand 109 7. Conclusion Hand tracking unification. In chapter 3, we introduced the ’EasyHand’ framework, which serves as a unification tool for visualization, interaction, and gestures across multiple hand tracking APIs. This framework enables developers to easily extend support for any hand tracking technology, allowing them to deploy software across various VR systems while maintaining consistent behavior. Additionally, ’EasyHand’ includes capabilities for networked scenarios, supporting both non-shared and shared virtual environments. This fulfills the contribution I. By integrating existing standardization such as OpenXR hand tracking [32], ’EasyHand’ expands its functionalities and offers a versatile solution for hand tracking in VR applications. Creating colocated scenarios for SLAM-tracked headsets using only hands. In chapter 4, we introduced a novel method for creating colocated virtual environments for SLAM-tracked headsets, using only tracked hands as synchronization anchors, hence affirming contribution II. This approach surpassed existing methods in synchronization accuracy, which often rely on additional AR markers or manual positioning of the headsets. Importantly, our method is headset-agnostic, as it relies solely on hand tracking and is not limited to a specific headset model. Unlike solutions such as the Vive Focus 3, which require sharing internal environment mapping data between headsets to create colocated environments 2, our approach offers greater flexibility and interoperability. Tracking more than two hands in three-dimensional environments. In chapter 5, we addressed the limitation of current state-of-the-art hand tracking systems, which typically track only two hands simultaneously. Our solution involved using 2.5-dimensional tracking data from the MediaPipe hand tracking system in a three-dimensional space, confirming contribution III. This approach enabled VR interactions with a much larger tracking range than existing systems. Through an extensive evaluation, we compared our method with existing state-of-the-art hand tracking systems, demonstrating both comparability and improvements. Importantly, our method not only enables tracking of the user’s own hand but also hands of other users within the tracking frustum of the camera. This capability allows and for sharing of tracking data among users, enhancing consistency in interaction and counteracting tracking losses within colocated scenarios. Therefore we can offer the contribution IV. Assigning virtual hands to users. In chapter 6, we addressed the challenge of hand assignment that arises with the ability to track more than two hands simultaneously. We proposed solutions to this problem by introducing algorithms designed to assign virtual hands to the correct user. Through evaluation, we demonstrated that we can achieve this assignment with 99% accuracy in real-time scenarios, without the need for additional camera recognition or hardware. By solely utilizing positional data from hands and users, we enable correct and consistent interactions tailored to the respective user within colocated virtual environments. This finally fulfills contribution V. 2VIVE Environment Mapping: https://business.vive.com/us/solutions/vive-locatio n-based-software-suite/ (Accessed: 2024-02-08) 110 7.2. Open questions In summary, with ’EasyHand’ we have introduced a comprehensive system capable of creating colocated VR scenarios, tracking both the user’s own hands and those of other users within shared environments, and accurately assigning these hands to the correct virtual user. This framework enables the sharing of hand tracking data among users, facilitating tracking and interaction with the virtual environment even when hands are obscured, as long as at least one system can detect the hand. Importantly, all these functionalities are achieved solely through hand tracking, without the need for additional external sensors or hardware. This not only ensures affordability but also mobility and flexibility (as there is no room preparation required), as well as ease of use for both developers and end-users alike. With these results, we have achieved our overarching goal of providing ways for hand recognition systems to support each other in colocated VR scenarios. Thus, we see this work as an important improvement in the field of hand recognition and interaction in colocated multi-user VR scenarios. 7.2 Open questions The research presented in this dissertation demonstrates the utility of hand-tracking technology in creating colocated multi-user environments and improving the consistency of interactions within such environments through multi-user hand tracking. Additionally, the dissertation addresses the challenge of assigning virtual hands to users in such scenarios. However, despite these advancements, several open questions remain, and opportunities for future research come up. Usability and limitations in real user scenarios. Usability and limitations in real user scenarios are crucial aspects that warrant further investigation. While many evaluations and experiments in this work utilized simulated data or involved a limited number of testers3, future research could benefit greatly from extensive user tests. We believe that our results are applicable to use in a real user scenario, as our simulations and tests have been designed to be as close as possible to such scenarios. Nonetheless, such tests would provide valuable additional insights into the effectiveness and precision of sharing tracked hand data among users, and furthermore, the use of our methods in real colocated scenarios for the case of hand tracking loss is theoretically possible, but a practical application in those cases is still pending. In addition, effects such as the camera moving instead of the hands in the evaluation in chapter 5 can be measured and analyzed. Lastly, they could help identify limitations of the proposed methods and systems, such as the maximum number of users supported depending on runtime constraints, the scalability of the methods for colocating multiple users, and the impact of sharing hand data on computational load. Addressing these aspects would contribute significantly to the practical applicability and refinement of the proposed solutions. 3Due to the COVID-19 pandemic and limited availability of potential participants. 111 7. Conclusion Improving multi-user hand tracking. In our method, we utilized the MediaPipe framework to detect more than two hands simultaneously. While effective, this approach relies solely on RGB cameras and may lack some pose accuracy compared to systems with additional sensors, such as the LeapMotion sensor. We decided to use RGB cameras because of the lower price and better availability. Nonetheless, it would be beneficial to explore hand tracking systems equipped with improved sensor technology for higher accuracy and make them capable of tracking more than two hands. Such advancements could enhance the positioning of virtual hands, thereby improving hand ownership and facilitating the desired sharing of hand data among users. Integration of such technology into state-of-the-art HMDs like the Meta Quest or VIVE Focus holds promise for delivering a better multi-user hand tracking experience for end-users. Another interesting approach for the future would be to use multiple hand tracking systems per user. Each user has a ’primary’ hand tracking system to recognize their own hand and a secondary one to assist other users if tracking is lost. In this way, the tracking workload could be divided between several tracking systems. Continuous colocation with multi-hand tracking. The colocation method pre- sented using hand tracking was a one-time process, which introduces the challenge of dependency on the drift of the HMD for further accuracy. To address this issue, a method could be developed to synchronize the virtual worlds of individual users continuously in multi-user hand tracking scenarios, every time, and as long as the hands are visible to the users. This approach could help mitigate potential drift. Additionally, sharing internal tracking data of the environment across systems could enhance colocation for multiple different systems. However, this would require approval from the system manufacturers and agreement on a standard for the data, making it unlikely to happen in the near future. In summary, this dissertation has provided initial insights into calibrating multi-user environments, compensating for tracking loss and assignment of hands to users for colocated VR scenarios. While the focus was primarily on virtual reality, the methods presented are not limited to VR alone. Collaborative AR applications, such as multi- user surgery simulations or tabletop interactions where multiple users can interact simultaneously, stand to benefit from these findings as well. It is our hope that this work will have a significant impact on researchers and inspire further development and continuous improvement of hand recognition and multi-user interactivity in colocated scenarios. The natural input of hands makes this area particularly promising for future advancements. 112 Bibliography [1] Diar Abdlkarim, Massimiliano Di Luca, Poppy Aves, Sang-Hoon Yeo, R. Chris Miall, Peter Holland, and Joseph M. Galea. A methodological framework to assess the accuracy of virtual reality hand-tracking systems: A case study with the oculus quest 2. bioRxiv, 2022. [2] Christoph Anthes, Rubén Jesús García-Hernández, Markus Wiedemann, and Dieter Kranzlmuller. State of the art of virtual reality technology. 2016 IEEE Aerospace Conference, pages 1–19, 2016. [3] Daniel Bachmann, Frank Weichert, and Gerhard Rinkenauer. Evaluation of the leap motion controller as a new contact-free pointing device. Sensors, 15:214–233, 12 2014. [4] Peter Bauer, Werner Lienhart, and Samuel Jost. Accuracy investigation of the pose determination of a vr system. Sensors, 21(5), 2021. [5] Steve Benford, Chris Brown, Gail Reynard, and Chris Greenhalgh. Shared spaces: Transportation, artificiality, and spatiality. In Proceedings of the 1996 ACM Conference on Computer Supported Cooperative Work, CSCW ’96, page 77–86, New York, NY, USA, 1996. Association for Computing Machinery. [6] Stina Bengtsson and Sofia Johansson. The meanings of social media use in everyday life: Filling empty slots, everyday transformations, and mood management. Social Media + Society, 8(4):20563051221130292, 2022. [7] M. Borges, A. Symington, B. Coltin, T. Smith, and R. Ventura. Htc vive: Analysis and accuracy improvement. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2610–2615, 2018. [8] Nicholas Brunetto, Nicola Fioraio, and Luigi Di Stefano. Interactive rgb-d slam on mobile devices. In C. V. Jawahar and Shiguang Shan, editors, Computer Vision - ACCV 2014 Workshops, pages 339–351, Cham, 2015. Springer International Publishing. [9] Alvaro Parra Bustos, Tat-Jun Chin, Anders Eriksson, and Ian Reid. Visual SLAM: Why bundle adjust? In 2019 International Conference on Robotics and Automation (ICRA). IEEE, may 2019. 113 [10] Yunlong Che and Yue Qi. Detection-guided 3d hand tracking for mobile ar applications. In 2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 386–392, 2021. [11] Jun Cheng, Liyan Zhang, Qihong Chen, Xinrong Hu, and Jingcao Cai. A review of visual slam methods for autonomous driving vehicles. Engineering Applications of Artificial Intelligence, 114:104992, 2022. [12] Andrew Cohen, Ervin Teng, Vincent-Pierre Berges, Ruo-Ping Dong, Hunter Henry, Marwan Mattar, Alexander Zook, and Sujoy Ganguly. On the use and misuse of abosrbing states in multi-agent reinforcement learning. RL in Games Workshop AAAI 2022, 2022. [13] HTC Corporation. "customize 3d hand model". https://hub.vive.com/sto rage/tracking/unity/model.html, 2021. Accessed: 2023-10-06. [14] HTC Corporation. "vive hand tracking sdk overview - supported hardware". https: //hub.vive.com/storage/tracking/overview/hardware.html, 2021. Accessed: 2023-10-23. [15] Connor DeFanti, Davi Geiger, and Daniele Panozzo. Co-Located Augmented and Virtual Reality Systems. PhD thesis, New York University, 2019. [16] A. Del Bimbo, L. Landucci, and A. Valli. Multi-user natural interaction system based on real-time hand tracking and gesture recognition. In 18th International Conference on Pattern Recognition (ICPR’06), volume 3, pages 55–58, 2006. [17] Qicheng Ding, Jiexiong Ding, Jing Zhang, and Li Du. An attempt to relate dynamic tracking error to occurring situation based on additional rectilinear motion for five- axis machine tools. Advances in Mechanical Engineering, 12(10):1687814020967573, 2020. [18] K. C. Dohse, Thomas Dohse, Jeremiah D. Still, and Derrick J. Parkhurst. Enhancing multi-user interaction with multi-touch tabletop displays using hand tracking. In First International Conference on Advances in Computer-Human Interaction, pages 297–302, 2008. [19] Tobias Drey, Patrick Albus, Simon der Kinderen, Maximilian Milo, Thilo Segschnei- der, Linda Chanzab, Michael Rietzler, Tina Seufert, and Enrico Rukzio. Towards collaborative learning in virtual reality: A comparison of co-located symmetric and asymmetric pair-learning. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI ’22, New York, NY, USA, 2022. Association for Computing Machinery. [20] Ylva Ferstl, Rachel McDonnell, and Michael Neff. Evaluating study design and strategies for mitigating the impact of hand tracking loss. In ACM Symposium on Applied Perception 2021, SAP ’21, New York, NY, USA, 2021. Association for Computing Machinery. 114 [21] Tiare Feuchtner. Designing for Hand Ownership in Interaction with Virtual and Augmented Reality. PhD thesis, Aarhus Universitet, Aarhus, 2018. [22] Daniel Immanuel Fink, Johannes Zagermann, Harald Reiterer, and Hans-Christian Jetter. Re-locations: Augmenting personal and shared workspaces to support remote collaboration in incongruent spaces. Proc. ACM Hum.-Comput. Interact., 6(ISS), nov 2022. [23] D. Fox, J. Ko, K. Konolige, B. Limketkai, D. Schulz, and B. Stewart. Distributed multirobot exploration and mapping. Proceedings of the IEEE, 94(7):1325–1339, 2006. [24] Valentino Frati and Domenico Prattichizzo. Using kinect for hand tracking and rendering in wearable haptics. In 2011 IEEE World Haptics Conference, pages 317–321, 2011. [25] Milton Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200):675– 701, 1937. [26] Milton Friedman. A correction. Journal of the American Statistical Association, 34(205):109–109, 1939. [27] Milton Friedman. A Comparison of Alternative Tests of Significance for the Problem of m Rankings. The Annals of Mathematical Statistics, 11(1):86 – 92, 1940. [28] Emmanuel Frécon and Mårten Stenius. Dive: a scaleable network architecture for distributed virtual environments. Distributed Systems Engineering, 5(3):91, sep 1998. [29] Joshua S. Furtado, Hugh H. T. Liu, Gilbert Lai, Herve Lacheray, and Jason Desouza-Coelho. Comparative analysis of optitrack motion capture systems. In Farrokh Janabi-Sharifi and William Melek, editors, Advances in Motion Sensing and Control for Robotic Applications, pages 15–31, Cham, 2019. Springer International Publishing. [30] Paul A. Games and John F. Howell. Pairwise multiple comparison procedures with unequal n’s and/or variances: A monte carlo study. Journal of Educational Statistics, 1(2):113–125, 1976. [31] Liang Gong, Henrik Söderlund, Leonard Bogojevic, Xiaoxia Chen, Anton Berce, Åsa Fast-Berglund, and Björn Johansson. Interaction design for multi-user virtual reality systems: An automotive case study. Procedia CIRP, 93:1259–1264, 2020. 53rd CIRP Conference on Manufacturing Systems 2020. [32] The Khronos® OpenXR Working Group. "the openxr™ specification". https: //registry.khronos.org/OpenXR/specs/1.0/html/xrspec.html, 2023. Accessed: 2023-10-23. 115 [33] Robert Gruen, Eyal Ofek, Anthony Steed, Ran Gal, Mike Sinclair, and Mar Gonzalez-Franco. Measuring system visual latency through cognitive latency on video see-through ar devices. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pages 791–799, 2020. [34] Ayah Hamad and Bochen Jia. How virtual reality technology has changed our lives: An overview of the current and potential applications and limitations. International Journal of Environmental Research and Public Health, 19:11278, 09 2022. [35] Asim Hameed, Andrew Perkis, and Sebastian Möller. Evaluating hand-tracking interaction for performing motor-tasks in vr learning environments. In 2021 13th International Conference on Quality of Multimedia Experience (QoMEX), pages 219–224, 2021. [36] Jan Hendrik Hammer and Jürgen Beyerer. Robust hand tracking in realtime using a single head-mounted rgb camera. In Masaaki Kurosu, editor, Human- Computer Interaction. Interaction Modalities and Techniques, pages 252–261, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. [37] Shangchen Han, Beibei Liu, Randi Cabezas, Christopher D. Twigg, Peizhao Zhang, Jeff Petkau, Tsz-Ho Yu, Chun-Jung Tai, Muzaffer Akbay, Zheng Wang, Asaf Nitzan, Gang Dong, Yuting Ye, Lingling Tao, Chengde Wan, and Robert Wang. Megatrack: Monochrome egocentric articulated hand-tracking for virtual reality. 39(4), aug 2020. [38] Sebastian Herscher, Connor DeFanti, Nicholas Gregory Vitovitch, Corinne Brenner, Haijun Xia, Kris Layng, and Ken Perlin. Cavrn: An exploration and evaluation of a collective audience virtual reality nexus experience. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology, UIST ’19, page 1137–1150, New York, NY, USA, 2019. Association for Computing Machinery. [39] Meta: Joel Hesch, Anna Kozminski, and Oskar Linde. "powered by ai: Oculus insight". https://ai.meta.com/blog/powered-by-ai-oculus-insig ht/, 2019. Accessed: 2023-10-09. [40] Valentin Holzwarth, Joy Gisler, Christian Hirt, and Andreas Kunz. Comparing the accuracy and precision of steamvr tracking 2.0 and oculus quest 2 in a room scale setup. 03 2021. [41] Lin Huang, Boshen Zhang, Zhilin Guo, Yang Xiao, Zhiguo Cao, and Junsong Yuan. Survey on depth and rgb image-based 3d hand shape and pose estimation. Virtual Reality & Intelligent Hardware, 3(3):207–234, 2021. [42] The MathWorks Inc. "what is slam? 3 things you need to know". https: //www.mathworks.com/discovery/slam.html, 2023. Accessed: 2023-10- 09. 116 [43] Ultraleap Inc. "uh-003206-tc issue 6 leap motion controller data sheet". https: //www.ultraleap.com/datasheets/Leap_Motion_Controller_Data sheet.pdf, 2020. Accessed: 2023-10-06. [44] Pascal Jansen, Fabian Fischbach, Jan Gugenheimer, Evgeny Stemasov, Julian Frommel, and Enrico Rukzio. Share: Enabling co-located asymmetric multi-user interaction for augmented reality head-mounted displays. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, UIST ’20, page 459–471, New York, NY, USA, 2020. Association for Computing Machinery. [45] Arthur Juliani, Vincent-Pierre Berges, Ervin Teng, Andrew Cohen, Jonathan Harper, Chris Elion, Chris Goy, Yuan Gao, Hunter Henry, Marwan Mattar, and Danny Lange. Unity: A general platform for intelligent agents. arXiv preprint arXiv:1809.02627, 2020. [46] Panagiotis T. Karfakis, Micael S. Couceiro, and David Portugal. Nr5g-sam: A slam framework for field robot applications based on 5g new radio. Sensors, 23(11), 2023. [47] Gregory Kessler, Neff Walker, and Larry Hodges. Evaluation of the cyberglove(tm) as a whole hand input device. ACM Transactions on Computer-Human Interaction, 2, 12 1995. [48] Chaowanan Khundam, Varunyu Vorachart, Patibut Preeyawongsakul, Witthaya Hosap, and Frédéric Noël. A comparative study of interaction time and usability of using controllers and hand tracking in virtual reality training. Informatics, 8(3), 2021. [49] Chaowanan Khundam, Varunyu Vorachart, Patibut Preeyawongsakul, Witthaya Hosap, and Frédéric Noël. A comparative study of interaction time and usability of using controllers and hand tracking in virtual reality training. Informatics, 8:60, 09 2021. [50] H. Kortier, Martin Schepers, Victor Sluiter, Peter Veltink, Alberto Leardini, and Rita Stagni. Ambulatory assesment of hand kinematics : using an instrumented glove. Computer Standards & Interfaces - CSI, 01 2012. [51] William H. Kruskal and W. Allen Wallis. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260):583–621, 1952. [52] Bor-Woei Kuo, Hsun-Hao Chang, Yung-Chang Chen, and Shi-Yu Huang. A light- and-fast slam algorithm for robots in indoor environments using line segment map. Hindawi Publishing Corporation Journal of Robotics, 12, 01 2011. [53] Pierre-Yves Lajoie, Benjamin Ramtoula, Fang Wu, and Giovanni Beltrame. Towards collaborative simultaneous localization and mapping: a survey of the current research landscape, 08 2021. 117 [54] Eike Langbehn, Gerd Bruder, and Frank Steinicke. Moving towards natural interaction between multiscale avatars in multi-user virtual environments. In International Conference on Artificial Reality and Telexistence and Eurographics Symposium on Virtual Environments 2015, ICAT-EGVE 2015 : International Conference on Artificial Reality and Telexistence and Eurographics Symposium on Virtual Environments, 2015. [55] Eike Langbehn, Hannah Paulmann, Dennis Briddigkeit, Marc Barnes, Malte Husung, Kolja Kirsch, Daniel Neves Coelho, Tim Mayer, and Frank Steinicke. Frozen factory: A playful virtual experience for multiple co-located redirected walking users. In SIGGRAPH Asia 2020 XR, SA ’20, New York, NY, USA, 2020. Association for Computing Machinery. [56] Micheal Lanham. Learn Unity ML-Agents Fundamentals of Unity Machine Learning: Incorporate new powerful ML algorithms such as Deep Reinforcement Learning for games. Packt Publishing, 2018. [57] Steven M. LaValle, Anna Yershova, Max Katsev, and Michael Antonov. Head tracking for the oculus rift. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 187–194, 2014. [58] Kris Layng, Ken Perlin, Sebastian Herscher, Corinne Brenner, and Thomas Meduri. Cave: Making collective virtual narrative. In ACM SIGGRAPH 2019 Art Gallery, SIGGRAPH ’19, New York, NY, USA, 2019. Association for Computing Machinery. [59] Stefan Lee, Sven Bambach, David J. Crandall, John M. Franchak, and Chen Yu. This hand is my hand: A probabilistic approach to hand disambiguation in egocentric video. In 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 557–564, 2014. [60] Howard Levene. Robust Tests for Equality of Variance, volume 2:, pages 278–292. Stanford University Press, 01 1960. [61] Yue Li, Eugene Ch’ng, Shengdan Cai, and Simon See. Multiuser interaction with hybrid vr and ar for cultural heritage objects. In 2018 3rd Digital Heritage Inter- national Congress (DigitalHERITAGE) held jointly with 2018 24th International Conference on Virtual Systems & Multimedia (VSMM 2018), pages 1–8, 2018. [62] Fanqing Lin, Connor Wilhelm, and Tony R. Martinez. Two-hand global 3d pose estimation using monocular RGB. CoRR, abs/2006.01320, 2020. [63] Jiaojiao Lin, Fei Jiang, and Ruimin Shen. Hand-raising gesture detection in real classroom. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6453–6457, 2018. [64] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, 118 Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. Mediapipe: A framework for building perception pipelines. CoRR, abs/1906.08172, 2019. [65] Andréa Macario Barros, Maugan Michel, Yoann Moline, Gwenolé Corre, and Frédérick Carrel. A comprehensive survey of visual slam algorithms. Robotics, 11(1), 2022. [66] Jameel Malik, Ahmed Elhayek, Fabrizio Nunnari, Kiran Varanasi, Kiarash Tamad- don, Alexis Héloir, and Didier Stricker. Deephps: End-to-end estimation of 3d hand pose and shape by learning from synthetic depth. CoRR, abs/1808.09208, 2018. [67] Cristina Manresa-Yee, Javier Varona, Ramon Mas, and Francisco Perales. Hand tracking and gesture recognition for human-computer interaction. Electronic Letters on Computer Vision and Image Analysis;, ISSN 1577-5097 E:1, 01 2000. [68] Alexander Masurovsky, Paul Chojecki, Detlef Runde, Mustafa Lafci, David Prze- wozny, and Michael Gaebler. Controller-free hand tracking for grab-and-place tasks in immersive virtual reality: Design elements and their empirical study. Multimodal Technologies and Interaction, 4(4), 2020. [69] Fabrice Matulic, Taiga Kashima, Deniz Beker, Daichi Suzuo, Hiroshi Fujiwara, and Daniel Vogel. Above-screen fingertip tracking with a phone in virtual reality. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, CHI EA ’23, New York, NY, USA, 2023. Association for Computing Machinery. [70] Mark McGill, Jan Gugenheimer, and Euan Freeman. A quest for co-located mixed reality: Aligning and assessing slam tracking for same-space multi-user experiences. In 26th ACM Symposium on Virtual Reality Software and Technology, VRST ’20, New York, NY, USA, 2020. Association for Computing Machinery. [71] Google MediaPipe. "hand landmarks detection guide". https://developers .google.com/mediapipe/solutions/vision/hand_landmarker, 2023. Accessed: 2023-12-15. [72] Meta. "from the lab to the living room: The story behind facebook’s oculus insight technology and a new era of consumer vr". https://tech.facebook.com/re ality-labs/2019/8/the-story-behind-oculus-insight-technol ogy/, 2019. Accessed: 2023-10-09. [73] Meta. "set up hand tracking". https://developer.oculus.com/documen tation/unity/unity-handtracking/, 2023. Accessed: 2023-10-06. [74] Maximilian Metzner, Lorenz Krieg, Daniel Krüger, Tobias Ködel, and Jörg Franke. Intuitive, VR- and Gesture-based Physical Interaction with Virtual Commissioning Simulation Models, pages 11–20. 07 2020. 119 [75] Paul Milgram, Haruo Takemura, Akira Utsumi, and Fumio Kishino. Augmented reality: A class of displays on the reality-virtuality continuum. Telemanipulator and Telepresence Technologies, 2351, 01 1994. [76] C. Mizera, T. Delrieu, V. Weistroffer, C. Andriot, A. Decatoire, and J.-P. Gazeau. Evaluation of hand-tracking systems in teleoperation and virtual dexterous manip- ulation. IEEE Sensors Journal, 20(3):1642–1655, 2020. [77] Curtiss Murphy. Believable Dead Reckoning for Networked Games, pages 307–328. 02 2011. [78] Hyun Myung, Hae min Jeon, and Woo-Yeon Jeong. Virtual door algorithm for coverage path planning of mobile robot. In 2009 IEEE International Symposium on Industrial Electronics, pages 658–663, 2009. [79] Nadim Nachar. The mann-whitney u: A test for assessing whether two independent samples come from the same distribution. Tutorials in Quantitative Methods for Psychology, 4, 03 2008. [80] Supreeth Narasimhaswamy, Thanh Nguyen, Mingzhen Huang, and Minh Hoai. Whose hands are these? hand detection and hand-body association in the wild. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4879–4889, 2022. [81] D. Niehorster, L. Li, and M. Lappe. The accuracy and precision of position and orientation tracking in the htc vive virtual reality system for scientific research. i-Perception, 8, 2017. [82] Brendan O’Flynn, Javier Torres, James Connolly, Joan Condell, Kevin Curran, and Philip Gardiner. Novel smart sensor glove for arthritis rehabiliation. In 2013 IEEE International Conference on Body Sensor Networks, pages 1–6, 2013. [83] Patrick Aggergaard Olin, Ahmad Mohammad Issa, Tiare Feuchtner, and Kaj Grønbæk. Designing for heterogeneous cross-device collaboration and social in- teraction in virtual reality. In Proceedings of the 32nd Australian Conference on Human-Computer Interaction, OzCHI ’20, page 112–127, New York, NY, USA, 2021. Association for Computing Machinery. [84] Open Source Computer Vision OpenCV. "detection of aruco markers". https: //docs.opencv.org/4.x/d5/dae/tutorial_aruco_detection.html, 2024. Accessed: 2024-01-11. [85] Eva Ostertagova and Oskar Ostertag. Methodology and application of one-way anova. American Journal of Mechanical Engineering, 1:256–261, 11 2013. [86] Kaitlyn M. Ouverson and Stephen B. Gilbert. A composite framework of co-located asymmetric virtual reality. Proc. ACM Hum.-Comput. Interact., 5(CSCW1), apr 2021. 120 [87] Paschalis Panteleris, Iason Oikonomidis, and Antonis A. Argyros. Using a single RGB frame for real time 3d hand pose estimation in the wild. CoRR, abs/1712.03866, 2017. [88] Daniel Passos and Bernhard Jung. Measuring the Accuracy of Inside-Out Tracking in XR Devices Using a High-Precision Robotic Arm, pages 19–26. 07 2020. [89] Jérôme Perret and Emmanuel Vander Poorten. Touching virtual reality: a review of haptic gloves. 06 2018. [90] Stephen Pheasant. Bodyspace: Anthropometry, Ergonomics And The Design Of Work. CRC Press, London, 2 edition, 2003. [91] I. Podkosova, K. Vasylevska, C. Schoenauer, E. Vonach, P. Fikar, E. Bronederk, and H. Kaufmann. Immersivedeck: a large-scale wireless vr system for multiple users. In 2016 IEEE 9th Workshop on Software Engineering and Architectures for Realtime Interactive Systems (SEARIS), pages 1–7, 2016. [92] Iana Podkosova. Walkable multi-user VR: the effects of physical and virtual coloca- tion. PhD thesis, Wien, 2018. [93] N. Pretto and F. Poiesi. Towards gesture-based multi-user interactions in collabo- rative virtual environments. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLII-2/W8:203–208, 2017. [94] Sergi Pujades, Betty Mohler, Anne Thaler, Joachim Tesch, Naureen Mahmood, Nikolas Hesse, Heinrich H. Bülthoff, and Michael J. Black. The virtual caliper: Rapid creation of metrically accurate avatars from 3d measurements. IEEE Trans- actions on Visualization and Computer Graphics, 25(5):1887–1897, 2019. [95] Aylen Ricca, Amine Chellali, and Samir Otrnane. The influence of hand visualization in tool-based motor-skills training, a longitudinal study. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR), pages 103–112, 2021. [96] Holger Salzmann, Jan Jacobs, and Bernd Froehlich. Collaborative Interaction in Co- Located Two-User Scenarios. In Michitaka Hirose, Dieter Schmalstieg, Chadwick A. Wingrave, and Kunihiro Nishimura, editors, Joint Virtual Reality Conference of EGVE - ICAT - EuroVR. The Eurographics Association, 2009. [97] Muhamad Risqi U. Saputra, Andrew Markham, and Niki Trigoni. Visual slam and structure from motion in dynamic environments: A survey. ACM Comput. Surv., 51(2), feb 2018. [98] Daniel Schneider, Verena Biener, Alexander Otte, Travis Gesslein, Philipp Gagel, Cuauhtli Campos, Klen Copic Pucihar, Matjaz Kljun, Eyal Ofek, Michel Pahud, Per Ola Kristensson, and Jens Grubert. Accuracy evaluation of touch tasks in commodity virtual and augmented reality head-mounted displays. CoRR, abs/2109.10607, 2021. 121 [99] Daniel Schneider, Alexander Otte, Axel Simon Kublin, Alexander Martschenko, Per Ola Kristensson, Eyal Ofek, Michel Pahud, and Jens Grubert. Accuracy of commodity finger tracking systems for virtual reality head-mounted displays. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pages 804–805, 2020. [100] H. Schupp. Elementargeometrie. Number Bd. 1 in Grundkurs Mathematik. Schön- ingh, 1977. [101] Alexander Schäfer, Gerd Reis, and Didier Stricker. Comparing controller with the hand gestures pinch and grab for picking up and placing virtual objects, 2022. [102] S. S. SHAPIRO and M. B. WILK. An analysis of variance test for normality (complete samples)†. Biometrika, 52(3-4):591–611, 12 1965. [103] Toby Sharp, Cem Keskin, Duncan Robertson, Jonathan Taylor, Jamie Shotton, David Kim, Christoph Rhemann, Ido Leichter, Alon Vinnikov, Yichen Wei, Daniel Freedman, Pushmeet Kohli, Eyal Krupka, Andrew Fitzgibbon, and Shahram Izadi. Accurate, robust, and flexible real-time hand tracking. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI ’15, page 3633–3642, New York, NY, USA, 2015. Association for Computing Machinery. [104] Yangming Shi, Jing Du, Sarel Lavy, and Dong Zhao. A multiuser shared virtual environment for facility management. Procedia Engineering, 145:120–127, 2016. ICSDEC 2016 – Integrating Data Science, Construction and Sustainability. [105] TIGA: Suzi Stephenson. "tiga survey reveals that unity 3d engine dominates the uk third party engine market". https://tiga.org/news/tiga-survey-r eveals-that-unity-3d-engine-dominates-the-uk-third-party-e ngine-market, 2019. Accessed: 2023-10-06. [106] Stephan Streuber and Astros Chatziastros. Human interaction in multi-user virtual reality. Proceedings of the 10th International Conference on Humans and Computers (HC 2007), 1-7 (2007), 01 2007. [107] Z. Sun, Y. Hu, and X. Shen. Two-hand pose estimation from the non-cropped rgb image with self-attention based network. In 2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 248–255, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society. [108] Philipp Sykownik, Sukran Karaosmanoglu, Katharina Emmerich, Frank Steinicke, and Maic Masuch. Vr almost there: Simulating co-located multiplayer experiences in social virtual reality. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA, 2023. Association for Computing Machinery. 122 [109] Unity Technologies. "creating custom packages". https://docs.unity3d.com /Manual/CustomPackages.html, 2023. Accessed: 2023-10-06. [110] Pablo Temoche, Esmitt Ramirez, and Omaira Rodríguez. A low-cost data glove for virtual reality. pages TCG 31–36, 05 2012. [111] Silvia Terrile, Jesus Miguelañez, and Antonio Barrientos. A soft haptic glove actuated with shape memory alloy and flexible stretch sensors. Sensors, 21(16), 2021. [112] Shantanu Tilak, Michael Glassman, Irina Kuznetcova, Joshua Peri, Qiannan Wang, Ziye Wen, and Amanda Walling. Multi-user virtual environments (muves) as alter- native lifeworlds: Transformative learning in cyberspace. Journal of Transformative Education, 18(4):310–337, 2020. [113] Satoshi Tsutsui, Yanwei Fu, and David Crandall. Whose hand is this? person identification from egocentric hand gestures, 2020. [114] Kyriaki A. Tychola, Ioannis Tsimperidis, and George A. Papakostas. On 3d reconstruction using rgb-d cameras. Digital, 2(3):401–421, 2022. [115] Toin Villar. "what is the metaverse?". https://www.makeuseof.com/what -is-the-metaverse/, 2019. Accessed: 2023-10-10. [116] Jan-Niklas Voigt-Antons, Tanja Kojic, Danish Ali, and Sebastian Möller. Influence of hand tracking as a way of interaction in virtual reality on user experience. In 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), pages 1–4, 2020. [117] Aleš Vysocký, Stefan Grushko, Petr Oščádal, Tomáš Kot, Ján Babjak, Rudolf Jánoš, Marek Sukop, and Zdenko Bobovský. Analysis of precision and stability of hand tracking with leap motion sensor. Sensors, 20(15), 2020. [118] David Waller, E.R. Bachmann, Eric Hodgson, and Andrew Beall. The hive: A huge immersive virtual environment for research in spatial cognition. Behavior research methods, 39:835–43, 12 2007. [119] Jiayi Wang, Franziska Mueller, Florian Bernard, Suzanne Sorli, Oleksandr Sot- nychenko, Neng Qian, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. Rgb2hands: Real-time tracking of 3d hand interactions from monocular rgb video. 39(6), nov 2020. [120] Zheng Wang, Luca Mastrogiacomo, Fiorenzo Franceschini, and Paul Maropoulos. Experimental comparison of dynamic tracking performance of igps and laser tracker. The International Journal of Advanced Manufacturing Technology, 56, September 2011. 123 [121] Frank Weichert, Daniel Bachmann, Bartholomäus Rudak, and Denis Fisseler. Analysis of the accuracy and robustness of the leap motion controller. Sensors (Basel, Switzerland), 13:6380–6393, 05 2013. [122] T. Weissker, P. Tornow, and B. Froehlich. Tracking multiple collocated htc vive setups in a common coordinate system. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pages 592–593, 2020. [123] B. L. Welch. The generalisation of student’s problems when several different population variances are involved. Biometrika, 34(1-2):28–35, 01 1947. [124] Brian Williams, Georg Klein, and Ian Reid. Real-time slam relocalisation. pages 1–8, 01 2007. [125] Sen-Zhe Xu, Jia-Hong Liu, Miao Wang, Fang-Lue Zhang, and Song-Hai Zhang. Multi-user redirected walking in separate physical spaces for online vr scenarios, 2022. [126] Yu Xu and Xi’an Zhu. The research and application of data glove in virtual interaction system. Advanced Materials Research, 989-994:2057–2061, 07 2014. [127] Umema Zafar, Shafiq-Ur-Rahman, Naila Hamid, Junaid Ahsan, and Nimra Zafar. Correlation between height and hand size, and predicting height on the basis of age, gender and hand size. Journal of Medical Sciences (Peshawar), 25:425–428, 10 2017. [128] Faisal Zaman. [dc] improving multi-user interaction for mixed reality telecollabora- tion. In 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pages 940–941, 2022. [129] José Zariffa and Milos Popovic. Hand contour detection in wearable camera video using an adaptive histogram region of interest. Journal of neuroengineering and rehabilitation, 10:114, 12 2013. [130] Fangfang Zhang, Valentin Bazarevsky, Andrey Vakunov, A. Tkachenka, George Sung, Chuo-Ling Chang, and Matthias Grundmann. Mediapipe hands: On-device real-time hand tracking. ArXiv, abs/2006.10214, 2020. [131] Jingbo Zhao, Ruize An, Ruolin Xu, and Banghao Lin. Comparing hand gestures and a gamepad interface for locomotion in virtual environments. International Journal of Human-Computer Studies, 166:102868, 2022. [132] Huayi Zhou, Fei Jiang, and Ruimin Shen. Who are raising their hands? hand-raiser seeking based on object detection and pose estimation. In Jun Zhu and Ichiro Takeuchi, editors, Proceedings of The 10th Asian Conference on Machine Learning, volume 95 of Proceedings of Machine Learning Research, pages 470–485. PMLR, 14–16 Nov 2018. 124 [133] Thomas G. Zimmerman, Jaron Lanier, Chuck Blanchard, Steve Bryson, and Young Harvill. A hand gesture interface device. SIGCHI Bull., 17(SI):189–192, may 1986. [134] Christian Zimmermann and Thomas Brox. Learning to estimate 3d hand pose from single rgb images. In IEEE International Conference on Computer Vision (ICCV), 2017. https://arxiv.org/abs/1705.01389. 125 List of Figures 2.1 An example of SLAM tracking where features of the environment are tracked in a point-cloud an the user is positioned in the mapped environment. . . 8 2.2 One of the first data gloves, developed by ’VPL Research’ for interaction in virtual environments [133]. . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Two RGB-D cameras are depicted. Left: Intel RealSense Depth Camera D435; Right: ASUS Xtion PRO LIVE . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Tracked hands using the MediaPipe framework, which is used in our imple- mentation. Left: Tracked finger joints with relative depth to the wrist. Right: Simultaneously tracked multiple hands. . . . . . . . . . . . . . . . . . . . 14 2.5 Taxonomy of multi-user VR created by Podkosova [92]. . . . . . . . . . . 17 2.6 Two colocated users in both the physical and virtual space, with hand tracking enabled. The point of view (POV) is from the left user within the virtual scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7 Three users interacting with the multi-touch tabletop by Dohse et al. On top is a camera attached that is used for hand tracking [18]. The image is taken the authors’ work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.8 Hand-Body association by Narasimhaswamy et al. involves the creation of 2D bounding boxes to assign hands to users in the detected image [80]. . 24 3.1 Overview of hand tracking API integration of different systems in one ap- plication. Each system comes with distinct visualization, interaction and gestures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 ’EasyHand’ acting as a layer between several hand tracking APIs and the application, unifying visualization (V), interaction (I) and gesture recognition (G). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Hemispherical area of the interaction zone of the LeapMotion Sensor. . . 29 3.4 Left: The LeapMotion controller attached to a VR HMD. Right: Detected joints of the Ultraleap API . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5 Left: The Meta Quest 2 HMD. Right: Detected hands and joints with the Meta Quest 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6 Left: The detected joints of the Vive Hand Tracking. Right: Recognized Gestures of the Vive Hand Tracking. . . . . . . . . . . . . . . . . . . . . . 31 3.7 Joint index mapping and description of an ’EasyHand’ hand skeleton. . . 32 127 3.8 The ’Peace’ gesture recognized by the system. Left: Labels indicate which fingers are bent and which are not. Here the system recognizes the gesture when all fingers are bent except the index and middle Finger. Right: Angle calculations between the joints are used to determine which fingers are bent and which are not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.9 Skeletal rendering of detected hands. . . . . . . . . . . . . . . . . . . . . . 34 3.10 Skeletal rendering of detected hands. . . . . . . . . . . . . . . . . . . . . . 35 3.11 Components for the ’EasyHandRig’ template. . . . . . . . . . . . . . . . . 36 3.12 The ’EasyHandRig’ inside the Unity engine. The ’EasyHandManger’ compo- nent is placed on the parent ’EasyHandRig’ game object. Each hand object has its own visualizer and gesture recognizer. The VR camera represents the user’s HMD and is responsible for user positioning and rendering. In this example, Meta Quest hand tracking is used, which is why an OVRManager is created at runtime to obtain the low-level data of the tracking system for unification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.13 General overview of the flow of the ’EasyHand’ system. ’System Integration’ shows how base hand tracking systems are integrated. ’Unified Logic’ is the core logic for Visualization, Interactions, and Gesture Recognition of the system. These parts can be synchronized over the network. ’VR Headset Tracking’ is the integration of a VR headset to track the user’s head movement and work in conjunction with tracked hands in the final application. . . . 40 4.1 Sketch of requirements for a colocated scenario. Left: Two users standing in front of each other at a distance d and their view directions v1 and v2. Right: Their virtual representations are aligned to have the fitting distance and view direction of their real-world counterparts. . . . . . . . . . . . . . . . . . . 44 4.2 Left: The users‘ headsets are positioned on predefined locations in the real world. Right: Virtual users who are repositioned to UV , which is the virtual representation of UR. Distance ΔdU is the same in the real and virtual world. Red arrows represent the view direction of the user. . . . . . . . . . . . . 46 4.3 Left: The users standing in the real world detecting an ArUco marker. Right: Virtual users who are relocated depending on the detected marker. The virtual user is moved by Δp and rotated by α to get to UV , which is the position of the user p⃗m and rotation r⃗m in the marker space. . . . . . . . . . . . . . . 47 4.4 Left: The users standing in the real world detecting the same hand. Right: User gets relocated by difference Δp. α is the difference in rotation between the tracked hand and the received reference hand. Rotation is visualized by red arrows. Compared to other methods, only the other user gets relocated. 49 4.5 Meta Quest with an attached Vive Tracker and ZED-Mini camera used in the evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6 Time distribution of the distance error, shown on the example of a dataset from fixed-point calibration. . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.7 Distance error box-plots of pilot recordings. . . . . . . . . . . . . . . . . . 52 128 4.8 Network communication between admin computer and VR users. . . . . . 53 4.9 Exemplary experiment setup for marker-based calibration method. . . . . 54 4.10 Box-plots of median distance error for four calibration types. . . . . . . . 55 4.11 Mean values of calibration error (median distance error) for each calibration method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.12 Two colocated users standing in front of each other with hand tracking enabled. 59 4.13 Two seated colocated users using their hands. . . . . . . . . . . . . . . . . 60 5.1 A colocated multi-user VR setup. User B’s hands are outside the range of their own hand tracking, but visible to user A. With off-the-shelf solutions, only two hands can be detected at the same time within a short range (left). Our solution allows us to detect and position the hands of other users in 3D space. This way the hands of user B can still be detected (right). . . . . . 64 5.2 Step-by-step diagram for adjusting and positioning a detected hand from MediaPipe. After detection, the virtual hand is adjusted to the real-world hand size and then positioned with the help of the intercept theorem. . . 66 5.3 Landmark indices of the MediaPipe framework. Marked landmarks are used for hand length calculations. . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4 Body Size estimations excerpt from Pheasant [90]. . . . . . . . . . . . . . 67 5.5 Discrepancy between measured hand length and calculated hand length of ten users that participated in the evaluation of section 5.3.2 & section 5.3.3. 68 5.6 Sketch of the experimental setup. . . . . . . . . . . . . . . . . . . . . . . . 70 5.7 Box-plots for static data collection for each method and distance. . . . . . 72 5.8 Scatter plots of two example dynamic error distributions for Meta Quest and MediaPipeHand. The x-axis refers to real-world distance of the hand to the tracking device, the y-axis refers to the error difference between real-world and virtual-world distance. Three Regression lines for Meta Quest data are illustrated separately for NearRange, MidRange and FarRange due to the rising gradients in each range. One regression line for MediaPipeHand over all distances is illustrated. The linear equations for the regression lines are in cm cm . Quest median error shows a higher rising error for large distances. 74 5.9 Mean gradient values for each method and distance range. Gradients are in cm cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.10 Box-plots medians for lost and acquired tracking distance for tracking methods. Labels are median values. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.11 Box-plots of the mean error at every measured distance in the test with multiple users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.12 Mean gradient values for each method and distance range in Experiment 2. Gradients are in cm cm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.13 The point of view of User 1 with the tracking device attached tracking also the hands of User 2. Both users are colocated, the real-world view of the camera can be seen slightly overlayed. . . . . . . . . . . . . . . . . . . . . 82 129 5.14 Error of hand pairs in centimeters during the preliminary user test. The hands of User 1 were in NearRange, while the hands of user 2 were in MidRange. Spikes can occur due to tracking losses and wrongly positioning due to a smoothing filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.1 A colocated scenario involving two users, each equipped with their own hand tracking system. Virtual hands in the right image are assigned to the system that tracked the corresponding real hand in the left image. . . . . . . . . 88 6.2 Two colocated users where only one user is equipped with a hand tracking system that is able to track more than two hands. Virtual hands on the right image cannot be reliably assigned to the correct user. . . . . . . . . . . . 89 6.3 Distance method: The distance between the hand and the user is calculated. The hand is assigned to the user with the shortest distance. . . . . . . . . 90 6.4 Logistic curve for distance (formula at bottom left)) and rotation (formula at bottom right) calculations. The X-axis represents the input position/rotation, while the Y-axis denotes the resulting confidence. The red lines indicate the selected threshold values. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.5 Rotation method: The angle between the forward vector of the hand and the direction vector from hand to user is calculated. The hand is assigned to the user with the smaller angle. . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.6 Convex Hulls generated from user’s VR controller inputs. The red points correspond to recordings from the right hand, while the blue points represent recordings from the left hand. The resulting convex hulls are depicted in green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.7 Reward development in Unity3D for ML training. The blue line corresponds to training with 2 users, the grey line represents 3 users, the pink line denotes 4 users, and the yellow line represents 5 users. The lines are smoothed for clarity, while the transparent lines in the background display the unsmoothed results. The variance in unsmoothed rewards may arise from randomized input data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.8 An example of a dynamic selection of hand assignment methods for a hand. The two left users and the hand are in close proximity. The complex ’Machine Learning’ algorithm is used here. The user on the right is further away, which is why the simpler ’Distance’ algorithm is used. . . . . . . . . . . . . . . . 94 6.9 Some exemplary formations of the evaluation with two to five simultaneous users. Formations include near and far proximities to cover diverse and difficult assignment scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.10 Accumulated accuracy across all methods with and without the history algo- rithm. The utilization of the history algorithm leads to an overall higher level of accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.11 Mean accuracy results for the methods. The two most accurate methods are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 130 6.12 Mean performance results for the methods at various user counts. One bar illustrates the mean performance adjusted for the number of users. . . . . 104 6.13 Effect of user proximity on the accuracy for the distance method. Reduced accuracy and increased variance can be seen when users are near each other. Figure 6.9 illustrates formations with different proximities. . . . . . . . . 106 131 List of Tables 3.1 Overview over all synchronized that is sent to the Photon server and then broadcasted to all connected clients . . . . . . . . . . . . . . . . . . . . . 37 3.2 Synchronized data for a serialized hand . . . . . . . . . . . . . . . . . . . 38 4.1 Results of post-hoc pairwise comparisons with Games-Powel test. . . . . . 56 5.1 The resulting mean and median values for all distances in the static error evaluation. Error values are in centimeters. . . . . . . . . . . . . . . . . . 73 5.2 The resulting mean value gradients for different ranges in the dynamic error evaluation. Mean error gradient values are in cm cm . . . . . . . . . . . . . . 76 5.3 The resulting mean and median distances for the different tracking methods when tracking is acquired and lost. Distance values are in centimeters. . . 78 6.1 Mean and standard deviation of accuracy of the methods . . . . . . . . . 101 6.2 Resulting p-values of the pairwise comparison of the methods after applying the Bonferroni correction. Significant results are marked in bold. . . . . . 102 6.3 Mean performance of methods (in milliseconds) based on number of users, including User-corrected times . . . . . . . . . . . . . . . . . . . . . . . . 104 133