Die approbierte Originalversion dieser Dissertation ist in der Hauptbibliothek der Technischen Universität Wien aufgestellt und zugänglich. http://www.ub.tuwien.ac.at The approved original version of this thesis is available at the main library of the Vienna University of Technology. http://www.ub.tuwien.ac.at/eng Erklärung zur Verfassung der Arbeit Annette Mossel Lindengasse 41/9, 1070 Wien Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die ver- wendeten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit - einschlieÿlich Tabellen, Karten und Abbildungen -, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. (Ort, Datum) (Unterschrift Verfasserin) i Acknowledgements This PhD thesis would not have been possible without the support of my family, my beloved friends near and far, my advisors and my colleagues from the IMS. First of all, I would like to thank my advisor Priv.-Doz. Dr. Hannes Kaufmann for introducing me to the research eld, for providing me with scientic insights into mixed reality, for all the valuable discussions about ideas and projects and for all his help, feedback and support from the beginning of my time at the IMS. Furthermore, I thank Prof. Mark Billinghurst for being my second reviewer and for his valuable feedback for this thesis. I also would like to thank Prof. Christian Breiteneder for giving me the opportunity to work at the IMS, for valuable discussions and for supporting my plans and ideas. I would like to thank my colleagues who were involved in my research for their feed- back and support. In particular, I thank Georg Gerstweiler and Emanuel Vonach for their strong contributions to the results of Part II, my graduate student Benjamin Ven- ditti whose work signicantly contributed to the results presented in Part III, Christian Schönauer, Georg Gerstweiler, Michael Bressler, David Zeller and Mathis Csisinko for their contributions to Part IV, Dr. Matthias Zeppelzauer for valuable research discus- sions, test assistance as well as proof reading and Dr. Dalibor Mitrovic for providing me with scientic and non-scientic (sweets and cookies) support. I especially thank Ingrid Lissa for taking a lot of administrative work out of my hands and all of my colleagues and tutors, who helped me in holding my lectures. Furthermore, I would like to thank Dr. Klaus Chmelina from Geodata Ziviltechniker GesmbH for the good cooperation dur- ing our joint research projects. Financial support for this work was obtained by FFG - Austrian Research Promotion Agency with project no: 822680 (B1) as well as by VrVis Research GmbH (Vienna, Austria) within the EU 7th Framework project I2Mine (no: NMP2-LA-2011-280855). Besides my colleagues, I wish to express my most heartfelt thanks to my wonderful atmates for providing me a cozy and great place to live, especially Saskia Kuhlmann for making Vienna a new home for me and Julia Stockenreiter for all her advise and encouragement at any time. I am deeply grateful to my parents, my sister, my brother in law and my complete family for their love, endorsement and support of my ideas and plans throughout my entire life. Finally, I would like to thank Matthias for being such a wonderful partner, friend and colleague in one person. I really appreciate sharing my life with you. iii Abstract Mixed reality has been a focus of research for many years and has recently gained partic- ular importance with the emergence of powerful, low-cost input- and output devices as well as processing platforms that foster the applicability of virtual simulations for every- day usage. However, this leads to signicant challenges since the creation of compelling mixed reality environments requires knowledge and robust techniques in the areas track- ing, visualization, interaction and in the non-obligatory areas distribution and authoring. This thesis focuses on the development of novel techniques and algorithms to contribute to the solution of fundamental problems in the areas of tracking, interaction, and applica- tion development of mixed reality systems. Firstly, a novel system for wide-area optical tracking in unconstrained indoor environments is presented that is capable of stereo camera calibration and model-based tracking of rigid-body targets in environments with poor illumination, static and moving ambient light sources, occlusions and harsh con- ditions, such as fog. The experimental results demonstrate the system's capabilities to track targets up to 90m and its applicability to act as a mixed reality tracking system as well as a general purpose measurement tool for future (underground) surveying tasks, such as autonomous machine guidance. Secondly, we investigated concepts for intuitive 3D interaction in virtual environments, specically in one-handed handheld mixed re- ality. To address the shortcomings of state-of-the-art 3D selection and manipulation techniques, the novel algorithms DrillSample for selection, and 3DTouch and HOMER-S for manipulation are proposed. All three approaches aim at reducing the necessary input through the user's ngers to provide easy to understand and straightforward interac- tion. Therefore, they incorporate the 6-degree-of-freedom pose that is obtained through optical tracking, resulting in a one-nger interaction for precise selection of partly or fully occluded objects with high visual similarity. Thirdly, the novel software framework ARTIFICe is presented that facilitates the development of compelling mixed reality en- vironments. It aims at minimizing the initial hurdles of application development as it is inexpensive and provides a powerful graphical interface to easily access and author tracking, interaction, visualization and distribution. With the presented contribution, we aim at leveraging the applicability of mixed reality into unconstrained everyday environments that are used by non-experts. v Kurzfassung Mischrealitäten, als durch den Computer simulierte dreidimensionale Umgebungen, sind seit vielen Jahren Gegenstand der Forschung. In jüngster Zeit hat das Aufkommen von leistungsfähigen und kostengünstigen Recheneinheiten sowie Ein- und Ausgabegeräten zu gesteigerten Bemühungen geführt, Mischrealitäten verstärkt in Alltagssituationen einzu- setzen. Dies jedoch führt zu einer Vielzahl von Herausforderungen, da für deren Entwick- lung robuste Techniken und Kenntnisse in Lokalisation, Visualisierung, Interaktion und optional Verteilung erforderlich sind. Der wissenschaftliche Beitrag dieser Dissertation umfasst neue Techniken und Algorith- men für Lokalisation, Interaktion und Anwendungsentwicklung von Mischrealitäten. Im ersten Teil dieser Arbeit wird ein neues Lokalisierungssystem vorgestellt, das in Innenräu- men auf Distanzen von bis zu 90m die 3D Position von mit visuellen Markierungspunkten ausgestatteten Objekten bestimmen kann. Das System ist hierbei sowohl während Ka- merakalibrierung wie auch Lokalisierung robust gegenüber visuellen Störeinüssen der Umgebung, wie beispielsweise statischen und bewegten Lichtquellen sowie Verdeckun- gen. Der zweite Teil dieser Arbeit beschäftigt sich mit Mensch-Maschine-Interaktion in dreidimensionalen Systemen, speziell in mobilen Mischrealitäten mit berührungssensiti- ven Bildschirmen. Um die Schwächen bestehender 3D Interaktionstechniken zu beheben, werden die neuen Algorithmen DrillSample für Objektselektion, sowie 3DTouch und Homer-S für Objektmanipulation vorgestellt. Die drei Techniken zielen alle auf leicht verständlich und einfach zu bedienende Interaktionen ab, indem sie notwendige Benut- zerInneneingaben reduzieren und die Position sowie Orientierung des mobilen Endgeräts miteinbeziehen. Dadurch können mit lediglich einem Finger präzise teilweise oder gänz- lich verdeckte Objekte ausgewählt werden, und entweder ohne Finger oder mit einem bzw. zwei Fingern Objekte verschoben, gedreht und skaliert werden. Im dritten Teil die- ser Arbeit wird das neue Softwareframework ARTIFICe vorgestellt, das die Erkenntnisse der ersten beiden Teile einbezieht und der einfachen Erstellung von hochwertigen Misch- realitäten dient. Es stellt der BenutzerIn hierfür eine übersichtliche Benutzeroberäche zur Verfügung, über die man auf Techniken und Hardwareschnittstellen für Lokalisation, Interaktion, Visualisierung und Verteilung zugreifen kann. Der vorgestellte wissenschaftliche Beitrag zielt darauf ab, die Erstellung und Bedienung von Mischrealitäten im Alltag zu vereinfachen und zu fördern. vii Contents I Introduction 1 1 Introduction to Mixed Reality 3 2 Motivation & Contribution 5 2.1 Resulting Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Peer Reviewed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Technical Reports . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Thesis Organization 11 II Wide-Area Optical Tracking 13 1 Introduction 15 1.1 Motivation & Problem Statement . . . . . . . . . . . . . . . . . . . . . . 16 1.2 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Theoretical Foundations 19 2.1 Principles of Optical Tracking . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.1 Accuracy & Performance . . . . . . . . . . . . . . . . . . . . . . 19 2.1.1.1 Performance Measures . . . . . . . . . . . . . . . . . . . 19 2.1.1.2 Sources of Error . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Tracking Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 Feature Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1.1 Natural Features . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1.2 Articial Features . . . . . . . . . . . . . . . . . . . . . 23 2.2.2 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2.1 2D Domain . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.2.2 3D Domain . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.3 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3 Projective Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.1 The Pinhole Camera Model . . . . . . . . . . . . . . . . . . . . . 31 2.3.2 Camera Model Extensions . . . . . . . . . . . . . . . . . . . . . . 32 ix CONTENTS 2.3.2.1 Principal Point Oset . . . . . . . . . . . . . . . . . . . 32 2.3.2.2 Skew Parameter . . . . . . . . . . . . . . . . . . . . . . 33 2.3.2.3 Camera Lens Distortions . . . . . . . . . . . . . . . . . 33 2.3.2.4 Camera Rotation & Translation . . . . . . . . . . . . . 35 2.3.2.5 Intrinsic & Extrinsic Camera Parameters . . . . . . . . 36 2.3.3 Multiple-View Geometry . . . . . . . . . . . . . . . . . . . . . . . 36 2.3.3.1 Epipolar Geometry . . . . . . . . . . . . . . . . . . . . 36 2.3.3.2 Stereo Correspondence Problem . . . . . . . . . . . . . 38 2.3.3.3 Computing the Camera Projection Matrix . . . . . . . 38 2.3.3.4 3D Point Reconstruction . . . . . . . . . . . . . . . . . 40 2.3.4 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 Related Work 43 3.1 Radio Frequency & Ultra Sound . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Optical Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Laser Measurement Systems . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 Methodology 51 4.1 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Evaluation of Target Visibility . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.2 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Methodological Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.1 Vision System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.2 Target Design Guidelines . . . . . . . . . . . . . . . . . . . . . . 55 4.3.3 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.3.1 Intrinsic Calibration . . . . . . . . . . . . . . . . . . . . 57 4.3.3.2 Extrinsic Calibration . . . . . . . . . . . . . . . . . . . 58 4.3.4 Interference Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.4.1 Hardware-based Target Identication . . . . . . . . . . 62 4.3.4.2 Software-based Target Identication . . . . . . . . . . . 65 4.3.5 3 Degree-Of-Freedom Tracking . . . . . . . . . . . . . . . . . . . 67 4.3.6 Occlusion Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4 System Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.3 System Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5 Experimental Results 75 5.1 Test Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Test Cases & Performance Measures . . . . . . . . . . . . . . . . . . . . 75 5.2.1 Calibration Performance . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.2 Tracking Performance . . . . . . . . . . . . . . . . . . . . . . . . 76 x CONTENTS 5.3 Tracking for Mixed Reality . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.1 Target Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.1.1 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3.2 Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.4 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.5 3D Position Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3.6 3D Position Stability . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.7 Tracking Performance . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4 Hand-held Target Tracking for Tunneling . . . . . . . . . . . . . . . . . 82 5.4.1 Target Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.1.1 Tracking Scenarios . . . . . . . . . . . . . . . . . . . . . 84 5.4.2 System Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4.3 Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.5 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.6 Accuracy & Stability of 3D Position Estimation . . . . . . . . . . 88 5.4.7 Tracking Performance . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5 Machine Tracking for Underground Guidance . . . . . . . . . . . . . . . 91 5.5.1 Shortcoming of Existing Technology . . . . . . . . . . . . . . . . 92 5.5.2 Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5.3 Target Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.5.3.1 Evaluation of LED Range . . . . . . . . . . . . . . . . . 93 5.5.3.2 Target Prototype . . . . . . . . . . . . . . . . . . . . . . 94 5.5.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.5.5 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.5.6 Accuracy & Stability of 3D Position Estimation . . . . . . . . . . 96 5.5.6.1 Inuence of Vibrations . . . . . . . . . . . . . . . . . . 97 5.5.7 Tracking Performance for Machine Guidance . . . . . . . . . . . 98 5.5.7.1 Tracking under normal Visibility . . . . . . . . . . . . . 98 5.5.7.2 Tracking with Occlusions and Poor Visibility . . . . . . 99 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6 Summary 105 III User Interfaces for 3D Interaction 107 1 Introduction 109 1.1 Motivation & Problem Statement . . . . . . . . . . . . . . . . . . . . . . 110 1.2 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 2 Theoretical Foundations & Related Work 113 xi CONTENTS 2.1 User Interfaces in Mixed Reality . . . . . . . . . . . . . . . . . . . . . . 113 2.1.1 3D Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 2.1.1.1 3D Selection and Manipulation Tasks . . . . . . . . . . 114 2.1.1.2 3D Selection & Manipulation Metaphors . . . . . . . . 115 2.1.2 3D Selection & Manipulation in Handheld Mixed Reality . . . . 115 2.2 3D Object Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 2.2.1 Virtual Hand Metaphors . . . . . . . . . . . . . . . . . . . . . . . 117 2.2.2 Virtual Pointing Techniques . . . . . . . . . . . . . . . . . . . . . 117 2.2.2.1 One-Step Selection Techniques . . . . . . . . . . . . . . 118 2.2.2.2 Two-Step Selection Techniques . . . . . . . . . . . . . . 119 2.3 3D Object Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 2.3.1 For Immersive Environments . . . . . . . . . . . . . . . . . . . . 121 2.3.2 For 2D Multi-Touch Devices . . . . . . . . . . . . . . . . . . . . . 122 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 3 3D Selection in Handheld Mixed Reality 125 3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.2 Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.3 The DrillSample Technique . . . . . . . . . . . . . . . . . . . . . . . . . 127 3.3.1 Selection Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.3.2 Mobile Raycasting . . . . . . . . . . . . . . . . . . . . . . . . . . 129 3.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 3.3.4 Crucial Aspects of the Algorithm . . . . . . . . . . . . . . . . . . 130 3.3.4.1 Length of the DrillSample Ray . . . . . . . . . . . . . . 132 3.3.4.2 Z-Position of the DrillSample . . . . . . . . . . . . . . . 132 3.4 Performance Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 3.4.1 Baseline Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 134 3.4.2 Adaptions for Handheld Mixed Reality . . . . . . . . . . . . . . . 134 3.4.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 3.4.4 Experimental Design and Procedure . . . . . . . . . . . . . . . . 135 3.4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 3.4.6 Test Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 3.5.1 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . 139 3.5.1.1 Performance Evaluation . . . . . . . . . . . . . . . . . . 139 3.5.2 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 141 3.5.3 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 142 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 3.6.1 Variations of the Algorithm . . . . . . . . . . . . . . . . . . . . . 145 4 3D Manipulation in Handheld Mixed Reality 147 4.1 Methodological Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 148 4.1.1 Requirements & Prerequisites . . . . . . . . . . . . . . . . . . . . 148 4.1.2 Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . 148 xii CONTENTS 4.1.3 The 3D Touch Technique . . . . . . . . . . . . . . . . . . . . . . 149 4.1.3.1 Translation . . . . . . . . . . . . . . . . . . . . . . . . . 149 4.1.3.2 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.1.3.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.1.4 The HOMER-S Technique . . . . . . . . . . . . . . . . . . . . . . 151 4.1.4.1 6DOF Manipulations . . . . . . . . . . . . . . . . . . . 151 4.1.4.2 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 4.1.5 Assistance Design . . . . . . . . . . . . . . . . . . . . . . . . . . 153 4.1.5.1 Mode Switches . . . . . . . . . . . . . . . . . . . . . . . 153 4.1.5.2 Supporting Visualization . . . . . . . . . . . . . . . . . 154 4.1.6 Crucial Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.2 Performance Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.2.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.2.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 4.2.3 Experimental Design and Procedure . . . . . . . . . . . . . . . . 157 4.2.4 Subjects & Apparatus . . . . . . . . . . . . . . . . . . . . . . . . 158 4.2.5 Test Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 4.2.5.1 Positioning on a Plane . . . . . . . . . . . . . . . . . . . 158 4.2.5.2 Positioning in 3D Space . . . . . . . . . . . . . . . . . . 159 4.2.5.3 Positioning & Rotation in 3D Space . . . . . . . . . . . 159 4.2.5.4 Non-Uniform Scaling & Positioning in 3D Space . . . . 159 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 4.3.1 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . 160 4.3.1.1 Performance Evaluation . . . . . . . . . . . . . . . . . . 160 4.3.2 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 162 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5 Summary 167 IV Creating Mixed Reality Environments 169 1 Introduction 171 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 2 Background & Related Work 173 2.1 Key Elements of a Mixed Reality Framework . . . . . . . . . . . . . . . 173 2.2 Application Development & Scene Management . . . . . . . . . . . . . . 174 3 Framework Architecture 177 3.1 Base Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 3.1.1 Functionalities of Unity . . . . . . . . . . . . . . . . . . . . . . . 178 3.1.2 Core Concepts of Unity . . . . . . . . . . . . . . . . . . . . . . . 178 xiii CONTENTS 3.2 Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 3.2.1 OpenTracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 3.2.2 Vuforia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 3.2.3 Supported Setups & Hardware . . . . . . . . . . . . . . . . . . . 181 3.2.3.1 Desktop Mixed Reality . . . . . . . . . . . . . . . . . . 181 3.2.4 (Semi) Immersive Mixed Reality . . . . . . . . . . . . . . . . . . 182 3.2.4.1 Handheld Mixed Reality . . . . . . . . . . . . . . . . . 182 3.3 Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 3.3.1 The ARTiFICe Manager . . . . . . . . . . . . . . . . . . . . . . . 183 3.3.1.1 Tracking Module . . . . . . . . . . . . . . . . . . . . . . 184 3.3.1.2 Interaction Module . . . . . . . . . . . . . . . . . . . . 184 3.3.1.3 Collaboration & Distribution . . . . . . . . . . . . . . . 185 3.4 Workow for Application Development . . . . . . . . . . . . . . . . . . . 187 4 Developed Mixed Reality Environments 189 4.1 Test Setups & Environment . . . . . . . . . . . . . . . . . . . . . . . . . 189 4.2 Non-Immersive Mixed Reality . . . . . . . . . . . . . . . . . . . . . . . . 190 4.2.1 Single & Multi-User Desktop Mixed Reality . . . . . . . . . . . . 190 4.2.2 Multi-User Handheld Mixed Reality . . . . . . . . . . . . . . . . 191 4.3 Combined Non- & Semi-Immersive Mixed Reality . . . . . . . . . . . . . 192 4.4 Combined Semi- & Full Immersive Mixed Reality . . . . . . . . . . . . . 192 5 Summary 195 V Conclusion 197 1 Findings & Outlook 199 1.1 Wide-Area Optical Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 200 1.1.1 Open Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 1.2 3D Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 1.2.1 Open Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 1.3 Creating Mixed Reality Environments . . . . . . . . . . . . . . . . . . . 204 1.3.1 Open Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 VI Appendix 207 Bibliography 209 List of Figures 223 List of Tables 227 A User Studies 229 xiv PART I Introduction 1 Introduction to Mixed Reality 3 2 Motivation & Contribution 5 3 Thesis Organization 11 1 Chapter 1 Introduction to Mixed Reality The generation of computer simulated environments that combine virtual and real con- tent has been a focus of research for many years. It is now gaining particular importance with the emerge of powerful, low-cost input- and output devices as well as processing platforms that foster the applicability of virtual simulations for everyday usage. Typical application domains for such systems are training, therapy, education and entertainment [98]. The combination of real and virtual content is referred to as Mixed Reality and can be dened as a computer generated 3D simulation with dierent levels of blending of real and virtual scene objects. These levels are described by the Milgram Continuum [19] that encompasses all possible variations and compositions of real and virtual objects, as depicted in Figure 1.1. Figure 1.1: The Milgram continuum describing the variations of mixed reality. While Reality shows a real environment where no augmentation with virtual objects occurs, the observed environment in Augmented Reality mostly consists of real objects that are augmented with a few virtual objects. Augmented Virtuality consists of mostly virtual objects that are augmented with few real objects while Virtual Reality completely locks out the real world and only displays virtual objects in the observed environment. Each state of the Milgram continuum can be further categorized depending on the pro- vided amount of immersion that correlates with the involved input and output devices. A Non-Immersive system mostly consists of a non-stereoscopic screen and 2D discrete user interfaces, such as mouse and keyboard. The user views the virtual scene through the output device, that acts as a window into the virtual world. Thereby, the user is fully aware of the reality thats surrounds him. Examples are desktop setups that provide a stationary view into the virtual scene, and handheld mixed reality that allows the user to change the viewpoint by moving the mobile device. Semi-Immersive systems provide 3 1. INTRODUCTION TO MIXED REALITY an increased amount of immersion by enabling stereoscopic viewing and 3D interaction. This is usually achieved by employing stereo projection walls that are viewed through tracked shutter glasses. The user typically can freely walk in front of the wall and the involved interaction devices allow for interaction in 3D. Although the user cannot fully immerse into the mixed reality environment, as it does not entirely surround him, the amount of immersion is increased by stereoscopic viewing, natural walking and 3D ob- ject interaction. Fully Immersive setups incorporate head mounted displays as well as portable 3D interaction devices, enabling the user to freely move throughout the entire tracking space. For a in depth review of the dierent avors of mixed reality, the reader is kindly referred to [83]. Figure 1.2: Components of a mixed reality system. The creation of compelling mixed reality environments is built upon the mandatory key components tracking, visualization, interaction and the non-obligatory module distri- bution. Tracking of users as well as of interaction devices is necessary to allow egocentric scene view and to enable Interaction between the user and the virtual environment; Visu- alization is required to render the entire 3D scene on an output device, such as a screen, a projection wall or a head mounted display. In addition, Distribution of the scene objects and of the user's interactions allows for a remote mixed reality setup engaging one or more users to view and interact collaboratively with the virtual simulation. To create, maintain and deploy the mixed reality application, an Authoring module that interfaces with the four mentioned components is a valuable asset, especially for non-experts. It provide means to manage the 3D scene and to set up the entire system before deploy- ment. The components of a mixed reality system are illustrated in Figure 1.2. This thesis focuses on the development of novel techniques to contribute to the solu- tion of fundamental problems in the areas of tracking, interaction, and mixed reality application development. 4 Chapter 2 Motivation & Contribution The major objective of this thesis is the development of novel techniques and systems to leverage the applicability of mixed reality into unconstrained everyday environments that are used by non-experts. Figure 2.1: Investigated concepts, their relationship and the presented contribution. 5 2. MOTIVATION & CONTRIBUTION For this purpose, we investigated concepts in the area of tracking, interaction and mixed reality application development, as depicted in Figure 2.1, that resulted in the following contributions. 1. A novel optical tracking system with enhanced robustness against environmental interferences and extended volume coverage that requires a minimal amount of vision hardware. 2. Novel techniques for 3D selection and manipulation that employ 3D position and orientation of the input device as well as incorporate real world metaphors to highly simplify necessary user input. 3. A novel software framework to develop collaborative and distributed mixed real- ity applications. It features a powerful graphical user interface for authoring and supports a large number of o-the-shelf input as well as output devices. Tracking systems determine the position and orientation of an object in space, such as the user's head mounted display or an interaction device. A large number of dier- ent tracking technologies exist and each method has its advantages and disadvantages regarding volume coverage, tracking accuracy, sensitivity to interferences as well as scal- ability. Thus, there is no general tracking technology that perfectly suits all variations of tracking scenarios. Infrared optical tracking detects targets within camera images in the near infrared spectrum. This technology has been found to be fast, accurate as well as scalable to a certain extent, and is widely used to provide tracking in mixed reality applications. However, state-of-the-art systems suer from sensitivity to ambient inter- fering lights during calibration and tracking, furthermore they only cover standard room sized environments with a small amount of vision hardware. This yields lack of tracking support for wide unconstrained indoor environments and results in high hardware costs and complex setup as well as maintenance routines when extending the tracking volume. Thereby, the system is impractical for everyday usage, especially for non-experts. To overcome these limitations, a system for model-based optical 3D position tracking of rigid-body targets is presented. The proposed system is capable to cover wide, uncon- strained indoor volumes and provides robust calibration and tracking while requiring a minimal hardware setup of two cameras. The experimental results demonstrate the sys- tem's capabilities to act as a mixed reality tracking system as well as a general purpose measurement tool for future (underground) surveying tasks, such as autonomous machine guidance. It was successfully applied in three dierent unconstrained wide area indoor environments, providing relative millimeter point accuracy up to 30m and centimeter deviation up to 90m. These results clearly improve state-of-the-art systems and reveal the system's applicability to use cases that go beyond mixed reality scenarios. As described, tracking is a fundamental building block of a mixed reality system and is the technological foundation to enable interaction with a virtual 3D scene through the involved interaction devices. Therefore, it is applied to investigate novel techniques 6 to provide intuitive interaction between the user and the 3D simulation. Intuitive in- teraction can be dened as a mean that enables users to interact with a scene object using their real-world knowledge for selection and object manipulation. In a handheld mixed reality system, a user typically holds a portable device in one hand to view the scene onto the display that shows a live camera image that is augmented with virtual scene objects. Throughout this thesis, the handheld device is referred to a smartphone with a touch sensitive display to simultaneously detect multiple nger inputs. The user's second hand interacts with the scene objects using the multi-touch input. However, two problems arise in such a situation: the imprecise nger touch input for selection yields the high probability of inaccurate extraction of small objects, especially when they are partly or fully occluded or surrounded by highly similar virtual scene objects. For object manipulation, such as translating, rotating and scaling, existing methods use complex multi-nger gestures to provide full 3D manipulations. However, most of these gestures are dicult or impossible to apply in a one-handed setup and their usage additionally requires prior knowledge. To address the shortcomings of state-of-the-art 3D selection and manipulation techniques, three novel methods are proposed and evaluated through- out user studies. Firstly, the 3D selection technique DrillSample is described that only requires single touch inputs. Upon selection of multiple objects, the user can indicate the desired object in a renement step that presents the objects in their original spatial context. Thereby, it allows the user to precisely disambiguate between objects with high similarity in visual appearance and enables the selection of strongly or entirely occluded objects. For a comprehensive evaluation of the DrillSample selection technique, a sum- mative evaluation was conducted by comparing DrillSample with two baseline techniques across three dierent selection scenarios based on variations of object density and visi- bility. As demonstrated by the study results, DrillSample overall outperforms the state- of-the-art baseline methods and was found the best general purpose selection method for visible as well as party and fully occluded objects, independent of their visual appear- ance. To overcome shortcoming of state-of-the-art 3D manipulation techniques using 2D multi-touch input, two novel methods 3DTouch and HOMER-S are proposed. Both support the spatial rigid manipulations translation, rotation and the non-rigid manip- ulation scaling. 3DTouch provides 3D translation and rotation as well as non-uniform scaling by fusing one- or two-nger touch input with the handheld's 6 degree-of-freedom (DOF) pose that is obtained using optical tracking. The integral 6DOF manipulation is decomposed into two separate tasks, enabling a single touch input to be sucient to access all three 3DOF during translation and rotation. A two-nger pinch gesture al- lows for non-uniform scaling in 3D. HOMER-S provides interaction beyond the (limited) screen dimensions by decoupling the manipulation process from any touch input. It aims at DOF-integration and maps the 6DOF device pose onto the object upon selection. Thereby, full 6DOF manipulation as well as non-uniform scaling is performed by em- ploying real-world metaphors. In a comprehensive user study, the performance, accuracy and ease-of-use for both techniques are assessed across four dierent test scenarios with varying manipulation tasks. The results reveal both techniques to be intuitive to trans- late and rotate objects. HOMER-S lacks accuracy compared to 3DTouch but achieves a 7 2. MOTIVATION & CONTRIBUTION signicant performance increase in terms of speed for full 6DOF manipulations. While tracking and interaction are two key components to develop a mixed reality sim- ulation, a crucial factor to leverage mixed reality for everyday usage is quick application prototyping and development. Since creating a mixed reality application requires knowl- edge in all of the building blocks as depicted in Figure 1.2, a high entry threshold for development is the result. At the moment of investigating mixed reality frameworks, there were no inexpensive toolkits available that provided interfaces to extend the frame- work with novel techniques for tracking and interaction as well as that featured a powerful graphical authoring component. This technological gap fostered the development of a cost-ecient software framework ARTIFICe that enables quick prototyping of collabora- tive and distributed mixed reality environments. It features a loosely-coupled, modular software architecture that overcomes limitations of state-of-the-art frameworks regarding costs, usability and extensibility. ARTIFICe provides tracking data by several input de- vices and oers a number of built-in interaction methods, including the novel techniques of this thesis. It enables multi user collaboration in distributed virtual scenes and incor- porates recently emerged, popular o-the-shelf input devices, such as Microsoft Kinect, Razer Hydra and mobile phones, running Android and iOS. The framework was employed for proof-of-concept application development to evaluate the investigated concepts of this thesis. Furthermore, ARTIFICe was used by more than 100 students during their uni- versity graduate program who were not familiar with mixed reality technology before. It allowed them to develop distributed applications within just a couple of weeks that in- corporated dierent tracking devices and as well as interaction techniques. These results indicate that ARTIFICe can act as a foundation to further leverage the simplication of application development and thereby the pervasiveness of mixed reality. 2.1 Resulting Publications The work presented in this thesis has appeared in the following publications: 2.1.1 Peer Reviewed [1] Annette Mossel, Christian Schönauer, Georg Gerstweiler, and Hannes Kaufmann. ARTiFICe-Augmented Reality Framework for Distributed Collaboration. In: Presented at Workshop on O-The-Shelf Virtual Reality, IEEE VR, USA, 2012, published in International Journal of Virtual Reality 11.3 (2012), pp. 17. [2] Annette Mossel, Georg Gerstweiler, Emanuel Vonach, Klaus Chmelina, and Hannes Kaufmann. Robust Long-Range Optical Tracking for Tunneling Measurement Tasks. In: European Geosciences Union - General Assembly 2013. Vol. 15. Vi- enna, Austria: Geophysical Research Abstracts, 2013, p. 1. 8 2.1 Resulting Publications [3] Annette Mossel and Hannes Kaufmann. Wide Area Optical User Tracking in Un- constrained Indoor Environments. In: Proceedings of the The 23rd International Conference on Articial Reality and Telexistence (ICAT). Tokyo, Japan: IEEE, 2013, pp. 108115. [4] Annette Mossel, Benjamin Venditti, and Hannes Kaufmann. 3DTouch & HOMER- S: Intuitive Manipulation for One-Handed Handheld AR. In: Proceedings of the Virtual Reality International Conference on Laval Virtual (VRIC '13). Laval, France: ACM Press, 2013, pp. 110. isbn: 9781450318754. [5] Annette Mossel, Benjamin Venditti, and Hannes Kaufmann. DrillSample: Precise Selection in Dense Handheld Augmented Reality Environments. In: Proceeding of 15th Int. Conf. of Virtual Technologies (VRIC'13). Vol. 00. Laval, France: ACM Press, 2013, p. 10. isbn: 9781450318754. [6] Annette Mossel, Georg Gerstweiler, Emanuel Vonach, Klaus Chmelina, and Hannes Kaufmann. Vision-based Long-Range 3D Tracking, applied for Underground Sur- veying Tasks. In: Journal of Applied Geodesy 8.1 (2014), pp. 4364. 2.1.2 Technical Reports [1] Annette Mossel, Thomas Pintaric, and Hannes Kaufmann. Analyse der Mach- barkeit und des Innovationspotentials der Anwendung der Technologie des Optical Real-Time Trackings für Aufgaben der Tunnelvortriebsvermessung. Tech. rep. Aus- tria: Institute of Software Technology and Interactive Systems, Vienna University of Technology, 2008. [2] Klaus Chmelina, Egmont Lammer, Annette Mossel, and Hannes Kaufmann. Real- Time Machine Guidance with Tracking Cameras. In: Proceedings of Aachen In- ternational Mining Symposia (AIMS). Aachen, Germany, 2014. [3] Klaus Chmelina, Annette Mossel, and Hannes Kaufmann. Echtzeitvermessung mit Infrarottrackingkameras - Untersuchung einer neuen Messtechnik für untertage. In: Proceedings of 17. Internationaler Ingenieurvermessungskurs. Zürich, Switzer- land: Herbert Wichmann-Verlag, Oenbach/Berlin, 2014. 9 Chapter 3 Thesis Organization The organization of this thesis follows the identied key components of a mixed reality system, as shown in Figure 2.1. It presents the performed research in three parts. Part II focuses on wide area optical tracking in unconstrained indoor environments. After reviewing the principles of optical tracking and multi-view imaging, competing state-of-the-art tracking systems are discussed and compared. The background chapters are followed by the methodological approach that describes the theoretical principles that were investigated and developed to build the proposed robust wide-area tracking system. The system's prototype is then evaluated in depth by testing it in three dierent use cases: 1) user tracking in a mixed reality setup, 2) handheld target tracking for tunneling application and 3) tracking for machine guidance in underground environments. Finally, a summary presents ndings and concludes this part. The work in optical tracking is followed by Part III that presents the investigated concepts and developed algorithms for 3D object selection and manipulation in a one- handed handheld mixed reality environment. In the rst chapter of this part, theoretical foundations of 3D selection and manipulation are given and state-of-the-art techniques are reviewed and discussed. Next, the methodological approach of the novel selection technique is described and the results of the conducted user study are presented. After the study on object selection, two novel approaches for object manipulation are described. Both techniques are examined by a comparative user study and the results are statistically evaluated and discussed. Finally, conclusions of the novel techniques are given and the investigated concepts of this part are summarized. Part IV presents a software framework that enables the development of collaborative multi-use distributed mixed reality applications by integrating dierent hardware devices, tracking technologies and interaction metaphors. After giving an overview of related work, the design approach of the proposed framework is described. Next, the capabilities of the framework were evaluated by developing example applications that support various input devices as well as encompass dierent setups of mixed reality, ranging form desktop, handheld, semi- to full immersive compositions of real and virtual objects. Finally, in Part V the author summarizes the thesis and the presented contributions, and discusses open topics in the context of the investigated topics in mixed reality. 11 PART II Wide-Area Optical Tracking 1 Introduction 15 1.1 Motivation & Problem Statement . . . . . . . . . . . . . . . . . . . . . . 16 1.2 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Theoretical Foundations 19 2.1 Principles of Optical Tracking . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Tracking Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Projective Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 Related Work 43 3.1 Radio Frequency & Ultra Sound . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Optical Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Laser Measurement Systems . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 Methodology 51 4.1 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Evaluation of Target Visibility . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Methodological Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4 System Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5 Experimental Results 75 5.1 Test Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Test Cases & Performance Measures . . . . . . . . . . . . . . . . . . . . 75 5.3 Tracking for Mixed Reality . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 Hand-held Target Tracking for Tunneling . . . . . . . . . . . . . . . . . 82 5.5 Machine Tracking for Underground Guidance . . . . . . . . . . . . . . . 91 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6 Summary 105 13 Chapter 1 Introduction In mixed reality environments, accurate and fast tracking of arbitrary points, such as the user's head and hand, is crucial for creating a compelling virtual environment that provides seamless interaction. A number of tracking technologies and approaches exist, as depicted in Figure 1.1. Figure 1.1: Tracking approaches, with the eld of contribution marked bold. All of them have their advantages and disadvantages regarding volume coverage, tracking accuracy, sensitivity to interferences as well as scalability, thus there is no general tracking technology that suits perfectly every tracking scenario. The focus of this thesis is optical tracking, therefore this technology will be discussed and the contribution in this eld will be presented within this part of the thesis. For an in depth discussion of the other tracking technologies, the reader is kindly referred to [83]. 15 1. INTRODUCTION 1.1 Motivation & Problem Statement Optical tracking has been proven to be a reliable alternative to competing tracking tech- nologies since it is less susceptible to noise, it allows multiple objects to be tracked simul- taneously, trackable optical markers can be individually designed, they are lightweight, re-congurable and wireless and an optical tracking system can cover large areas. How- ever, state-of-the-art optical tracking systems are mostly designed for standard room sized environments or require a large number of vision sensors (cameras) to cover larger volumes to keep the precision high. This yields signicant, high hardware costs as well as complex setup and maintenance routines, making it impractical for general use, espe- cially for non experts. Thus, low-cost wide area tracking with high precision remains a challenge but is indispensable to lower the costs to build compelling immersive virtual environments. The increasing demand for such systems is indicated by the success of recently emerged low-cost hardware, such as the head mounted display Oculus Rift, the Razer Hydra for 3D interaction as well as the Microsoft Kinect for full body motion capture. They massively lowered the initial costs to build a fully immersive VE, but only for small tracking volumes. Furthermore, state-of-the-art optical tracking systems are sensitive to environmental interferences such as lights and reexions, especially during target training and camera calibration. This yields limited usability in every day tracking scenarios as well as error prone tracking results. Hence, the further employment of virtual reality scenarios for applications that are located in unconstrained environments such as rooms with wall illumination, entertainment stages, manufacturing workshops or even construction sites are impeded by the following three limitations 1) tracking coverage, 2) system sensitivity as well as 3) system scalability & costs. 1.2 Research Objective To overcome limitations of state-of-the-art optical tracking technology, the following research objectives have been dened. Firstly, a throughout evaluation of existing meth- ods, algorithms and hardware systems is conducted to analyze the requirements to a wide area tracking system for unconstrained environments. Next, existing methods have to be tested, extended and then integrated in a novel system to allow for camera calibration and tracking under heavy interferences. Finally, the system is required to be evaluated in real-life scenarios to draw a robust conclusion on its capabilities, limitations and possible application scenarios. 1.3 Organization This part is organized as follows. In Chapter II.2, the optical tracking problem is dened, the theory of multi-view imaging to solve the optical tracking problem is discussed and the most common recognition methods are reviewed. In Chapter II.3, competing state-of-the- art tracking approaches for 3D position estimation in indoor environments are reviewed and compared. In Chapter II.4, a description of the developed robust wide-area tracking 16 1.3 Organization system is given and its capabilities and accuracy are evaluated in Chapter II.5 within three dierent test scenarios; 1) user tracking in a mixed reality setup, 2) handheld target tracking for tunneling application and 3) tracking for machine guidance in underground environments. Finally, Chapter II.6 gives conclusions. 17 Chapter 2 Theoretical Foundations In this chapter, we describe the fundamental theoretical concepts of optical tracking. 2.1 Principles of Optical Tracking The term tracking relies to the technology to rst detect and then track arbitrary features in space over time to be able to determine the position as well as orientation of the tracker, which is the object that observes these features. In optical tracking, the tracker object is an imaging device, such as a mono, color or depth sensing camera. Pose Tracking In a three-dimensional tracking space, 3D-position and -orientation can be estimated, constituting a 6 degrees of freedom (DOF) pose of the tracker [62, 83]. 6DOF pose determination is fundamental for view-dependent visualization as well as 3D interaction, thus it is the crucial underlying technology for a mixed reality system. Tracking Scenarios In an Outside-Looking-In tracking scenario, the tracker is xed and observes a scene to track features (see Section 2.2.1.2) that are attached to an arbitrary object, such as a user. On the contrary, in an Inside-Looking-Out scenario, the tracker is attached to the tracking object and observes and tracks xed features [62]. 2.1.1 Accuracy & Performance The overall capabilities of a tracking system can be expressed by the performance mea- sures, which are described in the following Section 2.1.1.1. The system's performance is thereby inuenced by various internal and external sources of error, as specied in Section 2.1.1.2). 2.1.1.1 Performance Measures Latency describes the time delay between the change in tracker pose and the time, the system has estimated and outputs the new tracker pose [62, 83]. It involves 19 2. THEORETICAL FOUNDATIONS data capturing, model recognition and pose estimation. It correlates with the Update Rate that is the number of measurements that the tracking system outputs per second. The tracker update rate is usually higher than the overall (system's) update rate. In optical tracking, Update Rate and thus Latency depend on the imaging device's frequency and the speed of the processing unit to estimate the tracker's pose. Accuracy expresses the dierence of estimated and real tracker pose. It is inuenced by internal and external sources of error, Tracker Jitter and Tracker Drift. Tracker Jitter represents the change in tracker output when the tracker object is stationary. Tracker Drift is the steady increase in tracker error over time. To avoid error prone tracking results, it must be periodically zeroed by using a secondary tracker of a type that does not have a drift [62, 83]. In case of an optical tracking system, Tracker Jitter decreases with increasing the imaging sensor resolution as well as decreasing the distance between tracker and observed feature. Tracker Drift can be decreased to zero if position and orientation are estimated with every new incoming image frame. Robustness expresses the capabilities of the tracking system to uniquely identify the tracker object and to correctly estimate its pose [85]. Robustness relies on the system's ability to deal with the various sources of error, on a proper hardware setup for the intended tracking volume and on a properly designed tracker target model. 2.1.1.2 Sources of Error Optical tracking systems are very sensitive to the reliability of their inputs. According to [57], overall lighting conditions and estimated camera model (see Section 2.3.4) are two sources of errors. The ndings of [57] can be extended and furthermore split into internal and external sources of errors. Internal sources of error encompassed errors that are implicitly given in optical tracking due to the underlying sensor hardware and data processing. External sources of errors are caused by external circumstances that are present in the tracking volume. An optical tracking system has to cope with the following internal sources of error. Optical Aberrations & Camera Model Optical tracking systems require a precise estimation of the camera model's parameter to allow for accurate 3D point com- putation. The intrinsic camera parameters are required to provide a correct per- spective transformation between points in 3D space and points in the 2D camera plane. Since every object lens has (at least minimal) optical aberration that results in distorted camera images, theses distortions can be minimized by applying the intrinsic image distortion (radial and tangential) coecients. The extrinsic cam- era parameters describe the spatial relationship of the tracking system's cameras that encompasses position and orientation; the parameters highly inuences the accuracy of the 6DOF Pose Estimation (see Section 2.2). 20 2.2 Tracking Pipeline Image Processing Aberration The target model points must be robustly and pre- cisely segmented within all camera images. Since an image sensor consists of dis- crete pixels, rasterization causes inaccuracies during Feature Segmentation (see Section 2.2). The magnitude of rasterization artifacts depends on imaging sen- sor resolution, sensor noise as well as tracking distance. Thus, depending on the intended tracking coverage, imaging hardware must be properly selected. Sensor Noise Thermal deviation inuences the amount of noise on the image sensor and causes jitter on the image. Depending on pixel size and density, the sensor temperature and thus jitter can increase. High sensor noise decreases the quality of feature segmentation. In addition to the internal sources of error, the following external factors can reduce the performance of the tracking system. Interfering Lights Various light sources, such as sun light, wall illumination and mov- ing light sources can exist in an everyday optical tracking scenario. They can massively interfere with the estimation of the camera model as well as the unique identication of the target model during tracking, resulting in inaccurate pose es- timates. Occlusion Partially occluded target models can result in a complete loss of tracking, or can lead to inaccurate and false positive Feature Segmentation and hence Pose Estimation. Target Model The applied target model must be properly designed depending on the intended tracking system coverage to allow for accurate Model Fitting and Pose Estimation (see Section 2.2). Inaccurate target models result in systematic pose estimation errors. 2.2 Tracking Pipeline Figure 2.1 shows the optical tracking pipeline that processes the incoming images (frames) to provide the target's 6DOF pose to the system. As it is illustrated in Figure 2.1, the pipeline consist of the following four main sub-tasks: Feature Segmentation To detect and segment the observed optical feature in a camera image, image processing techniques are applied. They depend on the used opti- cal feature (see Section 2.2.1). To account for image aberrations, the underlying Camera Model (see Section 2.3.4) is incorporated into the segmentation process. Model Fitting To determine the correspondence between the segmented 2D features and the underlying Target Model, a tting routine is performed based on the track- ing model properties. Depending on the model, the tting is performed in 2D or in 3D; for 3D model tting, the camera model is integrated to transform 2D feature points back into 3D space. 21 2. THEORETICAL FOUNDATIONS Pose Estimation Based on the model tting, the 3D position and/or orientation of the applied target model can be calculated. This process requires both the camera and target model properties. Predictive Filtering is applied to minimize the inuence of tracker jitter, tracking drift and to reduce the eect of latency. This is an optional step in the pipeline. Figure 2.1: The optical tracking pipeline. Except Predictive Filtering, as it is an optional task and does not form the fundamen- tal base for optical tracking, the outlined subtasks are explained in detail in the following sections. 2.2.1 Feature Segmentation To detect and segment the observed optical features in a camera image, they must be automatically extracted from the incoming frames. Optical tracking approaches can be divided into tracking of Natural Features or Articial Features. 2.2.1.1 Natural Features Natural feature tracking refers to the process of detecting and describing prominent and distinctive structures in the observed environment, such as edges, corners and gradients. Thus, features neither need to be attached to objects that are subject of tracking nor need to be articially inserted into the tracking environment. With the introduction of robust local descriptors that are invariant to scale as well as rotation and that are robust - up to a certain extent - against illumination changes and viewing direction [58], employing natural features became popular for a broad number of computer vision applications and for optical tracking. Natural feature processing usually consists of three stages Detection, Description and Matching. For detection, distinctive local features are computed in each processed frame. The feature descriptor represents the neighborhood of detected features in a rotation and scale invariant way (i.e. by the computation of a gradient 22 2.2 Tracking Pipeline histogram). To nd matches between the corresponding feature points across multiple frames, the similarity between their descriptors is assessed. To facilitate this matching, the descriptors should be distinctive and insensitive to local image deformations. A wide variety of algorithms exists for feature detection, description, and matching. Prominent examples of feature detectors are Harris Corner [12] and FAST (Features from Accelerated Segment Test) [74, 110]. Popular methods that comprise feature detec- tion, description, and matching are SIFT (Scale-invariant Feature Transform) [36] and SURF (Speeded Up Robust Features) [88] that outperforms SIFT in terms of speed and robustness against dierent image transformation as claimed by its authors. Another well known feature descriptor is BRIEF (Binary Robust Independent Elementary Fea- tures) [105] that targets real-time applications and allows running feature point matching at low computational costs and memory load. Although it has weaknesses for robust matching in case of large changes in rotation and scale, it performs faster for feature de- scription calculation and matching compared to SIFT. Another fair alternative to SIFT and SURF in terms of computation costs and matching performance is ORB (Oriented FAST and Rotated BRIEF) [117] that is rotation invariant and resistant to noise. Inter- nally, it uses FAST for feature detection and a modied BRIEF descriptor to enhance the performance. The recently presented descriptor FREAK (Fast Retina Keypoint) [118] is computationally more ecient, computes faster, has lower memory load and is also more robust then SIFT and SURF. Thereby, it is a competitive alternative in particular for embedded applications. Choosing an adequate feature, and thus an appropriate detector, descriptor and matcher heavily depends on the given application scenario and the requirements. In general, feature descriptors perform too slow to be applied for applications that require high update rates, such as real-time tracking. Therefore, solely applying feature detec- tors such as FAST is a good choice to estimate the 6DOF pose in an Inside-Looking-Out tracking scenario, as it is computationally ecient and provides a high number of de- tected features. In application scenarios such as the estimation of external parameters for camera calibration (see Section 2.3.4), which do not necessary require real-time per- formance but highly robust features, more computationally complex algorithms might be employed. However, the computation of any kind of natural features requires sucient illumination and distinct geometrical structure within the observed environment; non- textured surfaces, repeating structures, glass as well as poor illumination yield little, no or unstable features. In case of tracking, this leads to error prone results or even loss of tracking. As our intended tracking environments might not necessarily serve constant il- lumination and distinct geometrical structures, we focus in this thesis on optical tracking using articial features. 2.2.1.2 Articial Features Articial feature tracking is based on the detection of predened, prominent features that are inserted into the tracking volume. These features are then considered as optical markers that need to be detected for tracking. Due to the prior knowledge of their properties and their distinctive visual appearance, it is more likeley that the tracking 23 2. THEORETICAL FOUNDATIONS system is able to detect them with increased robustness, accurarcy and speed. They can either be arranged on a planar surface (see Section 2.2.2.1) or consist of spherical optical markers so that their 2D representations in the camera image are dened by circles whose centroids are computed. If multiple markers are rigidly grouped together, they form a Rigid Body Target that can be used for Model Fitting (see Section 2.2.2). The optical marker can contain known patterns, it can have a specic shape or color and can be retro-reective as well as light emitting, as illustrated in Figure 2.2. (a) Passive (b) Active Figure 2.2: Types of optical markers. Passive markers reect infrared light that is strobed into the tracking volume back to the camera, while active markers directly emit light towards a camera. Passive markers require special retro-reective surface coating as well as an additional light emitter to illuminate the whole tracking volume, while in case of active markers, multiple light emitting diodes must be individually powered. Spherical shaped optical markers result in circular pixel-blobs (Blob) in the camera image whose centroid is computed for model tting and pose recognition. 2.2.2 Model Fitting The process of Model Fitting describes the problem of determining the correspondences between the detected 2D image features and the optical features of the tracked object. It can be accomplished by matching and tting to the underlying Target Model, that describes the structure of features on the tracked object. Figure 2.3: Taxonomy of model tting depending on domain and property. As depicted in Figure 2.3, methods for model tting can be divided into techniques that are either applied in the 2D- or 3D-domain. 24 2.2 Tracking Pipeline 2.2.2.1 2D Domain The target model of the tracked objects can be completely identied in 2D by processing the imaging data from a single camera. This is generally accomplished by exploiting properties that are invariant under perspective projection. There are a number of pro- jective invariant properties such as Cross Ratio and Graph Topology or Planar Bitmap Targets that share the idea of projective invariant properties. The three approaches are briey described in Table 2.1. Cross Ratio When projecting 3D points onto a 2D cam- era plane, neither distances nor ratios of distances are pre- served [56]. However, the Cross-Ratio as a ratio of distances as well as the collinearity of point sets is preserved [17]. Graph Topology When projecting a 2D graph structure as depicted in the Figure (Source: [86]) onto a camera image plane, its topology remains constant, as along as the parts of the graph do not overlap. Then, model tting and pose estimation can be performed, as proposed in [76, 86]. Planar Bitmap Systems Planar bitmap systems, such as [35, 163], encode information into a bitmap that can be retrieved after perspective projection. Using a planar pat- tern, optical aberrations can be accurately removed for ro- bust pattern recognition by using correlation techniques. Table 2.1: Projective invariant features in the 2D domain. Graph topology and planar bitmap patters are useful for many applications. While the binary patterns of ARToolkit and Vuforia [35, 163] must be fully visible and cannot cope with occlusions, ARTag [63] introduced an error correcting code as bitmap to reduce the occlusion problem. Graph topology [76] is more robust and can cope with partial occlusions. However, to be able to detect targets at larger distances that are based on graph topology or planar bitmap patters, large targets would be required. This reduces usability and increases manufacturing eort. In contrary to planar targets, target models that exploit the cross ratio of its markers can be designed more exibly since only a minimum of four points are required. The tracking system that is presented in Chapter II.4 accomplishes model tting by evaluating the cross ratio. Therefore, the underlying approach is described in detail in the next paragraph. 25 2. THEORETICAL FOUNDATIONS Cross Ratio As described in [54, 85], the Cross Ratio, as a ratio of ratios of distances, can be computed based on four collinear points, labeled as A ,B ,C ,D. Figure 2.4: After perspective projection of the four points, the projective invariant prop- erties of the cross ratio are expressed by λ(A ,B ,C ,D) =̂λ(A ′ , B ′ , C ′ , D ′ ). The points' collinearity is preserved as well, as l =̂ l ′ . The cross ratio is dened as the real number λ by λ = |AB|/|BD| |AC|/|CD| , (2.1) where |AB| denotes the length of the line segment between points A and B. Its projec- tive invariant properties are illustrated in Figure 2.4. As it can be seen in Equation 2.1, the computation of the cross ratio depends on the order of the four points (quadru- ple), resulting in 4! = 24 possible orderings. Instead of comparing the cross ratio of the detected features during model tting with all possible permutations, p2-invariants according to [30] can be computed. These are representations of point sets that are insensitive to projective transformations and permutations of the labeling of the quadru- ple. These p2-invariants use λ as argument for the projective and permutation invariant function J(λ) that is determined as follows: J(λ) = J2(λ)/J1(λ), (2.2) where J1, J2 are computed as linear combinations of λ, as denoted in Equation 2.3. J1(λ) = (λ6 − 3λ5 + 3λ4 − λ3 + 3λ2 − 3λ+ 1) λ2(λ− 1)2 (2.3) J2(λ) = (2λ6 − 6λ5 + 9λ4 − 8λ3 + 9λ2 − 6λ+ 2) λ2(λ− 1)2 26 2.2 Tracking Pipeline Training Before tracking, the properties of the pattern i must be obtained once during a training phase. Therefore, the target's points are detected in the camera image and checked for collinearity. However, collinearity and cross ratio are sensitive to noise (see Section 2.1.1.2) that inuence the accuracy of point segmentation. To account for noise when computing the points' collinearity, the following metric, as intro- duced in [14] and further described in [54], is used for determining the collinear- ity of three points (triple): for a triple of homogeneous points p1 , p2 , p3, where pj = (x, y, 1), j = 1, .., 3, dene a moment matrix M123 = ∑ pi p t i and calculate its smallest eigenvalue ev123. For three "perfectly" collinear points, ev123 = 0, indi- cating their linear dependency. If the point coordinate computation is inuenced by noise, ev123 6= 0 but provides an approximation of the three points' collinearity. During training, the smallest eigenvalue of the moment matrix for all three triples of the quadruple is calculated and the maximum smallest eigenvalue evmax i of all triples is stored. To account as well for noise during the p2-invariant calculation, the minimum Jmin i and maximum Jmax i values of the pattern's p2-invariant are stored, denoted as p2range = [ Jmin i , Jmax i ] . (2.4) Summarizing, the pattern's p2-invariant properties encompasses evmax i and p2range that are subsequently used to determine the target at the model recognition stage. Model Recognition During tracking, model tting is performed by employing the fol- lowing two steps. 1) For each detected quadruple Qj , compute the maximum smallest eigenvalue evmax j and perform a collinearity check to nd all possible quadruple candidates Qcand,1 ... Qcand,n, by Qj = { Qcand,n if evmax j ≤ evmax i ∅ otherwise 2) For each candidate Qcand,1 ... Qcand,n, compute its p 2 cand,n-invariant and perform p2range-test to identify the quadruple of the target model Qmodel, by Qcand,n = { Qmodel if Jmin i ≤ p2cand,n ≥ Jmax i ∅ otherwise To summarize, by exploiting the projective invariant properties of a rigid body target that is equipped with four optical markers, a computationally lightweight 2D model tting approach is provided. The main advantage of model tting in 2D over the 3D domain is that no stereo correspondence is required, hence it can be performed without knowledge of the external camera parameters, as described in Section 2.3.2.5. 27 2. THEORETICAL FOUNDATIONS 2.2.2.2 3D Domain To detect a target model that features a three dimensional geometric constellation, as illustrated in Figure 2.5, multiple views of the scene as well as the cameras' stereo ge- ometry are required. Features across multiple views are rst matched by applying stereo correspondence matching (see Section 2.3.3.2) and then transformed into 3D by applying projective triangulation (see Section 2.3.3.4). Within the resulting 3D point cloud, the target model can be tted using geometric hashing, exploiting the euclidean distances between the 3D points or by combining both base techniques [84]. Figure 2.5: An example of a passive 3D rigid body target. 3D Distance Techniques 3D distance tting methods basically exploit the Euclidean distances between unique points within the target model's geometric constellation. In [33, 48], trackable objects are equipped with three optical markers that form a non-regular triangle in which all inter-marker distances have to be unique. In a preprocessing stage, the model of this triangle is obtained and then applied to the detected 3D points during tracking by minimizing the sum of dierences between the model markers distances and the measured distances. As a generalization of [33], model tting based on the distance property is used by applying point patterns [57]. By measuring the 3D distance between each marker pair of the pattern, a pattern distance matrix P is constructed. During the recognition step, a distance matrix C of all detected 3D points is calculated. In C, all sub matrices Ci are determined that t in P . A least squares tting metric is applied to nd the sub-matrix within Ci that best matches P . Geometric Hashing The geometric hashing approach, introduced by [1], identies the target model's features in a set of detected 3D features based on a lookup table. The model's features are represented in an ane invariant as well as redundant way (to account for occlusions) and are stored in a hash lookup table, which is generated in a pre- processing stage. Here, for each of three non-collinear model features, a coordinate system basis with respect to the three features is dened, then the features are parametrized with respect to the basis and stored in a 3D hash table. To detect the target model during tracking, three detected features are selected, parameterized with respect to their coordinate system base and then a hash table lookup is performed; matches in the hash table vote for this model. For each candidate model the ane transformation is recovered, 28 2.2 Tracking Pipeline the candidate is transformed, and tested against the target model; if the match is not sucient, another three detected features are picked and the process is repeated. Due to the required transformation of data points into a reference frame, the eectiveness of geometric hashing highly depends on the amount of candidates that need to be examined. 2.2.3 Pose Estimation Pose Estimation describes the problem of determining the position and orientation of the tracked target model in 3D space. For that, the detected model points, the target model points and the multiple view geometry that is encapsulated in the camera model must be given. As shown in Figure 2.6, the pose can be estimated by either using techniques for computing the 3D rigid transformation between two sets of point-correspondences or, if a one-to-one correspondence between the detected points and the target model points could not be obtained, by applying optimization techniques. Figure 2.6: Taxonomy of pose estimation. A general purpose, representation-independent optimization approach is the Iterative Closest Point (ICP) algorithm, introduced by [15] that matches a set of obtained points to the points of a model, either in 2D or 3D. In the tracking system that is presented in Chapter II.4, a pose estimation by a one-to-one point correspondence can be obtained. It is described in detail in the next paragraph. One-to-One Correspondences Before estimating the pose, the 3D representation of detected points must be computed. Hence, for 2D model tting, a transformation of the detected 2D model points must be rst performed. This is accomplished by determining the 2D point correspondences across multiple views and by calculating their 3D positions using multiple view geometry (see Section 2.3.3). As soon as a one-to-one point correspondence between obtained 3D and model points is given, the pose estimation problem can be reduced to the absolute orientation problem. The 3D rigid transformation between these two points sets can be generally expressed by Equation 2.5. pi = Rmi + T + Vi (2.5) Given the 3D data points set pi and the corresponding model points mi, i = 1....N , the rotation R, the translation vector t and the noise vector Vi shall be obtained in order to optimally map mi → pi. Solving for the optimal transformation [R̂, T̂ ] typically requires 29 2. THEORETICAL FOUNDATIONS to minimize a least squares error criterion ε2 that is given by ε2 = N∑ i=1 ||pi − R̂mi − T̂ ||2. (2.6) There are a number of closed-form solutions to solve this problem, such as a quaternion- based approach [11] or by computing the singular-value decomposition of a derived ma- trix [10]. An overview of the four major techniques is given in [25], concluding that none of the four algorithms was found to be superior in all cases. The only truly distinguishing factor was determined in execution time that also depends on data set size and computer hardware and conguration. Thus, the choice of the algorithms depends mostly on data set size. 2.3 Projective Geometry Projective geometry serves as a mathematical framework for 3D multi-view imaging and can be applied to model the mapping between 3D world points and 2D image points, known as the image-formation process as well as to reconstruct 3D objects from multiple images. Thus, projective geometry can serve as the underlying mathematical framework to solve for the following requirements of the thesis' optical tracking pipeline: 1. An abstract camera model, including a description to model the relationship be- tween a 3D world point and its corresponding 2D image point, and vice versa. 2. A geometric foundation to search and describe point correspondences across mul- tiple camera views as well as to reconstruct 3D geometry. In 3D space, lines, planes and points are usually described using Euclidean Geometry. A point X ∈ R3 in Euclidean space is represented in so called inhomogeneous coordinates with a 3-element vector (x, y, z)T . To avoid the disadvantage of Euclidean geometry when projecting a 3D point onto an image plane1, projective geometry can be applied, representing X in homogeneous coordinates as a 4-element vector (x1, x2, x3, x4) T , such that x = x1 x4 , y = x2 x4 , z = x3 x4 , where x4 6= 0. (2.7) The general mapping between a point in an n-dimensional Euclidean space to a (n+ 1)-dimensional projective space employs the homogeneous scaling factor λ and can 1This operation requires a perspective scaling operation (division) using a scale factor, resulting in a non-linear operation. 30 2.3 Projective Geometry then be described as  x1 x2 x3 .. xn →  λx1 λx2 λx3 .. λxn λ  , where λ 6= 0. (2.8) Projective geometry is used in conjunction with the Basic Pinhole Camera, as de- scribed below. It is the most specialized and simplest camera model and acts as a mathematical foundation for the presented tracking approach of this thesis. For a com- prehensive and in depth review of single and multiple view projective geometry, the reader is referred to [56]. 2.3.1 The Pinhole Camera Model Generally said, a camera model is represented by matrices and describes a mapping between a 3D world (object space) and a 2D image (image space). The pinhole camera model performs this 3D → 2D mapping as a projective projection. The geometry of the pinhole camera is illustrated in Figure 2.7. (a) Overview (b) Geometric relations Figure 2.7: The pinhole camera geometry with camera center C coincides with the coor- dinate system's origin. The image plane is placed with distance f in front of C. The center of the perspective projection C is the point in which all incoming rays intersect and is denoted as camera center (or optical center). With the pinhole camera model, a point X = (x, y, z)T ∈ R3 is mapped to a point x = (u, v)T ∈ R2 on the image plane (or focal plane) where a line from X to C meets the image plane. The principle axis (or optical axis) is the line perpendicular to the image plane passing through C. The principal point p is denoted as the point where the principle axis intersects with the image plane. The plane through C that is parallel to the image plane is called principle plane. 31 2. THEORETICAL FOUNDATIONS Let C be the origin of the Euclidean coordinate system and the principal axis being collinear to the Z-axis. Consider the image plane placed at Z = f , where f denotes the focal length. As illustrated in Figure 2.7b, the point (x, y, z)T can be mapped to (f x/z, f y/z, f)T on the image plane. Ignoring the nal coordinate, the above mapping can be expressed as a projective mapping by ( u v ) = xy z →  f x z f y z  . (2.9) As mentioned in Section 2.3, such a non-linear devision operation should be avoided. Us- ing projective geometry and homogeneous coordinates, the relation from Expression 2.9 can be re-formulated in terms of matrix notation as λ uv 1  =  x y z 1 → f xf y z  = f 0 0 0 0 f 0 0 0 0 1 0   x y z 1  , (2.10) where the homogeneous scaling factor λ = z and where the homogeneous 3 x 4 matrix is called the Camera Projection Matrix P . Expression 2.10 can be written compactly as x = PX. (2.11) Deriving from Expression 2.10, P is dened for the pinhole model as P = K[I | 0], (2.12) with K denoting the Camera Calibration Matrix and expressed by K = f 0 0 0 f 0 0 0 1  (2.13) 2.3.2 Camera Model Extensions The basic pinhole camera models the 3D → 2D point mapping for a system that does not suer from aberrations caused by the employed optic components and by the imaging sensor. However, in practice these aberrations occur and thus, the underlying camera model must describe these properties as well to allow for a precise projective mapping. 2.3.2.1 Principal Point Oset The expression from Section 2.3.1 assumes that the origin of coordinates in the image plane coincides with the principle point p. In practice, the imaging systems often dene 32 2.3 Projective Geometry Figure 2.8: The principal point oset. the origin of the pixel coordinate system at the top-left pixel of the image, as depicted in Figure 2.8. Thus, a conversion of coordinate systems is necessary. Let be (px, py)T the coordinates of p; then the Expression 2.10 can be extended by integrating the principal point position into K, resulting in λ uv 1  =  x y z 1 → f x+ zpx f y + zpy z  = f 0 px 0 0 f py 0 0 0 1 0   x y z 1  (2.14) 2.3.2.2 Skew Parameter In Equation 2.10, it was further implicitly assumed that the pixels of the image sensor have equal scales mx, my in both axial directions with a square aspect ratio (i.e. 1 : 1) and are not skewed. However, in practice it might not the case. To account for both imperfections of the imaging system, the parameters m = (mx, my) and s can be employed to model non-squared and skewed pixel. The projective mapping is then denoted as λ uv 1  = fmx s px 0 0 fmy py 0 0 0 1 0   x y z 1  (2.15) The imaging hardware that is employed throughout this thesis has squared and non- skewed pixel. Thus, we assign mx, my = 1 and s = 0. 2.3.2.3 Camera Lens Distortions In practice, distortion eects can be observed in most camera lenses. Incorporating these distortions adds non-linear components to the linear transformations, as dened by Equation 2.10. Radial lens distortion maps straight lines as curves, with increasing magnitude to- wards the image edges. It is generally stronger in wide-angle lenses and the most present 33 2. THEORETICAL FOUNDATIONS form of lens distortion. Two types of radial distortion can be distinguished and are de- picted in Figure 2.9. The barrel radial distortion maps lines curved outwards from the image center while the pincushion radial distortion maps lines pinched towards the image center. (a) Barrel distortion [96] (b) Pincushion distortion [97] Figure 2.9: Two common types of radial distortion. Tangential lens distortion is caused by imperfect centering of the lens components and by other manufacturing defects. It results in the lens not being exactly parallel to the imaging plane. According to [27], the overall lens distortion can be accurately modeled by the sum of the radial and tangential distortion vectors to map the image coordinates 〈u, v〉 to their distorted counterparts 〈û, v̂〉. To describe the radial distortion, ki denotes the ra- dial distortion coecients and r = √ u2 + v2. The radial distortion vector is then dened as ( û v̂ ) = ( u(k1r 2 + k2r 4 + ...) v(k1r 2 + k2r 4 + ...) ) . (2.16) The tangential distortion vector with the coecients p1, p2 is dened by( û v̂ ) = ( 2p1uv + p2(r 2 + 2u2) p1(r 2 + 2v2) + 2p2uv ) . (2.17) For computing 〈u, v〉 based on 〈û, v̂〉, usually called inverse mapping or normalization, no general algebraic expression exists [27] because of the high degree distortion model. However, a number of approximative solutions exist, such as a numerical approach [133] or recovering the real pixel coordinates from the distorted ones by involving a non-linear search for implicit parameters [27]. 34 2.3 Projective Geometry 2.3.2.4 Camera Rotation & Translation In the above equations to model the basic pinhole camera and its extensions to describe the additional internal camera parameters, it was assumed that the origin of the camera coordinate system coincides with the origin of an Euclidean coordinate system that the principal axis is pointing straight down the camera's z-axis (see Figure 2.7), so that a 3D point X can simply be expressed in the camera coordinate system by Equation 2.12 that was x = PX, with P = K[I | 0]. To unbound from this constraint, points in 3D space are generally expressed in terms of a dierent Euclidean system, the world coordinate system. World and camera coordinate system are related to each other by a rotation R and a translation t, as depicted in Figure 2.10. Figure 2.10: The Euclidean transformation between the world and the camera coordinate system. Given a point X̃ in the world coordinate system, the same point in the camera coordinate system Xcam is obtained using homogeneous coordinates by Xcam = ( R −RC̃ 0 1 ) x y z 1  = ( R −RC̃ 0 1 ) X̃. (2.18) where C̃ is the position of the camera center C in world coordinates and R describes the orientation of the camera coordinate system with respect to the world coordinate system. C̃ is determined by translating C with t to the world coordinate origin O and then rotate it by R. To obtain the pixel coordinates of x = (u, v, 1)T of X̃, the updated spatial relations are fused into Equation 2.12 so that x = PX̃, (2.19) with P = K[R | t], t = −RC̃. 35 2. THEORETICAL FOUNDATIONS 2.3.2.5 Intrinsic & Extrinsic Camera Parameters Summarizing Sections 2.3.1 and 2.3.2, the mathematical description of an abstract camera model with its extensions is given. Intrinsic Camera Parameters comprises focal length, principal point oset, pixel scale as well as skew and are expressed by the Camera Calibration Matrix K. The intrinsic parameter lens distortion is dened by the radial and tangential coef- cients ki and p1, p2, respectively. All intrinsic camera parameters remain constant unless the optical setup is modied. Extrinsic Camera Parameters describe the external position and orientation of the camera in respect to the 3D world and are expressed by the homogeneous 4 x 4 matrix [R | t]. As soon as the camera is moved in the world space, the extrinsic parameters must be recomputed. K and [R | t] are encapsulated in the Camera Projection Matrix P , thus P relates 3D space measurements to image measurement. Consequently, P depends on both the world coordinate and image coordinate system. Determining the intrinsic and extrinsic camera parameters yields the process of Cam- era Calibration. Estimating internal and external camera parameters in one process is known as strong calibration, determining only one parameter set at a time is called weak calibration. Camera calibration methods are reviewed in Section 2.3.4. Before optical tracking with multiple cameras can be performed, the camera parameters must be known. 2.3.3 Multiple-View Geometry After reviewing the abstract model of a single camera to describe the relationship between a 3D world to corresponding 2D image points, the geometric foundation to reconstruct the 3D point's coordinates out of corresponding image points across multiple camera views is reviewed in this section. 3D reconstruction of a point is an indispensable task in the tracking pipeline. 2.3.3.1 Epipolar Geometry The geometric model to search for point correspondences across multiple camera views and to model the spatial camera constellation to be able to estimate the 3D position of a corresponding point pair is known as Epipolar Geometry. As depicted in Figure 2.11, it is constituted between the non-coincident optical centers C, C ′ of two pinhole cameras and a 3D point X̃ ∈ R3. X̃ is projected onto both image planes, resulting in the corresponding 2D point pair x, x′ ∈ R2. The epipolar geometry is then expressed by ˆ The baseline, as the line going through C, C ′. ˆ The epipolar plane, which is dened by a X̃ and C, C ′. 36 2.3 Projective Geometry (a) Epipolar plane (b) Epipoles and epipolar lines Figure 2.11: The epipolar geometry. ˆ The epipolar line l, which is determined by the intersection of the image plane with the epipolar plane. It passes through the projected point and the epipole e of the rst image plane (i.e. x and e) and is the projection of the optical ray that runs through the optical center and the projected 2D point of the second image plane (i.e. C ′ and x′). ˆ The epipole e as the 2D image point where the baseline intersects with the image plane. All epipolar lines in an image pass through the epipole, which also corre- sponds to the projection of the optical center of the other camera onto the image plane, i.e. e′ is the projection of C. The epipolar geometry is algebraically represented by the Fundamental Matrix F that is a homogeneous 3x 3 matrix of rank 2 that satises x′TFx = 0 (2.20) for all corresponding points x ↔ x′ ∈ R2. After estimating F , as described in Sec- tion 2.3.4, the geometric model of the epipolar geometry can be exploited to estimate the unknown coordinates of a 3D point X̃ by performing the following steps: Solve Correspondence Problem For x, its corresponding point x′ is constrained to lie on the epipolar line l′. Using this epipolar constraint, x′ can be determined by performing a search along the epipolar line l′. However, correspondence ambiguities can occur and must be robustly resolved before determining the corresponding 2D point pair, as described in Section 2.3.3.2. Compute Camera Projection Matrices After solving the stereo correspondence, P and P ′ for both cameras must be derived, as described in Section 2.3.3.3. They encapsulate both internal and the external parameters of each camera. The exter- nal parameters describe the position and orientation of each individual camera in relation to each other, respectively in relation to the world coordinate system. 3D Point Reconstruction Based on the known camera projection matrices and the point correspondence x, x′, the unknown 3D coordinates X̃ in the world coordinate 37 2. THEORETICAL FOUNDATIONS system can be reconstructed by performing a Projective Triangulation, as described in Section 2.3.3.4. 2.3.3.2 Stereo Correspondence Problem As it is dened by the epipolar geometry, for an image point x, its corresponding point x′ lies on the epipolar line l′. This search along the line can be reduced to a one- dimensional search problem when all epipolar lines are parallel. This can be achieved by rectifying the image pair, as described in [56]. Image rectication is advisable for images taken from widely dierent viewpoints. However, even when reducing the dimension, the search along the epipolar line can be ambiguous, since multiple features in the right image may lie on the same epipolar line of a feature in the left image. To solve for this stereo correspondence problem, further matching constraints that exploit the features' properties, such as Similarity, Uniqueness, Continuity and Ordering of Points, can be applied [51]. As the tracking approach of this thesis is based on infrared optical tracking of spherical shaped markers, the resulting blobs in the camera images do not contain enough information to use the above mentioned characteristics. The blobs provide practically identical characteristics in both images, thus, model tting and recognition methods as described and discussed in Section 2.2.2 must be applied in conjunction with stereo correspondence search to solve correspondence ambiguities. 2.3.3.3 Computing the Camera Projection Matrix Based on F , the camera projection matrices P, P ′ of both cameras can be derived. How- ever, since this results in a projective ambiguity, it is more advisable to use the Essential Matrix E to extract P, P ′ up to scale. E is a specialization of F using normalized im- age coordinates, thus the camera calibration matrices (K,K ′) of both cameras must be known. E can then be obtained by E = K ′TFK. (2.21) A pair of corresponding 2D image points x, x′ are normalized by computing x̂ = K−1x, respectively x̂ ′ = K ′−1x′, re-expressing the dening equation of F from Equation 2.20 as x̂ ′ TEx̂ = 0. (2.22) Using normalized image coordinates, Equation 2.19 can then be reformulated as x̂ = [R | t]X̃ , where P = [R | t]. (2.23) This can be thought of as projection of X̃ onto the image plane with respect to a camera [R | t] that has an identity matrix I as K. Since K,K ′ are given, only rotation and translation from one camera to the other needs to be determined. As it is given for F , a pair (P, P ′) uniquely determines E, however the inverse is not true. Thus, it is common to dene (P, P ′) as P = [I | 0], P ′ = [R | t] (2.24) 38 2.3 Projective Geometry and compute P ′ by factorizing E into the product SR, where S is a skew symmetric matrix and R is a rotation matrix, using Single Value Decomposition (SVD). As reviewed in detail in [56], this results in a four-fold ambiguity, meaning that there are four possible geometrical combinations of translations and rotations giving four possibilities for P ′, as illustrated in Figure 2.12. (a) (b) (c) (d) Figure 2.12: The four possible solutions for P ′, as combinations of rotations and trans- lations. Between 2.12a and 2.12b, respective 2.12c and 2.12d, the translation vector is reversed (baseline reversal). Between 2.12a and 2.12b, respective 2.12c and 2.12d, camera B rotates 180◦ about the baseline. As it is shown, only in one of the four solutions the point T is in front of both cameras. Thus, it is sucient to test with a single point to determine if it is in front of both cameras to solve the four-fold ambiguity. Therefore, a test point from the data is taken, its 3D coordinates are reconstructed with each combination of (P, P ′), then the 3D point's depth in both cameras is determined and nally, the pair (P, P ′) is chosen that has a positive depth for both cameras. 39 2. THEORETICAL FOUNDATIONS 2.3.3.4 3D Point Reconstruction The process of reconstructing the unknown coordinates of a 3D point X̃ from two cor- responding 2D image points x, x′ is known as Back-Projection or Triangulation. The triangulation problem is dened as determining the intersection of the two rays in space that correspond to x, x′; this intersection is then X̃. These two rays will meet in space if and only if x, x′ satisfy the epipolar constraint from Equation 2.20, respective 2.22. In the absence of 2D point measurement inaccuracies, the triangulation problem can then be easily solved and there is a point X̃ that projects to x↔ x′ and thus exactly satises x = PX̃ and x′ = P ′X̃. However, digitalization errors such as sensor noise result in erroneous measured points x, x′ that do not in general satisfy the epipolar constraint. In this case, a pair of optimized image points x̂ ↔ x̂ ′ must be determined that reproduces the erroneous measured points x ↔ x′ as closely as possible by minimizing the residual errors between the reprojected and measured image points [82] and satisfying the epipolar constraint x̂ ′TFx̂ = 0. Once x̌↔ x̌ ′ are found, their corresponding rays will meet precisely in space and X̃ can be obtained by any triangulation method, such as the Linear Triangulation, Linear Least Squares Triangulation or Bundle Adjustment [26, 82]. 2.3.4 Camera Calibration Determining both the intrinsic and extrinsic camera parameters yield the process of Camera Calibration. For each camera that is involved in a multiple view setup, both pa- rameter sets are described by the Camera Projection Matrix P . A number of calibration approaches exist, and all share the common principle of determining the cameras' pa- rameter by initially obtaining a specic number of 3D world→ 2D image point relations, to later use these relationships in an optimization procedure. The existing approaches for multiple view camera calibration can be described based on the applied calibration object and its dimensionality, as illustrated in the taxonomy of Figure 2.13. Figure 2.13: A calibration taxonomy by dimension of the applied apparatus. Calibration based on a 2D or 3D reference target usually observes the object that is only shown at a few dierent orientations [38, 40] undergoing an unknown translation. The object's 2D, respective 3D geometry, is known with high precision. In 2D, this is typically a planar pattern (see Figure 2.14a), and in 3D two or three planar pattern in 40 2.3 Projective Geometry an orthogonal geometric arrangement to each other. With such a reference object, each cameras' internal (focal length, principal point oset, aspect ratio, radial and tangential distortion coecients) and external parameters (position and orientation) can be com- puted eciently [17]. The required calibration setup can be easily constructed, however this approach suers from declining ease of use due to increasing necessary pattern size as calibration distance to the camera and baseline increases. (a) 2D planar pattern (b) A bar with two optical markers. Figure 2.14: Reference targets for intrinsic and extrinsic camera calibration. Multiple view camera calibration can also be performed with a 0D object, such as corresponding points across the views. Either these points are manually generated by waving a single point [45, 69], such as a light emitting diode, retro-reective sphere, through the volume or by extracting natural features [102, 116, 99, 125] from the ob- served scene, which is referred to as Auto-Calibration. Since these single point methods cannot account for the estimation of distortion coecients, they are mostly intended to recover only the extrinsic parameters. To overcome this limitation, a 2D planar pattern calibration can be applied before recovering all internal parameters. Using a moving point or extracting natural features from the image, a sucient number of corresponding image pairs (a minimum of seven is required) can be computed to estimate the Funda- mental Matrix F , i.e. by performing the Normalized 8-Point Algorithm [9, 56]. It seeks for estimating F by constructing a set of linear equations, or in presence of noise, by solving a linear least square minimization problem. As described in th computation of the camera projection matrix (Section 2.3.3.3), the extrinsic parameters can be derived through this estimated epipolar geometry, up to a scale factor. To overcome the limita- tions of the calibration based on multiple single points, in [48] an extension is presented; a bar with optical markers at both ends is used as calibration target where the physical distance between the spheres is known (see Figure 2.14b). Thereby, internal and external camera parameters can be determined linearly in an initialization step, and then rened with a nonlinear least squares optimization method. Furthermore, the scale factor can be determined from the real and known distance between both spheres. As stated in [60], there is no calibration technique that suits best for all use cases. However, the following recommendations are given that inuenced the design of the calibration approach of this thesis: 41 2. THEORETICAL FOUNDATIONS ˆ Whenever possible, calibrate the cameras in a single or multi-view setup with a 2D or 3D reference object. Calibration with single point correspondences cannot usually achieve an accuracy comparable to a calibration using a higher dimensional reference object. ˆ Whenever possible, calibrate as many parameters with a calibration object as pos- sible. Thereby, the number of parameters to be estimated can be reduced for any subsequent self-calibration. 2.4 Summary This chapter has introduced the fundamental concepts of optical pose tracking and pro- jective geometry. Projective geometry uses homogeneous coordinates to represent the position of 2D image and 3D world points and is able to describe the projection of a 3D point onto a 2D image plane with a linear camera projection matrix P that com- prises the intrinsic and the extrinsic camera parameters. In a multiple view setting, the projection matrices of both cameras can be computed from the Fundamental Matrix F that constitutes the epipolar geometry that describes the geometric relationship between multiple camera images. The epipolar geometry is required to reconstruct a 3D point's coordinates out of corresponding images points, which is an indispensable task in the optical tracking pipeline. 42 Chapter 3 Related Work The main objective of this part of the thesis is to develop a novel approach to be able to track objects in large, unconstrained indoor environments. Therfore, the tracking system must be capable to cope with ambient interfering lights, infrared radiation, temporary occlusions and even harsh environmental conditions, such as fog and dust. We aim at tracking at large distances with a small amount of hardware to minimize the necessary preconditioning of the tracking environment. To track objects in space and especially in large volumes, dierent techniques exist from commercially available products to on-going research prototypes. Extensive research has been performed to develop indoor location systems (ILS) for enabling context-aware applications, user tracking and surveillance [44]. Since this work focuses on positioning in indoor environments, we do not discuss related work based on global navigation satellite systems (GNSS) or tracking solely based on inertial sensors, as inertial measurements suer from signicant drift over time, especially for position estimation. Moreover, we do not incorporate magnetic tracking into the discussion of related technologies, as it is subject to interference from ferromagnetic materials in the tracking volume and mag- netic elds generated by other electronic devices, and it is sensitive against conductive materials that are placed near to emitters or sensors. These factors tremendously limit potential tracking environments and making it impractical for our intended test setups. Regarding optical tracking, techniques based on natural features are not reviewed as well, since they require prominent and distinctive structures for pose estimation, as described in Section 2.2.1.1. These distinct features must either be found on the tracked object in an Outside-In scenario, or have to be distributed throughout the volume in an Inside- Out tracking setup. For both scenarios, a reliable feature distribution and an adequate illumination cannot be guaranteed in the intended tracking environments that have been investigated within this thesis, as described in Chapter II.5. To summarize, the most relevant tracking technologies for the intended wide area indoor environments are radio frequency (RF), ultra-sonic and model-based optical sys- tems. Since they all have advantages and disadvantages regarding accuracy, latency, reliability, scalability and cost, no de-facto standard has been established yet. Thus, we outline state-of-the-art ILS techniques and discuss their advantages and disadvantages. 43 3. RELATED WORK 3.1 Radio Frequency & Ultra Sound Radio frequency systems based on Wi-Fi infrastructure or radio-frequency identication (RFID) [34] require a number of readers within the measurement volume to enable object tracking with low latency in large volumes [73]. However, WiFi signals tend to be ex- tremely noisy and signal strength highly depends on surrounding building structures and materials. Thus, precise position estimation cannot be guaranteed even with multiple readers in the volume. In addition, the extensive pre-conditioning of the tracking volume is cost-intensive due to the amount of necessary hardware. Figure 3.1: Tracking of a smartphone using Google Indoor Maps [156]. Recently, a number of commercially available ILS applications such as Google Indoor Maps [156], SensionLab [164] as well as Indoo.Rs [157] emerged to localize a smartphone (and thus its user) by fusing mobile cellular data, WiFi and inertial measurements to min- imize position jitter from WiFi data. Google Indoor Maps that is depicted in Figure 3.1 optimizes the position accuracy by pre-measuring and mapping the signal strength of the WiFi spot within the volume. However, this process takes time before the actual track- ing can start. Furthermore, all systems require pre-built indoor oor plans for position visualization and only provide  in best case  several meter accuracy. Ultra-sonic location systems such as [67, 50] rely on time-of-ight measurement of ultra-sonic signals, calculated using the velocity of sound. Such systems are scalable and can track multiple moving objects. However, current systems oer in the very best case meter-level accuracy under optimal conditions for 3D position estimation [136]. Furthermore, precision and range are not reliable since velocity of sound in the air is highly dependent on environmental conditions, especially humidity and temperature. Especially at long ranges, ultra-sonic systems are often extremely noisy and for that reason not a proper solution for our system's objectives. Compared to ultrasound, the RF-based Ultra Wide Band (UWB) technology enables distance measurements without line-of-sight requirements. An example for such a system 44 3.2 Optical Tracking is Ubisense [166] that employs TDoA1 and AoA2 measurements between mobile tags and a minimum of four xed base stations, as shown in Figure 3.2. Figure 3.2: A simple four sensor Ubisense system [166]. It oers fast signal speed and hence high sample rates (approximately 135 Hz) and provides an accuracy of down to 0.2m. The LPM system by Abatec [61] oers a sample rate of 1 kHz with an accuracy down to 0.15m. It measures the distance between xed base stations and mobile tags based on the frequency modulated continuous wave prin- ciple [4]. Although large distances can be covered, the ultrasound and RF-based systems are expensive and the resulting accuracy is not sucient for precise user tracking for virtual reality applications. 3.2 Optical Tracking Model-based optical tracking systems require the target to be within the line-of-sight of one or more cameras to estimate its 3D coordinates from the 2D image-projections, as described in Chapter II.2. It is robust against magnetic, electric and acoustic interference and works with light-emitting (active) or retro-reective (passive) targets. One camera is sucient for tracking in an Inside-Out scenario that is intended for the InterSense IS 1200 system [151]. It oers a scalable, cost-eective solution for wide area tracking as it fuses optical tracking of planar bitmap patterns (see Section 2.2.2.1) with inertial measurement data. Therefore, an inertial measurement unit is combined with a single camera and attached to the trackable object to observe passive markers that have to be distributed throughout the volume. While this setup oers high updates rates with very low latency (max. 8ms) it requires sucient illumination and a large 1Time-Dierence-of-Arrival 2Angle-of-Arrival 45 3. RELATED WORK number of targets that have to additionally be in close range to the camera to ensure robust tracking. These prerequisites make this system impractical and even impossible to apply for our intended environments. As the implicit nature of Inside-Out tracking requires well-distributed visual features throughout the volume, it can be concluded that using active targets would also not be a sucient approach for our research objectives since it would violate the goals of omitting pre-conditioning of the environment and of minimizing the necessary amount of hardware components. Outside-In optical tracking systems require the target to be within the line-of-sight of two or multiple cameras. In the following, a number of state-of-the-art Outside-In model-based tracking systems are presented. (a) Calculating the 3D position of each point (b) Estimating the pose of each target Figure 3.3: Multiple target tracking using iotracker with 4 cameras, [84]. The near infrared (NIR) spectrum based systems, such as Vicon [145], A.R.T [132] or iotracker [84, 141] oer (sub)-millimeter accuracy in standard room sized environments (4x 4x 3m) and provide tracking of multiple targets with very low latency, as depicted in Figure 3.3. To enlarge the tracking volume, those systems increase the number of employed cameras (up to 50 in A.R.T). However, this causes a tremendous growth of costs and setup complexity. The PPT-E system [146] is able to cover areas up to 20 x 20m with a minimum of four cameras but sub-millimeter tracking accuracy is guaranteed only for volumes up to 3 x 3 x 3m. No accuracies are provided for larger volumes. Figure 3.4: A tracking setup using the Prime41 system, [138]. The Prime41 system [138] oers multiple user tracking by detecting passive targets up to 30m, using a perimeter setup with multiple cameras, as shown in Figure 3.4. However, no further details on accuracy nor the number of cameras are given to cover this volume. Furthermore, as the most cost ecient systems of the above mentioned, one Prime41 46 3.3 Laser Measurement Systems camera still costs about e5000. A minimal 4-camera perimeter setup results in pure camera costs of e20.000 (without software), which is a multiple of our complete system costs. For tracking in larger, unconstrained indoor environments, such as tunnels and mines, examples of application of optical tracking systems are rare and only exist for highly special measurement purposes. As depicted in Figure 3.5, one example is the application of a hand-held digital camera in combination with xed installed visual markers for monitoring tunnel wall displacements by close-range photogrammetry [29, 129]. The system requires huge installation eort and therefore is not practical for daily application. A further example is the use of a tracking camera and retro-reecting targets to track the relative position between two shields of a double shield tunnel boring machine as part of a guidance system. The system is in use in several tunnel projects and reported to function properly [135]. However, both optical tracking systems are not designed to simultaneously track several targets over longer distances in real-time. (a) The Aicon DPA Kit (b) Large scale measure- ments Figure 3.5: The AICON DPA-Pro System, [129]. Summarizing, existing Outside-In optical systems rely on articial features for model- based tracking and are thus robust against environments with non-distinctive geometric structure and poor illumination. However, for wide area tracking they require a complex system setup and thus are cost intensive. Furthermore, existing NIR tracking technol- ogy remains to be highly sensitive to ambient interfering lights and infrared radiation, especially during camera calibration, making those systems incapable of being deployed in unconstrained indoor environments. 3.3 Laser Measurement Systems For determining the 3D position of objects with very high accuracy, classical survey- ing methodology such as laser measurement systems are widely applied in research and 47 3. RELATED WORK industry. The employed instruments (total stations, terrestrial laser scanners and laser trackers) simultaneously measure the horizontal and vertical angle to the target-point to- gether with the slope distance by using laser distance measurement. Based on these polar observations, the 3D coordinates of the target-point are then processed. Depending on the specic surveying task, the target-point is either a geodetic prism or a non-signalized point, directly located on the object surface (reector-less measurement). The most fre- quently used instrument type is the total station [91, 100]. In the application eld, it can be found manually operated as well as integrated in automatic measurement and mo- bile multi-sensor systems. Advanced total stations have the capability to automatically search for, recognize, measure and even lock a prism, thus, are able to follow a slowly moving object. These options are primarily used to facilitate manual operation, increase speed of work and are indispensable when kinematic surveying is to be performed. Total stations are highly accurate for large distances of 100m and more. They are used for setting out, network measurements, tunnel heading control, machine guidance and dis- placement monitoring. However, specialized personnel are required for instrument control and several (kinematic) visual objects cannot be simultaneously sighted and measured. The technology of laser scanning by use of Terrestrial Laser Scanners (TLS) [13] is also broadly common in underground construction. It is routinely applied for a variety of purposes, such as tunnel prole control, volume determination and check of tunnel surface quality [77, 89, 90]. Recent research work [120] aims to use the technology for monitoring of tunnel wall displacements. As with total stations, laser scanners are operated either manually or automatically when integrated in tunnel laser scanning systems. They can perform static and kinematic scanning. However, the technology requires extensive post- processing of 3D point clouds and does not allow for ecient measurement of dened points or objects with low latency. So far, the technology does not provide real-time capability. Figure 3.6: Leica Absolute Tracker AT901 with T-Probe, [148]. Recently, Leica Geosystems introduced an approach that integrates an optical track- ing system with a laser tracker [93, 158]. It oers automatic lock-on and tracking of the 48 3.3 Laser Measurement Systems 3D position (by the laser tracker) and 3D orientation (by the optical tracking system) of a hand-held target [159] with high precision and low latency up to 18m. The systen is shown in Figure 3.6. As a portable system, it is designed for industrial applications (e.g. prototyping and reverse engineering, tooling inspection and part mating, positioning and aligning of machines). By using a special corner cube reector, the range can be extended up to 160m but only for the laser tracker, not for the optical tracking system. However, up to now, this system is only used for very particular measurement tasks in tunnel construction. The only example of regular use is the check of tunnel segment geometry, a daily task performed in the segment factory. Up to now, laser trackers cannot be found underground as they are expensive and not considered robust enough to operate in harsh environments. Besides, they cannot simultaneously track multiple targets. 49 Chapter 4 Methodology To overcome the limitation of existing optical tracking systems, as described in Sec- tion 1.1, a robust wide-area Outside-Looking-In optical tracking system for position tracking is described that requires only two cameras to track targets up to distances of 30− 100m, depending on the tracking task. It provides high tracking accuracy while being robust against interfering lights during calibration and tracking. (a) Coverage of stereo cameras (b) Robust calibration and tracking Figure 4.1: Key properties of the proposed optical tracking system. In Figure 4.1, the properties and capabilities of the proposed system are shown. Fig- ure 4.1a illustrates the system's hardware setup and the resulting tracking coverage. Fig- ure 4.1b depicts a successfully detected target of our system in the camera image that can be subsequently employed for calibration and tracking. The tracking is achieved despite heavy interfering lights as they might occur in an unconstrained indoor environment. By heavily minimizing the amount of necessary vision hardware, the system is highly cost eective and easy to set-up. Although current infrared optical tracking systems lack the capabilities of robust wide-area 3D position estimation, the underlying technology is very promising since it oers high precision with very low latency. Therefore, we heavily extend this technology to overcome limitations in terms of distance coverage, sensitivity 51 4. METHODOLOGY in harsh environments and the amount of simultaneous trackable targets. 4.1 System Requirements To achieve the research objective from Section 1.2, the following requirements were spec- ied to be fullled by the tracking system: Cover Wide Tracking Volume: Target(s) shall be tracked with two cameras up to distances of 100m. To account for varying real-life tracking scenarios, the distance between both cameras (baseline) may vary. Both cameras are connected to one pro- cessing unit, thus data exchange interfaces are required that support long distance cable transmission. Accurate Camera Calibration: To optimally compensate optical aberrations, the in- trinsic and extrinsic calibration must be able to be performed with the complete camera encasement. The extrinsic calibration has to be capable to be performed during on-going activities in the tracking volume and thus must be able to cope with heavy interferences. Unique Target Identication: Interfering light sources must be ltered to allow for a robust target detection during calibration and tracking, as illustrated in Fig- ure 4.1b. Continuous & Accurate 3D Position: The hardware and software algorithms have to ensure precise target detection at large distances and in environments with poor visibility due to particles (dust, dirt) in the air. Continuous 3D position estimation must be provided within the whole tracking volume. Robust Hardware Casing: To ensure system reliability in real-life environments, hard- ware components (cameras, lenses, target, processing unit) have to be encased to be dust- and dampness proof. Nevertheless, the system must be easy and quick to setup and the target should be usable even with thick gloves. Furthermore, side eects on the camera's eld-of-view (FOV) as well as optical aberrations must be considered when encasing the vision parts of the system. 4.2 Evaluation of Target Visibility Based on the system's requirements, a preliminary study was conducted [94] to dene an appropriate hardware setup to perform wide area 3D position measurements in vary- ing, unconstrained and even harsh indoor environments using infrared optical markers. Therefore, we rstly took into account our previously outlined factors that inuence ac- curacy of 3D position estimation as well as tracking performance (see Section 2.1.1.2) to derive a test hardware setup. Next, we practically evaluated combinations of tar- get types and camera setups to determine obtainable tracking distances. To be able to 52 4.2 Evaluation of Target Visibility perform tracking in even harsh indoor environments, we performed measurements in a tunnel during on-going construction to evaluate the best operating distances when using (1) passive or (2) active markers as targets (see Figure 2.2). Furthermore, we tested dierent object lens congurations to determine an optimal balanced optical setup for the intended tracking system. Therefore, we tested dierent focal lengths during the distance measurements. An optimal conguration has minimal optical aberrations while providing high light throughput and a sucient eld-of-view (FOV) to cover the intended tracking volume. Lenses with short focal length have stronger optical aberrations char- acteristics but provide larger FOVs than lenses with longer focal lengths. Furthermore, the optics system must provide sucient depth-of-eld (DOF) to ensure that the target appears sharp in an image taken within the intended tracking volume. DOF increases as focal length and aperture decreases. However, since accurate 3D position estimation relies on robust blob centroid computation, pixels of the target blobs ideally are bright and clearly distinguishable from the surrounding pixels. For that reason, large aperture must be employed to provide maximal light throughput emitted from distant targets. 4.2.1 Test Setup A high resolution machine vision camera (1/1.8" Mono CCD, 1624x1224px) with a vari- focal lens (focal lengths f = 12−36mm, aperture = F2.8−16) and a long-wave pass lter, placed in front of the camera, was selected. As passive target, retro reective foil targets in combination with a 850nm illuminator were employed. The active target comprised an infrared light diode with a peak wavelength at 850nm and a viewing half-angle of 23◦. 4.2.2 Test Results Test images have been captured with 8bit pixel depth at distances of 30m, 50m and 70m, employing open aperture (f/2.8). We dened a blob to be robustly detectable if it features 80%-100% of the maximal luminance [84]. Figure 4.2: Blobs at 50m distance with minimal/maximal focal length of f = 12 / 36mm. 53 4. METHODOLOGY As illustrated in Figure 4.2, passive as well as active targets were robustly segmented in the camera image up to a distance of 50m. The diminished blob's brightness of the passive target is well illustrated in Figure 4.2. For testing, we manually increased shutter speed and gain of the camera. The brightness of the passive target increased but likewise did image noise. This should be avoided to provide accurate feature segmentation. Fur- thermore, brightness of other light sources or reective material (i.e. construction vests) has increased as well. This can result in blooming (and hence tracking loss) of the target when getting in close range to these interfering areas. At a distance of 70m, blobs of passive targets could not be robustly detected while active targets were still visible and could be accurately segmented despite dust and dirt in the air. Consequently, active targets are suitable to fulll the proposed system's objectives. A focal length of 25mm proved to allow the best balance between optical aberrations, sucient blob brightness (and thus accurate feature segmentation and 3D estimation) and adequate FOV as well as DOF to cover the entire intended tracking volume with objects being in focus. 4.3 Methodological Approach The overall system's work-ow is depicted in Figure 4.3. The projective invariant proper- ties of the target model are trained and subsequently employed for 2D model recognition during calibration and tracking. Hence, the same target model can be used to perform extrinsic calibration and an additional calibration apparatus can be avoided. Figure 4.3: Overview over the system's workow. 4.3.1 Vision System The vision component of the proposed tracking system comprises two cameras, lenses and lters. Following our preliminary study, we derived an optimal balanced optical setup (sensor size, focal lengths, aperture) for the intended tracking volume that minimizes optical aberration and rasterization eects while providing a sucient eld-of-view (FOV) as well as depth-of-eld to cover the intended tracking volume with objects in focus. The coverage depends on focal length f , the distance between the cameras (baseline) as well as the amount of yaw-rotation β of each camera, as depicted in Figure 4.4. 54 4.3 Methodological Approach Figure 4.4: Coverage of stereo cameras Our system uses high-resolution machine vision cameras in combination with low- distortion lenses that feature large aperture and minimal optical aberrations, as described in Section 4.4.1. The high quality cameras provide low heat evolution and large image sensors yielding little sensor noise, so jitter in the camera image can be minimized. Together with high resolution image sensors, precise segmentation can be provided even at long distances. The cameras oer high global shutter speed to allow for low-latency tracking and to minimize motion blur when the target is moving fast. Both cameras form a Stereo Camera Rig and are shutter-synchronized by an external trigger signal to guarantee temporal synchronous image pairs. To enhance robust target identication, a long-wave pass lter is inserted into the optical path to ensure light transmission only in the NIR spectrum. To provide wide area tracking in width and depth, the baselines can heavily vary in the intended tracking environment. Thus, we propose to use the GigE Vision standard [130] to guarantee lossless image transmission while providing long cable lengths. Both cameras are connected to one workstation for image processing and tracking. 4.3.2 Target Design Guidelines The geometric constellation of our target design constitutes a line approach, as illustrated in Figure 4.5. Figure 4.5: The 2D model design features projective invariant properties. 55 4. METHODOLOGY It consists of four collinear optical markers which are attached in xed distances d1, d2, d3 to each other; thereby, cross-ratios and their projective invariant properties can be exploited for robust target identication, as described in Section 4.3.4, as well as occlusion recovery can be performed, as explained in Section 4.3.6. Within the whole intended tracking volume, the target must be reliably visible in the cameras' images to ensure robust feature segmentation. As described in Section 2.2.1.2 and illustrated in Figure 2.2, two common types of articial features exist that can be applied to targets for infrared optical tracking. Since precise feature segmentation in scenarios with interferences as well as at large distances can only be assured using active markers, we use infrared light emitting diodes (IR-LED) as optical markers. To protect the IR-LEDs and to prevent optical aberrations (are artifacts on the blob edges in the camera images), each IR-LED is covered with a translucent diuse plastic sphere, as shown in Figure 4.6. Figure 4.6: LED is coated with a translucent diuse plastic sphere. With this simple design, multiple unique constellations can be easily designed to simultaneously track multiple targets in the same tracking volume. Existing 3D rigid body targets (e.g [95]) also oer permutation invariant geometric constellations to track multiple targets. However, our line approach has three advantages over 3D targets that are crucial for our intended research goals. 1. We can re-purpose the tracking target as calibration apparatus by detecting the two outermost IR-LEDs during extrinsic camera calibration. Thereby, the amount of necessary hardware for setup and maintenance can be reduced. 2. Even during calibration, the target can robustly be tracked despite interfering lights, since the 2D characteristics of the target allows for Model Fitting already in the image domain instead of in 3D space, as it is common in competing ap- proaches [84, 145, 132]. 3. Fixing the IR-LEDs in a 2D manner increases the physical robustness of the target against accidental breaking o when touching the target during usage; this is es- pecially an issue for tracking at larger distances since the target requires enlarged dimensions as well. Accidental breaking o is a common problem with the sensi- tive 3D rigid targets (see Figure 2.5) that need frequent replacement or repair by experts. 56 4.3 Methodological Approach 4.3.3 Calibration As described in Section 2.3, the camera's intrinsic and extrinsic parameter must be known to perform precise feature segmentation and to provide 3D point reconstruction of the target model's IR-LEDs. Due to the large baselines and the intended range of our proposed tracking system, a 2D or 3D calibration apparatus is not applicable for both intrinsic and extrinsic parameter estimation. Following the calibration guidelines from Section 2.3.4, we split internal and external calibration into two separate steps, estimating the intrinsics with a planar calibration target (2D feature) and the extrinsics based on points (0D feature). 4.3.3.1 Intrinsic Calibration The Camera Calibration Toolbox [133] was used for intrinsic parameter estimation; it requires a 2D planar chessboard pattern for determination of the Camera Calibration Matrix K (see Section 2.3.2.5). To enhance the estimation of the parameters, all optical components (camera with lens and lter) of the nal tracking setup should be included in the calibration procedure. However, with such a setup, a normal black/white chessboard pattern would not be visible in the camera image. Therefore, we extended the standard intrinsic calibration setup by developing a chessboard plane made of retro-reective foil that is illuminated with an infrared light source to provide chessboard images in the NIR1 spectrum. The complete intrinsic setup is illustrated in Figure 4.7. (a) Infrared light (b) Reective pattern (c) Camera IR image Figure 4.7: Intrinsic camera calibration with a retro-reective pattern. Since the lens conguration must not change after intrinsic calibration and the track- ing will be at large distances, the focus settings are set to unlimited that results in a blurred pattern at close ranges. With this setting, the images of the tracking system's cameras and lenses (see Section 4.4.1) are in focus from 4m onwards. Thus, the pattern must have a sucient size to cover the entire camera image at a distance of 4m. Further- 1Near Infrared 57 4. METHODOLOGY more, the sharpness of calibration images was enhanced by closing the aperture (f/8) for increased depth of eld. 4.3.3.2 Extrinsic Calibration After the tracking system with its two cameras is physically set up, the geometric relation between the cameras is estimated by the extrinsic calibration process, yielding the def- inition of the two Camera Projection Matrices P, P ′ (see Section 2.3.3.3). As described in Section 2.3.4, calibration apertures of varying dimensionality can be used for extrinsic parameter estimation. Toolboxes such as [133, 140] estimate (P, P ′) by using a 2D pat- tern. For our calibration scenario, such a pattern would have to be extremely large to be visible at distances of 10 − 70m while being planar to provide precise corner extrac- tion. Furthermore, its surface would have to be composed of retro-reective foil, which is sensitive and requires additional hardware for pattern illumination. Such a target would neither be transportable nor suitable. Therefore, we exploit methods that use 0D features (points) for Fundamental Matrix F estimation as extrinsic calibration. Auto- calibration approaches that are based purely on natural features [102, 116, 99, 125] are not applicable since they require well-distributed features throughout the entire tracking volume to function robustly. This can be easily true in cluttered and well-illuminated environments but is hard to achieve in rather dark environments or scenarios with little geometric structures. Re-using existing light sources or reectors require manual selec- tion of correspondences in each image and a fair distribution cannot be guaranteed as well. Hence, this approach is omitted as well. Using Articial Points for P-Matrix Computation The calibration approach of this thesis thus follows the idea of using articial points that are created by manually waving the calibration target through the volume to achieve a high amount of detectable features. To allow for calibration in unconstrained environments with interfering lights, methods using a single point [45, 69, 84, 141] are not sucient. Those approaches, as depicted in Figure 4.8, require the background to be trained and to manually mask inter- ferences in the camera images to avoid false positive feature correspondences; obviously, those techniques cannot cope with moving interfering lights. Figure 4.8: Trained background (left) and manual masking (right), [141]. 58 4.3 Methodological Approach The stereo camera calibration approach [48] tries to overcome this limitation by eval- uating the screen-space coordinates of two blobs  that corresponding physical markers have a known distance  over a sequence of camera images (see Figure 2.14b). To nd the image correspondences, the algorithms seeks for the two longest paths of possible marker motion in each camera image and assumes that no other reections or markers are moved through the entire working volume in a similar manner as the calibration apparatus. Us- ing the corresponding image points, the approach estimates the Essential Matrix E by performing the Nominalized 8-Point Algorithm (see Sections 2.3.3.3 and 2.3.4). While being more robust against interferences than the single point approaches, this method has another advantage. The ane transformation to obtain real-world distance units [mm] is not only computed once (as in existing approaches such as [69, 84]) and which can result in inaccurate tracking at larger distances, but takes into account the measured distance between both optical markers of each processed camera frame. The scale is then obtained by scale = dreal dmean , (4.1) where dreal is the real known distance between the two markers and dmean is the mean distance calculated based on all measured distances between the two markers over all observed image frames. This scale is then applied to the Equation 2.24, re-formulating t by tmetric = t · scale, P ′ = [R|tmetric]. (4.2) However, as described above, certain criteria must be fullled for correct functioning of [48]. To expunge any assumptions of marker movement, to allow short tracks or even point pair correspondences without any spatial connections to each other, we developed a pipeline that extends the approach of [48], as illustrated in Figure 4.9. A line target, as described in Section 4.3.2, is used as calibration apparatus. Since its pattern can be recognized in a 2D camera image, no epipolar geometry is necessary to provide correct point correspondences for the estimation of E. During calibration, interferences are ltered and the target is identied (Model Identication) using the developed pipeline from Section 4.3.4. This pipeline returns a set of four ordered points p for each camera L and R of a frame at time t. St L = {ptL,1, ptL,2, ptL,3, ptL,4}, St R = {ptR,1, p t R,2, p t R,3, p t R,4} (4.3) where ptL,i , p t R,i ∈ R2, i = 1...4 . Although the model tting is reliable, the matching in each image is still independent from each other. Thereby, errors can occur such as a false positive identication in camera 1 and a hit in camera 2, or a hit in camera 1 and no detection in camera 2 (due to occlusions). Such erroneous input data would decrease the stability of the estimation of E and thus should be avoided. Therefore, a Similarity Check between both sets St L, S t R is performed. It is based on the idea, that the detected target has a similar 59 4. METHODOLOGY Figure 4.9: Extrinsic calibration pipeline. orientation in both images at time t up to a threshold, depending on the camera setup. For the similarity evaluation, the target in the left image is considered as a vector ~vL = pL,1, pL,4, respectively ~vR in the right image. The angles (φx, φy) between ~v and the x-axis, respectively the y-axis, are determined for the left and the right image. Outliers are detected if the angles dier by more than a given threshold λ, as in Equation 4.4. The same is done for the y-axis. Thereby, the algorithm can be used on images taken from both horizontally and vertically aligned cameras. outlier = { ~vL, ~vR if |φx,L − φx,R| > λ 0 otherwise (4.4) If outliers have been detected, the point sets (St L, S t R) are rejected, if not, the sets are considered as correct target blobs and are fed into the calibration routine of [48]. Since K is known from Section 4.3.3.1, the Normalized 8-Point Algorithm is applied for computation of the Essential Matrix E to enhance the stability of the epipolar geometry estimation [56]. To obtain a metric scale for Equation 4.2, the distance between the two outermost IR-LEDs of the calibration target are measured to sub-millimeter accuracy with a high precision total station, yielding dreal. The resulting world coordinate system is illustrated in Figure 4.10. With our described pipeline, we achieve a robust calibration procedure that can be performed in the presence of static and moving light sources. No pre-conditioning of the volume is necessary and background training as well as manual masking can be omitted, which increases the system's ease of use during setup and maintenance. Furthermore, 60 4.3 Methodological Approach Figure 4.10: Resulting camera coordinate system for tracking. by re-using a tracking target for extrinsic calibration and scale estimation, additional equipment can be minimized. 4.3.4 Interference Filtering To provide robust target identication at each stage of a optical tracking system work- ow (extrinsic calibration, target tracking), static and moving interfering lights must be robustly ltered out. In unconstrained tracking environments, as described in Section 1.1, a varying number of ambient light sources (wall illumination, spot lights, reections, vehicle lights, ...) might exist. Figure 4.11: Wavelengths of various light sources. To evaluate the wavelength emission, we measured frequently occurring standard illumination sources with a spectrograph. Their emission curves are illustrated in Fig- ure 4.11. As depicted, almost all ambient light sources show infrared radiation. A portion 61 4. METHODOLOGY of the interferences can be ltered by inserting a longwave pass lter with a cut-on value of 780nm into the optical path. However, most of the interfering lights are still visible in the camera images and result in bright circular blobs, similar to the IR-LEDs from the target model. To robustly detect the target amongst static and moving interfering lights, we inves- tigated dierent concepts based on hardware and software ltering that are presented in the following. 4.3.4.1 Hardware-based Target Identication The main idea of the presented hardware-based ltering approaches is to detect the blobs of the target without requiring the knowledge of the target's geometric structure. Thereby, also point-based targets, consisting of a single LED, can be robustly segmented and tracked. The rst concept aims at changing the target's LED state (on/o) in two subse- quent frames. This can be accomplished by remotely controlling the LEDs via a wireless communication. The dierence in luminance in both frames can then be evaluated (Lu- minance Filtering) to robustly detect the LED's position. To change the LED's state, we rst evaluated a number of wireless communication technologies, such as RFID [70], ZigBee [155] and radio chips in the GHz band. Due to its low price, high data through- put and small form factor, the 2,4GHz Nordic nRF24L01+2 chip was nally chosen for wireless data transmission. The target control unit consists of the open-source platform Arduino Nano 3.0 [131], that features a ATmega328 micro-controller, and the circuit to interface with the Nordic nRF24L01+ radio chip. (a) First prototype of receiver (b) Receiver integrated in tracking target Figure 4.12: Radio module for target communication for luminance-based ltering. To extend the radio frequency reception range, the radio chips for both base station and target are equipped with 2.4GHz dipol-antennas with a power gain of 5 dBi3 and 2.2 2NordicSemiconductor: http://www.nordicsemi.com 3dBi: decibels-isotropic 62 4.3 Methodological Approach dBi, respectively. Thereby, a communication range of 120m with an estimated round- trip-time of 5ms can be provided. In Figure 4.12, the development process of the radio module is shown. Due to the implicit nature of the radio connection, LED state changes and image cap- turing cannot be precisely synchronized in time by a hardware trigger. For that reason, we further investigated a concept of ltering interfering lights by applying wavelength ltering using a motorized lter unit. As luminance ltering, it aims at detecting the target's blobs without requiring a predened and well-known target geometric structure by evaluating two subsequently captured frames. Since a single infrared longwave pass lter inserted in the optical path of the camera removes only those parts of the am- bient illumination that solely emit in the visible light spectrum, we used a motorized lter-wheel to be able to change the applied lters at run-time. (a) A motorized lter wheel [153] (b) The wheel tted into the casing Figure 4.13: Using a motorized lter wheel for wavelength-based ltering. The employed lter wheel4 is therefore equipped with two lters, a shortwave pass lter (VIS) to transmit all wavelengths shorter than the cut-o length of 780nm, and a longwave-pass lter (IR) to transmit all wavelengths longer than the cut-on length of 780nm. A stepper motor5 is used to control the lter wheel and is connected to the workstation over USB 2.0. With this setup, the change time between two adjacent lters is 200ms. To robustly couple the lter wheel with the camera, a solid casing was designed that xates the wheel with a customized apparatus in front of the camera. In Figure 4.13, the lter wheel and the developed camera encasement are depicted. To access both radio and lter wheel to control the LED state and to change between lters during run-time, a software module was developed, as depicted in Figure 4.14. As illustrated, the software processing accesses either the radio- or the wheel control to change the LED's state or the employed lter between two subsequently captured 4Thorlabs Motorized Fast-Change Filter Wheel FW103S/M 5Thorlabs T-Cube TST001 63 4. METHODOLOGY frames Ik, Ik+1. For luminance ltering, Ik is captured and recorded while the LEDs are switched on, Ik+1 with LEDs are switched o, respectively. For wavelength ltering, Ik is recorded with IR lter, Ik+1 with VIS lter, respectively. Figure 4.14: Pipeline to detect target features using hardware-based ltering. Thereby, the LEDs are only visible in Ik for both ltering approaches. In the next step, a binary lter mask M ∈ {0, 1} is computed. Therefore, Ik, Ik+1 are converted to binary images with a given threshold α to lter noise and areas with low luminance, resulting in m xn matrices B(k), B(k + 1) ∈ {0, 1}. To identify the LEDs in I(k), a negated pairwise Logical Implication6 is applied to each element i, j of B(k), B(k + 1), as denoted in Equation 4.5. Mij = ¬(B(k)→ B(k + 1)) := ¬(¬bij(k) ∨ bij(k + 1))i=1,...,m; j=1,...,n (4.5) Thereby, only the areas that show the LEDs are marked in M with a logical true. The mask is then applied to Ik to segment the LED's blobs and to dene the region-of-interest (ROI) that is subsequently used for binary mask processing. Finally, the blob's centroids are computed using the feature segmentation algorithms from Section 4.3.4.2. Our initial tests indicated promising results for both approaches to detect static LEDs in the presence of ambient interfering lights. However, as soon as the target was quickly moved a robust LED detection could not be provided due to the latency introduced by the round trip of the radio connection and by the time the wheel requires to change between two adjacent lters. To reduce the latency for wavelength ltering, a high current stepper motor or even a multi-spectral camera would be an interesting option to provide robust 6Logical implication is also known as Material conditional or Logical conditional. 64 4.3 Methodological Approach target tracking of a single LED or of a multi-LED target without requiring to evaluate the target's geometric properties. These ideas are subject of future research. Based on the initial evaluation, we decided to perform tracking solely using a software- based target identication pipeline to be able to provide fast, reliable and robust tracking of static and moving targets. This pipeline is presented in the next section and is em- ployed throughout the following chapters of this part. 4.3.4.2 Software-based Target Identication To robustly detect the target amongst static and moving ambient interfering lights, we developed a software-based identication pipeline, as depicted in Figure 4.16. It is built around a 2D model tting approach that exploits the permutation and perspective in- variant properties (see Section 2.2.2.1) of the target design (see Section 4.3.2). The target model must be thus trained once before it can be recognized during calibration and tracking. Model Training To obtain the unique properties of a target pattern, it is trained once in an o-line process to determine its Model, as illustrated in Figure 4.15. Figure 4.15: Pipeline to obtain the target's model. The following steps are performed: 1. The distances d1, d2, d3 (see Figure 4.5) between the target's LEDs are precisely measured using a total station. 2. Based on d1, d2, d3, the cross ratio λ is computed and used as input argument for the function J to obtain an initial estimate for p2range. The eigenvalue of the 65 4. METHODOLOGY moment matrix M as a measure for collinearity is set to an initial value such as ev 6= 0, ev < 0. 3. The target is captured at all intended tracking distances to obtain a sucient number of samples (images) for the complete tracking volume. 4. Each of the captured images is then processed and blob candidates are obtained by performing feature segmentation and classication (see Figure 4.16). p2range and ev are applied to the blob candidates and subsequently rened to account for noise of cross ratio and collinearity. 5. After the renement phase, the minimum and maximum length of the target in the 2D images over all images are measured to obtain a threshold thrange. 6. Finally, the obtained model is stored, containing p2range as the minimum and max- imum values of the pattern's p2-invariants, ev, as the collinearity error model and thrange. Model Identication After a new image (frame) is captured from the camera with the attached long-wave pass lter, all blobs are segmented (Feature Segmentation) as proposed in [84]. First, the camera image is transformed to a binary image using a dynamic threshold. Blobs are created by applying a connected component analysis as well as a circular Hough transform [49]. Figure 4.16: Pipeline for model identication. 66 4.3 Methodological Approach Next, the center of each blob (centroid) is determined using a luminance-weighted average of the connected pixels, which describe the blob's 2D position with sub-pixel accuracy. For further processing, the centroids are undistorted based on the Camera Calibration Matrix K (see Section 2.3.2.3). In the next step, each resulting blob is classied by performing shape- and size-based classication (Feature Classication). The minimum and maximum values for the size-lter can be manually dened to provide quick conguration for dierent tracking ranges. The classication results in circular-shaped blobs (Blob Candidates) that diameters lie within the specied range. In practice however further ltering must be performed since interfering lights can have a similar size as the target's IR-LED blobs. Based on approaches [54, 76, 75, 81], a 2D Model Fitting within the set of remaining blob candidates is performed. As described in Section 2.2.2.1, the p2- Invariants of the blob candidates as well as their collinear properties are computed and compared to the pre-calculated target model. Thereby, false positive blob candidates are rejected and the target's blobs are determined. Due to the permutation invariant properties of the computed p2-invariants, an ordered set of blobs St = {pti}, i = 1...N, p ∈ R2 for each image at time t is output to be further used for calibration or tracking. 4.3.5 3 Degree-Of-Freedom Tracking To track optical markers in 3D space, the following two problems have to be solved: 1) the 2D blobs have to be identied throughout all camera views and then transformed to 3D marker locations, and 2) the 3D markers need to be tracked through time. The online image-processing pipeline for tracking is depicted in Figure 4.17. Figure 4.17: Tracking pipeline. 67 4. METHODOLOGY Given an intrinsically and extrinsically calibrated, shutter-synchronized stereo cam- era rig, the tracking is performed as follows. After a new frame is received from each camera, blob candidates are segmented and classied in both frames (see Section 4.3.4.2). Our approach uses the projective invariant properties that were obtained during model training (see Section 4.3.4.2) to search for a pattern within an image. The 3D position of the pattern's optical marker are only computed if and after the pattern was found in the image. To minimize computational load, the model identication is only performed in Image 1 by applying model tting within the set of all blob candidates. After the target blobs have been determined in Image 1, their correspondences have to be identied in Image 2 amongst all blob candidates that result from the feature classication by ex- ploiting the epipolar geometry, which is encapsulated in E (see Section 2.3.3.1). For each target blob in Image 1, a search for its corresponding blob is performed along its epipolar line (Stereo Correspondence) in Image 2 (see Section 2.3.3.2). Thereby, corresponding features over multiple camera views can be identied (Multiple-View-Correlation). De- pending on the camera setup of the tracking system, the baseline might be large and image pairs thus have been taken from widely diering viewpoint. Following [26, 56], it is advisable in those cases to perform image rectication to produce a pair of matched epipolar projections before stereo correspondence analysis. By applying model tting within the 2D projections of the target's IR-LED not only a drastically reduced set of correspondence candidates and ambiguities is obtained but the combinatorial complexity of the multiple-view correlation problem can be consid- erably decreased as well. By performing a projective triangulation between each cor- related 2D blob-tuple (Projective Reconstruction), the 3D-coordinate of each optical marker can be reconstructed. Following [84], we apply the standard Singular Value Decomposition (SVD) to obtain the initial 3D estimate for each blob-tuple, followed by bundle adjustment [39] with a Levenberg-Marquardt non-linear least squares algo- rithm for renement. This results in a 3D point cloud of the reconstructed model points T = {P1, P2, P3, P4}, P ∈ R3. To further increase the algorithm's robustness against out- liers of the model tting, the model points T are validated with a threshold to account for noise against the target's geometric constraints d1, d2, d3 (see Section 4.3.4.2) and volume. Based on T and a given distance depi as the real distance between the outermost IR-LED and the epicenter of the target, the target's epicenter C ∈ R3 can be calculated (Position Estimation) as follows. C = P4 − (depi ∗ m̂) (4.6) Therefore, we normalize the vectors ~a = P2P1, ~b = P3P2, ~c = P4P3, resulting in â, b̂, ĉ. By calculating the arithmetic mean of â, b̂, ĉ, we determine the mean direction m̂ which is applied according to Equation 4.6. Thereby, an arbitrary point along the line can be determined, resulting in the 3D pose of the target. In order to enhance the robustness when tracking the target through time, the result- ing target pose can be fed into a recursive lter (Predictive Filtering). Thereby, jitter can be reduced and the system's intrinsic latency can be compensated. Since we currently aim for position tracking, the non-extended Kalman Filter [3, 20] is therefore employed. 68 4.4 System Development 4.3.6 Occlusion Recovery If a target's IR-LED and an interfering light source lie on the same line of sight of the camera, their corresponding blobs can overlap in the images. Furthermore, parts of the target can be occluded, i.e. when the target gets partly hidden behind an object in the scene. Our model tting approach requires four optical markers. Currently, the proposed target identication pipeline can compensate one occluded marker while retaining the capability of detecting the target within the set of blob candidates. After projective reconstruction, the 3D positions of occluded markers can be reconstructed based on the target's geometric model and the resulting 3D point cloud. The recovery of occluded IR-LEDs optimizes the accuracy of the 3D position estimate of the target's epicenter. With this recovery functionality, loss of tracking can be reduced in cases of occlusions or over-blooming by (stronger) interfering light sources. 4.4 System Development Based on the methodological approach, we developed a hardware- as well as software system to test our tracking system in large, unconstrained indoor environments. 4.4.1 Hardware Our hardware prototype comprises targets, the vision system and a notebook as main processing unit. The schematics of the hardware components as well as cabling and power supply are illustrated in Figure 4.18. Figure 4.18: The cabling of the hardware prototype. 69 4. METHODOLOGY Each target consists of a minimum of four IR-LEDs that can be remotely controlled via 2,4GHz radio hardware module. It has to be noted that the wireless LED control is not used for target identication during tracking, as it was proposed in Section 4.3.4.1 but rather for convenience during the evaluation that is presented in Chapter II.5. To be able to remotely switch the target on and o, the hardware setup from Section 4.3.4.1 is used. Since further target specications depend on the given wide area tracking task, additional target design details are given in Chapter II.5. The system's vision system consists of two Dalsa Genie HM1400/XDR cameras which feature low heat evolution and a global-shutter 1 mono CMOS-sensor with high NIR7 spectral sensitivity. Low heat evolution and large image sensors yield little sensor noise to minimize jitter in the camera image. Together with the high resolution image sensors, precise segmentation can be provided even at longer distances. The cameras oer high global shutter speed to minimize motion blur when the target is moving fast. It is capable of delivering 60 frames per second (fps) with a resolution of 1400x1024 pixels. It provides external trigger functionality and uses the GigE Vision [130] standard. Thereby, lossless image transmission while providing long cable lengths can be guaranteed. Following the results from Section 4.2, both cameras are equipped with a EdmundOptics NT63-246 high-resolution and fast (f/1.4-f/16) xed focal lens (f = 25mm). To lter light from the visible wavelength spectrum, we attached a Heliopan RG-780 long wave pass lter allowing only wavelength above 780nm to transmit. Both cameras are powered by an external 12 VDC (1,5A) supply. Both cameras are shutter-synchronized from a square-wave current loop signal that is generated by the trigger unit with a built-in programmable oscillator. The trigger unit comprises two BNC connectors8 and the trigger signal, generated by an Arduino Uno board [131]. Similar to the target, the Arduino Uno interfaces with the 2.4GHz radio module, consisting of a Nordic nRF24L01+ chip and a 5dBi dipol-antenna. Via USB 2.0, the Arduino board connects to the mobile workstation for communication with the tracking software as well as for power supply. The workstation runs the software prototype and features two Gigabyte Ethernet host adapters (1x built-in, 1x ExpressCard) to interface via ISO/IEC 11801 (Category 6) cable with the cameras. The components of the base station are centrally powered by one external 240 ACV supply. 4.4.2 Software The developed software framework follows a three-tier-architecture comprising hardware abstraction, a processing layer and data visualization on a graphical user interface, as shown in Figure 4.19. The processing core consists of loosely-coupled modules for the oine processes intrinsic calibration and model training, as well as for the online pro- cesses target identication, extrinsic calibration and tracking. The modules and their 7Near Infrared 8BNC: Bayonet Neill Concelman connector 70 4.4 System Development functionalities are centrally accessed by the controller component that delivers data from the processing layer to the GUI. Figure 4.19: Software architecture and modules. Our software framework prototype is implemented in C/C++ and MATLAB. For the intrinsic camera calibration, the open-source MATLAB Camera Calibration Tool- box [133] was integrated. With the open-source Arduino IDE [131], we developed the embedded component for camera synchronization and radio communication. Training and intrinsic calibration are performed in an oine process and are im- plemented as stand-alone software packages. The graphical user interface of the model training component is shown in Figure 4.20. Figure 4.20: User interface of semi-autonomous Model Trainer. 71 4. METHODOLOGY Based on a selected model training set, the model properties are automatically ex- tracted and the user is informed about problems during autonomous model identication. In case of a detection of a problematic image, the user can manually adjust collinearity and p2-invariant range or can discard the image from the training set. If no problematic training image was found, the estimated model properties are stored in a XML model le. (a) Collinearity threshold is too large. (b) Insucient p2-Invariant range. Figure 4.21: Examples of incorrect model recognition during training. The presented Figures 4.20, 4.21a and 4.21b show interesting examples of the model detection in training images that have been captured in unconstrained settings. For vi- sualization purpose, the camera images in all gures have been inverted. The image in Figure 4.20 has been captured in an outdoor test environment during night (see Sec- tion 5.5). In this example, target reections in a water puddle causes the model training to detect the target twice in the image based on the given projective invariant settings, as indicated by the red arrows. Since the model detection correctly performs with the provided collinearity and p2-invariant range, no manual adjustment of the values is de- sired and the training image can be discarded from the set. In Figure 4.21, two examples are given for incorrect model recognition because of insucient model properties. In Fig- ure 4.21a, the collinearity threshold is too large, resulting in an incorrect identication of non-collinear blobs. In Figure 4.21b, the p2-invariant range is incorrect for the applied model, thus no model was identied in the depicted image. In both cases, the system proposes enhanced value for collinearity and p2-invariant range that the user can either apply or manually adjust the values to increase the accuracy for model detection. The graphical user interface of the Controller module for analyzing the input data during calibration and tracking is depicted in Figure 4.22. In this example, the same sit- uation as in Figure 4.20 is shown. However, due to ltering and correspondence analysis, the blobs that are reected in the water (indicated by the red arrow) are not considered for model tting and subsequent tracking, demonstrating the robustness of the model identication pipeline. 72 4.4 System Development Figure 4.22: User interface of Controller to analyze data during calibration and tracking. All parameters for feature segmentation and model tting and tracking are centrally stored in one XML conguration le, that can be edited and is read during system start- up. The parameters for feature segmentation and model tting are shown in Listing 4.1 Listing 4.1: Conguration for image processing and model tting. 1392 1024 ..\intrinsic_calibration_204.xml ..\intrinsic_calibration_216.xml ..\calibration\wheelLoader\baseline_3m TXT 1 1 4 4 4 30 0.5 2 73 4. METHODOLOGY ..\appliedModels\ Information about the hardware abstraction and access are stored as well in the cong- uration le, as illustrated in Listing 4.2. Listing 4.2: Conguration for hardware access. DalsaXDR1400HM S4405216 192.168.1.2 0 DalsaXDR1400HM 192.168.2.2 1 COM8 57600 8 1 byte 1 300 30 20 4.4.3 System Costs As stated in Section 1.1, cost eciency is one of the objectives of the presented racking system. Therefore, we minimized the amount of necessary hardware and focused on o-the-shelf components as well as open source hardware and software. The current hardware prototype costs in total ∼ e7300, excluding camera- and target casing. The price includes both cameras (each e2000 with IR lter), lenses (each e600), notebook (e2000), the synchronization unit (e30 for Arduino, BNC adapters and cabling) and technical parts for the target (e60 for Arduino, radio chip, battery, wires, IR-LEDs and target material). 74 Chapter 5 Experimental Results Based on the methodological approach from Chapter II.4 and the implemented prototype, the system's capabilities were experimentally evaluated within three dierent application scenarios that share the requirements of wide area tracking in an unconstrained and even harsh indoor environment: 1. User tracking for mixed reality applications 2. Handheld target tracking for tunneling 3. Machine guidance for mining In each scenario, the robustness of target identication and the accuracy of the relative 3D position estimation was tested with the platform from Section 5.1 and evaluated using the performance measures as described in Section 5.2. 5.1 Test Platform We tested our system on a Lenovo W520 notebook, featuring an Intel Quadcore i7 2820QM at 2,3GHz, 8 GB memory and Windows7 (64bit). The notebook acts as pro- cessing core unit that runs the software prototype. It features two Gigabyte Ethernet host adapters (1x built-in, 1x ExpressCard) to interface via Category 6 cable with the cameras. 5.2 Test Cases & Performance Measures As described in Section 2.1.1.2, the sources of error for an optical tracking system origi- nate from a combination of optical aberrations, image processing inaccuracies as well as varying lighting situations. Since these factors potentially inuence both the estimation of the external camera parameters as well as the position tracking, we separated them into two test cases in each of the three scenarios. 75 5. EXPERIMENTAL RESULTS 5.2.1 Calibration Performance Calibration performance was measured by evaluating the target identication robustness and the subsequent accuracy of the estimated relative 3D positions. Therefore, the detected blob centroids p ∈ R2 in both cameras images are plotted as a function of 2D measurements over time, as dened in Equation 5.1. f(x, y) = px,y(tk), k = 1, ..., n. (5.1) Thereby, false positive and loss of calibration target identication, target occlusions and the feature distribution across the image are visualized and can be evaluated. The calibration performance is further examined by evaluating the relative accuracy of the estimated 3D positions. Their implicit dependency on the determined camera parameters allow for conclusions to be drawn about the quality of the calibration. 5.2.2 Tracking Performance Following the performance measures from Section 2.1.1.1 that are applied to measure the capabilities of a tracking system, the following measures are evaluated during testing the tracking performance of the three dierent tracking scenarios. Relative Position Accuracy To obtain a valid ground truth for evaluating the rela- tive position accuracy of the estimated 3D target position, the geometric distance between the two outermost target's IR-LEDs is rstly measured to millimeter precision using the Leica TPS700. Thereby, ground truth dbar is determined. During tracking, the position of target's IR-LEDs L1..L4 ∈ R3 are calculated for each frame i and used for obtaining d̂bar,i = ‖L4, L1‖, where ‖ denotes the Euclidean norm. To avoid distortion of the 3D position reconstruction, no predictive ltering is applied for testing. The estimated bar length d̂bar is then applied to obtain the arithmetic mean µ̂bar with standard deviation σ̂bar over all processed frames i = 1...n, its absolute arithmetic mean deviation |ε̂bar| and root mean square are denoted as follows. d̂bar(RMS) = √ 1 n (d̂2bar1 + d̂2bar2 + ...+ d̂2bari) (5.2) d̂bar(RMS) is subsequently employed to obtain the deviation xRMS(bar), as an accuracy measure of the distance between the two outermost LEDs, and xRMS(P ), as a measure of the relative accuracy of a single LED. Both measures are obtained as follows: xRMS(bar) = dbar − d̂bar(RMS) (5.3) xRMS(P ) = xRMS(bar) 2 (5.4) Thereby, the relative 3D position accuracy of a single target point can be evaluated against a ground truth throughout the tracking volume. 76 5.3 Tracking for Mixed Reality Position Stability Based on the estimated target's IR-LED L1..L4, the target's epi- center C = Cx,y,z ∈ R3 is determined during tracking, as described in Section 4.3.5. To evaluate static jitter of the system and thus the stability (inner accuracy) of the 3D point estimation, the standard deviation σ̂ of Cx, Cy, Cz as well as C over the sequence of consecutive frames is calculated and used to evaluate the system's intrinsic tracking performance. Tracking Latency To obtain a measure for time-dependent tracking performance, the systems latency is measured as the time delay between the change in tracker pose and the time, the system has estimated and outputs the new tracker pose. 5.3 Tracking for Mixed Reality Figure 5.1: Wide area user tracking in a mixed reality setup. Wide area user tracking can be applied to a number of application scenarios, such as user tracking in mixed reality in environments using redirected walking approaches [144, 154], tracking of artists on stages or personnel in workshops and factories. In Figure 5.1, an example scenario for user tracking in mixed reality environments is depicted, that is characterized by static and moving light sources and distances up to 30m. 5.3.1 Target Design Following the design guidelines from Section 4.3.2 to allow for robust target identication (Section 4.3.4) and occlusion recovery (Section 4.3.6), we developed a line target. It oers continuously adjustable positioning of the IR-LEDs by xing each LED separately with 77 5. EXPERIMENTAL RESULTS nuts on a rigid bar. This ensures a rapid arrangement of the required IR-LEDs in a permutation invariant geometric constellation. Figure 5.2: Wide area user tracking in a mixed reality setup. Applying the proposed target design to a semi-immersive VR scenario in which the user is tracked in front of a projector wall, a single line target is sucient to determine the user's (head) 3D position. Figure 5.3: Target design for head tracking. In a fully immersive VR environment, the user freely moves in space and wears a head mounted display for visualization. In such a scenario, using a single line target for tracking in combination with two cameras results in occlusions as soon as the user turns around. Since we want to minimize the amount of (costly) vision hardware, the occlusion problem can be compensated by applying a redundant target setup with unique targets for user head tracking, as depicted in Figure 5.3. 5.3.1.1 Prototype The target prototype has a total length of 687mm and is equipped with four IR-LEDs OSRAM 4850 E7800 in a permutation invariant constellation. Each IR-LED emits a peak wavelength of 850nm with a radiant intensity of 40mW/sr1 and features a view- ing half angle of ±23◦. Thereby, robust feature segmentation up to a distance of 30m 1mW/sr: milli watts per steradian 78 5.3 Tracking for Mixed Reality can be performed. With the employed vision hardware setup from 4.4.1, that features a 1 CMOS sensor with a resolution of 1400x1024 pixels, a minimum distance of 130mm between two neighboring LEDs is advisable with a shutter speed of 100µs to avoid blob overlaps in the camera image during rotations and at large distances. With this proto- type, tracking in the intended volume can be provided. (a) Tracking target (b) IR-LED with sphere Figure 5.4: Target prototype attached on a HMD. Tracking in a smaller volume automatically leads to a decreased target size with the above mentioned setup. To further reduce the physical target size for volumes up to 30m, LEDs with dierent radiant intensity properties are applicable. 5.3.2 Test Environment Since we were lacking access to an indoor environment that features the intended track- ing ranges, we deployed the prototype in an outdoor environment during twilight and night. We added light sources (neon lights, halogen spots up to 1500W) to simulate wall illuminations, reections and locomotive interfering lights. Thereby, we established a controllable realistic simulation of the intended tracking scenario. Both calibration and tracking were performed in an environment with static as well as moving interfering lights. We employed a baseline dbase ≈ 10m and tracking distances between the vision system and target dtrack of 7, 5 − 30.0m. 5.3.3 Model Training As the target's prototype from Section 5.3.1.1 is used for calibration and tracking, its model was obtained in an oine process, as described in Section 4.3.4.2. First, the real distances d1, d2, d3 between the target's LEDs were precisely measured with millimeter precision using a Total Station (Leica TPS700). Afterwards, the target's projective invariant properties were calculated by evaluating 110 captured camera images across the entire tracking volume from 5 to 30m. 79 5. EXPERIMENTAL RESULTS 5.3.4 Camera Calibration Before setup, both cameras were intrinsically calibrated in an oine process, as described in Section 4.3.3.1, using 34 images that captured the retro-reective chessboard pattern from dierent angles and distances. For extrinsic calibration and subsequent tracking, the stereo camera system was setup with the following parameters to account for tracking distance and poor lighting situation: ˆ Real baseline dbase ≈ 10m ˆ Yaw-rotation βcam1 = 30◦, βcam2 = −30◦ ˆ Lens focus ∞ ˆ Aperture 1.4/f ˆ Shutter speed 1000µs Using the tracking target from Section 5.3.1.1, we performed the calibration at a distance around 15.0m from the cameras. (a) Right view (b) Left view Figure 5.5: Corresponding blob traces used for extrinsic calibration. We ran three dierent calibration tests with ∼ 1200 frames each to evaluate the ro- bustness of the calibration procedure. As depicted in Figure 5.5, our system robustly identies the target despite static and locomotive interfering lights, resulting in con- tinuous blob traces of the two outermost IR-LEDs. As illustrated, the blob trace was interrupted at some points due to complete occlusion of the target because of obstacles in the environment. Despite the unconstrained test calibration environment, our system robustly estimated the Essential Matrix E at each run. In average, E was determined with a duration of ∼ 110s. The second factor for evaluating the calibration are the tracked 3D points. We found the calibration yielding consistent 3D point estimates for all tracking distances, as pre- sented in detail in Section 5.3.5. 80 5.3 Tracking for Mixed Reality 5.3.5 3D Position Accuracy To evaluate the accuracy of relative 3D position estimation, we performed measurements at six dierent distances between camera and target, denoted as dtrack for each calibration procedure. At each accuracy run, the 3D coordinate of each target's IR-LED L1..L4 as well as of the target's epicenter C = Cx,y,z was estimated based on 300 consecutive frames. Thereby, accuracy and stability were evaluated for the entire tracking volume. The obtained xRMS(P ) values for each calibration run and each tracking distance dtrack are listed in detail in Table 5.1. Calibration 1 Calibration 2 Calibration 3 dtrack xRMS(P ) xRMS(P ) xRMS(P ) 5m 3.39 [mm] 2.99 [mm] 1.78 [mm] 10m 4.12 [mm] 3.91 [mm] 2.63 [mm] 15m 4.76 [mm] 4.54 [mm] 4.58 [mm] 20m 6.08 [mm] 6.23 [mm] 7.47 [mm] 25m 6.64 [mm] 6.97 [mm] 8.92 [mm] 30m 7.44 [mm] 7.96 [mm] 9.22 [mm] Table 5.1: Relative accuracy xRMS(P ) of three independent calibrations. In Figure 5.6, the arithmetic mean of xRMS(P ) over all three calibration runs with respect to the tracking distance is depicted. Figure 5.6: Mean of relative accuracy xRMS(P ) over all three calibrations. 81 5. EXPERIMENTAL RESULTS 5.3.6 3D Position Stability To evaluate static jitter of the system and thus the stability (inner accuracy) of the system, we xated the target and tracked it over a sequence of 200 consecutive frames. In each frame, Cxyz was calculated to determine the empirical standard deviation σ̂x, σ̂y and σ̂z of the target's center of gravity. Throughout the entire tracking volume and across the three calibration runs, we found sub-millimeter deviation for 3D position estimation with σ̂x = 0.05mm, σ̂y = 0.03mm, σ̂z = 0.11mm, resulting in an overall mean standard deviation of σ̂ = 0, 06mm for C. 5.3.7 Tracking Performance To determine the system's capability to continuously track a target throughout the en- tire tracking space, we moved it through the whole volume. The resulting 3D position reconstruction of each target's IR-LED is illustrated in Figure 5.7. Figure 5.7: 3D position tracking from 5 − 30m. Depending on the number of interfering lights, our system identies and tracks a target with a latency of ~69ms within the unconstrained test environment. 5.4 Hand-held Target Tracking for Tunneling To further exploit the capabilities of the developed tracking system beyond application scenarios for mixed reality, it was tested in an underground scenario, using a hand- held target to track the 3D position of arbitrary static points or the moving target over time. As described in Section 3.3, existing technology lacks the ability of tracking a fast moving target, tracking of multiple targets as well as tracking without manual sighting. 82 5.4 Hand-held Target Tracking for Tunneling Figure 5.8: Tracking situation in an underground environment. The intended underground tracking scenario, such as a tunnel or a mine, is illustrated in Figure 5.8. Two cameras are directed towards the tracking volume and connected to one processing unit. As soon as the hand-held target comes into sight of the cameras, tracking of the target's 3D position automatically starts. Compared to the previous scenario from Section 5.3, the tracking system does not only need to be able to cope with static and moving interferences, such as wall illumination and (strong) vehicles lights, but also with larger distances and harsh environmental conditions, such as dust or dirt. Dust, as a large number of small particles in the air, can inuence the visibility of the target, especially at long distances, and hence decrease the quality of feature segmentation during calibration and tracking. To account for these additional challenges, a specialized hand-held target was developed and all vision components were carefully encased to enable tracking from 30 − 70m. 5.4.1 Target Design Following the design guidelines from Section 4.3.2, the core geometric constellation of our target design constitutes a line approach. Figure 5.9: Multiple unique target constellations. 83 5. EXPERIMENTAL RESULTS As depicted in Figure 5.9, our target design provides an array of holes at xed dis- tances, in which one or multiple IR-LEDs can be mounted. This allows for the rapid arrangement of multiple IR-LEDs in a permutation invariant geometric constellation. Furthermore, multiple unique constellations can be easily designed to simultaneously track one or more targets in the same tracking volume. To be able to test the setup with planar patterns in the future as well, it provides a rectangular area at one end. 5.4.1.1 Tracking Scenarios With the proposed design for the tracking target, the 3D position of a static point can be measured. This is a common tunneling task. Since the target features a 20cm long tip without any optical markers attached, also points that are not visible to the cameras can be tracked, as shown in Figure 5.10. Thereby, the disadvantage of vision-based tracking systems that require a line-of-sight between cameras and measured point can be compensated to a certain extent. (a) Static (b) Moving Figure 5.10: 3D position estimation of visible or invisible static and moving target's tips. As the target is freely moved in space, as depicted in Figure 5.10b, the 3D position of the target's tip is continuously tracked. 5.4.2 System Prototype The target prototype was developed in cooperation with Geodata Ziviltechniker GesmbH, Austria and is depicted in Figure 5.11. Figure 5.11: Developed target prototype. 84 5.4 Hand-held Target Tracking for Tunneling The maximal distance between the two outermost IR-LEDs is 820mm, while the targets total length is 120, 0cm. The target is equipped with six IR-LEDs OSRAM 4850 E7800 to be able to construct a planar pattern in future as well. However, all experimental results are based on four collinear LEDs. As in the previous test scenario, each IR-LED emits at a peak wavelength of 850nm with a radiant intensity of 40mW/sr and features a viewing half angle of ±23◦. A minimum distance of 175mm between two neighboring LEDs with a shutter speed of 1000µs is required to ensure robust feature segmentation up to a distance of 70m. This distance was empirically determined with the given hardware setup, as described in Section 4.4.1, that features a 1 CMOS sensor with a resolution of 1400x1024 pixels. In Figure 5.12, details of the prototype are shown, including the coating of the IR-LED as well as dampness-proof cabling. (a) Single IR-LED (b) Cabling (c) Control box (d) Electronics Figure 5.12: Details of the developed target prototype. All electronic components for LED control, radio and power supply are robustly encased in the control box that features feedback LEDs to inform the user about the current tracking state. Furthermore, each camera was encased separately to be protected against dampness and dust. The components of the base station, comprising notebook with power supply, camera trigger and radio were encased as well for protection and to be transportable. (a) Encased camera (b) Camera (c) Base station (d) Cabling Figure 5.13: Robust and dampness proof encasement of cameras and base station. 85 5. EXPERIMENTAL RESULTS 5.4.3 Test Environment We deployed the prototype in an underground metro station that oers a long-range tracking volume with static illumination characteristics, similar to a tunnel construction site. Furthermore, we dynamically applied moving light sources by hand, i.e. halogen light up to an intensity of 1500W , to establish a controllable and realistic simulation of the application scenario, as shown in Figure 5.14. Again, both calibration and tracking was performed in an environment with static as well as moving interfering lights. (a) Cameras facing into test environment. (b) Light situation during calibration Figure 5.14: Test environment in a metro underground station. With respect to underground measurement scenarios, we performed calibration and tracking tests with baselines dbase from 6−12m and distances between the vision system and target dtrack from 30− 70m. Therefore, we prepared our test volume by measuring and marking xed spatial points on the ground within the tracking volume in distances of dtrack = 30, 40, 50, 60, 70m, using a Leica TPS700. 5.4.4 Model Training As the target's prototype from Section 5.4.2 is used for calibration and tracking, its model again was obtained in an oine process, as described in Section 4.3.4.2. First, the real distances d1, d2, d3 between the target's LEDs were precisely measured with millimeter precision using a Total Station (Leica TPS700) and dbar = 820mm was obtained. After- wards, the target's properties were calculated by evaluating 205 captured camera images across the entire tracking volume from 30 − 70m. To enhance robustness of the obtained model, we rotated the model during training as well. 5.4.5 Camera Calibration Before setup, both cameras were intrinsically calibrated in an oine process, as described in Section 4.3.3.1, using 44 images captured from dierent angles and distances. For extrinsic calibration and subsequent tracking, the stereo camera system was setup with the following parameters to account for tracking distance and poor lighting situation: ˆ Real baselines dbase ≈ 6− 12m ˆ Lens focus ∞ 86 5.4 Hand-held Target Tracking for Tunneling ˆ Aperture 1.4/f ˆ Shutter speed 1000µs Upon each physical re-conguration of the camera stereo system, we performed ex- trinsic calibrations in various distances dcalib between camera and target with a total number of ∼ 1400 frames at each run. (a) Distance: 30m (b) Distance: 50m (c) Distance: 70m Figure 5.15: Calibration with dbase ≈ 6m. Again, our system had to continuously identify the target despite the interfering lights in the tracking volume. Figure 5.15 depicts the continuous feature segmentation 87 5. EXPERIMENTAL RESULTS and resulting blob traces of the two outermost IR-LEDs for a baseline dbase ≈ 6m. As shown in Figure 5.15, for all dcalib our system robustly detects the target and can provide continuous blob traces. It is furthermore shown, how the coverage of blob traces in the camera images decreases as distance between cameras and target increases. With decreasing blob coverage, a decrease in accuracy of the estimated extrinsic parameters could be observed. The calibration tests indicate the importance of well distributed blob coverage on the image to obtain an accurate extrinsic calibration result. 5.4.6 Accuracy & Stability of 3D Position Estimation To evaluate the accuracy and stability of the relative 3D position estimation, we xated the target's tip to the previously measured spatial markers on the ground. Next, we performed yaw(α), pitch(β) and roll(γ) rotations around the xated tip over a sequence of consecutive frames. Applying these movements, we received data for an extensive and robust evaluation of the entire tracking pipeline. In Figure 5.16, the reconstructed 3D positions o all IR-LEDs L1..L4 and the target's tip (epi center) C are visualized. The sub-gures show the collected data from dierent perspectives. (a) x/y-plane (b) x/z-plane (c) y/z-plane (d) 3D Figure 5.16: Target movement during accuracy and stability measurements. We performed six runs in varying distances, dtrack = 30 − 70m, with two dierent baselines, dbase6 = 6m (distance approximation = 5.95m), dbase12 = 12m (distance ap- proximation = 12.29m) and dcalib = 30m. Each test was running 300 consecutive frames with α, β, γ ranging from 0− 45◦. For each run, the 3D coordinates L1..L4 as well as C were estimated to be able to evaluate relative position accuracy by analyzing µ̂bar, σ̂bar and |ε̂bar|, and the stability (inner accuracy) of the 3D point, using σ̂(C). Relative Position Accuracy To evaluate the accuracy of the relative 3D position estimation, we performed measurements at three dierent distances between camera and target, denoted as dtrack for each baseline. At each run, the 3D coordinate of each target's IR-LED L1..L4 as well as of the target's epicenter C = Cx,y,z were estimated based on 300 consecutive frames. εbar with respect to both baselines dbase and all tracking distances dtrack is depicted in Figure 5.17.As it can be seen for both baselines, |ε̂bar| increases as dtrack increases. This is due to a more inaccurate feature segmentation at larger distances 88 5.4 Hand-held Target Tracking for Tunneling since blob size and luminance diminish. This causes bigger rasterization artifacts than in close range that reduces the accuracy of blob centroid computation. Figure 5.17: |ε̂bar| for all dbase and dtrack. Furthermore, the distances between the blobs in the camera images decrease, espe- cially when large rotations of α, β = 45◦ are applied. With dbase12, more accurate results at larger distances can be achieved compared to dbase6. Triangulation, as described in Section 2.3.3.4, can be more robustly performed as the baseline dbase increases since the glancing intersection between both rays decreases. All results of µ̂, σ̂ and xRMS(bar) are listed in detail in Table 5.2. dbase ≈ 6m dbase ≈ 12m dtrack [m] |ε̂bar| [mm] σ̂bar [mm] |ε̂bar| [mm] σ̂bar [mm] 30 0.95 5.29 0.94 1.54 50 13.58 14.24 9.56 3.46 70 21.98 11.04 18.06 10.09 Table 5.2: Deviations and error of dbar. Up to 30m with dbase ≈ 6− 12m, the system is able to provide relative 3D accuracy with sub-millimeter deviation of 0.95mm for dbase6, and 0.94mm for dbase12. At 70m, the system achieves 3D accuracy with a maximal deviation of 21.98mm for dbase6, and 18.06mm for dbase12. Hence, accuracy decreases as distance increases, and larger baselines results in better accuracy, especially at large distances. However, our evaluation for dbase6 also reveals 3D position outliers in the result set for 30m and 50m since as a consequence, σ̂bar is larger. This does not indicate an overall lack of 3D position robustness since σ̂bar is low at 30m with dbase6 and at all distances with dbase12. Since no ltering was applied 89 5. EXPERIMENTAL RESULTS to avoid distortion of the 3D position estimation results, such outliers and its inuence can be minimized application tracking using predictive ltering. Overall, our proposed system provides a relative 3D measurement accuracy with an absolute maximal error |ε̂bar| = 21.98mm (σ̂bar = 11.04mm) for baselines dbase ≈ 6−12m throughout the entire volume. This accuracy has been achieved under constant movement and changes in rotation of α, β, γ up to 45◦. Stability After evaluating the accuracy of the relative position estimation, we evalu- ated the stability of the relative position estimation over 300 consecutive frames. Again, we continuously rotated the target by α, β, γ = 0− 45◦. The results are shown in detail in Table 5.3 with respect to dbase and dtrack. dbase ≈ 6m dbase ≈ 12m dtrack [m] σ̂x [mm] σ̂y [mm] σ̂z [mm] σ̂x [mm] σ̂y [mm] σ̂z [mm] 30 4,07 3,61 12,80 4,92 4,04 5,57 50 4,62 4,49 24,32 6,09 3,09 11,94 70 4,18 6,98 44,92 7,50 5,29 29,61 Table 5.3: Standard deviations σ̂ (C) at dierent tracking distances dtrack. The deviation of C correlates with the results and ndings of Section 5.4.6. Above all, σ̂z increases most as dtrack increases while σ̂x, σ̂y remain rather constant and are ≤ 7, 5mm for the entire tracking volume. Thus, tracking of the head's 3D position is very stable for the x/y-axes with both baselines dbase6, dbase12. Our optical setup as well as the software processing results in millimeter deviation for Cx,y with both baselines up to 70m. These results can be improved by using image sensors with higher resolution. σ̂z varies most at 70m with dbase6 with a maximal deviation of 44, 92mm. With larger baselines, the 3D position estimation of Cz gets more stable (σ̂z is decreasing for dbase ≈ 12m). 5.4.7 Tracking Performance Besides the accuracy and stability evaluation, we performed tests to determine the sys- tem's capability to continuously track the target in the intended tracking space. There- fore, we moved and rotated the target through the whole volume for dtrack = 30 − 70 and inserted static and moving interfering lights into the tracking volume. Currently the system provides ten 3D position estimates per second (10fps). Those rates allow for interactive tracking of static and moving objects. Figure 5.18 illustrates the target tracking and depicts the 3D position of L1..L4 as well as C. As illustrated, the target is robustly and continuously tracked with various rotations trough th entire tracking volume. 90 5.5 Machine Tracking for Underground Guidance Figure 5.18: 3D position tracking of a moving target through the entire volume. 5.5 Machine Tracking for Underground Guidance (a) Jumbo (b) Roadheader (c) Rock Support Drill Figure 5.19: Examples of modern underground machinery. Besides the ability to measure static or moving points using a handheld target, there is a huge demand in underground construction to track machines to enable remote control. Machines, as shown in Figure 5.19 such as roadheaders, jumbos, dredgers, contribute to signicant cost reductions and the increase of safety and eciency of underground works. For an ecient control of these machines the continuous and precise determination of their 3D position and orientation in the underground space is mandatory. The productivity of such machines depends on their ecient control. Therefore, an on-board machine control system is required to be able to measure, process and provide quickly, accurately and reliably all data that is needed for an optimal machine operation. One important subsystem of any such control system is the machine guidance system (navigation system) that is responsible for the determination of the absolute 3D position and orientation of a given machine and (more importantly) its dierent tools (e.g. booms, cutting heads) in the underground space. 91 5. EXPERIMENTAL RESULTS 5.5.1 Shortcoming of Existing Technology As described in Section 3.3, classical surveying methodology such as laser measurement systems are widely applied to determine the 3D position of objects with very high ac- curacy. Existing automatic systems use conventional tunnel lasers in combination with active laser targets/laser receivers that are installed on the machine (e.g. for jumbos). Other approaches apply classical surveying methods such as tachymetry where computer- controlled, robotic totalstations automatically and periodically measure to shutter prisms mounted on the machine (e.g. as used for roadheaders). However, the existing technolo- gies suer from the following shortcomings: ˆ They are highly specialized and designed for particular types of machines only; therefore, they lack the universal application to other machine types. ˆ They can only measure and thus control one machine at a time and lack the capa- bility of tracking multiple machines as well as machine parts that simultaneously operate. ˆ They can only be used for the purpose of machine guidance but not also for other measuring and surveying tasks such as setting out, prole control or deformation monitoring. ˆ They lack real-time tracking capability, especially when using totalstations. ˆ They are expensive, in particular their sensor hardware. 5.5.2 Test Environment As a rst approach to overcome the shortcomings of existing underground machine guid- ance systems, the developed tracking system from Chapter II.4 was tested by tracking two line targets that are rigidly attached to a wheel loader. (a) Environment with uncased camera (b) Wheel loader with two line targets Figure 5.20: Details of the test environment. 92 5.5 Machine Tracking for Underground Guidance The tests were conducted in cooperation with Geodata Ziviltechniker GesmbH and Sandvik Mining and Construction Central Europe GmbH, Austria. The loader was tracked open air at twilight and night during standstill and in motion, as well as un- der the inuence of moving interfering lights as well as articial smoke. The described test environment is illustrated in Figure 5.20, the images are manually brightened by 20% for enhanced visualization. In this environment, the tracking system has to cope with additional challenges compared to Section 5.4, such as heavy vibrations of the wheel loader during movement and standstill with engine at rest, as well as with an increased tracking volume ranging from 20 − 110m. With respect to underground measurement scenarios in tunnels and mines, we performed calibration and tracking tests with base- lines dbase from 3 − 9m and distances between the vision system and target dtrack from 20− 110m. 5.5.3 Target Design To account for the additional environmental challenges from Section 5.5.2, the robust encasement from Section 5.4.2 was reused and re-congurable machine targets were de- veloped. Following the design guidelines from Section 4.3.2, the geometric constellation of our target design constitutes a line approach. 5.5.3.1 Evaluation of LED Range To enable reliable tracking throughout the extended tracking range, robust feature seg- mentation and blob centroid determination must be ensured. Therefore, dierent LED types from various suppliers have been evaluated at distances from 30− 110m featuring radiant intensities from 40 − 230mW/sr. The aim was to nd the IR-LED with the best balance between appropriate intensity for long distance feature segmentation and minimal distance between two neighboring LEDs. For all tests, the vision setup from Section 4.4.1 was employed. We ran the LEDs with VF = 1, 5V, IF = 100mA and an operating voltage of 5V and used the vision system from Section 4.4.1 for comparison. Images were captured with 8bit, a shutter speed of 1000µs, unlimited focus and open aperture (f/1.4). Over all tests, the IR-LED Vishay TSHG6210 with 230mW/sr and a half angle of ±10◦ achieved the best blob quality at large distances. (a) OSRAM 4850 E7800 (b) Vishay TSHG6210 Figure 5.21: Comparison of blob quality at 110m with an inter LED distance of 34cm. 93 5. EXPERIMENTAL RESULTS In Figure, the blobs of Vishay TSHG6210 and OSRAM 4850 E7800 (used for the target prototypes from Sections 5.3 and 5.4) are illustrated. The dierence in luminance quality and even distribution is clearly visible. 5.5.3.2 Target Prototype For the rst machine tracking prototype, a target has been constructed in cooperation with Geodata Ziviltechniker GesmbH that consist of multiple Vishay TSHG6210 IR- LEDs. Each LED is encased in a plastic hemisphere which acts as a light diuser (see Figure 5.22a) and is installed in the center of a retro-reecting tape target (see Fig- ure 5.22b). (a) (b) (c) Figure 5.22: A single optical target comprising the encased IR-LED attached to a reec- tive geodesic foiled target. The diuser serves for an optimal light diusion and feature segmentation as well as protects the IR-LED. The target design enables simultaneous geodetic measurement and optical tracking; thereby, the camera system's world coordinate system can be trans- formed into a geodetic reference system for comparison as well as real-life use. Since the coordinate system transformation is future work, this part is not covered and discussed within the thesis. (a) The line target (b) Power supply Figure 5.23: The IR-LED line target prototype for machine tracking. 94 5.5 Machine Tracking for Underground Guidance To follow the overall target design guidelines from Section 4.3.2, four to ve of the single IR-LEDs are combined to form a line target, as shown in Figure 5.23. Each single target is mounted to a 160.0cm square bar steel and its position can be freely adjusted along the bar. The minimal LED distance is 22cm to be able to distinguish between two neighboring LEDs at a distance of 120m, given the vision hardware from Section 4.4.1. A geodesic prism can be attached as well, as shown in Figure 5.23a to measure the target with a theodolite as well. We developed two of these line targets to test multiple constellations as well as simultaneous tracking. All IR-LEDs of both targets are centrally powered by one main unit (see Figure 5.23b), featuring battery as well as 240Hz power supply. 5.5.4 Model Training As the target's prototype from Section 5.4.2 is used for calibration and tracking, its model again was obtained in an oine process, as described in Section 4.3.4.2. First, the single LEDs of each target were set to a unique geometric constellation; then the real distances d1, d2, d3 between the target's LEDs were measured using a total station (Leica TPS700), resulting in the following distances for Target 1 : d1 = 25.0cm, d2 = 40.0cm, d3 = 85.0cm; and d1 = 25.0cm, d2 = 55.0cm, d3 = 70.0cm for Target 2. Hence, for both targets, the distance between the two outermost IR-LEDs dbar = 150.0cm. Afterwards, the properties for each target were calculated by evaluating 255 captures camera images across the entire tracking volume from 20 − 110m. This results in the following p2-Invariant ranges, dened by [ Jmin i , Jmax i ] : p2range(Target1) = [ 2.2270, 2.5200 ] p2range(Target2) = [ 2.1108, 2.1696 ] As it can be seen, the chosen geometric constellation of both targets results in dierent, non-overlapping p2-Invariant ranges. This is important for robust model identication. 5.5.5 Camera Calibration Before setup, both cameras were intrinsically calibrated in an oine process, as described in Section 4.3.3.1, using 42 images captured from dierent angles and distances. The stereo camera system was setup with the following parameters to account for constrained baselines of a later application environment, the intended tracking distance as well as poor lighting situation. At each run, the system was calibrated with ∼ 1100 images. ˆ Real baselines dbase ≈ 3m, 9m ˆ Lens focus ∞ ˆ Aperture 1.4/f ˆ Shutter speed 1000µs 95 5. EXPERIMENTAL RESULTS 5.5.6 Accuracy & Stability of 3D Position Estimation To evaluate the accuracy and the stability of the relative 3D position estimation, we performed measurements at dierent distances dtrack of both targets during standstill of the wheel loader with engines shut o. At each accuracy run, the 3D coordinate of each target's IR-LED L1..L4 as well as of the target's epicenter C = Cx,y,z was estimated based on 180 consecutive frames at 10 fps. Thereby, accuracy and stability were evaluated for the entire tracking volume. The obtained xRMS(P ) values as well as the empirical standard deviations of σ (C) of the horizontal target with a baseline dbase ≈ 9m are listed in Table 5.4. dtrack [m] xRMS(P ) [mm] σ̂x [mm] σ̂y [mm] σ̂z [mm] 20 7,28 0.19 0.12 0.73 30 17,19 0.18 0.09 0.59 40 29,04 1.57 1.28 5.86 50 42,62 0.94 0.27 3.51 60 49,04 0.56 0.23 3.16 70 49,60 0.65 0.37 4.33 80 60,72 1.05 0.59 4.71 90 78,88 0.90 0.50 5.91 100 89,31 3.70 1.28 23.53 Table 5.4: Relative point accuracy and standard deviation σ̂C for dbase ≈ 9m. The results of the relative point accuracy show deviations in the low cm-range throughout the volume, and up to 80m a very high distance-invariant stability (a good repeatability of measurement results) in the low mm-range as well as even below 1mm in the X/Y -plane (vertical cross section). As to be expected and explicable by theory (see Section 2.3.3.4) reconstruction accuracy and stability decreases with the distance of the target to the cameras as the intersection angle for 3D point reconstruction becomes smaller. For measuring distances higher than approx. 100m the low cm-level is exceeded in the stability results, leading to unreliable point measurements as well as increased system jitter. For dbase ≈ 9m, measurements above 100m could not be performed due to immobile objects that were in the line of sight of Camera 1. As shown in Table 5.5, similar stability results were found for dbase ≈ 3m throughout the volume. Since no ob- jects were in the line of sight, target identication and tracking could be obtained until 120m distance. However, we observed instabilities in the calibration process leading to unreliable point measurements for dbase ≈ 3m and higher point accuracy deviations com- pared to the previous experiments for dbase ≈ 9m. This was found due to a insucient blob coverage of only about 50% in both camera images in the specic test environment. Since we could not repeat the eld test, we further investigated this issue, as described in Section 5.6. To summarize our ndings of the overall tracking performance, the target proto- type has a maximum measuring range of approx. 120m under good conditions (clear 96 5.5 Machine Tracking for Underground Guidance dtrack [m] σ̂x [mm] σ̂y [mm] σ̂z [mm] 20 0.03 0.03 0.15 30 0.13 0.10 1.74 40 0.14 0.08 2.22 50 0.32 0.17 4.57 60 0.39 0.14 4.08 70 0.72 0.15 5.90 80 2.79 0.41 7.47 90 2.97 0.54 10.72 100 2.71 0.43 18.21 110 6.31 0.85 26.77 120 4.24 0.82 31.60 Table 5.5: Empirical standard deviation σ̂C for dbase ≈ 3m. atmosphere, good visibility, rectangular viewing direction of the IR-LEDs towards the camera). At greater distances, the IR-LEDs cannot be reliably segmented by the Model Identication pipeline anymore. Up to 80m, the points' stability is reliable, resulting in robust point measurements and small system jitter. Increasing the distance between the LEDs associated with a higher radiant intensity of each LED would provide an improved target visibility and tracking stability at larger distances. 5.5.6.1 Inuence of Vibrations To evaluate the inuence of heavy vibrations, such as the wheel loader engine, to relative point accuracy and stability, the targets were measured in 20, 40, 60m distances during standstill with engine shuto (Test 1 ) and at standstill while the machine's motor was running (Test 2 ). For each run at each distance, about 200 frames were evaluated with 10fps. The comparison of the tracking results of the horizontal wheel loader target is described in Table 5.6. Test 1 Test 2 dtrack[m] x̂RMS(P )[mm] σ̂x[mm] σ̂y [mm] σ̂z [mm] xRMS(P )[mm] σ̂x[mm] σ̂y [mm] σ̂z [mm] 20 7.36 0.10 0.09 0.33 7.37 0.68 0.42 2.30 40 32.63 0.17 0.16 0.68 32.51 0.54 0.39 3.40 60 53.40 0.81 0.41 3.52 53.23 1.07 0.47 4.61 Table 5.6: Comparison of relative point accuracy xRMS(P ) and standard deviation σ̂ (C) without (motor shut o) and under heavy vibrations (motor running). Since no predictive ltering was applied during evaluation, the table shows the unal- tered results of the inuence of external vibrations. The tests reveal that the system's jitter increase from sub-millimeter to low millimeter deviation when the wheel loader's 97 5. EXPERIMENTAL RESULTS engine is running. However, the increased jitter was not found to be strong enough to have a signicant inuence on relative point accuracy. This is due to the fast shutter speed of 1000µs which should be further decreased to account for this very fast move- ments of the target; thereby, standard deviation could be reduced as well. 5.5.7 Tracking Performance for Machine Guidance As described in Section 5.5.2, a wheel loader was equipped with the two line targets (Fig- ure 5.20b) and tracked during operation to gain practical experience in the performance and capability of the system prototype for machine guidance applications. Currently the system provides ten 3D position estimates per second (10fps). For mining and tunneling applications such as machine guidance, this update rate is already sucient. 5.5.7.1 Tracking under normal Visibility First, we tracked the wheel loader under normal visibility conditions during driving operation from 20− 110m, as depicted in Figure 5.24 where only the horizontal target is shown for better visualization. Figure 5.24: Kinematic tracking of the horizontal target from 20−110m with dbase ≈ 3m. The wheel loader was tracked over a sequence of 2560 frames; in only two frames of this data set tracking was not successful. All tracking results are directly plotted, as no ltering to remove outliers is applied. Thereby, the robustness and accuracy of the entire tracking pipeline could be objectivity evaluated, resulting in a robust and continuous tracking. 98 5.5 Machine Tracking for Underground Guidance 5.5.7.2 Tracking with Occlusions and Poor Visibility Next, disturbing infrared light sources were held in the line of sight and fog has been produced articially by a machine to simulate dicult environmental conditions and other disturbing inuences. In the following, three example images are given to test both environmental interferences. In each example, both targets are simultaneously tracked and the data output of the tracking pipeline indicates a successful target tracking of the horizontal target with blue crosses, and green for the vertical target. In case of occlusions, a successful target model identication is marked by yellow crosses. To test the system's robustness to lter interfering lights, both targets were tracked over a sequence of 499 subsequent frames while the wheel loader was positioned at a distance dtrack = 23m in front of the cameras with dbase ≈ 9m. The horizontal target could be tracked in each frame of the sequence as it was not heavily aected by the interfering lights. The vertical model was subject to heavy occlusions and interferences. It's model could be fully identied in more than 50% of the frames. In 206 frames, occlusions of one or more IR-LEDs occurred. In case of one occluded LED, the target model could still be identied and its 3D epicenter was estimated after occlusion recovery was performed. In the accuracy and stability data, we found comparable results to the measurements obtained during accuracy and stability evaluation in Section 5.5.6 (see Ta- ble 5.4) with xRMS(P ) = 8.90mm and σ̂x/y/z = 0.09/0.11/0.32mm. This demonstrates the robustness of model identication and recovery of the tracking pipeline. As it can be seen in Figures 5.25 and 5.26, the manually inserted disturbing lights only aect the model identication in case that the interfering lights are directly in front of or very close to the target's IR-LEDs. This case is given in Figure 5.25 where the heavy interference leads to the occlusion of one LED of the vertical target. However, the tracking pipeline is still able to correctly identify the target model, as indicated by the yellow crosses in Figure 5.25b. Thereby, the system is able to subsequently recover the missing LED in 3D for epicenter estimation. (a) View on the scene in visible light spectrum (b) IR scene view with tracking state output Figure 5.25: The vertical target is partly occluded by an interfering light but can still be successfully identied, as indicated by the yellow crosses. 99 5. EXPERIMENTAL RESULTS The heavy light interference that is illustrated in Figure 5.26 did not lead to occlusions of the target due to the LED's properties. Both targets' models are fully identied and tracked by the tracking pipeline despite the interference. (a) View on the scene in visible light spectrum (b) IR scene view with tracking state output Figure 5.26: Both targets 'models are fully identied and tracked despite heavy interfering light. Poor visibility due to fog or dust clearly reduces the measuring range and the tracking update rate. This is a common disadvantage of all optical tracking systems as well as geodetic total stations. However, as it is depicted in Figure 5.27, our system is able to cope even with dense fog in front of the IR-LEDs since their radiant intensity is strong enough. Tracking loss was only temporary for a few frames and system readiness was immediately and automatically reestablished as soon as visibility improves. This is a huge advantage compared to i.e. geodetic total stations, where tracking loss requires additional sighting of the target. (a) View on the scene in visible light spectrum (b) IR scene view with tracking state output Figure 5.27: Both targets' models are fully identied and tracked during fog tests. 100 5.6 Conclusion 5.6 Conclusion We have experimentally evaluated the tracking system's performance properties in three dierent wide area tracking environments, all featuring unconstrained lightening condi- tions and two having additionally harsh characteristics. The proposed system provides quick setup since it needs a minimal hardware setup consisting of two high quality ma- chine vision cameras and a standard (portable) workstation for data processing. Besides stereo camera setup, pre-conditioning of the tracking volume is not required since inter- fering lights during camera calibration and tracking are ltered out and partly occluded targets can be recovered. Targets are designed to be re-congurable and are equipped with standard infrared light emitting diodes. We demonstrated the system's capabili- ties to extrinsically calibrate the stereo camera system as well as target tracking despite heavy interferences (lights, fog). Thus, the tracking system can operate during on-going activities in the volume, featuring it to be highly unobtrusive. The system oers track- ing with interactive frame rates providing centimeter precision of the relative 3D position estimates up to 100m. We proposed a wide area tracking prototype that can be used for user tracking in mixed reality applications. Our results demonstrate relative 3D point accuracy xRMS(P ) < 9.22mm with sub-millimeter static position jitter σ̂ = 0.0675mm throughout the entire tracking volume, ranging from 5 − 30m. We tested our system with several dierent target constellations, which can be detected within both camera views with rotations yaw and pitch from 0 − 45◦ as well as roll from 0 − 360◦. To our best knowledge, no competing approach provides comparable accuracy for this range, especially not with the minimal amount of only two cameras. Therefore, the presented system goes clearly beyond state-of-the-art. We demonstrated the capabilities of optical tracking to be applicable to measurement scenarios beyond mixed reality environments. By providing robust hardware encasement and a simple but exible target design, it can be used in underground scenarios such as tunnels and mines. It can be simultaneously used for a large variety of independent un- derground surveying tasks, such as setting out, prole control, deformation monitoring, personnel tracking for safety and machine tracking. It provides relative 3D point accu- racy with a deviation of ≤ 21.98mm throughout the tracking volume of 12x8x30−70m. Up to 80m, we demonstrated relative point accuracy of xRMS(P ) < 60, 72mm with a very high distance-invariant stability, indicated by the (sub)-millimeter static position jitter (σ̂x = 1.05mm, σ̂y = 0.59mm, σ̂z = 4.71mm). Compared to state-of-the-art un- derground measurement systems, our approach has the capabilities of 1) automatically starting to track one or multiple targets as soon as the target is within the view of the vision system, thus manual sighting can be omitted, 2) tracking moving as well as partly occluded targets, 3) provides a exible target design that allows general usage of various tracking and measuring tasks and 4) addresses the need for highly automated positioning systems [68, 77]. 101 5. EXPERIMENTAL RESULTS During our experimental tests and extensive evaluation, the following aspects have been identied to further optimize the system. 1) The software prototype of the proposed tracking pipeline oers interactive frame rates. However, the MATLAB image process- ing components [137] should be replaced by C/C++ modules and parallelization should be exploited to decrease tracking latency. This reduces this shortcoming to a pure soft- ware development task. 2) As every optical technology, the proposed system requires good visibility. In presence of strong fog and dust, the achievable measuring range is reduced, however, this eect can be partly mitigated by using LEDs with higher radiant intensity as well as LED arrays. 3) Furthermore, a free line of sight must be provided for both (all) cameras. For mixed reality tracking, this shortcoming can be reduced by mounting the cameras high up the wall to avoid occlusions by users. In case of un- derground tracking scenarios, this can be problematic in limited space and in crowded situations, especially close to tunnel walls. Compared to indoor tracking technologies, such as RFID that support multiple targets in a large volume, our proposed system supersedes pre-conditioning of the tracking volume to provide cost- and time-eciency. Comparing the presented system to state-of-the-art infrared optical tracking systems in terms of range coverage and accuracy, it signicantly extends the available tracking range up to 100m while requiring only two cameras and providing a relative 3D point accuracy with sub-centimeter deviation up to 30m and low-centimeter deviation up to 100m, as shown in Tables 5.1, 5.2, and 5.4. To our best knowledge, none of the existing systems, as described in Section 3.2 gives accuracy spec- ications for distances greater than 10m. Due to the implicit line characteristic of the target design, orientation can only be provided up to two DOFs. However, as depicted in Figure 5.3 for user head tracking, this can be compensated by combining several line targets into one composite target. Tracking accuracy in terms of orientation has not been part of this thesis and will be evaluated in the future. For underground surveying tasks, the achieved relative 3D point accuracy is adequate for machine guidance but was found not accurate enough for tasks such as setting out. However, the following aspects were identied to increase the accuracy. Extending the baseline results in better depth accuracy, while using an image sensor with higher resolution minimizes segmentation inaccuracies that leads as well to enhanced precision. The main aspect of optimization was found in the extrinsic calibration approach. The evaluation of our proposed calibration method indicates promising results. De- spite interfering lights, the target's LEDs are robustly segmented to ensure sucient and reliable camera parameter estimation. However, tests revealed some limitations of the current approach. The manual movement of the target through the volume keeps the tracking system independent from additional (xed-installed) visual features. How- ever, not all areas of the camera image can be covered and most blobs are found in the camera images' center which results in an unbalanced blob distribution, as depicted in Figures 5.5 and 5.15. Especially in the vertical direction, distribution is limited by 102 5.6 Conclusion human size and the length of the calibration target as well as by the natural bound- aries of the physical environment, such as the ceiling and the ground. The distribution can be improved by using a longer calibration apparatus but as stated only to a certain extend. Therefore, a future aspect of the research is to use additional visual features that are extracted from the environment and fuse them with the blob features to in- crease the feature distribution along the edges and in the corners of the images. In a well illuminated environment, i.e. for mixed reality tracking, natural features can be extracted from the environment. In an underground environment, where illumination is poor and geometric structures are mostly found around the front face, natural fea- ture extraction would not signicantly enhance the feature distribution in the camera images. Here, the installation of additional single IR-LED markers would serve as an adequate solution. They could be equally distributed within the tracking volume and autonomously detected and subsequently extracted using the hardware interference l- tering approaches from Section 4.3.4.1. Thereby, the system's unique features to function in an unconstrained environment while requiring a small amount of hardware and little user interaction would be retained. 103 Chapter 6 Summary In this part, a robust wide area optical tracking approach was presented that estimates the 3D position of model-based targets. The approach extends state-of-the-art optical tracking systems by proposing a robust extrinsic stereo camera calibration, by present- ing a highly re-congurable target design, and by providing a software-based processing pipeline that enables the system to cope with large tracking distances, static and moving interfering lights, partly occluded targets as well as disturbances such as fog and dust during calibration and tracking. We employ projective invariant property matching to robustly identify the model-based optical apparatus (target) that is used for extrinsic calibration and tracking. For estimating the external camera parameters, the apparatus is used to articially generate 0D image features that are crucial in poorly illuminated en- vironments with little geometric structure. Furthermore, the target's properties support reliable correspondence matching without requiring the epipolar geometry for correspon- dence analysis. During tracking, the approach allows model tting already in the 2D image domain that results in a drastically reduced set of correspondence candidates. This in turn considerably decreases the combinatorial complexity of the multiple-view correlation problem. We perform experiments with the developed software and hardware prototype in three dierent tracking scenarios that all feature large distances and unconstrained indoor en- vironments. From the experiments we observe that model identication is robust against strong interfering lights, partly occlusions as well as fog that is often present in harsh en- vironments. Furthermore, the experiments showed minimal system jitter and millimeter deviation of relative 3D point accuracy up to 30m and centimeter deviation up to 110m. This outperforms competing optical tracking systems in terms of volume coverage, point accuracy and robustness. Furthermore, only a minimum of two cameras are required that signicantly reduces the system's cost and complexity. This eases the necessary eorts for setup and maintenance of a mixed reality system and thereby makes it more suitable for non-experts. In addition, we demonstrated the system's abilities to act as a wide area tracking system 105 6. SUMMARY for underground surveying tasks that signicantly pushes the borders of state-of-the-art optical tracking approaches that are exclusively designed and thus only applicable for mixed reality applications. Compared to competing measurement technology for under- ground environments, our system is re-congurable to track handheld targets as well as any kind of machines, it omits manual target sighting and allows tracking of fast movements as well as multiple targets at a time. This clearly extends state-of-the-art optical tracking technology. In terms of accuracy, it can not compete with existing laser measurement technologies but can be a rst foundation for automated guidance for un- derground machine control. It was found by the experimental data that the generated blob features for extrinsic camera calibration can be insucient in terms of image coverage, caused by physical limitations of the environments through which the calibration apparatus is moved. This had lead to further research for the future to provide more reliable calibration parameters for stereo camera setups with large baseline in poorly and non-cluttered environments. Summarizing, the demonstrated system's properties allows for robust and cost ecient wide area tracking in mixed reality and beyond. By overcoming limitations of existing optical systems, it can foster the further emerging of mixed reality into the mainstream. A broad range of wide area tracking scenarios can be envisioned, such as user tracking in virtual environments, at entertainment stages, in manufacturing workshops as well as automated control of machines in underground environments. 106 PART III User Interfaces for 3D Interaction 1 Introduction 109 1.1 Motivation & Problem Statement . . . . . . . . . . . . . . . . . . . . . . 110 1.2 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 2 Theoretical Foundations & Related Work 113 2.1 User Interfaces in Mixed Reality . . . . . . . . . . . . . . . . . . . . . . 113 2.2 3D Object Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 2.3 3D Object Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 3 3D Selection in Handheld Mixed Reality 125 3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.2 Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.3 The DrillSample Technique . . . . . . . . . . . . . . . . . . . . . . . . . 127 3.4 Performance Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4 3D Manipulation in Handheld Mixed Reality 147 4.1 Methodological Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 148 4.2 Performance Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5 Summary 167 107 Chapter 1 Introduction As outlined in Chapter I.1, tracking is one of the mandatory key components to create a mixed reality environment. Furthermore, tracking is the crucial foundation for interaction in a mixed reality environment that enables a user to explore and interact with the virtual simulation. As it is depicted in Figure 1.1, interaction can be grouped in the categories 3D Selection, 3D Manipulation, Navigation with the subtasks Travel and Waynding, System Control, Symbolic Input and Modeling [62]. Figure 1.1: Interaction categories, with the elds of contribution marked bold. By employing techniques for 3D selection and manipulation, the user is provided with means to select virtual objects and subsequently position, rotate (spatial rigid object ma- nipulation) and scale (spatial non-rigid object manipulation) them. Navigation allows for moving in and around a virtual environment and incorporates travel and waynding. While travel enables the user to explore a mixed reality environment by employing tech- niques and devices for locomotion, waynding describes the cognitive process of dening a path through the environment aided by natural or articial cues. System control tasks aim at changing the state of the system, usually through a graphical user interface or 109 1. INTRODUCTION a command. Symbolic input addresses the tasks to process symbolic input data, such as text and numbers. Finally, modeling aims at creating 3D objects and modify their properties, including their spatial and visual appearance. Within this thesis, contribution to 3D Selection and 3D Manipulation is presented by demonstrating novel techniques to select and manipulate objects in one-handed hand- held mixed reality scenarios. In contrast to previously published work about portable handheld devices for mixed reality [64, 78], in this thesis, the handheld device is referred to a smartphone with a touch sensitive display to simultaneously detect multiple nger inputs. As described in Section 2.1 in Part II, pose tracking is the crucial foundation to enable 3D selection and manipulation through the involved interaction devices. While a novel system for wide area Outside-Looking-In tracking was presented in Part II, ex- isting methods for Inside-Looking-Out 6DOF pose tracking are used as technological prerequisite in this part. 1.1 Motivation & Problem Statement Recently emerged mobile hardware devices enable real-time rendering of a large number of 3D models. To interact with such a dense virtual scene, precise object selection and manipulation (translate, rotate, scale) are required. Existing interaction techniques for handheld mixed reality usually use the multi touch capabilities of the device for interacting with the virtual scene. Since the user has usually only one hand available for interaction while the other is holding the device several problems arise for object selection and manipulation. 3D Selection Using the imprecise nger touch input for selection yields the high prob- ability of inaccurate extraction of small objects, especially when they are partly or fully occluded or surrounded by highly similar virtual scene objects. To increase the accuracy of the selection process, state-of-the-art approaches usually propose two-handed tech- niques which can not be applied to a selection task for which only one hand is available. Furthermore, in case of multi-object selection, existing approaches do not provide context information about the original spatial layout of the selected objects, making it impossible to precisely select a desired object amongst visual similar ones. 3D Manipulation As there is just one hand available for object manipulation, only simple touch gestures of one hand are suitable. In addition, since the implicit characteris- tics of a mobile touch screen provide only 2D data, all three coordinate axes can never be simultaneously addressed for object manipulation. To cope with this input limitations, state-of-the-art methods use complex multi-nger gestures to provide an integral way for 3D manipulations. However, their usage not only requires prior knowledge that reduces overall intuitiveness, but also multi-nger gestures that are either impossible to apply in a one-handed setup or dicult to perform on a mobile device. Thus, interaction space is limited to the physical screen size and usability can suer because users occlude the object with their ngers [28]. 110 1.2 Research Objective 1.2 Research Objective To overcome the limitations of state-of-the-art 3D selection and manipulation techniques in one-handed handheld mixed reality, te following research objectives were formulated: ˆ To address the requirements of selection, novel methods have to be developed to enable precise object selection with spatial context preservation by only requiring one-nger touch input. Thereby, disambiguating and selecting strongly occluded objects or objects with high similarity in visual appearance is possible. ˆ To reduce the amount of nger touch input for full 6DOF object manipulation, algorithms have to be developed to provide an intuitive user interface. The focus lies on exploiting the possibilities of available tracking pose data as well as degree- of-freedom decomposition. ˆ All novel interface methods are to be evaluated in comprehensive user studies to explore their performance, usability and accuracy and to be able to draw a reliable conclusion on their benets over state-of-the-art techniques. 1.3 Organization In Chapter III.2, an overview over the theoretical foundations of 3D selection and manip- ulation in mixed reality environments is given and competing state-of-the-art approaches are discussed and compared. In Chapter III.3, the methodological approach of the de- veloped selection technique and its evaluation by a thorough user study is presented. In Chapter III.4, two novel 3D manipulation techniques are described and evaluated in a comparative study. Finally, Chapter III.5 gives conclusions to the developed 3D interaction techniques. 111 Chapter 2 Theoretical Foundations & Related Work This chapter covers an overview of the theoretical foundations of 3D selection and ma- nipulation in mixed reality environments and presents related work that is relevant for the performed research. 2.1 User Interfaces in Mixed Reality Over the last decades, computer users have become familiar with a specic set of 2D user interface components, compromising input hardware such as mouse and keyboard and output as the monitor. Furthermore, they got used to interaction techniques such as selecting a le by double-clicking, drag and drop as well as interaction metaphors as the desktop metaphor1. However, these interface components are inappropriate for non-traditional computer environments [62], such as the various kinds of mixed reality. They represent a virtual 3D environment where traditional 2D interaction techniques and metaphors lack the capabilities to appropriately function in space. (a) Handheld multi-touch (b) 3D mouse (c) 3D interaction pen Figure 2.1: An excerpt of 3D interaction devices. 1The screen space on the monitor is treated as a conventional desktop where folders and documents can be placed. 113 2. THEORETICAL FOUNDATIONS & RELATED WORK To enable a user to interact with virtual 3D objects, the interaction device (input) needs to be tracked and the virtual scene must be visualized (output). For the various graduations of mixed reality, a large number of possible input devices exist, ranging from multi-touch pads, 3D mice, joysticks to data gloves, 3D interaction pens as well as full body motion capturing. The input devices provide dierent degrees-of-freedom, thus their suitability depends on the required interaction task. In Figure 2.1, an excerpt of devices for 3D object selection and manipulation is depicted. Depending on the avor of mixed reality, the virtual scene can be visualized to the user on a standard monitor, on a stereo projection wall, within a stereoscopic head mounted display or on a mobile screen. For a comprehensive overview and discussion of existing in- and output technology, the reader is kindly referred to [62, 83]. 2.1.1 3D Interaction As depicted in Figure 1.1, 3D interaction for virtual environments can be divided into the categories Selection, Manipulation, Navigation, System Control, Symbolic Input and Modeling. In the following sections, theoretical foundations and related work in the eld of 3D selection and manipulation are described. 2.1.1.1 3D Selection and Manipulation Tasks According to [62], 3D manipulation describes an interaction task that can be decomposed into the three basic canonical tasks Selection, Translation and Rotation. They act as building blocks and can be used to compose more complex scenarios [6]. These canonical tasks are used to dene a taxonomy for this thesis that extends [62] and denes the following two major tasks: 3D Selection is the compound process of Indication, Conrmation and Feedback to select a desired virtual object in space. In the presence of multiple objects that have been indicated for selection, the conrmation process comprises a renement task (two-step selection process). In case of a single selection, the indicated ob- ject is usually automatically conrmed by the system and subsequently used for manipulation. 3D Manipulation comprises translation (positioning), rotation and scaling of a previ- ously selected object. All three manipulations are also referred as RST manipu- lations. Rotation is described by the three angles yaw, pitch and roll around the axes x, y, z. All three manipulation tasks can be performed independently, result- ing in three separate tasks with a maximum of 3 degree-of-freedom each. When integrating 3D translation and rotation to one compound task, it comprises a full spatial rigid 6DOF manipulation [62]. As stated, 3D manipulation has been extended by Scaling, as it is a common and thus important manipulation task in real-world applications. Therefore, is was included into the formal denition of 3D manipulation for this thesis. In contrast to the spatial rigid 114 2.1 User Interfaces in Mixed Reality object manipulations translating and rotating, scaling does not preserve the shape of the object. It either scales the 3D object uniformly, meaning changing its size equally in each dimension, or non-uniformly, changing its size for each axis separately according to the user input. 2.1.1.2 3D Selection & Manipulation Metaphors Many existing 3D selection and manipulation techniques are related to an interaction metaphor. Such a metaphor can comprise an action, an object or a combination of both and exploits the users' familiar knowledge to fulll a specic interaction task. According to [31, 37], selection and manipulation techniques for immerse virtual environments can be classied as follows: Exocentric Metaphors that are also known as the God's eye viewpoint. Here, users interact with the virtual environment from outside of it. Egocentric Metaphors that allow users to interact from inside the environment. Thus, it embeds the user and is most common in mixed reality applications. For egocentric interaction, the two major metaphorsVirtual Hand and Virtual Pointer exist. The classical Virtual Hand is the virtual avatar of a physical interaction device and visualizes the device's real position and orientation in the virtual space. With techniques using the Virtual Hand Metaphor, users can reach and grab objects by "touching" and "grasping" them with their virtual hand. This metaphor can be used for object selection as well as manipulation, as described in Sections 2.3 and 2.3. Techniques based on the Virtual Pointer Metaphor allow the user to point at an object to indicate it for further interaction. To determine the pointing direction, usually the user's head orientation and the virtual hand's position are incorporated. Hence, tracking of head and interaction de- vice is required. Many state-of-the-art selection techniques, as described in Section 2.2, are based on this metaphor and are characterized by virtual pointer direction, its shape and the method to disambiguate the target object. 2.1.2 3D Selection & Manipulation in Handheld Mixed Reality In case of a handheld mixed reality environment, input and output comprises a single device. The user can interact with the virtual scene using the screen's multi touch capa- bilities and the 6DOF pose (position and orientation) of the device can be estimated using the device's built-in camera. This can be characterized as an optical Inside-Looking-Out tracking system with the mobile device as single tracker object, as described in Section 2.1 in Part II, state-of-the-art mobile devices provide real-time 3D rendering of dense virtual scenes and act as a "window into the virtual world", as illustrated in Figure 2.2. Regard- ing the applicability of interaction techniques classied by metaphor, both exocentric and egocentric approaches are applicable in a handheld mixed reality scenario. However, since the user can freely move the mobile device in space and thus gains a egocentric view into the virtual world through this "window", egocentric metaphors are more suitable. 115 2. THEORETICAL FOUNDATIONS & RELATED WORK Figure 2.2: A mobile phone acting as a window into the virtual world. Furthermore, the mobile device can be understood not only as a window into the virtual world but also as the virtual hand to interact with 3D scene objects. By the linear and direct mapping between the physical and virtual world in terms of perspective and interaction device, intuitive user interface components can be provided for 3D selection and manipulation. Therefore, we focus on state-of-the-art techniques based on egocentric interaction metaphors that are discussed in the following sections. Figure 2.3: Taxonomy for egocentric object interaction in handheld mixed reality. Following the theoretical concepts from Sections 2.1.1.1 and 2.1.1.2, a taxonomy for 3D object selection and manipulation for a one-handed handheld mixed reality setup was derived, as illustrated in Figure 2.3. It depicts only those concepts that are relevant for the performed research of this part. It classies selection and manipulation techniques by egocentric metaphors and subsequently by task decomposition and acts as a theoretical foundation for the proposed techniques from Chapters III.3 and III.4. After this overview on user interfaces in mixed reality, the foundations of 3D selection and manipulation are outlined within the following sections by presenting background and related work in both elds. 2.2 3D Object Selection Selection is one of the universal interaction tasks in 2D as well as 3D and has been extensively studied [62]. As shown in literature, performance and usability of a selection 116 2.2 3D Object Selection technique varies greatly, depending on specic task requirements (i.e. object size and distance) and the environment's layout such as scene density and object occlusions. To indicate the desired object, the user can occlude the target, touch it or point at it. Figure 2.4: Taxonomy of immersive selection techniques classied by metaphor. In Figure 2.4, an excerpt of immersive ego-centric selection techniques are shown that can be divided into Virtual Hand and Virtual Pointing metaphors, as described in Section 2.1.1.2. 2.2.1 Virtual Hand Metaphors With Virtual Hand metaphors, such as VirtualHand [62] and Go-Go [22], the user selects objects in space by touching them; thereby, the desired object can also be fully occluded. To visualize the position and orientation of the virtual hand in the mixed reality environ- ment, Virtual Hand techniques use a 3D pointer. To select an object, the 3D pointer is required to intersect with the desired object and the conrmation of the selection can be triggered with a designated command, e.g. a button press [62]. Using traditional Virtual Hand selection, the physical location p ∈ R3 of the interaction device is directly used as the 3D pointer's position. However, this direct mapping introduces a spacial limitation since the physical length of the user's arm limits the range within a desired object can be selected. Go-Go is similar to traditional Virtual Hand but extends the selection radius and thus the virtual arm by applying a non-linear mapping function to p. However, the Virtual Hand selection metaphors are not suitable in a handheld mixed reality setup because the same physical device is used as interaction input and visualization output. 2.2.2 Virtual Pointing Techniques Virtual Pointing Techniques, such as Ray-Casting, Flashlight and Aperture [8, 62, 18, 21] are generally considered to be more precise than virtual hand techniques and provide a natural way to indicate an object [62]. They either employ one single step for object indication and conrmation or two steps and use the second step for rening the multiple objects that have been indicated for selection in the rst step. 117 2. THEORETICAL FOUNDATIONS & RELATED WORK 2.2.2.1 One-Step Selection Techniques Ray-Casting is a simple yet powerful selection technique for objects with a descent size on the image plane and also copes well with partly occluded objects in close distance [62]. To interact via Raycasting the user indicates an object to interact with by simply pointing at it. Therefore, a virtual ray along the users arm performing an pointing gesture is casted into the virtual environment and the closest object to the user that is intersected by the line is selected and used for subsequent interaction [62]. In an immersive environment, the direction ~p of the virtual ray can be dened by either 1) the vector going through the user's head position and its interaction device or 2) by the users gazing direction. The direction of the virtual ray is then attached to the position of the user's virtual hand h ∈ R3, resulting in the denition of the virtual ray p(α), as dened in Equation 2.1. p(α) = h+ α~p (2.1) In a non-immersive desktop setup, Raycasting can also be used by casting a ray from the 2D screen point perpendicular to the display plane into the scene. To select small objects at larger distances, Raycasting can suer from precise selection. This problem is mostly introduced by the high angular accuracy necessary for selecting small objects so that a small angular change induced, e.g. by tracker or hand jitter, causes a spatial digression at far distance [62]. Handheld Raycasting Adaptions state-of-the-art techniques for selecting 3D ob- jects in a handheld setup usually use a simple pointing metaphor, triggered by a single tap event on the mobile screen [101, 114, 124]. However, in a cluttered virtual envi- ronment, these approaches lack precision due to users' ngertip size. To enhance the accuracy of object selection, a set of 3D interaction techniques on mobile devices with a touch-screen is presented in [143]. The major objectives were to precisely select partly occluded objects as well as enhance the limited precision caused by the area that a nger tip covers on a comparable small screen of the mobile device. To overcome these prob- lems for object selection the authors of [143] propose two techniques that use multi-touch input and are based on Raycasting. The Dual-Finger Midpoint Ray-Casting technique is performed with three ngers. Between the two simultaneous touch points f1, f2 ∈ R2, the midpoint Cmid ∈ R2 is calculated from which a ray is casted perpendicularly to the device screen towards the virtual environment. During the selection, a cross hair is displayed at Cmid for user assistance. The rst object that is casted by the ray is highlighted to indicate a selection candidate. The selection is conrmed with an arbitrary third touch on the screen. To increase the precision for highly occluded objects, the view can be zoomed at the midpoint by increasing or respectively decreasing the distance between both ngers. The Dual- Finger Oset Raycasting technique is performed with only two ngers. One nger is used to indicate the position f of the cross hair on the screen from which the ray is casted into the scene, as described above. The cross hair position Coff is calculated with a 118 2.2 3D Object Selection pre-dened oset o ∈ R2 as follows. Coff = (fx + ox, fy + oy) (2.2) The second nger is used to either change the zoom level of the view, modify the oset o or conrm the selection. Both techniques tackle and overcome the problems of partly occluded targets as well as precise selection of small objects that can occur when performing Raycasting triggered by an imprecise nger tap on a mobile handheld device. However, both Raycasting adaptions are hardly suitable for one-handed handheld scenarios due to the following reasons. Close to the screen's corners and borders, both methods are impractical or even impossible to apply. Furthermore, large portions of the touch screen are occluded and important information of the desired object's surrounding is not visible if multiple ngers are required for interaction. The limited interaction space on the handheld's touchscreen is further reduced as all ngers have to t on the screen. Volumetric Object Casting The Flashlight technique (also often called Spotlight or Conecasting) extends the idea of Raycasting by replacing the ray with a cone-shaped selection volume [18]. All objects that fall completely or partly within this selection volume can be selected, thereby the technique enables easy selection of small and distant objects without requiring the pointing precision of Raycasting. To solve for ambiguities that can occur if more than one objects fall within the conic volume, the following two rules are applied [62]. The object that is closer to the center line of the selection cone is selected. And second, if the angle between the center line of the selection cone is the same for two or more objects, than the object closer to the interaction device is selected. However, this approach has its weakness if small and tightly coupled objects in a dense scene shall be uniquely selected since the angle of the cone cannot be adjusted. As an extension of Flashlight, Aperture [21] allows the user to interactively control the angle of the selection cone by using a second interaction device. Although this is an intuitive extension and allows for more precise selection, this method is not applicable in a handheld scenario where only one interaction device is present. 2.2.2.2 Two-Step Selection Techniques As described, Ray-Casting, Flashlight and Aperture can select partly occluded objects but cannot cope with fully occluded objects in a single selection step. To select entirely occluded objects, all objects that lie within the conic selection volume are considered as selection candidates (Identication). A second, additional renement step is required to let the user manually resolve all ambiguities and to conrm the selection of the de- sired object (Conrmation). Using this volumetric casting, several two-step selection techniques exist and two promising ones are discussed in the following paragraphs. SQUAD The Sphere-Casting Rened by QUAD-Menu selection technique [115] (SQUAD) was designed as a rapid and accurate method using a Sphere-Cast for identifying can- didate objects, followed by a multi-step progressive renement. Sphere-Casting extends 119 2. THEORETICAL FOUNDATIONS & RELATED WORK the idea of simple Raycasting and performs object identication and conrmation in a two-step process. Firstly, simple Raycasting is performed and the rst intersection with a scene object determines the position at which additionally a sphere is cast. Its size is calculated based on the distance between interaction object and the intersected object. All objects intersecting the sphere are subject of the renement in the next step. For renement, the image plane is split into four equally sized areas, the quad-menu, in which all candidate objects are evenly distributed neglecting their original spatial position. Af- terwards, the user can progressively narrow the selected candidate objects by manually choosing a quadrant from the menu. Each time a quadrant is selected, all objects of this quadrant are rearranged amongst all four quadrants. Therefore, a minimum of log4(n) selection steps is necessary to select the desired object out of n candidates. Although SQUAD overcomes the diculty of precise selection of small objects by employing a vol- umetric cast, the progressive renement has shown to be cumbersome to use, especially in dense virtual scenes. Furthermore, SQUAD does not preserve the original spatial con- text during renement, resulting in false selections if the desired object is not uniquely distinguishable from its surrounding objects by its visual appearance. Expand Expand [119] was proposed motivated by the problems that SQUAD induces when it removes the objects from its original context during renement. It is designed to work for dense conditions when multiple objects may be subject of an indicated se- lection. It provides two dimensional spatial context preservation and precise selection of objects that are partly or completely occluded. The major dierence between SQUAD and Expand is the usage of a dynamically sized grid instead of a xed QUAD-menu. This ensures a spatially correct relocation of the selected objects, resembling their orig- inal spatial arrangement. Therefore, more information is given to the user to identify the desired object for selection. Furthermore, the SQUAD's progressive renement is omitted and an animation is introduced that visualizes the original spatial context. The object identication is performed using Flashlight selection, as described above. For the renement step, all objects intersecting the conic volume are cloned and moved from their original to their designated position on the virtual grid; this process is visualized through an animation. The arrangement of the clones in the grid reects the spatial context of their original counterparts, as depicted in Figure 2.5. For conrmation, the user can manually select the desired object by pointing at it. The possibly cumbersome progressive renement step of SQUAD is drastically simplied by Expand displaying all candidate objects at once on a dynamically sized grid. For selecting an object from a set of well arranged and previously visible objects, Expand should work well, as it was designed to work in conditions where many objects are within the cursor position. Unfortunately, no further details of the mapping process f : R3 → R2 of the selected 3D objects onto the grid are given in [119]. It is uncertain if and how the mapping f resembles the original 3D arrangement so that false selections for objects with a similar or identical visual appearance can be avoided, especially when the objects are partially or fully occluded. 120 2.3 3D Object Manipulation Figure 2.5: The Expand renement view, courtesy of [119]. 2.3 3D Object Manipulation As soon as an object is selected, it can be used for subsequent manipulation. Accord- ing to [62], positioning and rotating an object are universal manipulation tasks. A plurality of techniques exists, ranging from 3D immersive methods to 2D multi-touch techniques that all aim at transforming the selected object in space. 3D manipulation using large multi-touch 2D displays in tabletop environments has gained interest, i.e. by Hancock [79]. However, methods for these large-scale environments are limited to the tabletop metaphor and hence not suitable for 3D manipulation using general-purpose multi-touch displays. On handheld devices, recent approaches [114, 124, 80, 109] explore the capabilities of an user-facing camera for gesture-based manipulation using markerless nger- and hand tracking. The position and orientation of the nger or hand are mapped to the virtual object for manipulation. However, tracking the hand lacks accuracy com- pared to estimating the pose of a handheld device by employing natural feature tracking or by using the handheld's built-in inertial unit. Thus, the related work of this part is focused on techniques designed for manipulation in mixed reality environments that can be adapted to handheld scenarios, as well as techniques using multi-touch nger input. 2.3.1 For Immersive Environments The simple virtual hand metaphor that was described for selecting objects in Section 2.2.1 can also be directly used as well as extended for 3D object manipulation. Virtual Hand When using the simple Virtual Hand technique, a user can manipulate a virtual object by directly mapping the movement and rotation of the interaction device - thus its 6DOF pose - onto the virtual hand object [62]. The relationship between a state of the interaction device Sr and the virtual hand object Sv is described by the 121 2. THEORETICAL FOUNDATIONS & RELATED WORK zero-order transfer functions from Equation 2.3. Pv = αPr, Rv = Rr (2.3) The position of the virtual hand Pv ∈ R3 is directly derived from the position of the interaction device Pr ∈ R3 multiplied with a scaling factor α to match possibly dierent scales of the real and virtual coordinate systems. The rotation of the interaction device Rr is applied with a 1 : 1 ratio to the virtual hand's rotation Rv. Both zero-order transfer functions are also called linear mappings, resulting in an intuitive manipulation because they directly simulate our interaction with everyday objects. In the following, this is referred to as Real World Metaphors. Due to the linear mapping, Virtual Hand methods are classied as isomorphic interaction techniques. HOMER To overcome limitations of selecting objects using Virtual Hand (see Sec- tion 2.2.1), the hybrid technique HOMER (hand-centered object manipulation extending ray-casting) [24] uses simple Raycasting for selection and Virtual Hand for manipulation. Upon selection, the virtual hand travels to the object and is attached to it. Until de- selection, the interaction device pose is mapped onto the selected objects. After the user triggered the de-selection, the virtual hand object travels back to its original position that is again identical with the interaction device position. For manipulation, a scaling constant α is calculated according to Equation 2.4. αh = Do Dh (2.4) It is dened as the ratio of the distance Dh between the user and the real interaction object (real hand) and the distance Do between the user and the virtual object upon selection. During selection, the position of the virtual hand rv is linearly scaled using αh, as dened in Equation 2.5. rv = αhrr (2.5) Thereby, a user is allowed to position virtual objects within a large range during the manipulation. 2.3.2 For 2D Multi-Touch Devices Touch input by multiple ngers for 2D object manipulation has become a de-facto stan- dard on smartphones to transform objects in 2D [72]. The direct mapping between nger touches and 2D object manipulation is straightforward and thus easy to understand for users. Various manipulation techniques have recently been designed for multi-touch dis- plays to rotate, translate and scale objects in a three-dimensional manner. However, the implicit characteristics of a two-dimensional input device leads to several drawbacks re- garding 3D manipulation. In contrast to i.e. Virtual Hand, a full 6DOF manipulation as an integral process of positioning and rotation in one step is a tedious and not straight- forward task to solve with 2D multi touch input. As stated in [126], a multidimensional 122 2.3 3D Object Manipulation object can be characterized by its attributes and classied in the two categories: inte- gral structure and separable structure. According to the theory of perceptual structure of visual information [5], visual object attributes are separable if their dimensions are perceptually distinct and identiable. It yields an integral structure if they can be per- ceptually combined to form a unitary whole. For example, the position and rotation of a 3D object are two integral attributes, thus full 6DOF manipulation can be dened as an integral task. According to this theory, relevant state-of-the-art methods for 3D object manipulation are coarsely classied by their characteristic of separately performing the RST tasks (Manipulation Separability) or integrating translation and rotation to one compound task (Manipulation Integrability) and treat only scaling separately. In [79], a technique for one-, two- and three-touch input interaction techniques is pre- sented to manipulate 3D objects on any kind of multi-touch display. By using three touch interactions, simultaneous translation and rotation can be performed. This approach is limited to 5DOF and requires a large number of simultaneous touch inputs, which is not applicable to one-handed interaction on a mobile device. The Z-Technique [108] uses multi-touch input of two ngers and adjusts the depth position of the object by moving both ngers on the screen. This method requires prior knowledge of the specic two-nger gesture and does not provide 3D orientation manipulation. To handle full 6DOF manip- ulation, in [103] all DOFs are integrated and the technique allows the user to directly manipulate 3D objects with three or more touch points. This approach takes perspective into account and is promising, but requires at least three points and mostly two hands for interaction input to access all 6DOF. Instead of integrating all 6DOF, in [126] it is proposed to separate the 3D manipulation into translation and rotation, resulting in a 3DOF problem using 2D touch input. The approach combines the Z-Technique [108] to control 3D position with [103] for orientation control. In [143], approaches are presented to separately translate, rotate and scale virtual objects with two ngers. Each technique decomposes the 3DOF tasks into subtasks with reduced degree-of-freedoms. Although only two ngers are required to provide 3D RST manipulation, a larger set of gestures needs to be known which does not make the technique intuitive to use. In addition to these multi-touch techniques, manipulation metaphors that have been particularly designed for handheld MR are introduced by [101, 71]. Both approaches freeze the current real-world view for touch manipulation and aim on reducing faulty user input due to a shaky handheld environment. In [113], multi-modal input for 6DOF object manipulation is used. Translation is performend via touch sliders and the handheld's inertial sensor data is directly mapped to the object to change its orientation. Scaling of the object is done through a pinch gesture using two ngers. The mobile device's inertial unit is also used in [124] to provide object translation in space. [23] investigates the use of the device's tilt as input for small screen interfaces to control menus, scroll bars and view point. This early work is promising and can be extended to work as a 3D object viewer, but does not oer full 6DOF manipulation control of an object. In [65, 66] natural feature tracking is used to estimate the 6DOF device pose. The authors compare the usage of keypad buttons with one-handed physical movement of 123 2. THEORETICAL FOUNDATIONS & RELATED WORK a phone in order to move and rotate a selected object. The rotation of the selected object is chosen based on the orientation of the phone in space after the selected object has been released. Intuitive 3D rotation of an object was the main motivation in [107]. This approach extends the virtual trackball metaphor by using a second phone as rear input device. This allows for accessing the full sphere to control 3D rotation using simple touch gestures. This work is very interesting, but does not oer translation and scale operations and requires a special hardware setup. Travel techniques for mobile virtual environments using touch input for viewpoint translation and the built-in sensors to control the viewpoint's orientation are explored in [112] while in [121], an approach for sensor-based interaction with 3D data on a mobile device is proposed. It provides interaction techniques for gaming environments for translation and rotation using simultaneously touch input and the device orientation for object manipulation. The proposed rotation requires touching an object with a nger and then rotating the device. Thereby, the object is xed and the scene is rotated around it. This does not allow for intuitive manipulation. Since no detailed information about the rotation and translation algorithms and their limitations are given, this approach cannot be further evaluated in comparison to the proposed approach in Chapter III.4 . 2.4 Summary This chapter presents an introduction and overview of the theoretical foundations of 3D selection and manipulation, classied by tasks and metaphors. Following the theory, a taxonomy for 3D object selection and manipulation in a handheld mixed reality scenario is introduced that is further applied throughout this part of the thesis. Subsequently, related state-of-the-art techniques for 3D object selection and manipulation that are suitable for one-handed handheld mixed reality environments are presented. 124 Chapter 3 3D Selection in Handheld Mixed Reality To address the limitations of existing selection techniques for handheld mixed reality scenes, as described in Section 1.1, DrillSample is presented as a novel technique for 3D object selection in dense virtual scenes. DrillSample is a two-step technique providing precise selection and disambiguation of visible, partly occluded or invisible objects, which can also be highly similar in appearance. To cope with the imprecise nger input, it allows the user to conrm its object indication in an optional second renement step, if more than one object has been selected in the initial step. At the renement step, all indicated objects are presented to the user as 3D virtual clones for conrmation that is again achieved using a single tap input. The original 3D spatial context of the selected objects is preserved in this detailed visualization view. Compared to competing approaches, DrillSample only requires single tap input for object indication and conrmation while it is fully 3D context preserving. (a) Object indication by single tap (b) Conrmation with preserved context Figure 3.1: The two-step DrillSample technique. In Figure 3.1, an example of selecting an object in a highly occluded scene using DrillSample's indication and renement capabilities is illustrated. Within this chapter, the methodological approach to develop the novel technique is rstly presented, followed 125 3. 3D SELECTION IN HANDHELD MIXED REALITY by a throughout user study to be able to perform an in depth evaluation and comparison with competing approaches. 3.1 Requirements To achieve the research objective from Section 1.2, requirements were specied to be fullled by the 3D selection technique. When designing a selection technique for handheld mixed reality, there are important factors that inuence performance and ease-of-use. Since precise selection in dense one-handed handheld mixed environments should be achieved, the application scenario's specic characteristics must be taken into account during selection design as well as for the choice of a baseline technique to guarantee a fair evaluation. The requirements can be summarized as follows: Single I/O Device Input and output comprise a single device. Thus, independent tracking of user's interaction and output device is not available compared to other mixed reality scenarios. The handheld's device pose needs to be tracked by appro- priate techniques, such as Inside-Looking-Out optical tracking. Limited Gesture Complexity Touch input by ngers can be imprecise due to the large area the user's ngertip covers on the screen. Since there is only one hand available for interaction, complex multi hand- and nger gestures cannot be applied to improve selection precision. 3.2 Design Guidelines Based on our motivation and the outlined requirements, we developed the following design guidelines to enable precise selection in a one-handed dense handheld AR environment. Keep Direct Touch Abilities One of the most appealing aspects of touch displays is the ability to directly "touch" an object in order to select it. We aim to support this direct manner and do not introduce an oset to the cursor due to the disadvantages that are mentioned in Section 2.2.2.1. Keep Interaction Simple Since multi nger interaction is not a straight forward metaphor and requires prior knowledge of specic gestures, we aim to reduce user touch in- put complexity for object selection. Only one-nger input should be applied to allow precise object selection. Two-nger input using a single hand should only be applied for optional interaction such as detailed inspection of selected objects. Enable Disambiguation and Unique Selection Since objects can be partly occluded or even invisible in dense virtual scenes, it is important to provide a technique that supports selection of these objects. Furthermore, objects can be highly similar in visual appearance. Thus, it is important to present multiple selected objects in the correct spatial context to assist object disambiguation while taking the limited screen size into account. 126 3.3 The DrillSample Technique 3.3 The DrillSample Technique Inspired by Raycasting and Expand (see Section 2.2.2), the novel selection technique DrillSample was designed in an iterative fashion according to the outlined guidelines while meeting the specied requirements. The workow of DrillSample is illustrated in Figure 3.2. (a) One-nger target indication (b) DrillSample Visualization (c) Zero-nger inspection (rotate) (d) One-nger inspection (browse) (e) One-nger inspection (zoom) (f) One-nger target selection Figure 3.2: DrillSample's two-step selection process. It requires single device tracking and only one nger input to select an object in a two- 127 3. 3D SELECTION IN HANDHELD MIXED REALITY step interaction process. The selection method provides one initial indication step (see Figure 3.2a) and an optional renement step (see Figure 3.2b) for selection conrmation in case of multiple object indication. By visualizing the indicated object in a spatial context preserving manner during the renement step, disambiguation of partly or fully occluded objects as well as objects with very high similarity is provided. This type of visualization and thus the technique's name was motivated from taking a drill (soil) sample for geological measures to visualize and analyze the dierent segments. In case only one object is indicated, it is immediately conrmed by the DrillSample technique, as it was proposed in the original Raycasting method. Thereby, simple selection tasks do not suer from increased interaction steps. 3.3.1 Selection Design DrillSample starts with a single tap on the screen which triggers Mobile Raycasting, as described in Section 3.3.2. Instead of selecting only the rst (and closest) scene object, it returns all objects that have been intersected by the virtual ray. In the second renement step, this set of objects is presented to the user as 3D virtual clones by visualizing them as if they where "pulled" out of the virtual scene, thus constituting the drill sample. All clones are rendered on a solid gray background (see Figure 3.1b) with the live tracking is turned o. The drill sample is aligned parallel to the horizontal axis of the image plane and the clones are arranged on a horizontal line. This is referred to as the DrillSample Visualization. The x- and y-position of the clone's centers corresponds to the hit point of the ray with the original objects, while the depth information is represented by the clone's position on the horizontal line of the DrillSample visualization. The spatial context of the indicated objects from the original scene layout is preserved upon casting by the virtual ray that extends the idea of Expand in the depth domain. Thereby, simple disambiguation of selected objects that are occluded or of similar visual appearance is provided. The drill sample visualization allows for a detailed inspection of the indicated objects by the following interactive options: ˆ By using the handheld's built-in Inertial Measurement Unit (IMU)1, the user can rotate the drill sample with the pivot point at screen center to inspect objects from dierent angles (see Figure 3.2c). ˆ By applying a horizontal one nger swipe gesture, the user can browse through the clones by traveling along the horizontal line (see Figure 3.2d). ˆ With a vertical one nger swipe gesture (or an undirected two nger pinch gesture) the virtual camera can be zoomed in and out to provide a detailed view or an overview onto the DrillSample visualization (see Figure 3.2e). This interaction is especially helpful on small displays to gain a quick overview if many objects have been selected. 1Depending on the hardware, the Inertial Measurement Unit consists of accelerometer, gyroscope and magnetometer. 128 3.3 The DrillSample Technique The renement step is nished with conrming the desired objects from the drill sample using a single nger tap gesture (see Figure 3.2f)). Upon conrmation, the user is informed about the selection, the 3D clones are removed from the scene and the live tracking is rendered in the display again. A formal description of the DrillSample state ow is given in Figure 3.3 where the specic gestures throughout the dierent phases are depicted. Figure 3.3: State diagram for DrillSample selection. DrillSample is especially useful in dense environments but also works well in sparse scenes when only single objects are selected. While the selection process requires addi- tional time in case of multiple object indication, an object is immediately conrmed and selected if only one object was casted by the virtual ray in the rst step. 3.3.2 Mobile Raycasting Original Raycasting (Section 2.2.2.1) requires the user's head and its interaction device to be tracked to calculate the pointing direction. Figure 3.4: Ray-Casting adapted to use it in a handheld mixed reality. 129 3. 3D SELECTION IN HANDHELD MIXED REALITY This is not applicable to handheld mixed reality as there is only one tracker object (the mobile device) available. However, the approach for desktop mixed reality scenarios can be easily extended for a handheld scenario by applying a tap gesture, as depicted in Figure 3.4. In the absence of two points in space to calculate the virtual ray's direction, Mobile Raycasting can be seen as the transformation of the 2D tap point pt = (x, y) ∈ R2 from screen space to world space so that the ray is shot into the virtual scene in a perpendicular way with the direction ~dray. As the virtual camera's parameter (image plane dimensions, eld of view) are known and its 6DOF pose is given by the optical tracking, obtaining the 3D point PT ∈ R3 of pt can be solved by applying a standard screen to world space re-projection. Mobile raycasting can then be performed along ~r(Pc, ~dray) ∈ R3, with Pc ∈ R3 being the virtual camera's position and ~dray the direction from Pc to the back- projected point PT . After the ray is shot, all objects are tested for intersection and one of the following options is applied: 1. All casted objects are returned as a set of clones S(Oi), i = 1...N . This is applied during DrillSample selection. 2. The rst casted object S(O1) is selected, as in the original Raycasting approach. Each object Oi contains the object's orientation upon selection osel ∈ R4, the ray hitpoint's position phit ∈ R3 as well as the object's geometry G. 3.3.3 Algorithm To formalize the illustrated selection process, the proposed DrillSample is described in pseudo code in Algorithm 3.1. 3.3.4 Crucial Aspects of the Algorithm Initial tests of the algorithm revealed that certain rotations at renement (see Fig- ure 3.2b) should be restricted since they were found not to be benecial for the users' perception of the spatial context or were even confusing. Most important, all rotations around the roll axis and rotations around the pitch axis in [180◦,−180◦] should be dis- carded. Thereby, the DrillSample is always aligned to the horizontal screen with the rst object that was hit positioned on the left side (see Figure 3.1b). Furthermore, the 1:1 mapping between device and DrillSample orientation proved to be too cumbersome to inspect the objects from their back sides. Therefore, it was found to be useful to speed up the rotation around the yaw axis 3-times and around the pitch axis 1.5-times. To provide a reasonably rened visualization of the virtual clones, there are two critical aspects that are discussed in the following sections: 1. The length of the DrillSample line needs to be optimized while preventing inter- section of the clones and preserving their relative distances. 2. The optimal Z-Position of the DrillSample to the virtual camera must be obtained. 130 3.3 The DrillSample Technique Algorithm 3.1: DrillSample selection technique in pseudo code. Set: selectedObject← NULL; Set: objectConfirmed← false; Set: list of hit objects S(O)← ∅; Step 1: Target Indication; Detect tap point pt ∈ R2 and perform raycast along ~r(Pc, ~dray); Obtain S(Oi), i = 1...N ; if S(O) > 1 then Step 2: DrillSample Construction; Calculate and optimize DrillSample length l (Section 3.3.4.1); Calculate pivot point ppiv at the center of all hit-points phit,i; Rotate objects in S(Oi) around ppiv so that l || image plane's horizontal axis; Perform Z-Positioning of the DrillSample (Section 3.3.4.2); Step 2a: DrillSample Inspections; while objectConfirmed = false do if Rotation of Mobile Device then Map orientationdevice to DrillSample around ppiv; end if Horizontal Swipe Gesture then Obtain tab point ps ∈ R2 and swipe direction ~ds; Use ~ds to travel along the DrillSample if it spans multiple screens; end if Vertical Swipe Gesture then Obtain tab point ps ∈ R2 and swipe direction ~ds; Use ~ds to zoom in/out (Section 3.3.4.2); end Step 3: Object Conrmation; Detect tap point pt ∈ R2 and perform raycast along ~r(Pc, ~dray); Set: objectConfirmed← true; Set: selectedObject = S(Osel); end Set: S(O)← ∅; Set: DrillSample = NULL; else Set: selectedObject = S(O1); end 131 3. 3D SELECTION IN HANDHELD MIXED REALITY 3.3.4.1 Length of the DrillSample Ray Since the relative distance of objects to each other is sucient to preserve the spatial context, the real length of the ray should be scaled for visualization to provide an optimal overview. If the objects are far away from each other, the ray might be shortened, or stretched to reveal objects that are inside of another (e.g. a ball in a bucket). The optimal amount by that the ray should be scaled depends on the shortest distance between the convex hulls of the two neighboring objects along the direction of the ray. For objects with overlapping hulls, the distance is specied as a negative value and positive otherwise. Assuming n objects on the DrillSample and the shortest distance between (n− 1) neighbors is denoted by di, the length of the ray x is then computed by x = −di ∗ (n− 1) (3.1) The precise calculation of these distances can be computationally costly, especially in dense environments with complex shapes. To minimize the computational load, we chose an approximation with linear complexity by treating all objects as spheres (see Figure 3.5) with the maximum extent of the objects' bounding box used as its radius and the hit point as its center. Figure 3.5: Sphere approximation of clones' size to calculate the optimal ray length. For objects whose center point is not close to the ray or have a complex concave shape, this may not be visually pleasing as it overestimates the real distance between neighboring objects. More elaborate algorithms can be employed in the future to enable an optimal adjustment of the length. 3.3.4.2 Z-Position of the DrillSample The proposed algorithm visualizes an overview of all ambiguously indicated objects. Depending on the spatial properties of the objects and their relation to each other, this can result in the following challenges. 132 3.4 Performance Studies 1. The larger the distance between clones varies, the lesser the DrillSample ray can be compressed. To provide an overview of all clones on one screen, the ray must be positioned at greater distance to the virtual camera. This could result in small clones being barely visible. 2. The more the size of the clones varies, the less likely there is a distance to the virtual camera at which all clones are nicely visible. Small objects may appear too small or big objects might be clipped at the near image plane. 3. The more objects are selected, the less likely the overview provides a mean- ingful starting point for renement, as the clones in the overview appear too small, as in 1. Thus, the distance between the virtual camera and the DrillSample depends on the size of the clones and their relative distances to each other. The distance Dov of the virtual camera to obtain an adequate overview can be calculated as denoted in Equation 3.2. Dov(BDSS) = exp tan(fov ∗ 0.5) + BDSS(z), (3.2) where exp = { BDSS(y) if RB < Rfov BDSS(x) if RB ≥ Rfov fov = { fov(y) if RB < Rfov fov(x) if RB ≥ Rfov While BDSS ∈ R3 is the DrillSample's axis-aligned bounding box represented as an expansion vector, Rfov is the aspect ratio of the virtual camera's eld of view, RB the aspect ratio of the bounding box's side facing the camera and fov is the eld of view of the virtual camera. Additionally it has to be ensured, that neither the near nor the far clipping plane of the virtual camera are violated. The interval, in which users may modify the distance of the camera to the DrillSample with a vertical swipe gesture (see Figure 3.2e), is then limited to [Dov(BC), Dov(BDSS)] by the bounding box of the biggest clone BC on the DrillSample. It can be noted, that depending on a specic application, other schemes for the computation of Dov might be suitable. 3.4 Performance Studies For a comprehensive evaluation of the proposed selection technique, a summative evalu- ation was conducted by comparing DrillSample with the two baseline techniques Mobile Raycasting and Expand, as described in Section 3.4.1, across three dierent selection scenarios with dierent variations of object density and visibility. 133 3. 3D SELECTION IN HANDHELD MIXED REALITY 3.4.1 Baseline Techniques Most of the virtual pointing techniques that are discussed in Section 2.2.2 are origi- nally not designed for handheld mixed reality environments, while popular multi-touch selection techniques aim at selecting 2D objects. Thus, a direct comparison of these techniques is hard to obtain. Related work [143] introduces a qualitative evaluation of 3D selection techniques in handheld 3D environments. For performance analysis, they propose an adaption of Go-Go using swipe gestures to adjust the virtual arm length and multi-touch input to select an object. However, this adaption changes the direct mapping between virtual hand, arm length and target object of the original algorithm and does not apply for a clean and fair performance evaluation of selection techniques in handheld mixed reality. For the summative evaluation of DrillSample, Raycasting [62] and Expand [119] were chosen as baseline techniques since they are both applicable in dense environments, they can be adapted to function in one-handed handheld MR with- out changing the original mapping characteristics during interaction and both fulll the requirements from Section 3.1. Furthermore, Expand is a two-step technique as well and thus acts as a valid baseline for performance measurements regarding selection speed and number of interaction steps. 3.4.2 Adaptions for Handheld Mixed Reality For the study, the adapted Mobile Raycasting from Section 3.3.2 was employed using it in its single selection mode. As described in Section 2.2.2.2, Expand is a two-step technique in which virtual scene objects are selected using Cone-Casting [18] and are presented in a second renement step aligned on a virtual grid for object conrmation. To use it for one-handed mixed reality, one-nger tap gestures are employed within the three phases of our Mobile Expand adaption that can be described as follows. For object indication, a cone cast is performed, similar to the Mobile Raycasting approach, using a single tap on the device's screen to indicate the cone's direction. All objects intersecting the cone are subject of a second renement step. This second step is preceded by a non-interactive animation showing the objects moving from their original positions upon cone-casting to their designated positions on the virtual grid. During the renement step, all casted objects are presented aligned on a grid in front of a solid gray background. The selection is conrmed by a tap gesture above the desired object. Since the original publication [119] does not provide detailed information about the grid alignment, the following positioning onto the grid was performed for Mobile Expand : 1. For i objects intersecting the cone, a dynamically sized grid is created with m ≥ 4i cells to provide a sucient number of positions to resemble the original context of the casted objects. For visualization, each cell is represented by its center point ci ∈ R2 on the screen. 2. For each object i, its projected 2D screen position pi ∈ R2 is calculated. 134 3.4 Performance Studies 3. For each screen position pi, its closest ci is determined by evaluating the euclidean distance ||pi−ci|| and the object i is placed into its calculated cell with center point ci. For purposes of simplication, the mobile adaptions hereinafter are referred as Raycasting and Expand. 3.4.3 Objectives The main goal of the experiment is to evaluate the performance and ease of use of Drill- Sample compared to competing techniques. In this study, we focus on selection of objects in closer range in dense environments. A second objective is to examine the performance of the spatial context preservation of our proposed algorithm in environments with ob- jects of high visual similarity. In designing the experiment, we formulate the following hypotheses: H1 Raycasting will be best suited for non-occluded objects. H2 Expand and DrillSample will perform considerably better than Raycasting in envi- ronments with overlapping, partly occluded or invisible objects, which dierentiate signicantly in appearance, in terms of speed and precision. H3 Expand will suer in environments with objects of high visual similarity. Likewise, DrillSample will perform considerably better than Expand in terms of speed and precision. 3.4.4 Experimental Design and Procedure We conducted the study using a 3x3 within-subjects factorial design where the indepen- dent variables are selection technique and task scenario. Figure 3.6: User study procedure. The selection techniques are Raycasting, Expand and DrillSample, while the scenarios included three dierent experimental tasks with varying selection conditions in close range. The dependent variables are Task Completion Time and Number of Selection 135 3. 3D SELECTION IN HANDHELD MIXED REALITY Steps. Task completion time represents the time it takes to successfully nish a specic scenario from the time the user started it, while number of selection steps comprises the amount of necessary object selections to successfully nish a selection task. This measure indicates precision of the applied technique. Furthermore, we measured user preferences for both technique in terms of speed, accuracy, and ease of use. The user study procedure is depicted in Figure 3.6. The user study consists of a pre-questionnaire followed by a practical test and a post-questionnaire. The material of the study is presented in Appendix VI.A. It took approximately 25 minutes for each participant to nish the user study. No. Question Q1 What is your gender? Q2 How old are you? Q3 About how often do you play video games? Q4 What percentage of your gaming is playing mobile 3D games? Q5 Do you have a multi-touch Smartphone? Q6 Do you have any exibility or pain issues with your primary hand, ngers or arm? Table 3.1: Pre-Questionnaire No. Question Q1 How adequate do you feel the time allotted for practice was? Q2 How comfortable were you with using a smartphone for task completion? Q3 How would you rate the RAYCAST selection technique in terms of usability? Speed? Accuracy? Q4 How would you rate the EXPAND selection technique in terms of usability? Speed? Accuracy? Q5 How would you rate the DRILLSAMPLE selection technique in terms of us- ability? Speed? Accuracy? Q6 Rank the three selection techniques in order of desired use (with 1 being the most desired). Q7 When determining how much you like using a selection technique, how impor- tant in inuence on your decision was usability? Speed? Accuracy? Q8 Regarding the visualization during the renement process of the DRILLSAM- PLE technique, how helpful and useful was the linear arrangement for spatial visualization? Table 3.2: Post-Questionnaire At the beginning of the study, each participant was asked to read and sign a stan- 136 3.4 Performance Studies dard consent form and to complete a pre-questionnaire, as described in Table 3.1. Upon completion, the participant was given a detailed description of the practical part about "Selection in Handheld Mixed Reality". A tutor coached them on how to use the hand- held device and how to perform selection in the testing environment. Afterwards, each participant had ve minutes time to practice the three selection techniques. Once they started the study, they were not interrupted or given any help. Upon completion of the practical part, they were asked to ll out a post-questionnaire (see Table 3.2). Of the 28 participants ranging from 23 − 38 years, 12 were female and 16 male. 12 users had no experience of playing mobile 3D games and 7 had no experience with smartphones. One person reported to have occasionally severe pain issues in her/his primary hand's wrist. All 28 participants yielded successful simulation trials from which all data was used for analysis. 3.4.5 Implementation All computations  tracking, rendering, selection and manipulation of virtual objects  are performed on a smartphone using Android OS. For developing and testing the proposed interaction techniques, the Virtual and Augmented Reality Framework ARTIFICe is used that is described more in detail in Part IV of this thesis. To access touch inputs on the mobile device screen for triggering interaction, ARTIFICe uses Unity's [167] built- in Android interface to access the hardware layer. 6DOF pose data from Vuforia [163] is processed by the specic interaction technique (IT) and handed to the ARTIFICe interaction framework. Using ARTIFICe's interaction interface, all required selection techniques (DrillSample, Mobile Raycasting Mobile Expand) as well as manipulation techniques (3DTouch, HOMER-S) for the performance study in Section 4.2 have been implemented. The practical test ran on a Samsung Galaxy S II I9100, featuring an Arm Cortex A9 Dual Core-Processor, a 4.27" WVGA multi-touch display and an 8 megapixel camera. Galaxy S II weighs 116g and has the physical dimensions of 125.3x 66.1x 8.49mm. The phone was protected with a market available hard cover to minimize the problem of canceling the simulations by mistake by pushing the buttons on the side. 3.4.6 Test Scenarios We built three dierent scenarios to cover dierent selection situations in dense 3D en- vironments. They ranged from unique and un-occluded to non-distinguishable and fully occluded object selection tasks. Thus, we used occlusion and visual similarity as variables for task design. As the underlying building block [6] for interaction design, we applied the canonical task Selection, which refers to the task of acquiring a particular object from the entire set of objects available (see Section 2.1.1.1). All scenarios are based on the same virtual working ground (black & white textured plane) that was printed to paper at 56x40cm and acted as a visual planar marker for the natural feature tracking toolkit [163]. The marker was placed on a table that was positioned at the center of a room so that users had around 150cm of obstacle free space to 137 3. 3D SELECTION IN HANDHELD MIXED REALITY work within. All 28 users completed the three scenarios in random order. Each scenario featured a simple description of the upcoming task. The participants could inspect the scenario, without being able to manipulate it, in order to understand the task according to its description before starting with the actual test. (a) Scenario 1 (b) Scenario 2 (c) Scenario 3 Figure 3.7: The three test scenarios of the performance user study. The three scenarios are depicted in Figure 3.7 and are dened as follows: Scenario 1: Unique Object & No Occlusion The user was challenged to select a green cube in the middle of the working ground which was cluttered with around 80 other cubes of the same size but of dierent color (see Figure 3.7a). The targeted object was easy to distinguish and not occluded by any of the objects in the scene. As soon the user selected or conrmed the selection of the green cube, the task nished automatically. Scenario 2: Unique Object & Strong Occlusion The user had to select a green brick in the lower right corner of a wooden textured box (see Figure 3.7b). The box contained four stacks of dierent colored equally sized bricks. The targeted object was located on the very bottom of the last stack and it was the only brick that was colored in green. Although it was easy to distinguish, it was hardly visible due to the strong occlusion of the bricks stacked on top of it and the box's walls. Again, on selection of the targeted object, the task nished automatically. Scenario 3: Not-Unique & Strong Occlusion In this scenario the user had to select a brick from a wooden textured box again (see Figure 3.7c). The box contained four stacks of equally sized bricks. All bricks were colored in light blue except for the bricks of the second stack which had a magenta colored texture. The targeted object was located on the very bottom of the magenta colored stack. It was only distinguishable by its position in the stack and was hardly visible due to strong occlusions of the bricks stacked on top of it and the box's walls. The number of bricks on top of the targeted object varied randomly for each participant from four to seven pieces. 3.5 Experimental Results Based on the performance study, we conducted an evaluation on the quantitative data to examine performance of the three techniques and a subjective evaluation regarding 138 3.5 Experimental Results user's preferences and feedback. 3.5.1 Quantitative Evaluation The quantitative data gathered from the questionnaires and automatically collected data of the test application were analyzed with Friedman's χ2 test [123, 106] and repeated measures single factor ANOVA [55] accordingly on both Task Completion Time and Num- ber of Selection Steps (see Section 3.4.4) as well as for each scenario (see Section 3.4.6). When suitable, we further employed post hoc analyses using pairwise t-tests or Wilcoxon signed rank test [123, 106] with the Holm's sequential Bonferroni correction [7]. We focused on two dierent aspects during data analysis. Firstly, data of all participants regarding selection techniques was evaluated and secondly, we analyzed the techniques' performance depending on tasks. 3.5.1.1 Performance Evaluation The evaluation of the completion time shown in Figure 3.8 indicates signicant dierences for the three interaction techniques with (F2,54 = 6.74, p < 0.00243) for all tasks on average, but also with (F2,54 = 9.27, p < 0.00035), (F2,54 = 21.84, p < 1, 1e − 7) and (F2,54 = 4.91, p < 0.011) for the tasks 1-3 separately. Figure 3.8: Mean completion time per task and on average. The pairwise t-test shows, that only DrillSample is signicantly faster than Raycast- ing with (t27 = 4.33, p < 0.00018) in the overall mean completion time. For task 1, the techniques Raycasting and DrillSample score signicantly better than Expand with (t27 = −3.82, p < 0.0007) and (t27 = 2.65, p < 0.0134). Most likely because Expand uses a cone-cast to select objects, which results more often in a renement-step com- pared to DrillSample that casts a ray. No signicant dierence was measured between Raycasting and Drill-Sample. For task 2, the techniques with an additional renement step proved to be faster than Raycasting with Expand at (t27 = 7.8545, p < 1.9e − 8) and DrillSample at (t27 = 3.73, p < 0.0009), however no signicant dierence between DrillSample and Expand could be found. Here, Raycasting forces the user to successively 139 3. 3D SELECTION IN HANDHELD MIXED REALITY select and put objects away until the desired object is easily accessible which results in a very time-consuming problem. In task 3, the users required signicantly less time when using DrillSample, compared to Raycasting (t27 = 3.24, p < 0.0031) or Expand (t27 = 2.6, p < 0.0148). Raycasting fails as it did in task 2 because both problems force the user to move objects out of view step by step. Expand scores much worse than in task 2 because the targeted object cannot be distinguished out of its spatial context and because Expand is only aligning the objects to a two-dimensional grid. Between Ray- casting and Expand no signicant dierence could be found. Signicant dierences can be seen in Figure 3.9 for the results of the number of selections for task 2 and task 3 as well as on average, each with (F2,54 = 10.98, p < 0.0001). Task 1 shows no signicant dierences at (F2,54 = 0.491, p < 0.615) and advises that all selection techniques perform well in the simplest case. Figure 3.9: Mean selection steps per task and on average. The pairwise comparison for selection steps on average found the techniques Expand (t27 = 15.29, p < 8.04e− 15) and DrillSample (t27 = 18.83, p < 4.7e− 17) to be signi- cantly better than Raycasting, but no signicance among another at (t27 = 1.31, p < 0.2). Similar to the task completion time, the number of selection steps in task 2 were sig- nicantly smaller for Expand at (t27 = 18.4512, p < 7.78e − 17) as well as for Drill- Sample with (t27 = 13.93, p < 7.55e − 14) compared to Raycasting, but also Expand (t27 = −2.2, p < 0.036) appears to be slightly less error-prone than DrillSample. Ex- pand benets at task 2 from the fact that the targeted object is easily distinguishable, but also from its coarse selection volume where techniques casting a ray may have a hard time to hit an object that is only slightly visible. In task 3, likewise for aver- age completion time, we found DrillSample having less false selections than Raycasting (t27 = 16.87, p < 7.29e−16) and Expand (t27 = 2.61, p < 0.0146). Additionally, Expand is signicantly better than Raycasting at (t27 = 8.34, p < 6.01e − 9), too. A possible cause for Expand scoring worst in terms of completion time, but not on number of false selections could be that each renement step costs extra time for the visualization, but also allows users to accidentally choose the targeted object each time. In average, DrillSample outperforms Raycasting and Expand, both in completion 140 3.5 Experimental Results time as well as in number of selection steps. 3.5.2 Subjective Evaluation Besides the performance measures based on quantitative data, we also examined the user's subjective evaluation on speed and accuracy of each technique. Furthermore, we also include the abstract performance value "ease-of-use" [32] to further evaluate the capabilities of the underlying technique. When answering the questions Q1-Q5, Q7 and Q8, users were able to choose from a 7-point Likert scale [2]. While all questions feature the highest rating at seven, and the lowest at one, Q1 states the best rating with four (appropriate). The participants found the time allotted for practice appropriate with (µ = 3.93 and σ = 0.25 at α = 0.05). Using a smartphone to complete the dierent tasks was rated to be moderately comfortable with (µ = 5.72 and σ = 0.98 at α = 0.05). As depicted in Figure 3.10, all three techniques were rated at least above average but with signicant dierences regarding speed (χ2 2 = 10.48, p < 0.0053), ease of use (χ2 2 = 9.53, p < 0.0085) and accuracy (χ2 2 = 15.27, p < 0.00048). Figure 3.10: Users' average rating of Q3, Q4 and Q5. Only DrillSample was found to be signicantly faster than Expand in the pairwise comparison (Z = −2.63, p = 0.0085). Due to the Bonferroni adjustment, Raycasting failed to be signicantly faster than Expand with (Z = −2.088, p = 0.0368). Raycasting was not found to be signicantly dierent from DrillSample (Z = −1.0558, p = 0.29108). Expand was likely rated lower than the other techniques because it triggers renement too often, while DrillSample only asks for renement if objects overlap. Using Raycasting, users are not interrupted by a renement step and might therefore consider it faster. Users' ratings on ease of use found DrillSample signicantly better than Raycasting and Expand at (Z = −2.84, p < 0.0045) and (Z = 2.91, p < 0.0036). Raycasting was insignicantly dierent to Expand with (Z = −0.89, p = 0.371) even without the Bonferroni adjustment. Similarly, users found DrillSample signicantly more accurate than Raycasting (Z = −2.69, p < 0.007) and Expand (Z = −3.17, p < 0.0015). Likewise 141 3. 3D SELECTION IN HANDHELD MIXED REALITY Raycasting showed no signicant dierence to Expand at (Z = −1.23, p = 0.218). Both Raycasting and Expand are not easy to use or accurate, if objects are occluded or look very similar. Hence, both factors result in a tedious, and when using Expand even a confusing, sequence of interactions to select the desired object. For question Q6, asking the participant to rank the selection techniques in order of desired use, signicant rankings for 1st (χ2 2 = 18.5, p < 9.6e− 005), 2nd (χ2 2 = 12.29, p < 0.0021) and 3rd(χ2 2 = 9.91, p < 0.007) could be found as shown in Figure 3.11 Figure 3.11: Users' rating of Q6. Rank one was clearly given to DrillSample with (Z = −3.54, p < 0.00039) and (Z = −3, p < 0.0027) signicantly outranking Raycasting and Expand. Rank two was given to Raycasting with (Z = −2.98, p < 0.0028) and (Z = −2.45, p < 0.014) signi- cantly outranking DrillSample and Expand. Rank three seems to be given to Expand, however it only signicantly outranks DrillSample with (Z = −2.83, p < 0.004) but not Raycasting at (Z = −2.04, p = 0.041) due to the Bonferroni adjustment. All other pair-wise interaction technique tests show no signicant dierence. Users stated all aspects of Q7 evenly important with 6 (important) or higher when answering Q6. Addressing in Q8, how helpful the spatial visualization is, the participants found it useful to very useful with (µ = 6.5 and σ = 0.1 at α = 0.05). 3.5.3 Qualitative Evaluation Based on the 3D formalization principles by [32], we outline a number of factors for the canonical interaction task 3D Selection that inuence performance in virtual environ- ments. Since all three evaluated selection techniques are suited or explicitly designed for dense environments, we do not include "density" as a performance factor. The specied factors are: 1. Object Size: This object property is related to the geometric area, a 3D object covers on the output device screen. A selection technique must be capable to select objects of varying size. 142 3.6 Discussion 2. Occlusion: In any virtual environment, but especially in a dense environment, ob- jects can partially or fully occlude each other which may result in invisible objects. In such environments, selection must be precise and provide some assisting visual- ization to identify occluded objects. 3. Visual Appearance: The visual appearance of virtual objects can be of high sim- ilarity. Identifying the desired target object can result in problems in dense envi- ronments with occluded objects. In such environments, selection must provide an assisting visualization to disambiguate the desired object. Based on the results from quantitative as well as subjective evaluation, we summarize our ndings with respect to the proposed parameters in Table 3.3. Parameters Object Size Occlusion Appearance Raycasting − [119, 62] − ◦ Expand + [119] + − DrillSample − + + Table 3.3: Evaluation of selection techniques in handheld mixed reality. Previous work [119, 62] report that Raycasting performs badly for objects covering only a small portion of the screen, while Expand performs well for the same case by casting a volume instead of a single ray. Beyond that, our ndings indicate that Raycasting is well suited for selecting non-occluded objects which can be also similar in appearance. However, if the desired object is small and is located amongst similar looking objects, imprecise touch input can evoke wrong selection. Compared to Raycasting, Expand is well suited to select visible or fully occluded objects of varying size. But the grid representation during the renement step does not provide full spatial correspondence to the original position of the selected objects; hence, precise selection of an object from a set of similar looking objects can be dicult and can result in wrong selections. DrillSample also lacks accuracy when selecting small objects due to the underlying use of Raycasting in combination with the imprecise single touch input. However, since DrillSample selects all objects which are cast by the ray, overlapping or occluded objects can be precisely selected due to DrillSample's renement step. Here, spatial context preservation provides a full overview that allows object disambiguation, which is especially of interest when selecting from a set of similar looking objects. 3.6 Discussion We designed the experiment to compare three dierent techniques in terms of speed, precision and ease-of-use for performing 3D selection tasks with a multi-touch handheld device in a dense mixed reality scene. Many of the outcomes of our performance study 143 3. 3D SELECTION IN HANDHELD MIXED REALITY were statistically signicant which enable us to draw multiple meaningful conclusions. In H1 we proposed Raycasting to be best suited for selection of non-occluded objects. Results of completion time for task 1 support H1, since Raycasting signicantly out- performs Expand. H1 can further be strengthened by taking the subjective evaluation into account where users considered Raycasting to be fast. DrillSample also performed signicantly better than Expand for task 1. This indicates the strength of techniques casting a ray instead of casting a cone for visible object selection in close range, since a ray selects fewer objects. Thereby, just a few objects need to be presented at Drill- Sample's renement step, while Cone-Casting is always coarser. There, more objects are presented during a renement step, which takes more time for a user to get an overview before indicating the desired object. Therefore H1 can be supported to be true in terms of speed. Regarding precision, neither performance nor subjective evaluation revealed statistical signicance to back up H1. Therefore, we must state H1 to be not true in terms of precision. Results for evaluating speed and precision, when selecting almost fully occluded ob- jects, clearly reveal Expand's and DrillSample's strengths. Both perform signicantly faster and need less selection steps than Raycasting, which supports H2. Since no signi- cant dierence in completion time and interaction steps between Expand and DrillSample could be found, H2 can be backed up further. These results indicate that Expand and DrillSample are both equally suited for selecting an occluded object, which highly dif- fers in appearance from the surrounding ones. Regarding precise selection of occluded objects with high visual similarity, DrillSample signicantly outperforms both baseline techniques in terms of completion time and number of interaction steps. Based on these results, H3 can clearly be supported. It proves the advantage of our proposed spatial context preservation compared to the grid representation that Expand provides. The disadvantage of Expand's detailed visualization becomes even more apparent, since no signicant dierence in completion time could be found between Expand and Raycasting. Regarding users' preference, the subjective evaluation clearly reveals users' being in favor of DrillSample. It signicantly outranked both baseline techniques when users were asked for an overall ranking. This rst rank can further be conrmed when looking at the details. Users ranked DrillSample highest in terms of speed, precision and ease- of-use. It signicantly outperformed Expand in terms of speed, but not Raycasting. Since Raycasting does not provide a renement step, it tends to be considered fast and "direct". The DrillSample's capability to precisely select the desired object over all three test scenarios was ranked signicantly best in terms of precision. Finally, the users ranked DrillSample signicantly best in ease-of-use. Based on these results and ndings, we have developed a set of basis guidelines regarding object selection in closer range: ˆ Raycasting remains a good alternative selection technique for sparse selection tasks and as long as objects are fully visible. ˆ Expand remains a good alternative for visible or occluded objects of varying object size, as long as they dier in visual appearance. 144 3.6 Discussion ˆ For visible or occluded objects, independent of their visual appearance, DrillSample is the best general purpose method. 3.6.1 Variations of the Algorithm DrillSample is originally designed for multi-touch displays which allow for one-nger or two-nger processing. Tracking two independent contacts of the surface is only necessary for optional interactions in the DrillSample visualization view, thus the algorithm can be applied in various kinds of virtual environments with just one 2D or 3D interaction device. For example in a fully immersive environment, the user's interaction device can be used for Raycasting and object conrmation. Since the DrillSample visualization does not depend on display size but on the eld of view (FOV) of the user's output device, such as a Head Mounted Display (HMD), the Image-Plane technique [62] can be applied to show the indicated objects in front of the user in space. Furthermore, the rotation of the interaction device can be mapped to rotate the DrillSample for inspection. As described, only a few minor changes of the original algorithms are necessary to apply the technique in another type of mixed reality environment without changing is original mapping characteristics. 145 Chapter 4 3D Manipulation in Handheld Mixed Reality After a virtual object has been selected it can be subsequently transformed by translating, rotating and scaling it. However, using 2D touch input for 3D manipulation induces several problems, as described in Section 1.1. To address the limitations of existing 3D manipulation techniques for handheld mixed reality scenes, the two novel methods 3DTouch and HOMER-S are presented which both support RST manipulations. (a) View on user (b) Mixed reality view Figure 4.1: Touchless full 6DOF object manipulation using HOMER-S. 3DTouch provides 3D translation and rotation as was well as non-uniform scaling by combining simple 2D touch gestures with the handheld's current 6DOF pose. The 6DOF manipulation is decomposed into two separate tasks where one-nger is sucient to access all three 3DOF during translation and rotation. Scaling requires only a two- nger pinch gesture while provide non-uniform transformation in all three dimensions. HOMER-S pushes the idea of enabling intuitive 3D manipulation in handheld mixed reality further and aims on interaction beyond the (limited) screen dimensions by de- coupling the manipulation process from any touch input. It is based on the immersive 147 4. 3D MANIPULATION IN HANDHELD MIXED REALITY VR technique HOMER (see Section 2.3.1) and maps the handheld's pose, regarded as the virtual hand, onto the object upon selection. The 6DOF access is exploited for full 6DOF manipulation, as illustrated in Figure 4.1, and 3D non-uniform scaling. Compared to existing state-of-the-art methods, the novel techniques aim at improving intuitiveness and ease-of-use by reducing user touch input complexity and adapting real- world metaphors for object manipulation. 4.1 Methodological Approach There are dierent ndings in recent literature regarding DOF separation and integra- tion to improve intuitiveness and ease-of-use for object manipulation. [104] states that DOF integration does not necessarily mean that the performance for orientation tasks is increased. This, however, contradicts the ndings in [126]. The authors observe reduced interaction performance and user satisfaction for DOF integration for translation and rotation tasks. It is rather proposed to follow the structure of the input device than the task structure when designing the interaction technique. As the applied interaction device oers two dierent input structures, this proposition is the fundamental founda- tion of the two novel manipulation techniques. 3D Touch follows DOF separation by employing the 2D multi touch structure of the input device while HOMER-S aims at DOF integration re-using the 6DOF information of the device pose. 4.1.1 Requirements & Prerequisites To achieve the research objective from Section 1.2, the same requirements as for 3D selection have to be met by the 3D manipulation technique, as specied in Section 3.1. For both techniques, prior objects selection using i.e. Mobile Raycasting or DrillSample is assumed. 4.1.2 Design Guidelines Based on the presented motivation and requirements, the following design guidelines were specied which were applied during algorithm development of both techniques: Keep Direct Touch Abilities The probably most appealing aspect of touch displays is the ability to directly "touch" an object in order to interact with it. We aim on preserving this ability and do not introduce any osets or non-direct gestures. Simplify Touch Input Since multi nger interaction requires prior knowledge for cor- rect usage of the touch gesture and can be hard to apply with only one hand, we aim to simplify touch gesture complexity for object manipulation. If necessary, we introduce degree-of-freedom separation to fulll this guideline as well as mode switches to perform RST operations. Furthermore, we aim at adapting real world metaphors for touchless object manipulation. 148 4.1 Methodological Approach 4.1.3 The 3D Touch Technique Following the design guidelines, the direct mapping between nger touches and virtual touch points is preserved in the proposed 3D Touch technique. According to [126], the separated structure of the input device is matched to the technique design by separat- ing integral 3D manipulations into 3DOF entities for rotation, scaling and translations (RST). A mode switch is employed to change between the three manipulation entities at run-time, as described in Section 4.1.5.1. To comply with the requirement of limited gesture complexity when manipulating the remaining DOFs, simple 2D multi-touch ma- nipulation gestures as in Hancock [72] are combined with the 6DOF device pose. Inspired by Reisman [103], the 2D screen coordinates of the touch input are transformed to 3D space. Thereby, 3DTouch is able to solely rely on one-nger (translate and rotate) or two-nger (scale) gestures to allow non-uniform scaling. In contrast to [143], one gesture for each 3DOF entity is sucient to enable non-uniform manipulations without requiring a manual switch to address each dimension. Thereby, our proposed approach results in a minimal set of necessary gestures, each having a low complexity. With the described methodology, our proposed approach features the seamless tran- sition between the dierent DOF subtasks to fulll each 3DOF manipulation task. To access all 3DOF of each RST task, neither an abstract switch, such as a button, nor applying a distinct gesture for each subtask is necessary. Since the user naturally chang- ing his viewpoint in a handheld mixed reality setup, the provided handhand's pose and resulting perspective onto the virtual objects can be seamlessly exploited to obtain the accessible DOF at a moment in time. In the following paragraphs, algorithmic details of the described manipulation process are given. Upon selection, the 6DOF pose of the selected object obj(R, T ) ∈ R3 is stored and the handheld's device pose pose(R, T ) ∈ R3 is continuously updated. 4.1.3.1 Translation 3D translations are performed using single touch inputs that are combined with the cur- rent pose(R, T ). First, at two moments in time t, consecutive touch points p(t1), p(t2) ∈ R2 are projected from 2D screen into 3D world space, as described in Section 3.3.2, but with a specic distance d, resulting in the 3D points P (t1), P (t2) ∈ R3. The distance d = ||pose(T ) − obj(T )|| is obtained upon selection, where ||...|| denotes the Euclidian norm. Both points form the vector ~v(P (t2), P (t1)) that is subsequently nor- malized, denoted as v̂. To determine the current interaction dimension, the collinearity between v̂ and the normalized target coordinate system's basis vectors in world coordi- nates êi ∈ R3, i = x, y, z is calculated by ci = êi · v̂. The basis vector êi with the highest resulting scalar |cmax| ∈ ci indicates the dimension êmax that is subsequently used for translation. The sign s of cmax determines the direction of the manipulation. Given the objects position obj(T ), the objects manipulated position obj(T )′ is obtained by obj(T )′ = obj(T ) + (s · ||~v|| · êmax). (4.1) 149 4. 3D MANIPULATION IN HANDHELD MIXED REALITY Subsequently, the 3D position of the selected object is adjusted. In Figure 4.2, some example translations using the 3DTouch algorithm are illustrated. (a) x-axis (b) y-axis (c) z-axis Figure 4.2: Examples of translations using 3DTouch. Moving the nger right or left in Figure 4.2a causes a translation along the x-axis. Analogously, moving the nger up and down in Figure 4.2b, respective 4.2c, results in translations along the y- and z-axis. 4.1.3.2 Rotation Similar to translations, 3D rotations are performed using single touch and the device pose. (a) x-axis (b) y-axis (c) z-axis Figure 4.3: Examples of rotations using 3DTouch. The algorithm is based on the proposed translation algorithm and extended by the following steps. Instead back-projecting the two touch point, a line perpendicular to line(p(t1), p(t2)) ∈ R2 is calculated and the two new points line(p⊥(t1), p⊥(t2)) are back-projected, resulting in P (t1), P (t2) ∈ R3. To calculate the angle of rotation, the factor f is determined, as described in Equation 4.2. fr = (360 · s · ||~v|| · êmax) U (4.2) 150 4.1 Methodological Approach s is the scalar taken from the translation algorithm, indicating a positive or negative rotation. ||~v|| regulates the angle as a fraction of the circumference U of the bounding sphere of the manipulated object. The factor fr is then applied to the current rotation in the object's local coordinate system. In Figure 4.3, examples of the resulting 3D rotations using the proposed 3DTouch algorithm are illustrated. Moving the nger up and down in Figure 4.3a causes a ro- tation around the x-axis. Analogously, moving the nger right or left in Figure 4.3b, respective 4.3c, results in a rotation around the y- or z-axis. 4.1.3.3 Scaling The proposed algorithm supports non-uniform scaling. Therefore, a two-nger pinch-like gesture is used and applied with an adapted version of the proposed algorithm from Section 4.1.3.1. The touch points of both ngers pi ∈ R2 at two moments in time t(i) are back-projected into 3D, resulting in a set of points Pi(ti) ∈ R3 | i = 1, 2. The sign of scaling and its amount depend on the direction and magnitude of the pinch gesture. Moving both ngers together results in negative scaling, moving apart determines a positive scaling. The scaling factor fs ∈ R3 is then calculated as denoted in Equation 4.3. fs = (||~v(t2|| − ||~v(t1||) · êmax(t1) (4.3) Finally, the sign for scaling is then determined by Equation 4.4 and fs is added to the current scale in the object's local coordinate system. fs = { fs if fs > 0 fs · (−1) else (4.4) 4.1.4 The HOMER-S Technique The mapping between touch input and object manipulation of 3DTouch is straightfor- ward and simple. However, the touch abstraction layer still exists and manipulation is limited to the screen size of the handheld's device. Therefore, the novel HOMER-S technique is introduced, which integrates all 6DOF of a translation and rotation task by directly mapping the handheld's pose onto the selected object. Scaling as a spatial non-rigid transformation is designed as a separate 3DOF task and re-uses the device's po- sition information for non-uniform object manipulation. Thereby, real-world metaphors for translation, rotation and scaling are imitated, touch input during manipulation is eliminated and the interaction space is extended to the user's physical space. 4.1.4.1 6DOF Manipulations Inspired by [66] and using the immersive 3D method HOMER [24] as foundation, the proposed technique HOMER-S was designed. The original HOMER algorithm uses the 6DOF pose of the user's torso and that of the interaction device to manipulate an object. 151 4. 3D MANIPULATION IN HANDHELD MIXED REALITY Since a handheld setup features dierent characteristics, we adapted HOMER to be ap- plicable in handheld mixed reality environments using a tablet or smartphone (therefore HOMER-S). The full 6DOF manipulation of HOMER-S is depicted in Figure 4.4. Figure 4.4: 6DOF translation and rotation using HOMER-S. Rotations of the selected object around arbitrary axes are controlled independently. An isomorphic mapping between the handheld's orientation and the virtual hand is ap- plied to rotate an object around the hit point that describes the pivot point. Thereby, the physical movement and rotation of the mobile device directly inuences the transfor- mation of the selected object. By performing Mobile Raycasting, the object is released and the virtual hand moves back to the handheld's position. The proposed HOMER-S algorithm is summarized in Algorithm 4.1. 4.1.4.2 Scaling To scale an object, the virtual hand's position ~p(vh) ∈ R3 is used. At each frame at time t, ∆p ∈ R3 is obtained as described by: ∆p = ~pt(vh)− ~pt−1(vh). (4.5) ∆p is subsequently mapped onto the select object O to update its scale ~s(O) ∈ R3 in a frame-wise manner, as described by: s(O) = s · (∆p + ~s(O)), (4.6) where the scalar s denotes a scaling factor that controls the amount of the frame-wise scaling and that can be adjusted to the specic application requirements. Thus, moving the virtual hand in positive direction of each axis will scale up; moving in negative will scale down the object. Thereby, a straightforward non-uniform scaling along all axes is achieved. 152 4.1 Methodological Approach Algorithm 4.1: Algorithm of 6DOF manipulation with HOMER-S in pseudo code. Data: Handheld's 6DOF pose (hp, h0 | ∈ R3) Init V irtualHand (vhp, vho ← 0 | ∈ R3); Init Object (Op, O0 ← 0 | ∈ R3); Set uponSelection← false; while (hp &h0) = true do vhp ← hp; vho ← ho; if O is selected then if uponSelection = false then vhsel ← hp; uponSelection← true; end A rotation has performed ; Oo ← ho; A translation has performed ; vhcurr ← hp; Calculate distance: d(vhcurr, vhsel) ; Normalize vector: ~vnorm ← ~v(vhsel, vhcur); Set: vhp ← vhsel + d · ~vnorm; else uponSelection← false; end end 4.1.5 Assistance Design To allow changing the manipulation tasks during run-time as well as to support the user with visual feedback about accessible axis for interaction, the following design modalities for assistance have been incorporated into both manipulation techniques. 4.1.5.1 Mode Switches Since 3DTouch oers RST by decomposing each transformation into a separate 3DOF task, mode switches between the manipulation entities are required. This is realized through a simple button interface, as illustrated in Figure 4.5a. This mode switch intro- duces an additional extra input modality compared to previous work [103, 143]. However, as reported in literature [126, 104], DOF-separation of the manipulation task leads to better results than trying to use the separated DOF of a multi-touch display in an in- tegral way, as demonstrated in [103]. Thereby, the additional input modality can be compensated by enhanced performance and ease of use. 153 4. 3D MANIPULATION IN HANDHELD MIXED REALITY When using HOMER-S, no DOF-separation is required for the integral task of trans- lating and rotating an object. However, to provide the same structure for later evaluation, translation and rotation can also be performed as separated manipulation entities (see Figure 4.5b). HOMER-S takes advantage of exploiting real-world metaphors for trans- lation, rotation and scaling an object in space. However, the metaphors for translating and scaling are akin in movement and hence are hard to distinguish if only the device pose is examined. Instead of introducing another, more complex metaphor for scaling, a mode switch between the 6DOF manipulation task and non-uniform scaling is proposed. Therefore, the simple button interface is applied, as described for 3DTouch. (a) 3DTouch manipulation switches (b) HOMER-S manipulation switches Figure 4.5: Floating GUIs of both techniques upon selection. To summarize, HOMER-S provides the following manipulation entities: (1) transla- tion, (2) rotation, (3) translation & rotation (6DOF), and (4) scaling. 4.1.5.2 Supporting Visualization To increase the ease of use during object manipulation, supportive information is provided to users. (a) 3DTouch: Translations are currently performed in the x/y-plane (b) HOMER-S: During 6DOF manipula- tion, all axes and angles are accessible. Figure 4.6: Supporting visualization depending on manipulation task and current acces- sible interaction axes. 154 4.2 Performance Studies Therefore, 3DTouch and HOMER-S draw axes during translation and scaling as well as gimbals during rotation to visualize accessible interaction axes and angles according to the current device pose, as illustrated in Figure 4.6. 4.1.6 Crucial Aspects The nature of the proposed techniques oers intuitive handling of 3D manipulation tasks but introduces some crucial aspects as well. Loss of Tracking Since both methods are designed for handheld mixed reality setups, a valid device pose is required. Loss of tracking thus results in malfunction of object manipulation. Currently optical tracking is proposed to estimate the device pose due to its accuracy, low latency and non-drift characteristics (see Section 2.1.1.1 in Part II). To increase the tracking robustness, fusing of optical tracking data with the measurements of the handheld device's built-in inertial measurement unit can be applied. Thereby, a complete loss of tracking can be omitted in case of (temporary) occlusions or inconsistent light situations. Rotation When performing rotational tasks with HOMER-S, a drawback is caused by the direct mapping of the device orientation onto the selected object. Given the implicit binding of input- and output device in handheld mixed reality, rotations around the pitch-axis are limited. This is especially true as soon as only one physical feature is used for optical tracking. 360◦ rotations around the yaw-axis can be applied by real world movements of the user, while rotations around the roll-axis are straightforward and employ the steering wheel metaphor. 4.2 Performance Studies For a comprehensive evaluation of the two proposed manipulation techniques, a sum- mative evaluation was conducted across four dierent manipulation scenarios based on variations of the employed interaction tasks. 4.2.1 Prerequisites Object Selection The interaction task Object Selection was not reviewed within the following study. However, selection is required for subsequent manipulation. Therefore, Mobile Raycasting from Section 3.3.2 is employed across the four test scenarios. Baseline Technique The immersive 3D manipulation techniques presented in Sec- tion 2.3.1 are not originally designed for handheld mixed reality setups and require sep- arate tracking of the user's head and the input device. This directly conicts with the Single I/O Device requirement and can therefore not be applied without further adap- tion. However, any adaption needs to be carefully reviewed to ensure that the original 155 4. 3D MANIPULATION IN HANDHELD MIXED REALITY characteristics of the technique remain. As described in Section 3.4.1, the adaption to use Go-Go with a 3D multi-touch device clearly alters the original non-linear mapping approach and thus is not applicable as baseline technique. Multi-touch techniques such as [108, 143] use DOF-separation to obtain manipulation tasks with reduced degree- of-freedom. These subtasks are then performed with specic 2D multi-touch gestures. However, [108] only enables 3D translations. Although [143] provides gestures for RST manipulation, the necessary multi-touch gestures are inconsistent. For instance, a verti- cal movement of two ngers causes a translation along the y-axis but for scaling a change along the z-axis. These inconsistencies in combination with the required prior knowledge of the underlying multi-touch gestures do not allow for a valid comparison. In [103], RST operations are provided by employing two-handed three-nger gestures. However, using two handed multi-touch input violates the requirement of Limited Gesture Complexity, resulting in a dicult applicability for the one-handed handheld setup. As the existing techniques do not apply for a clean and fair performance evaluation of manipulation techniques in one-handed handheld mixed reality, 3DTouch and HOMER-S are compared within the following study. Thereby, the characteristics of DOF-seperation in contrast to DOF-integration according to the interaction task can be robustly evalu- ated. 4.2.2 Objectives The main goal of the experiment was to evaluate the performance and usability of 3DTouch and HOMER-S. Since 3DTouch matches the separated structure of the multi- touch input device and HOMER-S adapts real-world metaphors by applying the integral structure of the given device pose, both techniques apply for straightforward manipula- tion. Thus, a second objective was to compare both techniques and to examine intuitive handling. In designing the experiment, the following hypotheses were formulated: H1 3DTouch and HOMER-S are both designed to provide intuitive manipulation. Thus, both techniques will perform similar in terms of speed and ease-of-use for 3DOF tasks. H2 Since HOMER-S oers full 6DOF manipulation, it will perform considerably faster than 3DTouch for compound translation and rotation tasks. H3 Touch gestures enable a higher precision than free movements in 3D. Thus, 3DTouch performs better for ne manipulation tasks that require precise input. H4 Regarding prior knowledge, users with experience using multi-touch devices will perform equally or better with 3DTouch than with HOMER-S. Likewise, the de- sign of HOMER-S enables better performance for users with no prior multi-touch knowledge. 156 4.2 Performance Studies 4.2.3 Experimental Design and Procedure We conducted the study using a 2x4 within-subjects factorial design where the indepen- dent variables are manipulation technique and task scenario. In a second order evaluation, user experience was the third independent variable. The manipulation techniques were 3DTouch and HOMER-S, while the scenarios included four dierent experimental tasks with varying types of canonical manipulation tasks and combinations. The dependent variables were Task Completion Time and Number of Interaction Steps. Task completion time represents the time it takes to successfully nish a specic scenario while number of interaction steps comprises the amount of necessary mode switches to successfully nish an (compound) manipulation task. Furthermore, we measured user preferences for both techniques in terms of speed, accuracy, and ease of use. The user study was analogously designed to the Selection Study from Section 3.4. Thus, the same procedure as in Figure 3.6 was applied. The material of the study is presented in Appendix VI.A. At the beginning of the study, each participant was asked to read and sign a standard consent form as well as to complete the pre-questionnaire from Table 3.1. No. Question Q1 How adequate do you feel the time allotted for practice was? Q2 How comfortable were you with using a smartphone for task completion? Q3 How would you rate the 3DTouch manipulation technique in usability? Speed? Accuracy? Q4 How would you rate the HOMER-S manipulation technique in usability? Speed? Accuracy? Q5 How would you rate intuitiveness of 3DTouch for 2D-translate, 3D-translate, rotate, move & rotate, scale an object? Q6 How would you rate intuitiveness of HOMER-S for 2D-translate, 3D-translate, rotate, move & rotate, scale an object? Q7 Which manipulation technique do you prefer to 2D-translate, 3D-translate, rotate, move & rotate, scale an object. Q8 Rank the two manipulation techniques in order of desired use (with 1 being the most desired). Q9 When determining how much you like using a manipulation technique, how important in inuence on your decision was ease-of-use? Speed? Accuracy? Table 4.1: Post-Questionnaire Upon completion, the participant was given a detailed description of the practical part about "Manipulation in handheld Mixed Reality". A tutor coached them on how to use the handheld device and how to perform 3D manipulation in a test environment. Afterwards, each participant had ve minutes time to practice both techniques. Once they started the study, they were not interrupted or given any help. Upon completion 157 4. 3D MANIPULATION IN HANDHELD MIXED REALITY of the practical part, they were asked to ll out a post-questionnaire (see Table 4.1). It took approximately 25 minutes for each participant to nish the user study. All 28 participants yielded successful simulation trials from which all data was used for analysis. 4.2.4 Subjects & Apparatus Of the 28 participants ranging from 23 to 38 years, 12 were female and 16 male. 12 participants stated not to have any mobile 3D gaming experience at all, while 7 reported no experience with multi-touch smartphones. Table 4.2 gives on overview of users based on their prior experience. Group Inexperienced Experienced a) Mobile 3D Gaming 12 16 b) Smartphone 7 21 Table 4.2: Users grouped by prior experience All computations  tracking, rendering, selection and manipulation of virtual ob- jects  were performed on a smartphone using Android OS; more details are given in Section 3.4.5. 4.2.5 Test Scenarios We built four dierent scenarios to simulate typical 3D manipulation situations. Accord- ing to [62], the basic canonical tasks position, rotation and scaling were used to design the four test tasks of varying complexity. To manually identify the desired object for subsequent manipulation, another canonical task selection is required. Since the selec- tion task is performed by all users in the same way and is equally designed over all four tasks, the necessary time does not inuence the performance metrics. All scenarios are based on the same virtual working ground (black & white textured plane) that was printed to paper at 56x40cm and acted as a visual planar marker for the natural feature tracking toolkit [163]. The 28 participants completed the four scenarios in a random order. Each scenario featured a simple description of the upcoming task. Before starting the actual tests, users could inspect the scenario without being able to interact with in order to understand the task according to its description. The four scenarios are depicted in Figure 4.7 and are dened in the following. 4.2.5.1 Positioning on a Plane The rst task comprises the canonical task positioning. The user was challenged to translate a pink cube in the lower left corner to the center of a green area in the upper right corner, as depicted in Figure 4.7a. The distance between the targeted object and its destination was 35cm on the horizontal plane. It was sucient to complete the task with the cube partly overlapping the designated target. 158 4.2 Performance Studies (a) Scenario 1 (b) Scenario 2 (c) Scenario 3 (d) Scenario 4 Figure 4.7: The three test scenarios of the performance user study. 4.2.5.2 Positioning in 3D Space The second task extends the rst scenario by requiring positioning in all three dimensions. The user was challenged to translate a pink cube in the lower left corner on top of a small tower in the upper right corner (see Figure 4.7b). The distance between the targeted object and its destination was 35cm on the horizontal plane and 20cm vertically. The destination area was again a square. If it was partly overlapped by the target object, the task was completed. 4.2.5.3 Positioning & Rotation in 3D Space For better simulation of manipulation requirements in mixed reality applications, we applied an integral task design for the third scenario comprising a combination of posi- tioning and rotation. The user was challenged to rotate a red barrel in the lower left corner by 45◦ around its vertical axis and translate it on top of an inclined plane (see Figure 4.7c). From there the barrel was supposed to roll down the plane and over a square at its bottom. The test was successfully completed if the barrel was let loose on the top of the inclined plane rolling down its full length and at least partly hitting the center of the destination area. 4.2.5.4 Non-Uniform Scaling & Positioning in 3D Space A second integral task was designed for the fourth scenario. Here, the user was rst challenged to scale a blue cube by a fth in length and a third in width of its original 159 4. 3D MANIPULATION IN HANDHELD MIXED REALITY size and then move the cube into a glass positioned at the center of the scene (see Figure 4.7d). The distance between the targeted object and its destination was 38cm horizontally and 10cm vertically. The destination was the circular shaped bottom of the glass. Users needed to let the cube fall into the glass from above and as soon as it hit the bottom, the task was completed. 4.3 Experimental Results Based on the performance study, we conducted an evaluation on the quantitative data to examine performance of the two techniques and a subjective evaluation regarding user's preferences and feedback. 4.3.1 Quantitative Evaluation The quantitative data gathered from the questionnaires and automatically collected by the test application was analyzed with Friedman's χ2 test1 and repeated measures single factor ANOVA accordingly on both Task Completion Time and Number of Interaction Steps (see Section 4.2.3) as well as for each scenario (see Section 4.2.5). We focused on three dierent aspects during data analysis: 1. Data of all participants regarding the manipulation techniques is evaluated. 2. The techniques' performance was analyzed depending on tasks. 3. Data of selected participants - according to the user experience listed in Table 4.2 - was analyzed for each manipulation technique and task separately. 4.3.1.1 Performance Evaluation Analyzing the overall mean completion time, no signicant dierence was found between HOMER-S and 3DTouch (F1,27 = 0.00299, p = 0.957), as illustrated in Figure 4.8. When inspecting the mean completion time for each task separately, again no signicant dierences could be found for both positioning tasks 1 (Positioning on a Plane) and 2 (Positioning in 3D Space) at (F1,27 = 1.4, p = 0.2468) and (F1,27 = 0.814, p = 0.375), respectively. However, task 3 (Positioning & Rotation) was performed signicantly faster with HOMER-S (F1,27 = 7.379, p < 0.0114). In contrast to that, HOMER-S took signicantly more time to complete task 4 (Scaling & Positioning) at (F1,27 = 7.379, p < 0.0114), as illustrated in Figure 4.9. Analyzing the task completion time, grouped by users' knowledge according to Ta- ble 4.2 revealed no further signicant dierences other than the overall ones illustrated in Figure 4.9. No signicant dierences could be found for both positioning tasks when analyzing the users' experience. For task 3 (Positioning & Rotation), the signicantly better performance of HOMER-S was never independent of the users' experience. The 1Since the degree-of-freedom is k = 2 for this analysis, we denote χ2 k−1 = χ2 1 as χ2 in the following. 160 4.3 Experimental Results inexperienced users of the mobile gamer group (a) as well as of the smartphone group (b) performed signicantly faster with HOMER-S than with 3DTouch. The experienced users of both groups performed faster with HOMER-S as well, but not signicantly. Fur- thermore, only the experienced groups of a) and b) had signicant results for task 4 (Scaling & Positioning), since they were signicantly faster using 3DTouch. No signi- cant dierence in performance between 3DTouch and HOMER-S could be found for the inexperienced users of both groups in task 4 (Scaling & Positioning). Figure 4.8: Mean completion time and mean number of interaction steps. Figure 4.9: Mean completion time per task. The results of the evaluation for the overall mean number of interaction steps exposed that 3DTouch enabled users to perform manipulations in signicantly less steps than HOMER-S (F1,27 = 4.552, p < 0.0421), as illustrated in Figure 4.8. However, the evaluation of the number of interaction steps per tasks found only a signicant dierence in task 2 (Positioning in 3D Space) at (F1,27 = 4.374, p < 0.046) and in task 4 (Scaling & Positioning) at (F1,27 = 12.81, p < 0.0013), both in favor of 3DTouch. Figure 4.10 indicates no signicant dierence for both task 1 (Positioning on a Plane) or task 3 (Positioning & Rotation), both with (F1,27 = 0.685, p < 0.415). 161 4. 3D MANIPULATION IN HANDHELD MIXED REALITY The evaluation of the mean number of interaction steps, grouped by users' experience, revealed with one exception for task 3, no deviant results than those illustrated in Figure 4.10. Figure 4.10: Mean number of interaction steps per task. The signicantly better performance of 3DTouch in task 2 (Positioning in 3D Space) could only be conrmed for the experienced users in a) and b). For task 3 (Positioning & Rotation), the inexperienced group of a) achieved signicantly better results with HOMER-S than with 3DTouch. For all other groups no signicance could be found for that task. For task 4 (Scaling & Positioning), only the experienced users of both groups had signicantly better results with 3DTouch than with HOMER-S. No signicant dierence could be found for the inexperienced users of both groups. 4.3.2 Subjective Evaluation When answering the questions Q1-Q6 and Q9, users were able to choose from a 7-point Likert scale [2]. Figure 4.11: Users' average rating of Q3 & Q4. 162 4.3 Experimental Results While all questions feature the highest rating at seven, and the lowest at one, Q1 states the best rating with four (appropriate). Our participants found the time allotted for practice appropriate (µ = 4 and σ = 0.46 at α = 0.05). Using a smartphone to complete the dierent tasks was rated to be moderately comfortable (µ = 5.9 and σ = 1.14 at α = 0.05). As illustrated in Figure 4.11, the questions Q3 and Q4 revealed both to be average or good, but 3DTouch was rated signicantly better for ease-of-use and accuracy with (χ2 = 6.55, p < 0.0105) and (χ2 = 15.696, p < 0.0000744) respectively. In terms of speed, no dierence was conrmable. Analyzing the subjective evaluation of ease-of-use, speed and accuracy, grouped by the user's experience, revealed signicantly better ratings of 3DTouch in ease-of-use only for experienced users of a) and b). 3DTouch's better rating for accuracy was indepen- dent of the users experience in a) und b) except for inexperienced users in b) where no signicant dierence occurred. Users' ranking of the two interaction techniques indicated no signicant preference (Q8) (χ2 = 0.57, p = 0.45). Figure 4.12: Users' preferences given Q7. A closer inspection of the users' preferences grouped by individual manipulations revealed that for 2D- and 3D translation as well as rotation alone no signicant dier- ence in preferences could be found, as shown in Figure 4.12. For the integral 6DOF manipulation of task 3 (Positioning & Rotation), HOMER-S is signicantly preferred with (χ2 = 10.67, p < 0.0011). For scaling, 3DTouch is signicantly preferred with (χ2 = 12.57, p < 0.00039). This subjective evaluation reects the results of the quanti- tative evaluation in terms on completion time. No deviant results for 2D- and 3D-translation as well as rotation alone were revealed, when analyzing the ranking of each manipulation, grouped by the users' experience. The users' preference of both groups for the integral rotation and translation task 3 (Position- ing & Rotation) revealed that HOMER-S was signicantly preferred by the experienced users. Also the inexperienced users preferred HOMER-S, but not signicantly. 3DTouch's preference for scaling remains independent of user's experience in both groups. Question Q9 inquiring the users' inuence on their decision for questions Q3 and Q4 163 4. 3D MANIPULATION IN HANDHELD MIXED REALITY yields with (χ2 = 3.89, p < 0.143) no signicant dierence for the three options ease- of-use, speed and accuracy. Users stated all aspects of Q9 similarly important, ranging from µ = 5.5 (slightly important) to µ = 6.18 (important). 4.4 Discussion We designed the experiment to compare two dierent techniques for performing 3D ma- nipulation tasks with a multi-touch handheld device. While 3DTouch separates the DOFs of the task to improve performance as shown in previous work [126], HOMER-S controls 6DOF in an integral way and takes advantage of simulating real-world metaphors. Results show that for both techniques, no signicant dierence was found for over- all mean task completion time, completion time for the positioning tasks, overall user preference or user preferences regarding the positioning tasks that support hypothesis H1. Inspecting performance and user's preference for compound canonical tasks, two ndings can be stated. First, for 6DOF manipulation tasks, as simulated by task 3 (Po- sitioning & Rotation), HOMER-S performed signicantly faster than 3DTouch. This quantitative evaluation is supported by the user's subjective feedback. HOMER-S is signicantly preferred for translation and rotation tasks by users as expressed in Q7. These ndings support H2 and indicate the strength of the integral design of HOMER-S for compound canonical 6DOF tasks. This is also reected by users' comments who described HOMER-S to be natural, of "more direct contact" and fun. Thus, these real world metaphors tend to be very intuitive and straightforward. The second nding when inspecting performance and user's preference for composite manipulation tasks reveals the strength of 3DTouch for scaling tasks. It took considerably less time to complete task 4 (Scaling & Positioning) using 3DTouch than with HOMER-S. Furthermore, users signicantly preferred 3DTouch for scaling. Since no signicant dierence was found regarding the positioning tasks in completion time or user preferences, positioning can be neglected when evaluating task 4. This nding supports H3, since the scaling tasks required very ne manipulation in all three dimensions. H3 can further be backed up by the signicant fewer number of interaction steps 3DTouch needed in task 2 (Positioning in 3D Space) and task 4 (Scaling & Positioning). Furthermore, the users' rating in Q3 & Q4 attested it a better accuracy. Besides the assumption, that humans are able to control their ngers more precisely, the underlying metaphor can be another conceivable reason to further explain the un- derperformance of HOMER-S in scaling tasks. In the real world, usually two hands are involved to expand or shrink an object. Since HOMER-S only provides one virtual hand to simulate one real hand, this metaphor could not be adapted in a direct way. Thereby, a direct mapping could not be provided that limits HOMER-S straightforward usage for scaling. However, the pinch-like gesture to scale an object using 3DTouch is also not com- pletely intuitive and straightforward. Since, more than half of our test group classied themselves as experienced mobile 3D gamers, they are familiar with using multi-touch for interaction; standard touch gestures such as the pinch-out and in are known and well trained. This is also backed up by the results including user experience. There, the 164 4.4 Discussion results of 3DTouch for scaling are only signicantly better for users who are experienced with smartphones or mobile 3D gaming. Studying further details regarding user experience leads to H4. We proposed that prior touch knowledge would result in equal or better performance of 3DTouch com- pared to HOMER-S, while inexperienced users would perform better with HOMER-S due to its integral 6DOF design and adaption of real-world metaphors. For many re- sults of the study, this is true. Regarding completion time, no signicant dierences between 3DTouch and HOMER-S could be found for positioning when analyzing expe- rienced users. For 3D positioning, experienced users needed signicantly less interaction steps when using 3DTouch. For integral positioning and rotation, experienced users of both groups performed faster with HOMER-S, but not signicantly. Experienced users performed signicantly faster for scaling in terms of completion time and number of in- teraction steps when using 3DTouch. They rated 3DTouch signicantly better in terms of ease-of-use, but signicantly preferred HOMER-S for 6DOF manipulation. Regarding inexperienced users, H4 can be further backed up by the signicant better performance in terms of completion time and number of interaction steps for task 3 (Positioning & Rotation) using HOMER-S. Users' comments reect the quantitative results. Most users, especially the inexperienced, reported to have quickly familiarized with HOMER-S for any translations and rotations. However, exceptions when evaluating H4 could be found, too. The quantitative results do not indicate a better performance of inexperienced users using HOMER-S for positioning tasks. For scaling, HOMER-S did not result in better performance of the inexperienced users. However, despite of the good results of 3DTouch for scaling, inexperienced users did not signicantly perform better using 3DTouch for scaling. The underlying two- ngers pinch gesture requires prior knowledge and thus, is not as straightforward and direct than the one-nger inputs for translate and rotate. But users' preference of 3DTouch's for scaling is independent of the users' experience. This is also reected by users' comments. Some users experienced HOMER-S as being "too direct", since even small movements of the mobile device result in a transformation. Most users complained about HOMER-S being unintuitive to use for scaling. Based on these observations, we cannot draw a clear conclusion to support H4. Further research needs to be performed for a detailed evaluation of this hypothesis. Based on these results and ndings, we come to the following ultimate conclusions that can further act as basic design guidelines: ˆ Both methods provide intuitive manipulation with similar performance when the canonical tasks Positioning and Rotation are performed. ˆ HOMER-S outpaces 3DTouch in performance and ease-of-use when performing a compound, full 6DOF positioning and rotation tasks. ˆ 3DTouch is the better choice, if scaling is involved in the manipulation task. 165 Chapter 5 Summary In this part, three novel 3D interaction techniques were introduced for selection and ma- nipulation of 3D objects, all aiming on intuitive and straightforward 3D interaction in one-handed handheld mixed reality environments. With these results for object selection and manipulation, our research objectives from Section 1.2 are achieved. Using the imprecise nger touch input for object selection yields the inaccurate extrac- tion of small objects, especially when they are partly or fully occluded or surrounded by highly similar virtual scene objects. State-of-the-art approaches mostly propose two- handed techniques to increase selection accuracy, which is not applicable in the given interaction scenario. Furthermore, existing approaches do not provide sucient con- textual information upon object indication to precisely select a desired object amongst visually similar ones. To overcome the limitations, the novel technique DrillSample was developed with a major design focus on precise selection of objects in dense virtual scenes while reducing necessary 2D multi-touch input. DrillSample only requires one-nger tap gestures as input and splits the selection procedure into two steps. For object indication, Raycasting is employed that indicates all casted scene objects for later selection. In case of multi-object indication, their full 3D spatial context is preserved upon object indica- tion allowing for disambiguation and precise selection of occluded objects or objects with high similarity in visual appearance. By employing a one-nger tap gesture, the desired object is selected within this renement step. The possibly imprecise object indication is thereby compensated by the optional second renement step. For a comprehensive evaluation of the DrillSample selection technique, a quantitative and qualitative evalua- tion was conducted by comparing DrillSample with the two baseline techniques Mobile Raycasting and Expand across three dierent selection scenarios based on variations of object density and visibility. The study clearly revealed the strengths of DrillSample in precise selection of objects within close range in dense virtual scenes. To select small and distant objects, Expand was found more sucient as it applies a volumetric object casting. While Raycasting remains a good alternative for selecting visible objects in a sparse scene, DrillSample was found the best general purpose method for visible as well as partly and fully occluded objects, independent of their visual appearance. 167 5. SUMMARY To provide 3D manipulations using 2D multi-touch, existing approaches usually use com- plex nger and hand gestures that are dicult or impossible to apply in a one-handed handheld interaction scenario. Furthermore, their application lowers intuitive handling since the complex gestures require prior knowledge. To address the limitations of exist- ing 3D manipulation techniques for handheld mixed reality environments, the two novel methods 3DTouch and HOMER-S are presented which both support translation, rotation and scaling as 3DOFs manipulation tasks. 3DTouch provides 3D translation and rotation as well as non-uniform scaling by fusing simple one- or two-nger touch input with the handheld's current 6DOF pose. The integral 6DOF manipulation is decomposed into two separate tasks, enabling one nger to be sucient to access all three DOFs during translation and rotation. Scaling requires a two-nger pinch gesture while providing non- uniform transformation in all three dimensions. HOMER-S provides interaction beyond the (limited) screen dimensions by decoupling the manipulation process from any touch input. It aims at DOF-integration and maps the 6DOF device pose onto the object upon selection. Thereby, full 6DOF manipulation as well as non-uniform scaling is performed by employing real-world metaphors that are intuitive to use. In a comprehensive user study, performance, accuracy and ease of use for both techniques were assessed across four dierent test scenarios with varying manipulation tasks. The results reveal both techniques to be intuitive to translate and rotate objects. HOMER-S lacks accuracy compared to 3DTouch but achieves a signicant performance increase in terms of speed for full 6DOF manipulation. 168 PART IV Creating Mixed Reality Environments 1 Introduction 171 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 2 Background & Related Work 173 2.1 Key Elements of a Mixed Reality Framework . . . . . . . . . . . . . . . 173 2.2 Application Development & Scene Management . . . . . . . . . . . . . . 174 3 Framework Architecture 177 3.1 Base Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 3.2 Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 3.3 Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 3.4 Workow for Application Development . . . . . . . . . . . . . . . . . . . 187 4 Developed Mixed Reality Environments 189 4.1 Test Setups & Environment . . . . . . . . . . . . . . . . . . . . . . . . . 189 4.2 Non-Immersive Mixed Reality . . . . . . . . . . . . . . . . . . . . . . . . 190 4.3 Combined Non- & Semi-Immersive Mixed Reality . . . . . . . . . . . . . 192 4.4 Combined Semi- & Full Immersive Mixed Reality . . . . . . . . . . . . . 192 5 Summary 195 169 Chapter 1 Introduction To create a compelling mixed reality environment, tracking and interaction are two key components, as it was extensively described and studied within the previous chapters of this thesis. A crucial factor to enable mixed reality for broad (everyday) usage is quick application prototyping and development. Figure 1.1: Key components of mixed reality, with the contributions marked bold. 171 1. INTRODUCTION Application development, however, requires knowledge in all involved sub-domains, as depicted in Figure 1.1. This comprises tracking, interaction, scene authoring, 3D visu- alization and, optionally, network handling for distribution. For each component, a large variety of technologies, methods and algorithms exists, as for tracking and interaction described in the Chapters II.1 and III.1. This necessary knowledge results in a high entry threshold to create mixed reality applications, even for quick prototyping. 1.1 Motivation To lower the entry threshold of application development and thereby, to leverage mixed reality technology for a broader everyday usage, an inexpensive toolkit is required that serves a powerful graphical interface to easy access and to author the modules visual- ization, tracking, interaction and distribution. Furthermore, to be able to employ such a framework for our performed research to develop test applications, it must provide interfaces to extend the framework with novel software techniques and it has to support state-of-the-art mobile devices running Android for handheld mixed reality application development. However, at the moment of investigating mixed reality frameworks, there were no inexpensive toolkits available that served the describes features and properties. This technological gap fostered the development of a cost-ecient software framework that enables quick prototyping of collaborative and distributed mixed reality environ- ments. As existing toolkits and approaches have drawbacks regarding costs, usability, exibility and extensibility, the implemented framework can act as foundation to further foster the simplication of application development and thereby the pervasiveness of mixed reality in general. Therefore, the proposed framework concludes the contributions of this thesis. 1.2 Organization This part is organized as follows. After an overview over related frameworks is given in Chapter IV.2, the design approach of the proposed framework is described in Chap- ter IV.3. In Chapter IV.4, examples of applications that have been developed with the proposed framework are presented and a summary is given in Chapter IV.5. 172 Chapter 2 Background & Related Work Developing and authoring mixed reality applications requires a lightweight and exible but still powerful hard- and software framework, which is expendable to easily integrate new devices and technologies. Ideally, it supports diverse input and output devices, high quality real-time rendering, physics support, networking and scene management to build rich 3D applications. 2.1 Key Elements of a Mixed Reality Framework A wide variety of hardware and software setups has been built in the past and all share a common general system architecture [53] that is illustrated by the modules depicted in Figure 2.1. Figure 2.1: Mixed Reality system architecture. The depicted general architecture can be applied to create non-immersive to fully 173 2. BACKGROUND & RELATED WORK immersive mixed reality applications. Non immersive systems include 2D (multi)-screen setups, such as a desktop environment, where the user usually sits or stands in front of the screen interacting with a stationary input device, i.e. a joystick or 3D mouse. Semi- immersive scenarios employ a stereo projection with shutter glasses while the user's head and interaction device is tracked in space. Fully immersive setups are provided by using a multi-screen CAVE projection setup [16] with shutter glasses or by using head mounted displays for visualization. Again, the user's head and its interaction device (or the entire body) are tracked for visualization and interaction [83]. The hardware components (gray) of a mixed reality framework comprise input and output devices and a computing platform (e.g. workstation, mobile device) for device communication with a powerful graphics processor for 3D scene rendering. The software modules (green) of the middleware handle the tracking data, perform the 3D visual- ization and provide networking to allow a client-server based framework for single or multi-users. The middleware components communicate with the application layer that provides 2D and 3D graphical user interfaces (GUI), 3D interaction techniques (3DIT), 3D scene elements and layout as well as application specic behavior. The spatial posi- tion and orientation of the input and output devices might be tracked to apply 6DOF pose estimation. Tracking data of these devices is received by the computing platform and handed over to the framework's tracking middleware. The middleware processes, merges and transforms the input data to provide it in a consistent data format for sub- sequent usage within the application. Using this input data, 3D interaction techniques can be provided to the user by employing an event handling mechanism. Subsequently, the virtual scene is visualized to the user on its output device using the rendering en- gine. As visualization, tracking and interaction are fundamental components of a mixed reality application, multi-user support as well as 3D scene distribution are optional as- sets to allow for collaborative and distributed mixed reality setups. In such a case, the framework's networking and session module handles the connections of all users within the network and controls the communication amongst them to ensure correct event and scene synchronization. 2.2 Application Development & Scene Management Since the mid-1990s, a number of mixed reality frameworks have been developed and a variety of systems supporting distributed applications emerged [43]. They mostly pro- vide the integral components of a mixed reality application in a integrated development environment (IDE) to simplify application development and presume programming know- how. To further ease application prototyping and to provide a clear representation of the rendered virtual scene, 3D object management and scene authoring is advisable using a graphical user interface. Most of the high level programming toolkits are based on scene graph libraries, for example open source toolkits such as Studierstube [52], VR Juggler [42], Avango [92] or commercial ones like 3DVIA Virtools [168] and provide a complete framework for developing mixed reality applications. Studierstube is an ap- plication framework for collaborative augmented reality and incorporates all necessary 174 2.2 Application Development & Scene Management functionality such as scene graph rendering, networking, window management and sup- port for input devices. It oers tracking of multiple input devices that are congured using XML les and allows multiple users that are embedded as nodes in the scene graph. While this C++ based framework is very powerful, it has several drawbacks re- garding ease-of-use for application prototyping and cross-platform compatibility. While the open-source components allow deployment for Windows and Linux platforms, mo- bile devices are not supported. Furthermore, it lacks a state-of-the-art rendering engine that provides physics support and does not oer a graphical user interface for 3D scene management and authoring. Commercially available systems, i.e InstantRealiy [149] and MiddleVR [150], enable rapid application development with a comprehensive graphical user interface and support a wide variety of tracking and output devices. As drawback, only simple point and click metaphors [150] are provided as 3D user interface. 3DVIA Virtools [168] is a commercial development and deployment platform for interactive 3D content creation. It supports multiple users and physics behavior to create immersive and distributed applications using industry standard mixed reality peripherals. It oers a comprehensive graphical development environment and can deploy to a wide range of output devices. However, all three frameworks are cost intensive or just free to use in a private context. Frameworks such as BuildAR [152] and DART [59] focus on enabling mixed reality application development by non-programmers. Using BuildAR, the programmer can as- sociate virtual models with visually tracked planar markers. However, it does not provide more complex tracking behaviors, object interaction or a broader choice of tracking de- vices. One of the rst AR frameworks using o-the-shelf software to design and develop mixed reality applications was the Designers Augmented Reality Toolkit (DART) [59]. DART is a plug-in for the popular Macromedia Director multimedia programming en- vironment. It uses the familiar Director paradigms of a score, sprites and behaviors to allow a user to visually create complex mixed reality applications. DART also provides low-level support for the management of trackers, sensors, and cameras via a Director plug-in Xtra. However, DART is expensive due to licensing costs for Director. In addi- tion, the time line based scene management is rather made for story telling environments than for non-linear mixed reality applications. Although there are several frameworks for building mixed reality systems on a stationary workstation, there is little support for handheld mixed reality [111]. Furthermore, none features straightforward integration of novel hardware devices and techniques while being cost ecient and providing an intu- itive scene management to create collaborative distributed mixed reality applications. Similar to Virtools, Unity3D [167] provides an editor for authoring 2D and 3D con- tent and compromises a game engine for executing and rendering the 3D application. Nevertheless, Unity3D by itself is not a mixed reality framework since it lacks support for tracking and interaction. It is rather designed for creating 3D video games and other interactive content. It oers a powerful render engine providing lighting, physics, network communication for collaboration and content distribution. Furthermore, it provides an integrated programming environment using C#, JavaScript or Boo while development can be done under Windows as well as Mac OS X. The nal application can be built 175 2. BACKGROUND & RELATED WORK  generally without changes  for various platforms such as Windows, Mac, iOS, An- droid, all major game consoles, Flash and web clients. For private and research purpose, Unity3D is available for free and applications can be deployed at no charge to Windows, Mac, iOS and Android. This makes this software a compelling component for scene management, rendering and distribution in a mixed reality framework. 176 Chapter 3 Framework Architecture Regarding our motivation from Section 1.1, the aim was to develop a loosely coupled, modular mixed reality framework which can easily be adapted to support emerging de- vices and interaction techniques. Furthermore, multiple user in a distributed environment shall be supported, providing non-immersive to fully immersive mixed reality setups as well as handheld scenarios. The proposed software architecture borrows from best design practices, as illustrated in Figure 2.1. AR Ti FI Ce M ID DL EW AR E AP PL IC AT IO N L AY ER Workstation Data Conversion and Transformation ARTiFICe CORE Unity3D Scene Management Mobile Data Processing Unity3D Rendering Se rv er + C lie nt 1 Cl ie nt 2 Networking Mobile Data Conversion 6DOF Interaction Handheld MR Desktop MR TR AC KI N G IN PU T (Semi)-Immersive MR Razer Hydra SpaceNavigator Tracking Module Full Body MoCap (Microsoft Kinect) 6DOF Optical Tracker Tracking DataVideo Data Video Data Interaction Module Distribution Module Figure 3.1: ARTiFICe framework components and data ow. 177 3. FRAMEWORK ARCHITECTURE An overview of the developed Augmented Reality Framework for Distributed Collab- oration (ARTiFICe) with its components and the data ow is illustrated in Figure 3.1. Tracking data from the workstation-based input devices as well as from handheld de- vices are fed into ARTiFICe using the middleware layer that transforms all input data in a consistent way and delivers it to the application layer. The application layer is built on top of the external game engine Unity3D [167]. Within the application layer, the ARTiFICe core handles the tracking input data, provides interaction techniques and distribution support and delivers the data to the game engine's scene management. The virtual scene with real-time interaction is then visualized on dierent output devices us- ing the game engine's rendering module. The ARTiFICe core denes a unied tracker object to provide the input data from the middleware that can be accessed for visualiza- tion and interaction. Furthermore, the ARTiFICe core comprises an interaction module with well-dened interfaces to integrate selection and manipulation techniques. Besides single-user 3D interaction, the co-presence of multiple users interacting with the same content at the same point in time opens up great possibilities for collaborative work. Therefore, a distribution module was integrated into the ARTiFICe core to enable real time user-managed collaboration for various hardware setups of two or more users over the network. It distributes the scene as well as user interaction in real time and was built upon the networking layer of Unity3D. 3.1 Base Infrastructure ARTiFICe uses Unity3D, an "integrated authoring tool for creation of 3D videogames" [167], as base infrastructure for scene authoring, rendering and for its application layer. 3.1.1 Functionalities of Unity The free to use license of Unity3D oers a powerful Application Programming Interfaces (API) to create projects in JavaScript, Boo and C#. These projects can be deployed without any further changes to multiple platforms, including Windows, OSX and Linux, iOS and Android, various game consoles and a special web player for online deploy- ment. Unity's 3D rendering engine supports both DirectX and OpenGL. Furthermore, the Nvidia (previously Ageia) PhysX engine is included and supports real-time physics simulation such as object collisions and casts, forces and multiple joints. For 3D content, Unity natively provides only creation of very simple shapes, such as cubes, spheres and cylinders. More sophisticated 3D meshes can be imported using common formats such as .FBX, .OBJ, COLLADA, as well as models created in 3D Studio Max, Blender and Maya. 3.1.2 Core Concepts of Unity The Unity scene management oers a rich GUI to place and arrange 3D objects, such as geometry, virtual cameras in space. All objects of a scene are organized in a hierar- chical order that follows the basic principles of a 3D scene graph. Each object in the 178 3.2 Middleware hierarchy is represented by the Unity basic class GameObject that acts as a container for all kind of objects. Each GameObject can be enhanced by so called Components to control the GameObject's transformation including position, rotation and scale, its appearance, rendering and physics behavior. Therefore, a Component is "attached" to a GameObject. While there is a magnitude of pre-dened Components, it is also possi- ble to create specic behavior by implementing them in custom scripts using the Unity API and attaching those scripts to the GameObject. These two core concepts form the foundation for the development of ARTiFICe's core that resides as a script hierarchy within Unity's Integrated Developing Environment (IDE). For an in-depth description of Unity's functionality, the reader is referred to [167]. 3.2 Middleware To process the tracking input and to provide it to the application layer, the Artice's middleware layer uses OpenTracker [46] to gather tracking data of various workstation- based input devices and Vuforia [163] for 6DOF estimation of a handheld device. 3.2.1 OpenTracker OpenTracker [46] is an open-source software framework that serves as connection be- tween the input devices and the application layer and communicates with the ARTi- FICe core. It reads out tracking data from the input devices using appropriate drivers, transforms the data in a consistent format, fuses multiple tracking sources and nally delivers the data via a transport mechanism. To fetch tracking data from remote input devices, OpenTracker supports the Virtual-Reality Private Network [47] (VRPN) that is a device-independent and network-transparent framework for devices used in mixed reality systems. Thereby, it provides a hardware abstraction layer and eases the devel- opment and maintenance of hardware setups in a exible manner. This is achieved by using an object-oriented design based on XML and utilizing standard XML tools for development, conguration and documentation. To describe the employed tracking con- guration, a data ow graph is dened via a XML le complying to a predened DTD. A multi-threaded execution model takes care of lters and transformations that are applied to the tracking data. The underlying data ow graph can be described by the following three XML node types: Sources: This is the entry point for all tracking data. Typically, a source node is a wrapper of a specic device driver. Filters: A lter node performs the actual work of processing the input data to be able to deliver it in a consistent way to the application layer. There is a great number of available lter nodes, such as geometric transformations, conversions to translate one data type into another or lters to merge tracking data from multiple inputs by combining them into a new data format. 179 3. FRAMEWORK ARCHITECTURE Sinks: The sink node is mostly responsible for distributing the ltered data to the application that communicates with OpenTracker. Extending OpenTracker On start-up, the XML conguration le is loaded and parsed to generate the data ow graph by dynamically instantiating the dened nodes. However, this convenient way to congure the interaction devices and to connect them to the application layer is only given if both hard- and software are fully integrated in OpenTracker. The native OpenTracker implementation does not provide support for i.e. Razer Hydra [142] and the 3D Connexion SpaceNavigator [147]. Since these devices have great potential to enable intuitive 3D interaction in a desktop mixed reality scenario, two novel source nodes were implemented as further described in Section 3.2.3.1. To fur- ther support Artoolkit markers [35, 87] as well as optical tracking and full body motion capturing, as outlined in Section 3.2.3, existing OpenTracker source nodes were used. Figure 3.2: OpenTracker nodes with new ones marked in blue. Furthermore, OpenTracker did not provide by default a sink node to communicate with Unity3D. Therefore, a new OpenTracker sink node UnitySink was implemented to provide a single sink for all tracking devices to link them with Unity. The UnitySink node is referenced during run-time by the ARTiFICe core for fetching tracking data to provide them within the application. An overview of the extended OpenTracker architecture that is employed within the middleware layer of ARTiFICe is depicted in Figure 3.2. An example XML conguration le is given in Listing 3.1. 3.2.2 Vuforia Vuforia [163] is a software development kit (SDK) to create augmented reality appli- cations for handheld devices. It uses natural features (see Chapter II.2) of planar or volumetric objects to determine in a frame wise manner the 6DOF pose of the handheld device's camera, relative to the object. The object has to be registered using the Vuforia Target Management System in an o-line process before it can be tracked by the online 180 3.2 Middleware Vuforia processing pipeline. It provides native SDKs for Android with an Application Programming Interfaces (API) in Java and Java/C++ as well as for iOS in Objective C. The Vuforia AR Extension for Unity furthermore provides the pose tracking functional- ity within the Unity IDE. Currently, Vuforia is compatible with a broad range of mobile devices, such as the the iPhone (4/4S), iPad, and Android phones and tablets running Android OS version 2.2 or higher. 3.2.3 Supported Setups & Hardware Using OpenTracker and Vuforia, a wide range of tracking input devices are linked to Unity3D to enable further mixed reality specic behavior, provided by the ARTiFICe core. For workstation-based devices, either existing OpenTracker source nodes were used or novel ones were implemented. To enable mobile devices, Vuforia was used by ARTi- FICe. A comprehensive overview of all supported tracking devices by ARTiFICe is given in Table 3.1. Beyond this table, all devices that are natively supported by OpenTracker Software Development Kit Device-Name Existing Node New Node ARToolkit Markers OT ARToolKitPlusSource 3D Connexion SpaceNavigator OT SpaceDeviceSource Razer Hydra OT HydraSource MS Kinect OT VRPNSource Optical Tracking OT VRPNSource Handheld Device Vuforia Table 3.1: Interaction devices supported by ARTiFICe. and by VRPN can be used within ARTiFICe as well. The exible middleware concept al- lows conguration of all these devices in various combinations using a single OpenTracker XML conguration le. Conguration of mobile devices is treated separately using Vufo- ria. With the supported tracking input, ARTiFICe enables the creation of desktop-based, semi-immersive, full immersive as well as handheld mixed reality environments that are described in the following. 3.2.3.1 Desktop Mixed Reality For desktop setups, ARToolKit [35] as well as ARToolkit+ [87] are tracking libraries providing projective-invariant planar bitmap patterns for 6DOF pose estimation that encode a unique number for distinguishing multiple markers (see Chapter II.2). AR- Toolkit is usually employed in desktop based mixed reality environments while AR- Toolkit+ enhances the original ARToolkit library and is optimized for usage on hand- held devices. ARToolkit+ is used in ARTiFICe framework, which has been previously integrated into OpenTracker. To enable live video view within a deployed Unity project, OpenVideo [161], a data integration- and processing toolkit, is used. It acquires video 181 3. FRAMEWORK ARCHITECTURE frames from the connected webcam that are subsequently processed by ARToolkit+. The video is then streamed into Unity3D to provide a view of the real world scene while inter- acting with the planar bitmap pattern. For more enhanced 3D interaction, the 3D mouse SpaceNavigator from 3D Connexion [147] was integrated by wrapping its native driver into a new OpenTracker source node. Furthermore, the two-handed interaction controller Razer Hydra [142] was integrated into OpenTracker, as described in detail in [122]. 3.2.4 (Semi) Immersive Mixed Reality Model based optical tracking, as described in detail in Chapter II.4 can be employed to track the user's head and interaction device in a semi or fully immersive mixed reality environment. For room-sized environments, the passive optical tracking system [84] was integrated into ARTiFICe using VRPN [134]. The 6DOF pose tracking data is read by the existing OpenTracker VRPNSource node, transformed and provided to the ARTiFICe core using the UnitySink node. Figure 3.3: ARTiFICe's processing pipeline of depth data for full-body motion tracking. With emerging depth sensing technology, such as the Microsoft Kinect [127], mark- erless full-body motion tracking becomes more and more popular for user tracking and device-less 3D interaction in a mixed reality environment. Therefore, the Kinect was in- tegrated using OpenNI/NITE [160, 162] and FAAST [165, 128]. OpenNI/NITE provides an API to access raw depth data as well skeleton data, which are calculated based on the depth data. FAAST runs as self-contained application and reads this data. It provides gesture recognition and supports streaming of the full body tracking data over VRPN. Using the VRPNSource node and the UnitySink node, this data is read and fed into the ARTiFICe core. The entire pipeline is depicted in Figure 3.3. 3.2.4.1 Handheld Mixed Reality A modern mixed reality framework should support handheld devices to allow for mobile augmented or virtual reality setups. Due to its powerful properties and its ne-tuned integration into Unity3D, Vuforia [163] is integrated into the middleware layer of ARTi- FICe. Over the ARTiFICe framework, it is interfaced to the ARTiFICe's core to process the mobile tracking data, as described in Section 3.3. 182 3.3 Application Layer 3.3 Application Layer The middleware components communicate with the application layer that comprises Unity3D and the embedded ARTiFICe core. Unity's graphical user interface as well as its IDE are used for 3D scene authoring and application prototyping and its rendering engine is employed for 3D visualization. The ARTiFICe core comprises a Manager and a tracking-, interaction- as well as distribution module and is embedded into the Unity3D IDE. In Figure 3.4, a detailed view on the framework with its data ow and components is given. OpenTracker Data conversion and transformation OpenVideo ARTiFICe Manager Tracking data Unity3D Management & Scene Authoring Vuforia IN PU T AR Ti FI Ce M ID DL EW AR E AP PL IC AT IO N L AY ER M ob ile V id eo Vuforia Unity Extension ARTiFICe Tracking Module ARTiFICe Interaction Module ARTiFICe Distribution Module W eb ca m V id eo Workstation Devices Handheld Devices ARToolkit Tracking DataVideo Data Video Data Webcam Video Unity3D Rendering Server + Client 1 Client2, 3, 4, …N Mobile Video 6D O F De vi ce P os e AR Ti FI Ce Co re Tracking data Networking (UDP) VRPN OpenNI/NITE FAAST Kinect Tracking DataARToolkit Tracking Data Hydra SpaceMouse Optical Tracker Handheld Device Monitor Stereo Projection Wall Head Mounted Display O U TP U T Figure 3.4: Detailed framework components. 3.3.1 The ARTiFICe Manager The ARTiFICe Manager controls the data ow between middleware and application layer. Upon application start-up, it reads the OpenVideo and OpenTracker conguration les and loads the dependent tracking libraries. It starts an OpenTracker instance and an OpenVideo handler for ARToolkit+ marker tracking. It also closes OpenVideo and stops OpenTracker at application shutdown. 183 3. FRAMEWORK ARCHITECTURE 3.3.1.1 Tracking Module The Tracking Module reads the tracking data of the connected input devices and feeds it into the transformation component of a Unity3D GameObject. The overall design of the tracking module is shown in Figure 3.5. It derives from TrackBase for workstation-based devices, respectively from Vuforia.TrackerBehaviour for handheld devices. Since these two classes inherit from the Unity3D base class MonoBehaviour, the deriving classes are capable to be attached to any scene object within the Unity3D hierarchy. Figure 3.5: Tracking class hierarchy. For each of the supported workstation-based input devices, a subclass was imple- mented to provide the specic tracking data depending on the attached devices. Upon application start, TrackProvider creates ARTiFICe Trackers through the ARTiFICe Manager, which is implemented as singleton. Each ARTiFICe Tracker is interfaced to the corresponding OpenTracker Unity node. For planar bitmap marker tracking, a multi- marker tracking support was implemented to be able to track cuboid-formed 3D objects and determine their absolute physical 6DOF pose. To access the handheld device, Track- Mobile reads from Vuforia.TrackerBehaviour that interfaces the Vuforia tracking core in Unity3D's IDE. All tracking subclasses provide Tracker Objects that form a consistent tracking data layer and can be accessed by the ARTiFICe's interaction and distribution module for further processing. 3.3.1.2 Interaction Module The raw tracking data of a connected input device can be accessed using a Tracker Object, as described in Section 3.3.1.1. It can be subsequently used for 3D object selection and 184 3.3 Application Layer manipulation, as depicted in Figure 3.6. Figure 3.6: Interaction class hierarchy. The data of the tracker object is processed by the specic interaction technique that can be attached to any scene object, i.e. to visually represent a virtual hand. Each concrete interaction technique inherits from the abstraction layer ObjectSelectionBase that provides a clean interface of data handling for workstation as well as handheld devices and oers a transparent layer to integrate new techniques into the framework. At run-time, the concrete interaction technique determines the currently selected scene objects as well as calculates its absolute 6DOF pose. This data is then handed over in a uniform format to ObjectSelectionBase which is further processed by the Inter- actionBase class and delivered to all selected virtual scene objects. Virtual scene objects that are selectable must have the ObjectController class attached. By reading the data from InteractionBase, the ObjectController checks if the scene object to which it is at- tached to is selected and if it is, it manipulates the position and orientation depending on the given pose. As concrete 3D interaction techniques, a number of state-of-the-art interaction tech- niques were implemented, such as a simple VirtualHand, GoGo [22], Aperture [21] and HOMER [24]. For 3D manipulation in a handheld mixed reality environment, the novel interaction techniques DrillSample, 3DTouch and HOMER-S, as described in Chap- ters III.3 and III.4, are integrated into the framework. As shown in Figure 3.6, the class MobileObjectSelectionBase acts as an interface for these interaction techniques. The class inherits from ObjectSelectionBase and provides a common layer to gain access to handheld specic hardware functionality, such as touch input. 3.3.1.3 Collaboration & Distribution To provide multi-user support for interaction using dierent interaction devices and re- mote collaboration of one virtual scene, a collaboration and distribution module was 185 3. FRAMEWORK ARCHITECTURE furthermore implemented. It is loosely coupled with the interaction module and enables distribution of both mobile and all workstation setups. The networking functions are based on the Unity3D network layer using the User Datagram Protocol (UDP) for com- munication. A client-server architecture is applied with a direct connection between the server and all clients, resulting in a Star Topology. For data exchange, remote procedure calls (RPC) and state synchronization are employed. To prevent data loss, the state synchronization is buered. An overview of the distribution module and its connection to the interaction module is given in Figure 3.7. The NetworkBase class provides functions to initialize the server and to connect a client to the server. All connected clients are managed by the UserManager class, implemented as singleton. To reduce necessary hardware for realizing a client- server application and to improve overall usability, one device can act simultaneously as server and client. Figure 3.7: Distribution class hierarchy. To enable multi-user collaboration of a virtual scene, all user-specic interaction must be distributed as well. Therefore, each selectable scene object must have a NetworkObject- Controller component attached that distributes selection and manipulation functionality over the network. To enable exclusive access to a scene object, ExclusiveAccessObject- Controller prevents simultaneous usage by multiple users. As long as a user selects and manipulates the scene object, it is locked for other users. To provide exclusive object access to a specic user, the UserManagmentObjectController is used. 186 3.4 Workow for Application Development 3.4 Workow for Application Development With the proposed middleware and application layer components, a new mixed reality application can be developed using the following steps. 1. A new Unity3D project is created and the ARTiFICe framework is added to the project by copying the sources into the project's folder hierarchy under Assets. 2. The desired workstation-based interaction devices are then congured using the single OpenTracker XML le. An example is given in Listing 3.1, conguring one ARToolkit+ marker as well as the SpaceNavigator as input devices. Both are ltered in terms of transformation to ensure a common orientation of the tracking input. 3. If the application is deployed as a handheld mixed reality setup, Vuforia is inte- grated into the Unity project, as described on the Vuforia developers page [163]. 4. Virtual cameras, lights, interaction and selectable scene objects are created and added to the 3D environment using the Unity3D graphical scene management. They are encapsulated as Unity3DGameObjects and can be subsequently connected to the according classes of the ARTiFICe core modules. 5. Finally, the project is built and deployed to the desired platform using Unity's built-in deployment tool. Listing 3.1: An OpenTracker example conguration. 187 Chapter 4 Developed Mixed Reality Environments ARTiFICe was intensively tested and used within research projects as well as for teaching. ˆ The framework was applied as the technological foundation for the Virtual and Augmented Reality laboratory exercise in the graduate program of Vienna Univer- sity of Technology from winter term 2011/12 on until now. In total, more than 150 students developed distributed and collaborative mixed reality applications with ARTiFICe, using several interaction techniques in combination with ARToolkit markers, 3D Connexion SpaceNavigator and Microsoft Kinect for Windows. ˆ ARTiFICe was employed for the laboratory exercise Augmented Reality as a part of the graduate program Mobile Computing at the University of Applied Sciences Upper Austria during winter term 2011/12 and 2012/13. With the help of the framework, more than 30 students developed a distributed and collaborative appli- cation for handheld mixed reality within just four weeks, using HOMER-S and 3D Touch. ˆ The framework is an integral component of research projects in the eld of interac- tion and tracking at the Interactive Media Systems Group at Vienna University of Technology to enable rapid prototyping. Within the projects, ARTiFICe is subject to continuous development. 4.1 Test Setups & Environment In the following sections, we demonstrate an excerpt of the setups that have been devel- oped with ARTiFICe. The presented mixed reality environments feature dierent com- binations of processing platforms and hardware for in- and output, and provide varying levels of immersion (see Chapter I.1). The framework was tested on various worksta- tions, running Windows 7 (32/64bit). All parts of the framework, except Kinect and ARToolkit, can also be deployed on Mac OS X/iOS. The handheld mixed reality setup 189 4. DEVELOPED MIXED REALITY ENVIRONMENTS was tested on multiple Android devices, all running a minimum of Android v2.2 featuring an ARMv7 architecture or higher. 4.2 Non-Immersive Mixed Reality A Non-Immersive mixed reality environment usually consists of a non-stereoscopic screen through which the user observes the virtual scene, making the screen a window into the virtual world. In such a setup, the user is fully aware of the reality that surrounds him or her, resulting in a feeling of non-immersion. In the following, two typical non-immersive scenarios, a desktop as well as a handheld setup are presented. 4.2.1 Single & Multi-User Desktop Mixed Reality Two mixed reality desktop applications were realized. In the rst, as depicted in Fig- ure 4.1a, a multi-user collaborative and distributed augmented reality simulation was developed using multiple ARToolkit+ markers as input and interaction devices. (a) ARToolkit interaction. (b) Interaction with Razer Hydra. Figure 4.1: Two examples of desktop mixed reality setups. A portion of the markers form a MagicBook [41] that was used for interactive story telling. The other portion of the markers acts as a cube that was employed as a multi- purpose interaction device, using the multi-marker tracking capabilities of the framework (see Section 3.3.1.1). All markers in the scene are centrally organized in one OpenTracker XML conguration le and were tracked by a low-cost o-the-shelf camera (Logitech Webcam C905 ). The virtual scene as well as any user interactions are distributed to all 190 4.2 Non-Immersive Mixed Reality connected clients using the ARTiFICe distribution module while the workstation of one user acts simultaneously as server and client. The second desktop-based setup employs a Razer Hydra [142] as a high-precision 6DOF interaction device to realize a single-user virtual reality training environment. In an application for geometry education [122], virtual scene objects can be created, controlled and manipulated using the Hydra, as illustrated in Figure 4.1b. Thereby, spatial abilities as well as a deeper understanding of 3D geometry can be trained by using a low-cost setup that allows for seamless 3D manipulation. 4.2.2 Multi-User Handheld Mixed Reality As an example for a non-immersive handheld mixed reality environment, a collaborative and distributed application was developed. It provides a multi-user augmented reality game in which users can interact with the physically driven virtual scene objects using HOMER-S. Again, the virtual scene as well as any user interactions are distributed to all connected clients using the ARTiFICe distribution module while the mobile device of one user acts simultaneously as server and client. Figure 4.2: Multi-user collaborative and distributed handheld mixed reality. As shown in Figure 4.2, the user on the left hand side currently translates a virtual brick in space while the user on the right observes this interaction. To enable 6DOF pose tracking, an arbitrary image is registered in an o-line process with the natural feature tracking toolkit [163]. At runtime, the image is used as playground and is augmented with the virtual scene that can be observed through the handheld's device screen. Multiple users can collaborate and interactively play together, either by pointing their phones on the same physical image or at dierent images at distributed locations that show the same motive. 191 4. DEVELOPED MIXED REALITY ENVIRONMENTS 4.3 Combined Non- & Semi-Immersive Mixed Reality Furthermore, ARTiFICe can be employed to create collaborative and distribued mixed reality setups that oer dierent levels of immersion. In Figure 4.3, a collaborative and distributed multi-user setup is shown providing a non-immersive environment for User 1 and a semi-immersive setup for User 2. Semi-Immersive environments provide an increased amount of immersion by enabling stereoscopic viewing through shutter glasses and 3D interaction using mobile 6DOF devices, such or 3D pens (see Figure III. 2.1c) or motion capturing. (a) Non-immersive setup using a stationary 6DOF interaction device. (b) Semi-immersive stereo projection setup with full body motion capture. Figure 4.3: A distributed multi-user non & semi-immersive mixed reality setup. The combined non- and semi-immersive distributed setup is achieved by supporting a dierent set of in- and output devices for each user. A game was developed as test application in which two users have to collaboratively control a ying bird through a virtual environment. While the rst user (Figure 4.3a) views the scene on a screen and interacts with 3D Spacenavigator to control the attitude as well as clearing the bird's ight path using the GoGo interaction technique [22], the second user (Figure 4.3b) is provided with a stereoscopic scene view and controls the speed and direction of the virtual character by full body motion capturing and gesture recognition, using Microsoft Kinect [127] as input. Both users interact in dierent physical locations and are connected over the ARTiFICe distribution module. 4.4 Combined Semi- & Full Immersive Mixed Reality Furthermore, ARTiFIce has amongst others also been employed for serious game devel- opment. A virtual reality training was created based on ARTiFIce to support upper limb prosthesis patients in learning to control their myoelectric prostheses, even before they have access to the physical ones [139, 134]. The software consists of a server ap- plication to control the training parameters, and a client module to visualize the virtual environment to the user in a head mounted display (HMD). In Figure 4.4, a test setup of this fully immersive application is shown. Both HMD and the user's upper arm are tracked using optical tracking. Thereby, the user is provided 192 4.4 Combined Semi- & Full Immersive Mixed Reality with a egocentric scene view and can control the position and orientation of the virtual prosthesis. (a) A detailed view of the immersive virtual reality (b) The combined immersive and semi-immersive virtual reality Figure 4.4: A distributed multi-user non & semi-immersive mixed reality setup. The tracking data is sent to ARTiFICe through the OpenTracker VRPN node. An electromyographic (EMG) tracking device was integrated into the optical tracking target to detect muscle contraction for controlling grasping of the prosthesis, as shown in Fig- ure 4.4a. The EMG data is sent via the wireless Blutooth protocol to the workstation. As depicted in Figure 4.4b, the egocentric scene view can be displayed on a stereo projection wall for demonstration purposes to share the user's HMD experience for discussion and explanations. 193 Chapter 5 Summary In this part, a exible software framework named ARTiFICe is introduced to develop collaborative and distributed mixed reality applications. The framework follows a mod- ular software architecture and features loosely-coupled, extendable modules for tracking, interaction and distribution. Built upon a state-of-the-art game engine Unity3D [167], the framework further provides high quality 3D rendering, physics support, a powerful graphical user interfaces for scene authoring and an integrated build tool to deploy the project for various hardware platforms. ARTIFICe's middleware is using Vuforia [163] and extends OpenTracker [46] to support tracking of various input sources, such as planar bitmap patterns, 3D mice, rigid body optical tracking targets as well as recently emerged, popular o-the-shelf devices, such as Microsoft Kinect, Razer Hydra and mobile devices running Android and iOS. The design of the middleware as well as the tracking mod- ule in ARTIFICe's application layer allow for a straightforward integration of new input devices. ARTIFICe's interaction module provides well-dened interfaces to integrate cus- tom methods and oers a number of built-in techniques, including the proposed methods of Part III. Finally, ARTIFICe supports the distribution of scene content and user inter- action to create remote mixed reality environments that can be shown on a wide range of devices, such as smartphones, stereo projectors and head mounted displays. Based on these functionalities, ARTiFICe provides the development of versatile mixed reality environments, ranging from non- to fully-immersive setups, that can run on dierent operating systems and platforms, including Windows and Android. ARTiFICe was employed to create mixed reality environments for a number of sci- entic projects, including application development for the techniques that are presented in this thesis. Furthermore, the framework was used by more than 150 students during their university graduate program who were not familiar with mixed reality technology before. It allowed them to develop distributed applications within just a couple of weeks that incorporated dierent tracking devices and as well as interaction techniques. These results demonstrate the framework's applicability and usability for users, which are tech- nically versed but do not have in depth knowledge in mixed reality. Thereby, it can support these non-experts to overcome the initial hurdles of creating advanced applica- tions to create embodied mixed reality experiences. As existing toolkits and approaches 195 5. SUMMARY have drawbacks regarding costs, usability, exibility and extensibility, the results indicate that the implemented framework can act as foundation to further foster the simplication of application development and thereby the pervasiveness of mixed reality applications in everyday scenarios. 196 PART V Conclusion 1 Findings & Outlook 199 1.1 Wide-Area Optical Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 200 1.1.1 Open Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 1.2 3D Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 1.2.1 Open Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 1.3 Creating Mixed Reality Environments . . . . . . . . . . . . . . . . . . . 204 1.3.1 Open Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 197 Chapter 1 Findings & Outlook This thesis has focused on novel concepts and systems to leverage the applicability of mixed reality into unconstrained everyday environments. Therefore, we investigated concepts in the area of tracking, interaction and mixed reality frameworks, that resulted in the presented contributions, as depicted in Figure 1.1. Figure 1.1: Investigated concepts, their relationship and the presented contribution. For each of the investigated areas, we recapitulate our ndings and give an outlook on open topics that are worthwhile to investigate and that we plan to conduct in the future. 199 1. FINDINGS & OUTLOOK 1.1 Wide-Area Optical Tracking The rst part of this thesis has focused on optical tracking in large, unconstrained indoor environments. There, environmental conditions pose challenges in tracking volume cov- erage, tracking accuracy and disturbing interferences, such as static and moving lights, poor visibility and occlusions. Our literature review revealed that state-of-the-art optical tracking systems are not capable to cope with the intended environments with a minimal vision hardware setup. Existing systems usually require a large amount of vision hardware to cover larger vol- umes and are sensitive to interferences, especially during target training and camera calibration. Therefore, they cannot provide accurate tracking without pre-conditioning the environment. To overcome the limitations of current approaches, we presented a ro- bust and cost ecient wide area optical tracking system that estimates the 3D position of model-based targets up to 100m while requiring a minimal amount of two cameras. We extend the state-of-the-art in optical tracking systems by proposing a robust extrinsic stereo camera calibration, by introducing a highly re-congurable target design and by providing a software-based processing pipeline that enables the system to cope with large tracking distances, static and moving interfering lights, partly occluded targets as well as disturbances such as fog and dust during calibration as well as tracking. To evaluate the developed hard- and software system, we conducted experiments in three dierent tracking scenarios that all feature large distances and unconstrained indoor environments. During the tests, we observed our system to robustly identify the target's model during stereo camera calibration and tracking in the presence of strong interfering lights, temporary occlusions as well as poor visibility, such as fog. The measurements of accuracy and stability up to 100m indicate that the proposed system outperforms com- peting optical tracking systems in terms of volume coverage, relative point accuracy and robustness. Furthermore, only a minimum of two cameras is required, leading to a sig- nicant reduction in system's cost and setup complexity. In addition, we demonstrated the system's abilities to act as a wide area tracking system for underground surveying tasks. This pushes the boarders of optical tracking to a new application domain since state-of-the-art optical tracking approaches are exclusively designed and thus solely ap- plicable for mixed reality applications. Our results indicate that our proposed system cannot compete with existing surveying measurement technologies in terms of relative point accuracy but outperforms existing systems in the following aspects. No manual sighting of a target is required, tracking of fast movements as well as of multiple targets at the same time can be provided and targets can be easily recongured to track static and portable objects as well as machines. Thus, our system acts as a rst foundation for automated guidance for underground machine control. We hope that our contributions help engineers and developers to foster the further emerging of mixed reality into everyday work and to improve automated surveying. For both application areas, a broad range of wide area tracking scenarios can be envisioned that are currently impeded by the limitations of state-of-the-art systems, such as user tracking at entertainment stages or in manufacturing workshops as well as for survey- 200 1.1 Wide-Area Optical Tracking ing tasks such setting out, prole control, deformation monitoring, automated machine guidance. 1.1.1 Open Topics Our evaluation revealed several open topics that we plan to address in future research. ˆ We plan to evaluate the relative point accuracy with dierent hardware setups using higher resolution cameras and lenses with smaller focal length to extend the eld of view and thereby, the horizontal and vertical tracking coverage. Additionally, we will examine infrared LEDs with less radiant intensity to reduce the tracking target length. Both aspects can be benecial especially for tracking at smaller distances up to 30m. ˆ We will address the improvement of feature distribution in the camera image to enhance the estimation of external camera parameters in terms of robustness and accuracy. We found an unbalanced blob coverage of the articially generated point features especially in the vertical dimension that is caused by limited human size and the length of the calibration target as well as by the natural boundaries of the physical environment, such as the ceiling and the ground. Therefore, we will inves- tigate concepts to extract natural features from distinct environmental structures and fuse them with the blob features to increase the distribution along the edges and in the corner of the images. This approach requires a well illuminated envi- ronment with a sucient amount of prominent geometrical structure that might be given in a standard indoor environment. In an underground scenario, where illumination is poor and geometric structures are mostly found around the front face, natural feature extraction would not signicantly enhance the feature distri- bution in the camera images. Here, additional single IR-LED markers that are installed throughout the volume would be an adequate solution to improve the feature distribution. These single blob features could be autonomously detected and extracted using the proposed hardware interference ltering approaches from Chapter II.4. With these methods, we hope to achieve a more accurate calibration for stereo rigs with large baseline in both illuminated as well as poorly illuminated and non-cluttered environments. ˆ To obtain absolute 3D coordinates for surveying measurement tasks, linking the camera's coordinate system to the geo-reference coordinate system is required. The geo-reference coordinate system is obtained by geodesic measurements using a total station/theodolite. To determine the transformation matrix between the two coor- dinate systems, we plan to equip the tracking targets as well as additional stationary single point targets with geodesic prisms that are measured with a theodolite to obtain highly accurate geo-referenced 3D measurements. 201 1. FINDINGS & OUTLOOK 1.2 3D Interaction The second part of this thesis has focused on 3D interaction techniques in one-handed handheld mixed reality. We specically investigated concepts for selection and manipu- lation of objects in dense mixed reality scenes. As tracking is the crucial foundation to enable interaction, Inside-Looking-Out optical 6DOF pose tracking is used as technolog- ical prerequisite for the presented interaction techniques. To enable precise 3D object selection and manipulation (translation, rotation, scaling) on a handheld device, our literature research indicated that state-of-the-art interaction techniques usually use the multi-touch capabilities of the device in combination with complex multi-nger or -hand gestures. However, in a handheld mixed reality scenario where the user has usually only one hand available for interaction while the other one is holding the device, these approaches are not applicable and impede the intuitive usage as they require prior knowledge about the supported gestures. To overcome these limi- tations, we proposed three novel techniques for 3D interaction that employ the tracked device pose to highly reduce and thus simplify the user touch input. For 3D object selection, we presented DrillSample that only requires one-nger tap gestures as input and splits the selection procedure into two steps. For object indication, Raycasting is employed that indicates the scene object(s) for later selection. In case of casting multiple objects, their full original 3D spatial context is preserved upon object indication. Thereby, the user is enabled to disambiguate and precisely select occluded objects or objects with high similarity in visual appearance. Finally, the desired object is selected within this renement step by employing an one-nger tap gesture. The imprecise touch input of a nger that might yield ambiguous object indication is thereby compensated by the optional second renement step. In comparison to state-of-the-art techniques, DrillSample provides precise selection of party or fully occluded objects and the non-ambiguous identication of a desired object amongst visually similar ones by only requiring one-nger touch input. The conducted quantitative and qualitative evaluation revealed the strengths of DrillSample that outperformed the baseline techniques as it was found the best general purpose method for visible as well as partly and fully occluded objects, independent of their visual appearance. For 3D object manipulation, the two novel methods 3DTouch and HOMER-S were presented which both support translation, rotation and non-uniform scaling. 3DTouch is based on multi-nger touch input and employs DOF-decomposition. Thereby, the integral 6DOF manipulation is split into the two tasks translation and rotation, enabling one nger to be sucient to access all three DOFs of both tasks. Scaling is designed as another separate 3DOF task and requires a two-nger pinch gesture to allow for non- uniform transformations. HOMER-S decouples the manipulation process from any touch input and thus provides interaction beyond the (limited) screen dimensions. Therefore, it maps the estimated 6DOF device pose onto the object upon selection and employs real-world metaphors to enhance ease of use. HOMER-S applies DOF-integration for the 6DOF task translation and rotation and uses the 6DOF device pose to provide non- uniform scaling in a separate manipulation task. A comprehensive user study indicated 202 1.2 3D Interaction the strength of both techniques to intuitively translate and rotate objects. HOMER-S was found to be less accurate for 3D manipulation compared to 3DTouch but performed signicantly faster for integral 6DOF manipulation tasks. 1.2.1 Open Topics While investigating and developing the presented techniques, we have identied the fol- lowing open topics in the context of 3D interaction. ˆ DrillSample was tested and evaluated in handheld mixed reality setups. However, the underlying algorithm can be applied to semi- as well as fully immersive envi- ronments. Thus, we plan to use and evaluate DrillSample in various mixed reality setups, using 6DOF input devices for object indication and selection in combination with stereoscopic viewing through shutter glasses or head mounted displays. Since the DrillSample visualization does not depend on display size but on the eld of view of the user's output device, concepts such as the Image-Plane technique [62] can be employed to show the indicated objects in front of the user in space. ˆ We plan to further examine performance and usability of DrillSample for selecting objects in scenarios with various combinations of object density, size and distance. Therefore, we also consider to investigate using DrillSample with Cone-Casting to provide accurate selection of smaller objects at a larger distance. ˆ Our ndings and the promising results of 3DTouch and HOMER-S motivate us to further evaluate the capabilities of both techniques. Therefore, we will investigate concepts to combine both techniques to enable context-aware manipulation to ben- et from HOMER-S capabilities for rather coarse 3D manipulations and to exploit 3DTouch for ne-grained interactions. ˆ We plan to optimize the overall usability of HOMER-S to further exploit its po- tential. Therefore, we focus on improving the stability of the 6DOF device's pose during manipulation by applying ltering techniques to further reduce the intrinsic optical tracking jitter. This would yield an increased accuracy and might enhance the technique's potential to successfully perform ne manipulations as well. Given the direct mapping of the device's pose onto the selected object, rotations around the pitch-axis are limited. To solve for this issue, a non-direct mapping between the device's and object's orientation will be examined. Furthermore, we plan to provide more robust and view-independent pose tracking by incorporating natural feature tracking based on the surrounding scene geometry. Additionally, a temporal loss of the tracking pose might be compensated by fusing the inertial measurement data of the handheld device with the optical inside-out tracking data. 203 1. FINDINGS & OUTLOOK 1.3 Creating Mixed Reality Environments The third and last part of this thesis has focused on providing a framework to facilitate the development of compelling mixed reality environments. As this requires knowledge in all involved sub-domains, comprising tracking, interaction, scene authoring, 3D visu- alization and, optionally, network handling for distribution, the resulting entry threshold for application development is high. To minimize these initial hurdles and thereby, to leverage mixed reality technology for a broader everyday usage, an inexpensive novel toolkit ARTIFICe was presented that provides a powerful graphical interface to easy access and author the previously mentioned ve modules. ARTIFICe's framework design follows a modular software architecture and features loosely-coupled, extendable modules for tracking, interaction and distribution. Built upon a state-of-the-art game engine, the framework further provides high quality 3D rendering, physics support and an integrated build tool to deploy the project for vari- ous hardware platforms, including Windows and Android. To support a wide range of tracking input, we integrated and extended two middleware frameworks for workstation and mobile device support. Thereby, ARTIFICe is capable to integrate tracking input from planar bitmap patterns, 3D mice, rigid body optical tracking targets, Microsoft Kinect, Razer Hydra and mobile devices running Android and iOS. The frameworks in- teraction module provides well-dened interfaces to integrate custom methods and oers a number of built-in techniques, including the proposed methods from Part III. Finally, the developed distribution module supports the creation of collaborative and distributed mixed reality environments that can be visualized on a wide range of devices, such as smartphones, stereo projectors and head mounted displays. We demonstrated the framework's capabilities of creating versatile mixed reality envi- ronments by presenting a number of examples of non-, semi- and fully-immersive setups. Finally, the framework was tested by more than 150 users who were technically versed but did not have in depth knowledge in mixed reality. Their results indicated that the framework is able to lower the initial hurdles of creating advanced applications and to develop embodied mixed reality experiences. We hope that our contributions can support mixed reality developers in creating high quality, compelling virtual environments to further foster the pervasiveness of mixed reality applications in everyday scenarios. 1.3.1 Open Topics As developing a software framework is a constant and on-going process, there are a number of open topics that are worthwhile to investigate in the future. ˆ We focus on improving mobile support and interaction. Therefore, we plan to assess, test and integrate dierent mobile middleware frameworks to provide mixed reality also on devices running iOS. Furthermore, we will examine concepts to enable distributed mixed reality across stationary and handheld devices. Here, we 204 1.3 Creating Mixed Reality Environments aim at the exible management of the employed 3D user interaction depending on the mixed reality setup and interaction device of each user. ˆ We aim on providing the novel framework as open-source project to developers and the research community. 205 PART VI Appendix Bibliography 209 List of Figures 223 List of Tables 227 A User Studies 229 207 Bibliography [1] Yehezkel Lamdan and Haim Wolfson. Geometric Hashing: A general and ecient Model-Based Recognition Scheme. In: ICCV 88 (1088), pp. 238249. [2] Rensis Likert. A Technique for the Measurement of Attitudes. In: Archives of Psychology 140 (132), pp. 155. [3] R.E. Kalman. A new Approach to Linear Filtering and Prediction Problems. In: Journal of Basic Engerneering 82 (1960), pp. 3545. [4] Merrill I. Skolnik. Introduction to Radar Systems. In: Radar Handbook. 1962, p. 2. [5] Wendell R Garner. The Processing of Information and Structure. L. Erlbaum Assoc., 1974. [6] M.E. Mündel. Motion and Time Study: Improving Productivity. Englewood Clis, New Jersey: Prentice-Hall, Inc, 1978. [7] S. Holm. A simple sequentially rejective multiple test procedure. In: Scandina- vian Journal of Statistics 6.2 (1979), pp. 6570. [8] Richard A. Bolt. Put-that-there. In: ACM Voice and Gesture at the Graphics Interface 14 (1980). [9] H.C. Longuet-Higgins. A Computer Alorithm for Reconstructing a Scene from Two Projections. In: Nature 293 (1981), pp. 133135. [10] KS Arun, TS Huang, and SD Blostein. Least-squares tting of two 3-D point sets. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 9.5 (1987), pp. 698700. [11] Berthold K P Horn. Closed-form solution of absolute orientation using unit quaternions. In: JOSA A 4.4 (1987), pp. 629642. [12] C. Harris and M. Stephens. A combined corner and edge detector. In: Proceedings of the 4th Alvey Vision Conference. 1988, pp. 147151. [13] Mark R. Shortis and Clive S. Fraser. A review of close range optical 3D mea- surement. In: Proceedings of 16th National Surveying Conference. Barossa Valley, Australia, 1990. [14] K. Kanatani. Computational Projective Geometry. In: CVGIP 54.3 (1991), pp. 333348. 209 BIBLIOGRAPHY [15] P. J. Besl and N. D. McKay. A Method for Registration of 3-D Shapes. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 14.2 (1992), pp. 239 256. [16] C Cruz-Neira, DJ Sandin, and TA DeFanti. Surround-Screen Projection-Based Virtual Reality: the Design and Implementation of the CAVE. In: 20th Annual Conference on Computer Graphics and Interactive Techniques. ACM Press, New York, NY, USA, 1993, pp. 135142. isbn: 0897916018. [17] Oliver Faugeras. Three-Dimensional Computer Vision: A Geometric Viewpoint. MIT Press, 1993. isbn: 0-262-06158-9. [18] J. Liang and M Green. JDCAD: a Highly Interactive 3D Modeling System. In: Proceedings of Third International Conference on CAD and Computer Graphics. 1994, pp. 217222. [19] Paul Milgram, H. Takemura, A. Utsumi, and F. Kishino. Augmented Reality: A class of displays on the reality-virtuality continuum. In: Proceedings of Telema- nipulator and Telepresence Technologies. 1994, pp. 235134. [20] Greg Welch and Gary Bishop. An Introduction to the Kalman Filter. Tech. rep. Chapel Hill, USA: University of North Carolina, 1995, pp. 116. [21] Andrew Forsberg, Kenneth Herndon, and Robert Zeleznik. Aperture based Selec- tion for Immersive Virtual Environments. In: Proceedings of the 9th ACM Sympo- sium on User Interface Software & Technology. 1996, pp. 9596. isbn: 0897917987. [22] Ivan Poupyrev and Mark Billinghurst. The Go-Go Interaction Technique: non- linear Mapping for direct Manipulation in VR. In: Proceedings of the 9th annual ACM symposium on User interface software and technology. 1996, pp. 7980. [23] J Rekimoto. Tilting Operations for Small Screen Interfaces. In: Proceedings of the 9th annual ACM symposium on User interface software and technology. 1996, pp. 167168. [24] Doug A Bowman and Larry F Hodges. An Evaluation of Techniques for Grabbing and Manipulating Objects in Immersive Virtual Environments Arm-Extension Ray-Casting. In: Proceedings of the 1997 Symposium on Interactive 3D Graphics. 1997, pp. 3538. [25] D.W. Eggert, A. Lorusso, and R.B. Fisher. Estimating 3-D Rigid Body Trans- formations: a Comparison of Four Major Algorithms. In: Machine Vision and Applications 9.5-6 (Mar. 1997), pp. 272290. issn: 0932-8092. [26] Richard Hartley and Peter Sturm. Triangulation. In: Computer Vision and Im- age Understanding 68.2 (Nov. 1997), pp. 146157. issn: 10773142. [27] Janne Heikkila and Olli Silven. A four-step camera calibration procedure with implicit image correction. In: IEEE Conference on Computer Vision and Pattern Recognition. San Juan, 1997, pp. 11061112. 210 BIBLIOGRAPHY [28] J S Pierce, A Forsberg, M J Conway, S Hong, R Zeleznik, and M Mine. Image Plane Interaction Techniques in 3D Immersive Environments. In: Proceedings of the Symposium on Interactive 3D Graphics (I3D `97). 1997, pp. 3943. [29] Hans-Jürg Fuchser. Determining Convergences by photogrammetric Means. In: TUNNEL 17.7 (1998), pp. 3842. [30] P. Meer, R. Lenz, and S. Ramakrishna. Ecient invariant representations. In: IJCV 26.2 (1998), pp. 137152. [31] Ivan Poupyrev, T Ichikawa, S Weghorst, and Mark Billinghurst. Egocentric object manipulation in virtual environments: empirical evaluation of interaction tech- niques. In: Computer Graphics Forum (Wiley Online Library) 17 (1998), pp. 41 52. [32] Doug a. Bowman and Larry F. Hodges. Formalizing the Design, Evaluation, and Application of Interaction Techniques for Immersive Virtual Environments. In: Journal of Visual Languages & Computing 10.1 (Feb. 1999), pp. 3753. issn: 1045926X. [33] Klaus Dorfmüller. Robust Tracking for Augmented Reality using Retroreective Markers. In: Computers and Graphics 23.6 (1999), pp. 795800. [34] Klaus Finkenzeller. RFID handbook: Radio-frequency identication fundamentals and applications. New York, USA: John Wiley, 1999. isbn: ISBN 0471988510. [35] Hirokazu Kato and Mark Billinghurst. Marker Tracking and HMD Calibration for a Video-based Augmented Reality Conferencing System. In: Proceedings of the 2nd IEEE and ACM International Workshop on Augmented Reality (IWAR). IEEE, 1999, pp. 8594. [36] David G. Lowe. Object recognition from local scale-invariant features. In: Pro- ceedings of the International Conference on Computer Vision (ICCV). 1999, pp. 1150 1157. [37] Ivan Poupyrev and Tadao Ichikawa. Manipulating Objects in Virtual Worlds: Categorization and Empirical Evaluation of Interaction Techniques. In: Journal of Visual Languages & Computing 10.1 (Feb. 1999), pp. 1935. issn: 1045926X. [38] P. Sturm and S. Maybank. On Plane-based Camera Calibration: A general Al- gorithm, Singularities, Applications. In: In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Fort Collins, Colorado, USA: IEEE Computer Society Press, 1999, pp. 432437. [39] Bill Triggs, Philip F. McLauchlan, Richard Hartley, and Andrew Fitzgibbon. Bundle adjustment - A modern synthesis. In: Vision Algorithms: Theory and Practise. Ed. by W. Triggs, A. Zisserman, and R. Szeliski. Vol. 34099. Springer, 2000, pp. 298372. [40] Zhengyou Zhang. A Flexible new Technique for Camera Calibration. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 22.11 (2000), pp. 1330 1334. 211 BIBLIOGRAPHY [41] Mark Billinghurst, Hirokazu Kato, and Ivan Poupyrev. The MagicBook: a Tran- sitional AR Interface. In: Computers & Graphics 25.5 (2001), pp. 745753. [42] Carolina Cruz-Neira, Allen Bierbaum, Patrick Hartling, Christopher Just, and Kevin Meinert. VR Juggler  An Open Source Platform for Virtual Reality Ap- plications. In: Proceedings of IEEE Virtual Reality. Reno, Nevada, USA: IEEE, 2001, pp. 8996. [43] Gerd Hesina. Distributed Collaborative Augmented Reality. PhD thesis. Vienna University of Technology, 2001. [44] Jerey Hightower and Gaetano Borriello. Location Systems for Ubiquitous Com- puting. In: IEEE Computer 34(8).August (2001), pp. 5766. [45] J. Lasenby and A. Stevenson. Using Geometric Algebra for Optical Motion Cap- ture. In: Geometric Algebra: A Geometric Approach to Computer Vision, Neural and Quantum Computing, Robotics and Engineering pages. 2001, pp. 147169. [46] Gerhard Reitmayr and Dieter Schmalstieg. An Open Software Architecture for Virtual Reality Interaction. In: Proceedings of ACM Symposium on Virtual Re- ality Software & Technology (VRST). Ban, Canada, 2001, pp. 4754. [47] Russel Taylor, Thomas C Hudson, Adam Seeger, Hans Weber, Jerey Juliano, and Aron T Helser. VRPN: A Device-Independent, Network-Transparent VR Periph- eral System. In: Proceedings of ACM Symposium on Virtual Reality Software & Technology (VRST). Ban, Canada, 2001. [48] Klaus Dorfmüller-Ulhaas. Optical Tracking: From User Motion To 3D Interac- tion. PhD Thesis. Vienna University of Technology, 2002. [49] Rafael C. Gonzales and Richard E. Woods. Digital Image Processing. Prentice Hall, New Jersey, USA, 2002, 587. isbn: 0-201-18075-8. [50] Mike Hazas and Andy Ward. A novel broadband ultrasonic location system. In: Ubiquitous Computing 2498.September (2002), pp. 264280. [51] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In: International Journal of Computer Vision 47.1/2/3 (2002), pp. 742. [52] Dieter Schmalstieg, Anton Fuhrmann, Gerd Hesina, Zsolt Szalavári, Miguel En- carnacao, Michael Gervautz, and Werner Purgathofer. The Studierstube aug- mented reality project. In: Presence - Teleoperators and Virtual Environments 11.1 (2002), pp. 3354. [53] Grigore C Burdea and Philippe Coiet. Virtual Reality Technology. 2nd. Wiley- IEEE, 2003. isbn: 0471360899. [54] Robert van Liere and Jurriaan D. Mulder. Optical tracking using projective in- variant marker pattern properties. In: Proceedings of IEEE Virtual Reality. IEEE Comput. Soc, 2003, pp. 191198. isbn: 0-7695-1882-6. 212 BIBLIOGRAPHY [55] Ralitza Gueorguieva and John H. Krysta. Move over anova: Progress in analyzing repeated-measures data andits reection in papers published in the archives of general psychiatry. In: Archives of General Psychiatry 61.3 (2004), pp. 310317. [56] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2004. isbn: 0521540518. [57] Robert van Liere and Arjen van Rhijn. An experimental comparison of three optical trackers for model based pose determination in virtual reality. In: Pro- ceedings of 10th Eurographics Conference on Virtual Environments (EGVE'04). Aire-la-Ville, Switzerland, 2004, pp. 2534. [58] David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. In: International Journal of Computer Vision 60.2 (Nov. 2004), pp. 91110. issn: 0920-5691. [59] Blair MacIntyre, Maribeth Gandy, Steven Dow, and Jay David Bolter. DART: a Toolkit for Rapid Design Exploration of Augmented Reality Experiences. In: Pro- ceedings of the 17th ACM Symposium on User Interface Software and Technology. ACM Publications, 2004, pp. 197206. [60] Gerard Medioni and Sing Bing Kang. Emerging Topics in Computer Vision. Ed. by Prentice Hall Professional Technical Reference. Upper Saddle River, NJ, USA, 2004. Chap. 2. isbn: 0131013661. [61] A. Stelzer, K. Pourvoyeur, and A. Fischer. Concept and application of LPMA novel 3-D local position measurement system. In: IEEE Transactions on Mi- crowave Theory and Techniques 42 (2004), pp. 26642669. [62] Doug Bowman, Ernst Kruij, Joseph J LaViola Jr., and Ivan Poupyrev. 3D User Interfaces: Theory and Practice. Addison-Wesley, 2005. [63] M. Fiala. ARTag, a ducial marker system using digital techniques. In: Pro- ceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Washington, DC, USA: IEEE, 2005, pp. 590 596. [64] Rapahel Grasset, Julian Looser, and Mark Billinghurst. A Step Towards a Mul- timodal AR Interface : A New Handheld Device for 3D Interaction. In: IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 2005, pp. 206207. [65] Anders Henrysson, Mark Billinghurst, and Mark Ollila. Augmented Reality on Mobile Phones: Experiments and Applications. In: The Annual SIGRAD Con- ference Special Theme  Mobile Graphics. Linköping University Electronic Press, Linköpings Universitet, 2005, pp. 3540. [66] Anders Henrysson, Mark Billinghurst, and Mark Ollila. Virtual object manipula- tion using a mobile phone. In: Proceedings of the 2005 International Conference on Augmented Teleexistence (ICAT). ACM, 2005, p. 164. isbn: 0473106574. 213 BIBLIOGRAPHY [67] Bodhi P. Nissanka. The cricket indoor location system. PhD Thesis. Massachusetts Institute of Technology, USA, 2005. [68] Jesús Rodríguez. Vision 2030. Tech. rep. Maastricht, Netherlands: European Con- struction Technology Platform (ECTP), www.ectp.org, 2005. [69] Tomas Svoboda, Daniel Martinec, and Tomas Pajdla. A convenient multi-camera self-calibration for virtual environments. In: Presence: Teleoperators & Virtual Environments 14.4 (2005), pp. 407422. [70] Bill Glover and Himanshu Bhatt. RFID Essentials. O'Reilly Media, 2006. isbn: 0-596-00944-5. [71] Sinem Guven, Steven Feiner, and Ohan Oda. Mobile Augmented Reality Interac- tion Techniques for Authoring Situated Media On-Site. In: IEEE/ACM Interna- tional Symposium on Mixed and Augmented Reality (ISMAR). IEEE, Oct. 2006, pp. 235236. isbn: 1-4244-0650-1. [72] M.S. Hancock, S. Carpendale, F.D. Vernier, and D. Wigdor. Rotation and Trans- lation Mechanisms for Tabletop Interaction. In: International Workshop on Hor- izontal Interactive Human-Computer Systems (TABLETOP '06). IEEE, 2006, pp. 7988. isbn: 0-7695-2494-X. [73] Bing Jiang, Kenneth P. Fishkin, Sumit Roy, and Matthai Philipose. Unobtrusive long-range detection of passive RFID tag motion. In: IIEEE Transactions on Instrumentation and Measurement 55.1 (2006), pp. 187196. [74] Edward Rosten and Tom Drummond. Machine Learning for High Speed Corner Detection. In: 9th European Conference on Computer Vision. 2006, pp. 430443. [75] Pedro Santos and Andre Stork. Ptrack: introducing a novel iterative geometric pose estimation for a marker-based single camera tracking system. In: Proceedings of IEEE Virtual Reality. USA, 2006, pp. 149156. isbn: 1424402247. [76] Ferdi Alexander Smit, Arjen van Rhijn, and Robert van Liere. GraphTracker: A Topology Projection Invariant Optical Tracker. In: Proceedings of the 12th Eurographics Conference on Virtual Environments. 2006, pp. 6370. [77] Klaus Chmelina. Laserscanning in Underground Construction: State and Future of a Multi-Purpose Surveying Technology. In: Austria - China - International Symposium on Challenging Tunnel Construction. Vienna, Austria: Institut fuer Interdisziplinaeres Bauprozessmanagement, 2007. [78] Raphael Grasset, Andreas Dünser, and Mark Billinghurst. Human-Centered De- velopment of an AR Handheld Display. In: IEEE and ACM International Sym- posium on Mixed and Augmented Reality (ISMAR). Nara, Japan: IEEE, 2007, pp. 177180. isbn: 9781424417506. [79] Mark Hancock, Sheelagh Carpendale, and Andy Cockburn. Shallow-Depth 3D Interaction : Design and Evaluation of One-, Two- and Three-Touch Techniques. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Sys- tems. ACM, 2007, pp. 11471156. isbn: 9781595935939. 214 BIBLIOGRAPHY [80] Taehee Lee and Tobias Hollerer. Handy AR: Markerless Inspection of Augmented Reality Objects Using Fingertip Tracking. In: Proceedings of the International Symposium on Wearable Computers. IEEE, Oct. 2007, pp. 18. isbn: 978-1-4244- 1452-9. [81] Manuel Loaiza, Alberto Raposo, and Marcelo Gattass. A novel optical tracking algorithm for point-based projective invariant marker patterns. In: Advances in Visual Computing 4841 (2007), pp. 160169. [82] Fangfang Lu and Richard Hartley. A fast optimal algorithm for L 2 triangulation. In: Computer VisionACCV 2007. Springer, 2007, pp. 279288. [83] Karen McMenemy and Stuart Ferguson. A Hitchhiker's Guide to Virtual Reality. 1st. A.K. Peters, LtD, Wellesley, MA, USA, 2007. isbn: 13:978-1-56881-303-5. [84] Thomas Pintaric and Hannes Kaufmann. Aordable Infrared-Optical Pose-Tracking for Virtual and Augmented Reality. In: Proceedings of Trends and Issues in Track- ing for Virtual Environments Workshop, IEEE VR 2007. 2007, pp. 4451. [85] Arjen van Rhijn. Congurable Input Devices for 3D Interaction using Optical Tracking. PhD Thesis. Technische Universiteit Eindhoven, Netherlands, 2007. isbn: 9789038608341. [86] F.a. Smit, A. van Rhijn, and R. van Liere. Graphtracker: A Topology Projection Invariant Optical Tracker. In: Computers & Graphics 31.1 (Jan. 2007), pp. 2638. issn: 00978493. [87] Daniel Wagner and Dieter Schmalstieg. ARToolKitPlus for Pose Tracking on Mo- bile Devices. In: Proceedings of 12th Computer Vision Winter Workshop (CVWW'07). Ed. by Michael Grabner and Helmut Grabner. 2007, pp. 139146. [88] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded Up Robust Features. In: Computer Vision and Image Understanding (CVIU) 110.3 (2008), pp. 346359. [89] Klaus Chmelina. Tunnel Laser Scanning, Current Systems, Applications and Re- search Activities. In: Proceedings of the ITA - AITES World Tunnel Congress. Agra, India, 2008, pp. 8692. [90] Klaus Chmelina and Klaus Rabensteiner. Laser Scanning Technology in Under- ground Construction. In: Proceedings of the Jubilee International Scientic and Technical Conference on Tunnel and Metro Constructions. Soa, Bulgaria, 2008. [91] Barry Kavanagh. Surveying principles and applications. 8th. Prentice Hall Inc., 2008. isbn: 978-0132365123. [92] Roland Kuck, Jürgen Wind, Kai Riege, and Manfred Bogen. Improving the AVANGO VR/AR Framework - Lessons Learned. In: 5th Workshop of the GI- VR/AR Group. Magdeburg, Germany: VDTC, 2008. [93] Leica Geosystems. The Leica Absolute Interferometer: A New Approach to Laser Tracker Absolute Distance Meters. Tech. rep. Unterentfelden, Switzerland, 2008, p. 11. 215 BIBLIOGRAPHY [94] Annette Mossel, Thomas Pintaric, and Hannes Kaufmann. Analyse der Mach- barkeit und des Innovationspotentials der Anwendung der Technologie des Optical Real-Time Trackings für Aufgaben der Tunnelvortriebsvermessung. Tech. rep. Aus- tria: Institute of Software Technology and Interactive Systems, Vienna University of Technology, 2008. [95] Thomas Pintaric and Hannes Kaufmann. A Rigid-Body Target Design Method- ology for Optical Pose-Tracking Systems. In: Proceedings of the 2008 ACM Sym- posium on Virtual Reality Software and Technology (VRST). ACM, 2008, pp. 73 76. isbn: 978-1-59593-951-7. [96] WolfWings. Barrel Dirtortion. [Online Image]. 2008. url: https://en.wi kipedia.org/wiki/File:Barrel%5C_distortion.svg (visited on 03/01/2014). [97] WolfWings. Pincushion Distortion. [Online Image]. 2008. url: https://en. wikipedia.org/wiki/File:Pincushion%5C_distortion.svg (visited on 03/02/2014). [98] Alan B. Craig, William R. Sherman, and Jerey D. Will. Developing Virtual Real- ity Applications: Foundations of Eective Design. Morgan Kaufmann Publishers Inc, 2009. [99] Thao Dang, Christian Homann, and Christoph Stiller. Continuous stereo self- calibration by camera parameter tracking. In: IEEE Transactions on Image Pro- cessing 18.7 (July 2009), pp. 153650. issn: 1057-7149. [100] Barry Kavanagh. Surveying with Construction Applications. 7th. Prentice Hall Inc., 2009. isbn: 978-0135000519. [101] GA Lee, Ungyeon Yang, Y Kim, D Jo, and KH Kim. Freeze-Set-Go Interaction Method for Handheld Mobile Augmented Reality Environments. In: Proceedings of the 16th ACM Symposium on Virtual Reality Software and Technology (VRST ). ACM, 2009, pp. 143146. isbn: 9781605588698. [102] Ran Liu, Hua Zhang, Manlu Liu, Xianfeng Xia, and Tianlian Hu. Stereo Cam- eras Self-Calibration Based on SIFT. In: International Conference on Measuring Technology and Mechatronics Automation. Ieee, 2009, pp. 352355. isbn: 978-0- 7695-3583-8. [103] Jason L. Reisman, Philip L. Davidson, and Jeerson Y. Han. A Screen-Space Formulation for 2D and 3D Direct Manipulation. In: Proceedings of the Sympo- sium on User interface software and Technology (UIST ). ACM, 2009, p. 69. isbn: 9781605587455. [104] Manuel Veit. Inuence of Degrees of Freedom's Manipulation on Performances During Orientation Tasks in Virtual Reality Environments. In: Proceedings of the 16th ACM Symposium on Virtual Reality Software and Technology (VRST). Vol. 1. 212. 2009, pp. 5158. isbn: 9781605588698. 216 BIBLIOGRAPHY [105] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. BRIEF: Binary Robust Independent Elementary Features. In: 11th European Conference on Computer Vision (ECCV). Heraklion, Greece: LNCS Springer, 2010. [106] Jürgen Janssen and Wilfried Laatz. Statistische Datenanalyse mit SPSS. In: Eine anwendungsorientierte Einführung in das Basissystem und das Modul exakte Tests 7 (2010). [107] Sven Kratz and Michael Rohs. Extending the Virtual Trackball Metaphor to Rear Touch Input. In: Proceedings of the IEEE Symposium on 3D User Interfaces (3DUI). Ieee, Mar. 2010, pp. 111114. isbn: 978-1-4244-6846-1. [108] Anthony Martinet, Gery Casiez, and Laurent Grisoni. The design and evaluation of 3D positioning techniques for multi-touch displays. In: Proceedings of the EEE Symposium on 3D User Interfaces (3DUI). IEEE, Mar. 2010, pp. 115118. isbn: 978-1-4244-6846-1. [109] Takehiro Niikura, Yuki Hirobe, Alvaro Cassinelli, Yoshihiro Watanabe, Takashi Komuro, and Masatoshi Ishikawa. In-Air Typing Interface for Mobile Devices with Vibration Feedback. ACM, 2010, pp. 115. isbn: 9781450303927. [110] Edward Rosten, Reid Porter, and Tom Drummond. Faster and better: a machine learning approach to corner detection. In: IEEE Trans. Pattern Analysis and Machine Intelligence 32 (2010), pp. 105119. [111] Dieter Schmalstieg, Tobias Langlotz, and Mark Billinghurst. Augmented Reality 2 . 0. Ed. by Sabine Coquillart, Guido Brunnett, and Greg Welch. Dagstuhl S. Springer, 2010. [112] Amal Benzina, M Toennis, Gudrun Klinker, and Mohamed Ashry. Phone-based Motion Control in VR: Analysis of Degrees of Freedom. In: Proceedings of the 2011 Conference on Human Factors in Computing Systems. 2011, pp. 15191524. isbn: 9781450302685. [113] A. Cohé, D Fabrice, and Martin Hachet. tBox : A 3D Transformation Widget designed for Touch-Screens. In: Proceedings of the 2011 annual conference on Hu- man Factors in Computing Systems. 2011, pp. 30053008. isbn: 9781450302678. [114] Wolfgang Hürst and Casper Van Wezel. Multimodal Interaction Concepts for Mo- bile Augmented Reality Applications. Springer, 2011, pp. 157167. [115] Regis Kopper, Felipe Bacim, and Doug a. Bowman. Rapid and accurate 3D Selection by Progressive Renement. In: 2011 IEEE Symposium on 3D User Interfaces (3DUI). IEEE, Mar. 2011, pp. 6774. isbn: 978-1-4577-0063-7. [116] Jens Puwein and Remo Ziegler. Robust multi-view camera calibration for wide- baseline camera networks. In: IEEE Workshop on Applications of Computer Vi- sion (WACV). 2011, pp. 321328. isbn: 9781424494972. [117] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R. Bradsk. ORB: An ecient alternative to SIFT or SURF. In: International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 25642571. 217 BIBLIOGRAPHY [118] R. Ortiz Alahi and P. Vandergheynst. FREAK: Fast Retina Keypoint. In: Con- ference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2012. [119] Jerey Cashion, Chadwick Wingrave, and Joseph J LaViola. Dense and Dynamic 3D Selection for Game-based Virtual Environments. In: IEEE Virtual Reality. Vol. 18. 4. Costa Mesa, USA: IEEE, Apr. 2012, pp. 63442. [120] Klaus Chmelina, Josef Jansa, Gerd Hesina, and Christoph Traxler. A 3-D Laser- scanning System and Scan Data Processing Method for the Monitoring of Tunnel Deformations. In: Journal of Applied Geodesy 6.3-4 (Jan. 2012), pp. 177185. issn: 1862-9016. [121] Florian Daiber, Lianchao Li, and Antonio Krüger. Designing Gestures for Mobile 3D Gaming. In: Proceedings of the 11th International Conference on Mobile and Ubiquitous Multimedia. ACM, 2012, p. 3. isbn: 9781450318150. [122] Zeller. David. Physics driven 3D Dynamic Geometry Software for Elementary Education. Master Thesis. Vienna University of Technology, 2012, p. 96. [123] Andy Field, Jeremy Miles, and Zoe Field. Discovering Statistics Using R. SAGE Publications, 2012. isbn: 9781446200469. [124] Wolfgang Hürst and Casper Wezel. Gesture-based Interaction via Finger Tracking for Mobile Augmented Reality. In: Multimedia Tools and Applications 62.1 (Jan. 2012), pp. 233258. issn: 1380-7501. [125] Hanno Jaspers, Boris Schauerte, and GA Fink. Sift-based Camera Localiza- tion using Reference Objects for Application in Multi-camera Environments and Robotics. In: ICPRAM (2). 2012, pp. 330336. [126] Anthony Martinet, Géry Casiez, and Laurent Grisoni. Integrality and Separa- bility of Multitouch Interaction Techniques in 3D Manipulation Tasks. In: IEEE Transactions on Visualization and Computer Graphics 18.3 (Mar. 2012), pp. 369 80. issn: 1941-0506. [127] Microsoft. Kinect full body interaction. 2012. url: http://www.xbox.com/ en-US/kinect. [128] Evan Suma, David Krum, Belinda Lange, Skip Rizzo, and Marc Bolas. FAAST: The Flexible Action and Articulated Skeleton Toolkit. In: Proceeding of IEEE Virtual Reality. Costa Mesa, USA: IEEE, 2012, pp. 247248. [129] Accurex. AICON DPA-Pro System. [Online]. 2013. url: http://www.accure xmeasure.com/dpapro.htm (visited on 29/11/2013). [130] AIA. GigE Vision. [Online]. 2013. url: http://www.visiononline.org (visited on 08/01/2013). [131] Arduino. Arduino IDE. [Online]. 2013. url: http://www.arduino.cc/ (vis- ited on 12/05/2013). [132] ART. Advanced Real Time Tracking. [Online]. 2013. url: http://www.ar- tracking.de (visited on 01/08/2013). 218 BIBLIOGRAPHY [133] Jean-Yves Bouguet. Camera Calibration Toolbox for Matlab. [Software]. 2013. url: http://www.vision.caltech.edu/bouguetj/calib%5C_doc (visited on 09/01/2013). [134] Michael Bressler. A Virtual Reality Training Tool for Upper Limb Prostheses. Master Thesis. Vienna University of Technology, 2013, p. 115. [135] Geodata Group. Gripper Camera - System Description and Data Sheet. Tech. rep. Austria, 2013. [136] Faheem Ijaz, Hee Kwon Yang, Arbab Waheed Ahmad, and Chankil Lee. Indoor Positioning: A Review of Indoor Ultrasonic Positioning systems. In: Proceedings of 15th International Conference on Advanced Communication Technology (ICACT). 2013, pp. 11461150. [137] MathWorks. MATLAB ImageProcessingToolbox. [Software]. 2013. url: http: //www.mathworks.com/help/toolbox/images/ (visited on 01/12/2013). [138] NaturalPoint Inc. OptiTrack. [Online]. 2013. url: http://www.naturalpoin t.com/optitrack/ (visited on 01/12/2013). [139] Andrei Ninu. Prosthesis Embodiment : Sensory-Motor Integration of Prosthetic Devices into the Amputee's Body Image. PhD Thesis. Vienna University of Tech- nology, 2013. [140] OpenCV. Open Computer Vision Library. [Software]. 2013. url: http://open cv.org/ (visited on 12/01/2013). [141] Thomas Pintaric and Hannes Kaufmann. iotracker. [Online]. 2013. url: http: //www.iotracker.com (visited on 01/12/2013). [142] Razer Inc. Hydra. [Online] http://www.razerzone.com/gaming-controllers/razer- hydra. 2013. url: http://www.razerzone.com/gaming-controllers/ razer-hydra (visited on 2013). [143] Can Telkenaroglu and Tolga Capin. Dual-Finger 3D Interaction Techniques for Mobile Devices. In: Personal and Ubiquitous Computing 17.7 (Sept. 2013), pp. 1551 1572. issn: 1617-4909. [144] Khrystyna Vasylevska, Hannes Kaufmann, Mark Bolas, and Evan A. Suma. Flex- ible Spaces : Dynamic Layout Generation for Innite Walking in Virtual Environ- ments. In: IEEE Symposium on 3D User Interfaces (3DUI). Orlando: IEEE, 2013, pp. 14. [145] Vicon.Motion Capture. [Online]. 2013. url: http://www.vicon.com/ (visited on 12/01/2013). [146] WorldViz. PPT E Motion Tracking. [Online]. 2013. url: http://www.worldv iz.com/products/ppt/ (visited on 12/01/2013). [147] 3D Connexion. SpaceNavigator. [Online]. 2014. url: http://www.3dconnexi on.de/ (visited on 05/06/2014). 219 BIBLIOGRAPHY [148] DirectIndustry. Multi-sided 3D touch probe (MSP) for optical tracker, Leica T- Probe. [Online Image]. 2014. url: http://www.directindustry.com/ prod/hexagon-metrology/multi-sided-3d-touch-probes-msp- optical-trackers-5623-1132257.html (visited on 12/04/2014). [149] Fraunhofer IGD. InstantRealiy. [Software]. 2014. url: http://www.instantr eality.org (visited on 10/05/2014). [150] ImInVR. MiddleVR. [Software]. 2014. url: http://www.imin- vr.com/ middlevr/ (visited on 10/05/2014). [151] InserSense. IS-1200 System. 2014. url: http://www.intersense.com/ pages/21/13 (visited on 20/08/2014). [152] MOB Labs. BuildAR. [Software]. 2014. url: https://buildar.com (visited on 20/05/2014). [153] Thorlabs. Motorized Fast-Change Filter Wheel. 2014. url: http://www.thor labs.com/newgrouppage9.cfm?objectgroup%5C_id=2945 (visited on 07/20/2014). [154] K. Vasylevska and H. Kaufmann. Inuence of Vertical Navigation Metaphors on Presence. In: Challenging Presence - Proceedings of 15th International Conference on Presence (ISPR 2014). Vienna, Austria, 2014, pp. 205212. [155] ZigBeeAlliance. ZigBee. 2014. url: http://zigbee.org/ (visited on 20/08/2014). [156] Google. Indoor Maps. [Online]. url: https://www.google.com/intl/en/ maps/about/explore/mobile/ (visited on 01/12/2013). [157] IndooRs. Location Tracking. [Online]. url: http://indoo.rs/ (visited on 12/01/2013). [158] Leica Geosystems. Absolute Tracker AT901. [Online]. url: http://www.leic a-geosystems.com/en/Leica-Absolute-Tracker-AT901%5C_69047. htm (visited on 02/12/2013). [159] Leica Geosystems. T-Probe. [Online]. url: http://www.leica-geosystems. com (visited on 02/12/2013). [160] OpenNI. [Software] (Version 1.3.2.3). url: http://openni.org (visited on 11/01/2011). [161] OpenVideo. [Software] (Version 1.0.0). url: http://rpm.icg.tugraz.at/ (visited on 01/08/2011). [162] PrimeSense. NITE. [Software] (Version 1.4.1.2). url: http://www.primesen se.com/ (visited on 11/01/2011). [163] Qualcomm Inc. Vuforia SDK. [Software] (Version 2.8). url: https://develo per.vuforia.com/resources/sdk/android/ (visited on 05/12/2013). [164] SensionLab. Indoor Positioning and Navigation. [Online]. url: http://www. senionlab.com (visited on 12/01/2013). 220 BIBLIOGRAPHY [165] Evan A. Suma, Belinda Lange, Skip Rizzo, David Krum, and Mark Bolas. Flexible Action and Articulated Skeleton Toolkit (FAAST). [Software] (Version 0.08). url: http://projects.ict.usc.edu/mxr/faast/ (visited on 01/11/2011). [166] Ubisense. Real-Time Localization Systems. [Online]. url: http://www.ubise nse.net (visited on 12/01/2013). [167] Unity Technologies. Unity3D. [Software] (Version 4.3.4). url: http://www. unity3d.com/ (visited on 01/01/2014). [168] Virtools. Virtools Dev User Guide. [Online]. url: http://www.virtools.com (visited on 02/01/2014). 221 List of Figures I Introduction 1.1 The Milgram continuum describing the variations of mixed reality. . . . . . 3 1.2 Components of a mixed reality system. . . . . . . . . . . . . . . . . . . . . . 4 2.1 Investigated concepts, their relationship and the presented contribution. . . 5 II Wide-Area Optical Tracking 1.1 Tracking approaches, with the eld of contribution marked bold. . . . . . . 15 2.1 The optical tracking pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Types of optical markers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Taxonomy of model tting depending on domain and property. . . . . . . . 24 2.4 After perspective projection of the four points, the projective invariant prop- erties of the cross ratio are expressed by λ(A ,B ,C ,D) =̂λ(A ′ , B ′ , C ′ , D ′ ). The points' collinearity is preserved as well, as l =̂ l ′ . . . . . . . . . . . . . . 26 2.5 An example of a passive 3D rigid body target. . . . . . . . . . . . . . . . . . 28 2.6 Taxonomy of pose estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.7 The pinhole camera geometry with camera center C coincides with the coor- dinate system's origin. The image plane is placed with distance f in front of C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.8 The principal point oset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.9 Two common types of radial distortion. . . . . . . . . . . . . . . . . . . . . 34 2.10 The Euclidean transformation between the world and the camera coordinate system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.11 The epipolar geometry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.12 The four possible solutions for P ′, as combinations of rotations and translations. 39 2.13 A calibration taxonomy by dimension of the applied apparatus. . . . . . . . 40 2.14 Reference targets for intrinsic and extrinsic camera calibration. . . . . . . . 41 3.1 Tracking of a smartphone using Google Indoor Maps [156]. . . . . . . . . . 44 223 List of Figures 3.2 A simple four sensor Ubisense system [166]. . . . . . . . . . . . . . . . . . . 45 3.3 Multiple target tracking using iotracker with 4 cameras, [84]. . . . . . . . . 46 3.4 A tracking setup using the Prime41 system, [138]. . . . . . . . . . . . . . . . 46 3.5 The AICON DPA-Pro System, [129]. . . . . . . . . . . . . . . . . . . . . . . 47 3.6 Leica Absolute Tracker AT901 with T-Probe, [148]. . . . . . . . . . . . . . . 48 4.1 Key properties of the proposed optical tracking system. . . . . . . . . . . . 51 4.2 Blobs at 50m distance with minimal/maximal focal length of f = 12 / 36mm. 53 4.3 Overview over the system's workow. . . . . . . . . . . . . . . . . . . . . . . 54 4.4 Coverage of stereo cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 The 2D model design features projective invariant properties. . . . . . . . . 55 4.6 LED is coated with a translucent diuse plastic sphere. . . . . . . . . . . . 56 4.7 Intrinsic camera calibration with a retro-reective pattern. . . . . . . . . . . 57 4.8 Trained background (left) and manual masking (right), [141]. . . . . . . . . 58 4.9 Extrinsic calibration pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.10 Resulting camera coordinate system for tracking. . . . . . . . . . . . . . . . 61 4.11 Wavelengths of various light sources. . . . . . . . . . . . . . . . . . . . . . . 61 4.12 Radio module for target communication for luminance-based ltering. . . . 62 4.13 Using a motorized lter wheel for wavelength-based ltering. . . . . . . . . 63 4.14 Pipeline to detect target features using hardware-based ltering. . . . . . . 64 4.15 Pipeline to obtain the target's model. . . . . . . . . . . . . . . . . . . . . . 65 4.16 Pipeline for model identication. . . . . . . . . . . . . . . . . . . . . . . . . 66 4.17 Tracking pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.18 The cabling of the hardware prototype. . . . . . . . . . . . . . . . . . . . . 69 4.19 Software architecture and modules. . . . . . . . . . . . . . . . . . . . . . . . 71 4.20 User interface of semi-autonomous Model Trainer. . . . . . . . . . . . . . . 71 4.21 Examples of incorrect model recognition during training. . . . . . . . . . . . 72 4.22 User interface of Controller to analyze data during calibration and tracking. 73 5.1 Wide area user tracking in a mixed reality setup. . . . . . . . . . . . . . . . 77 5.2 Wide area user tracking in a mixed reality setup. . . . . . . . . . . . . . . . 78 5.3 Target design for head tracking. . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 Target prototype attached on a HMD. . . . . . . . . . . . . . . . . . . . . . 79 5.5 Corresponding blob traces used for extrinsic calibration. . . . . . . . . . . . 80 5.6 Mean of relative accuracy xRMS(P ) over all three calibrations. . . . . . . . 81 5.7 3D position tracking from 5 − 30m. . . . . . . . . . . . . . . . . . . . . . . 82 5.8 Tracking situation in an underground environment. . . . . . . . . . . . . . . 83 5.9 Multiple unique target constellations. . . . . . . . . . . . . . . . . . . . . . . 83 5.10 3D position estimation of visible or invisible static and moving target's tips. 84 5.11 Developed target prototype. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.12 Details of the developed target prototype. . . . . . . . . . . . . . . . . . . . 85 5.13 Robust and dampness proof encasement of cameras and base station. . . . . 85 5.14 Test environment in a metro underground station. . . . . . . . . . . . . . . 86 5.15 Calibration with dbase ≈ 6m. . . . . . . . . . . . . . . . . . . . . . . . . . . 87 224 List of Figures 5.16 Target movement during accuracy and stability measurements. . . . . . . . 88 5.17 |ε̂bar| for all dbase and dtrack. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.18 3D position tracking of a moving target through the entire volume. . . . . . 91 5.19 Examples of modern underground machinery. . . . . . . . . . . . . . . . . . 91 5.20 Details of the test environment. . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.21 Comparison of blob quality at 110m with an inter LED distance of 34cm. . 93 5.22 A single optical target comprising the encased IR-LED attached to a reective geodesic foiled target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.23 The IR-LED line target prototype for machine tracking. . . . . . . . . . . . 94 5.24 Kinematic tracking of the horizontal target from 20− 110m with dbase ≈ 3m. 98 5.25 The vertical target is partly occluded by an interfering light but can still be successfully identied, as indicated by the yellow crosses. . . . . . . . . . . . 99 5.26 Both targets 'models are fully identied and tracked despite heavy interfering light. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.27 Both targets' models are fully identied and tracked during fog tests. . . . . 100 III User Interfaces for 3D Interaction 1.1 Interaction categories, with the elds of contribution marked bold. . . . . . 109 2.1 An excerpt of 3D interaction devices. . . . . . . . . . . . . . . . . . . . . . . 113 2.2 A mobile phone acting as a window into the virtual world. . . . . . . . . . . 116 2.3 Taxonomy for egocentric object interaction in handheld mixed reality. . . . 116 2.4 Taxonomy of immersive selection techniques classied by metaphor. . . . . 117 2.5 The Expand renement view, courtesy of [119]. . . . . . . . . . . . . . . . . 121 3.1 The two-step DrillSample technique. . . . . . . . . . . . . . . . . . . . . . . 125 3.2 DrillSample's two-step selection process. . . . . . . . . . . . . . . . . . . . . 127 3.3 State diagram for DrillSample selection. . . . . . . . . . . . . . . . . . . . . 129 3.4 Ray-Casting adapted to use it in a handheld mixed reality. . . . . . . . . . 129 3.5 Sphere approximation of clones' size to calculate the optimal ray length. . . 132 3.6 User study procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 3.7 The three test scenarios of the performance user study. . . . . . . . . . . . . 138 3.8 Mean completion time per task and on average. . . . . . . . . . . . . . . . . 139 3.9 Mean selection steps per task and on average. . . . . . . . . . . . . . . . . . 140 3.10 Users' average rating of Q3, Q4 and Q5. . . . . . . . . . . . . . . . . . . . . 141 3.11 Users' rating of Q6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.1 Touchless full 6DOF object manipulation using HOMER-S. . . . . . . . . . 147 4.2 Examples of translations using 3DTouch. . . . . . . . . . . . . . . . . . . . . 150 4.3 Examples of rotations using 3DTouch. . . . . . . . . . . . . . . . . . . . . . 150 4.4 6DOF translation and rotation using HOMER-S. . . . . . . . . . . . . . . . 152 4.5 Floating GUIs of both techniques upon selection. . . . . . . . . . . . . . . . 154 225 List of Figures 4.6 Supporting visualization depending on manipulation task and current acces- sible interaction axes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 4.7 The three test scenarios of the performance user study. . . . . . . . . . . . . 159 4.8 Mean completion time and mean number of interaction steps. . . . . . . . . 161 4.9 Mean completion time per task. . . . . . . . . . . . . . . . . . . . . . . . . . 161 4.10 Mean number of interaction steps per task. . . . . . . . . . . . . . . . . . . 162 4.11 Users' average rating of Q3 & Q4. . . . . . . . . . . . . . . . . . . . . . . . 162 4.12 Users' preferences given Q7. . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 IV Creating Mixed Reality Environments 1.1 Key components of mixed reality, with the contributions marked bold. . . . 171 2.1 Mixed Reality system architecture. . . . . . . . . . . . . . . . . . . . . . . . 173 3.1 ARTiFICe framework components and data ow. . . . . . . . . . . . . . . . 177 3.2 OpenTracker nodes with new ones marked in blue. . . . . . . . . . . . . . . 180 3.3 ARTiFICe's processing pipeline of depth data for full-body motion tracking. 182 3.4 Detailed framework components. . . . . . . . . . . . . . . . . . . . . . . . . 183 3.5 Tracking class hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 3.6 Interaction class hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 3.7 Distribution class hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 4.1 Two examples of desktop mixed reality setups. . . . . . . . . . . . . . . . . 190 4.2 Multi-user collaborative and distributed handheld mixed reality. . . . . . . . 191 4.3 A distributed multi-user non & semi-immersive mixed reality setup. . . . . 192 4.4 A distributed multi-user non & semi-immersive mixed reality setup. . . . . 193 V Conclusion 1.1 Investigated concepts, their relationship and the presented contribution. . . 199 226 List of Tables II Wide-Area Optical Tracking 2.1 Projective invariant features in the 2D domain. . . . . . . . . . . . . . . . . 25 5.1 Relative accuracy xRMS(P ) of three independent calibrations. . . . . . . . . 81 5.2 Deviations and error of dbar. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3 Standard deviations σ̂ (C) at dierent tracking distances dtrack. . . . . . . . 90 5.4 Relative point accuracy and standard deviation σ̂C for dbase ≈ 9m. . . . . . 96 5.5 Empirical standard deviation σ̂C for dbase ≈ 3m. . . . . . . . . . . . . . . . 97 5.6 Comparison of relative point accuracy xRMS(P ) and standard deviation σ̂ (C) without (motor shut o) and under heavy vibrations (motor running). . . . 97 III User Interfaces for 3D Interaction 3.1 Pre-Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 3.2 Post-Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 3.3 Evaluation of selection techniques in handheld mixed reality. . . . . . . . . 143 4.1 Post-Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.2 Users grouped by prior experience . . . . . . . . . . . . . . . . . . . . . . . 158 IV Creating Mixed Reality Environments 3.1 Interaction devices supported by ARTiFICe. . . . . . . . . . . . . . . . . . . 181 227 Appendix A User Studies 229 Selection in Handheld Mixed Reality 231 A. USER STUDIES 232 233 A. USER STUDIES 234 235 Manipulation in Handheld Mixed Reality 237 A. USER STUDIES 238 239 A. USER STUDIES 240 241 A. USER STUDIES 242