Streaming und Exploration von sich Dynamisch Ändernden Oberflächenrekonstruktionen in Immersive Virtual Reality DIPLOMARBEIT zur Erlangung des akademischen Grades Diplom-Ingenieur im Rahmen des Studiums Visual Computing eingereicht von Manuel Kröter Matrikelnummer 0820478 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Priv.-Doz. Mag. Dr. Hannes Kaufmann Mitwirkung: Dipl.-Ing. Dr. Annette Mossel Wien, 7. April 2016 Manuel Kröter Hannes Kaufmann Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at Die approbierte Originalversion dieser Diplom-/ Masterarbeit ist in der Hauptbibliothek der Tech- nischen Universität Wien aufgestellt und zugänglich. http://www.ub.tuwien.ac.at The approved original version of this diploma or master thesis is available at the main library of the Vienna University of Technology. http://www.ub.tuwien.ac.at/eng Streaming and Exploration of Dynamically Changing Dense Surface Reconstructions in Immersive Virtual Reality DIPLOMA THESIS submitted in partial fulfillment of the requirements for the degree of Diplom-Ingenieur in Visual Computing by Manuel Kröter Registration Number 0820478 to the Faculty of Informatics at the Vienna University of Technology Advisor: Priv.-Doz. Mag. Dr. Hannes Kaufmann Assistance: Dipl.-Ing. Dr. Annette Mossel Vienna, 7th April, 2016 Manuel Kröter Hannes Kaufmann Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at Erklärung zur Verfassung der Arbeit Manuel Kröter Wasserburgergasse 5/14, 1090 Wien Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. Wien, 7. April 2016 Manuel Kröter v Acknowledgements I am thankful to my advisors Priv.-Doz. Hannes Kaufmann and Dr. Annette Mossel for giving me the opportunity to write this diploma thesis at the Interactive Media Systems group. Besides guiding me through the thesis and providing me with valuable feedback, I want to thank Annette for helping me in conducting the user study. I also want to thank all the volunteers of this study for their time and participation. My special thanks go to my parents Irene and Werner for always supporting me and making my studies possible. Furthermore, I want to thank my brother David for his help in all situations and especially the time in Vienna. I want to thank Stefan for the great moments in Copenhagen and Vienna and also for providing his laptop for the time of my thesis. Lastly, I am extremely thankful to Stephi for all her support and the last six wonderful years. vii Kurzfassung Günstige Tiefenbildkameras wie die Microsoft Kinect ermöglichen es die Struktur der Umwelt zu erfassen. Mit Hilfe von 3D Rekonstruktionsverfahren kann ein detailliertes 3D-Modell in Echtzeit aus den Kameradaten berechnet werden. Autonome Roboter können diese Techniken anwenden, um eine Karte der Szene zu erstellen, während sie diese erforschen. Dies ermöglicht es dem Roboter, sich selbst in unbekannten Umgebungen zu orten und zu navigieren. Das rekonstruierte Modell kann zudem auch für andere Parteien interessant sein, um sich ein Bild dieser Umgebung zu machen. Besonders weit entfernte oder gefährliche Bereiche können durch Roboter gescannt werden, während entfernte Beobachter in der Lage sind, sicher einen Überblick über die Szene zu bekommen. Um diese Fernerkundung zu erlauben, während die Szene noch gescannt wird, müssen die rekonstruierten Daten inkrementell über ein drahtloses Netzwerk gesendet werden. Da derzeit keine Lösungen mit inkrementeller Netzwerkübertragung vorhanden sind, wird das existierende Rekonstruktionsframework InfiniTAM erweitert, um genau dies für große, dynamische Modelle zu unterstützen. Die Visualisierung und Exploration des Modells wird mit Hilfe der Unreal Engine 4 durchgeführt, einer State-of-the-Art 3D- Engine. Zu diesem Zweck wird eine Darstellnung des Modells als Dreiecksnetz bevorzugt, während dichte Rekonstruktionsverfahren meist mit einer volumetrischen Darstellung arbeiten. In aktuellen Ansätzen wird das Dreiecksnetz in einem Nachbearbeitungsschritt extrahiert. Dies ist jedoch nicht anwendbar, wenn die Szene noch während dem Scannen betrachtet und erforscht werden soll. Das verwendete Rekonstruktionframework ist daher so angepasst, dass ein aktuelles Dreiecksnetz in Echtzeit erhalten wird. Das rekonstruierte Dreiecksnetz wird schließlich in einem Virtual Reality Setup mittels Head-Mounted Display und einem omnidirektionalen Laufband erforscht. Der Einsatz von Virtual Reality Hardware ermöglicht es, auf natürliche Art und Weise zu navigieren. Das entwickelte System ist in Bezug auf die Speicheranforderungen und Datenübertragunsraten evaluert. Zudem ist der Erwerb des Raumverständnisses im Rahmen einer Nutzerstudie analysiert. ix Abstract Low cost commodity depth cameras like the Microsoft Kinect allow to sense the structure of the environment. With the aid of dense surface reconstruction methods, a detailed 3D model can be computed in real-time from the acquired camera data. Autonomous robots can apply this techniques in order to build a map of the scene while they are exploring it. This allows the robot to locate itself and to navigate in unknown environments. Besides that, the reconstructed model can be interesting for different parties, who want to explore these environments as well. Especially, distant or dangerous areas can be scanned by robots while remote observers are able to safely get an overview of the scene. In order to support remote exploration while the scene is still scanned, the reconstructed information has to be streamed incrementally over wireless network. Since currently no solution exists with this feature, the existing reconstruction framework InfiniTAM is extended to support the transmission of a large scale, dynamically changing model. The visualization and exploration of the model is performed with the aid of Unreal Engine 4, a state-of-the-art 3d engine. For this purpose, a triangular mesh representation is favored, while dense reconstruction methods mostly operate on a volumetric representation. In current approaches, the mesh is extracted in a post-processing step, which is not applicable when the scene should be explored while being scanned and updated. The used reconstruction framework is therefore adapted to maintain an up-to-date mesh in real-time. The reconstructed mesh finally is explored in a virtual reality setup using a head-mounted display and an omnidirectional treadmill. The usage of virtual reality hardware enhances the ease of use and makes it possible to navigate in a natural way. The developed system is evaluated in terms of memory requirements and data rates as well as within a user study, that analyzes the effect of the incremental streaming and the virtual reality exploration on spatial knowledge acquisition. xi Contents Kurzfassung ix Abstract xi Contents xiii 1 Introduction 1 1.1 Incremental Network Transmission . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Live Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Fundamentals of Real-Time Surface Reconstruction 7 2.1 Depth Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Scene Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 State-of-the-Art 11 3.1 Dense Real-Time Surface Reconstruction . . . . . . . . . . . . . . . . . . . 11 3.2 Network Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Surface Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Visualization and Exploration . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Methodology 21 4.1 Model Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Camera Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4 GPU - CPU Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.5 Raycasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.6 Network Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.7 Dynamic Scene Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.8 Visualization and Exploration . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.9 Current Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 xiii 5 Implementation 45 5.1 Implementation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2 Usage Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3 Camera Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4 Network Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.5 Dynamic Scene Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.6 Visualization and Exploration . . . . . . . . . . . . . . . . . . . . . . . . . 59 6 Experimental Results 63 6.1 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2 System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.3 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7 Conclusion 85 7.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A User Study Pre-Questionnaire 89 B User Study Post-Questionnaire 93 List of Figures 101 List of Tables 102 Bibliography 103 CHAPTER 1 Introduction The reconstruction of environmental geometry is broadly researched in computer vision and computer graphics. The goal of this process is to capture the 3D structure of the scene using information from an RGB or depth camera and generate a computer model from it [NIH+11]. Especially in the area of robotics, these methods are widely applied, when the robot has to navigate autonomously in unknown environments. In order to avoid obstacles and to explore new areas, the robot simultaneously needs to build a model of the environment and estimate its position within the generated model. This principle is called Simultaneous Localization and Mapping (SLAM) [TL08]. Another application area of this method is architecture, where a 3D model of a building can be generated by walking through it with a handhold scanning device [FCSS09]. Other persons can then experience this building without actually being there. The capturing of an environment and its exploration is often performed by two remotely connected parties, such as in a typical setting in robotics: a human observer virtually explores the scene while an autonomous robot is actually scanning it. Using a robot might be cheaper, safer or more convenient than going everywhere in person. For example, a quadcopter can scan large areas without effort [WMG+12] or firefighters can send a robot into dangerous buildings to safely get an overview. In all these cases, data has to be exchanged between the two parties in some way. The aim of this thesis is to create a 3D reconstruction pipeline, which integrates such data transmission over wireless network and allows for live exploration by a remote observer, as illustrated in Figure 1.1. 1.1 Incremental Network Transmission A trivial solution is to transmit the final model after the reconstruction process has finished. As a consequence, one has to wait a long time before the environment can be analyzed, especially for large scenes. If the data is streamed incrementally while scanning, the scene can be explored from the start and findings are obtained sooner. This can be 1 1. Introduction Scanning and reconstruction Live exploration of remote environment Virtual reality exploration 3D reconstruction Voxel triangulation and dynamic scene update Incremental streaming of 3D model (as voxels) RGB-D camera with color and depth sensor RGB-D capturing Figure 1.1: A room is scanned using a RGB-D camera and 3D model is reconstructed from the camera data. The model of the room is transmitted incrementally over network to a remote client. On the client side, the volumetric model is triangulated and can be virtually explored while it is gradually expanding. vital in scenarios like the firefighter example. Moreover, it is possible to influence the way the scene is scanned. For instance, one can give directions to the scanning robot if something interesting is discovered during exploration. Current reconstruction systems do not incrementally stream the reconstructed model but only the sensor data. The 3D reconstruction process is completely [NSS14], or at least partially [WMG+12], performed on a remote server. This is advantageous since generally, the server is more powerful than the mobile scanning computer (e.g. the robot). Moreover, the complete 3D model of the reconstruction is available on the server and thus, the scene can be remotely explored, independently of the current position of the scanning camera. The major drawback is, that this principle is prone to network failures or bottlenecks. Once the network connection is lost, the reconstruction cannot continue and might fail due to camera tracking errors. Moreover, the robot does not get instructions from the server anymore, e.g. for navigation purposes. Autonomous decisions cannot be made because the robot does not store the 3D model of the environment itself. In the reconstruction pipeline developed within this thesis, not the camera data, but the model itself is streamed. This way, the reconstruction is both available to the scanning computer and the remote server, and one is less dependent on a stable network connection. If the network connection is lost, the reconstruction still continues on the client side and once the connection is established again, the model on the server is updated. The latency of the model update is not critical compared to the case when streaming camera 2 1.2. Live Exploration images. The major challenge with the incremental transmission of the model is to keep the needed bandwidth as low as possible. Furthermore, the reconstruction process needs to be very efficient in order to minimize the required computing power on the mobile computer. 1.2 Live Exploration As already mentioned, a reason for the need of incremental model transmission is the desire to explore the scene while still scanning. The observer should be able to examine the reconstruction as quickly as possible and to get to know the spatial layout of the environment. 1.2.1 Exploration in Virtual Reality Exploring the scene by navigating through the model using standard input and output devices such as a keyboard and a standard computer monitor does not facilitate a natural user experience due to the limitations of both input and output device. Studies have shown, that the spatial knowledge acquisition can be enhanced by exploring a virtual scene as natural as possible [RL09]. Learning the spatial arrangement of a scene requires mental resources and the fewer one has to concentrate on the human-computer interface, the more one can concentrate on the environment [BKJP04]. In addition, non-visual sensory information such as vestibular cues or muscle usage support the building of a mental map, as one can easier keep track of the orientation and position within the scene [RB04]. In order to improve the exploration in terms of ease-of-use and spatial knowledge acquisition, one can use immersive Virtual Reality (VR) technologies. As part of this thesis, the implemented reconstruction pipeline integrates a head-mounted display (HMD) and an omnidirectional treadmill (ODT) for scene exploration. An HMD allows to rotate and look around in the virtual scene like in reality and an ODT enables to walk infinitely in the virtual world while actually staying at the same place in the real world. While it was previously shown that natural walking significantly contributes to spatial learning [RB04], there are no studies yet that analyze the effect of a low friction ODT. In this thesis, it is evaluated in an experimental user study if walking with an ODT is to be favored over pure virtual walking with keyboard input when considering spatial knowledge acquisition. Apart from that, the study also examines if the fact, that the model is not available in its entirety from the beginning but incrementally expands during exploration, increases or decreases the spatial knowledge. 1.2.2 Visualization To be able to explore the reconstructed scene, the model needs to be visualized first. Generally, there are two options: Raycasting or standard forward rendering [FVDF+94]. The process of raycasting is an image-based approach. It generates an image by shooting 3 1. Introduction a ray into the scene for every pixel and gathering information at the intersection with the model. The computational effort therefore depends on the resolution of the final image. Forward rendering is an object-based approach and constitutes the standard approach used by most existing 3D engines. The image is computed by projecting the model on the image plane. The complexity of the model determines the runtime. While raycasting is fine for live feedback during scanning, forward rendering is favored when exploration and interaction with the model is desired. In the latter case, high image resolutions and high framerates are necessary, especially in a VR setup [Ocu16b]. Dynamic Mesh Update In order to apply forward rendering, a triangular mesh needs to be extracted from the representation used for reconstruction. Most of the currently available reconstruction solutions perform this mesh computation in a post-processing step. In this thesis, a real-time extraction is implemented since it is necessary to support live exploration. Note, that the underlying representation does not only grow during reconstruction but can also change in already scanned areas. Thus, the mesh does not only have to be expanded but updated in existing areas, which comprises an additional challenge. Lastly, the final rendering step is adjusted to support this dynamically changing geometry. In standard rendering approaches, the geometry to be rendered is loaded before visualization and is then not changed anymore [FVDF+94]. 1.3 Contribution To sum it up, the aim of the thesis is to create a 3D reconstruction pipeline, which allows to generate a 3D computer model from a real-world environment and to stream that model incrementally over wireless network to another PC. The model can then be remotely explored in a VR setup, which supports the spatial knowledge acquisition. All steps happen in real-time while the environment is still scanned and the reconstructed model grows. For this purpose, the existing state-of-the-art reconstruction framework InfiniTAM, developed by Kähler et al. [KPR+15], represents the base implementation. InfiniTAM is extended with following parts: • Incremental transmission of the reconstructed model over network • On-the-fly mesh generation (and dynamic update) from the underlying model • Communication with UE4 to pass the extracted meshes In order to visualize and explore the gradually expanding reconstruction in an immersive VR setup, the game engine Unreal Engine 4 (UE4) is applied [Epi16]. In addition to the standard UE4 capabilities, it contains following features: • Communication with InfiniTAM • Dynamic update of the scene representation 4 1.3. Contribution • Integration of the VR hardware (Oculus Rift HMD and the Virtualizer, an ODT from Cyberith [CH14]) The implemented system is evaluated in terms of frame rate, required memory and network bandwidth. Besides that, a user study is performed, which examines the effectiveness of the VR exploration regarding spatial knowledge acquisition in unknown environments. A paper summarizing the work of this thesis is also submitted to the International Symposium on Mixed and Augmented Reality (ISMAR) 2016. At the time of writing, the submission is currently under reviewing. The rest of the thesis is structured in the following way: Chapter 2 introduces the main concepts of 3D reconstruction methods. The state-of-the-art in real-time dense 3D reconstruction, network transmission, mesh computation and visualization and exploration is reviewed in Chapter 3. Afterwards, Chapter 4 covers the methodology, where the concepts of the developed reconstruction pipeline are explained along with the used methods. Implementation details and instructions, how to use the system, are presented in Chapter 5. Chapter 6 describes both, the system performance, as well as an experimental user study, targeting spatial knowledge acquisition during live exploration. Finally, the thesis ends with a conclusion in Chapter 7. 5 CHAPTER 2 Fundamentals of Real-Time Surface Reconstruction Before reviewing related work and introducing the methodology, the basic concept of real-time surface reconstruction is explained. Reconstruction methods work either in real-time or offline. While offline approaches compute the model from a captured camera stream in a post-processing step, real-time approaches can reconstruct the scene while it is scanned [NIH+11]. This enables live feedback and allows to see which regions of the scene still need to be scanned and if the reconstruction process is working without failure. Furthermore, these methods can be used in robotics as mentioned in Section 1. This thesis focuses on real-time techniques. 2.1 Depth Sensors To be able to build a 3D model of the environment, one needs to sense not only the color but also the depth of the scene, i.e. the distance to the nearest object. With that information, it is possible to reconstruct a 3D point in space. Standard RGB cameras only provide color images but not depth information. There exist approaches to estimate this information from the available color data, which can however lead to wrong values and require additional computational resources. RGB-D cameras spare this step since they are able to measure the depth directly. An example of such an RGB-D camera is the Xtion Pro Live from Asus [Asu16], which works the same way as the first version of the Microsoft Kinect. It projects a known infrared dot pattern into the scene, which is then captured with a built-in infrared sensor. The depth of the scene is estimated by analyzing the distortion of this pattern. This principle is called structured light. Besides the scene depth, the camera measures the color with a separate RGB sensor. Figure 2.1 shows an image of the Asus Xtion. Another type of 7 2. Fundamentals of Real-Time Surface Reconstruction RGB-D sensors are Time-of-Flight (TOF) cameras, as for instance the second version of the Microsoft Kinect. They emit infrared light pulses and measure the time until the light is reflected back. The larger the distance to an object, the longer the measured time is. The drawback of using both types of RGB-D cameras is, that they do not work in direct sunlight since the infrared part of the sunlight interferes with the infrared dot pattern or light pulses. As a consequence, the cameras can only be used indoors in a reliable way. A more robust and precise depth estimation can be performed with laser scanners, however, they are also more costly. Within the thesis, the Xtion Pro Live camera is used since only indoor scenes are scanned. RGB Sensor Infrared Projector Infrared Sensor Figure 2.1: RGB-D camera Xtion Pro Live from Asus [Asu16]. 2.2 Scene Representation Independently of the real-time property and the camera type, a further characteristic of a reconstruction system is the used scene representation. It defines how the recon- structed environment is stored in memory and strongly affects the reconstruction process. Generally, the model can be represented either in a sparse or dense way [NIH+11]. In contrast to sparse models, a dense representation allows to maintain fine details of the geometry and is thus well suited for exploration and interaction with the captured scene. A Signed Distance Function (SDF) [CL96] is a common type of such a dense representation and is also applied in InfiniTAM, the reconstruction framework used in this thesis [KPR+15]. A SDF stores the distance to the closest surface for each point in space, where all values on one side of the surface are negative and all values on the other side of the surface are positive. The scene is therefore represented implicitly as the zero crossing of this function. A drawback of dense representations is the high amount of required memory and, especially for large scenes, an efficient data structure is necessary. An overview of dense data structures suited for large scales can be found in Chapter 3. 2.3 Reconstruction The actual reconstruction process of real-time methods mainly consists of two steps: Camera pose estimation (or camera tracking) and integration of the current camera 8 2.3. Reconstruction information into the model [NIH+11] (mapping). Pose Estimation The goal of camera pose estimation is to track the position and orientation of the scanning camera by using its images. This process is also referred to as visual odometry [SSC11]. By comparing the information of two camera frames, it is possible to estimate how the camera moved and rotated in between. As a consequence, the complete camera trajectory can be reconstructed. Like the scene representation, the method for camera pose estimation can be classified as either dense or sparse. Sparse camera tracking approaches use only parts of the available image information. They try to compute and find corresponding points in both images. Then a transformation is computed which best maps the points from frame one onto the points of frame two. This transformation constitutes the camera transformation between the two frames. Corre- sponding points can be found using RGB image features like SIFT or SURF[PPS13] and the transformation can be estimated using a procedure like RANSAC [FB81]. RANSAC tries to find the best transformation by eliminating wrong point correspondences. Instead of only using a set of points, dense methods estimate the transformation by using the information of every pixel. In order to still achieve real-time performance dense methods often utilize the GPU. The computations for each pixel can performed very well in parallel. A widely used method within dense camera tracking is Iterative Closest Points (ICP) [BM92]. It tries to find the camera transformation matrix by iteratively aligning two 3D point clouds. The point clouds can be generated from the depth images and the internal camera parameters. ICP is also used in this thesis and is therefore explained in more detail in Section 4. Camera Drift In order to achieve well aligned reconstructions, it is fundamental to keep the error in the camera pose estimation step as small as possible. Since only the relative camera transformations between two frames are computed, the error sums up over time. This problem is known as camera drift [SSC11]. Especially for large scenes, it leads to unpleasant results: walls or floors are not straight anymore and reconstructed objects can overlap. Figure 2.2 shows a reconstructed scene with notable camera drift, where it can be seen that corresponding parts are strongly misaligned. To mitigate this problem, instead of a frame-to-frame tracking, a frame-to-model camera tracking can be applied [NIH+11]. This means, that the camera transformation is not estimated directly between two successive camera frames but between the current frame and a frame which is synthesized from the current model. As a consequence, the resulting model is much more aligned to the model, but the general problem of camera drift still remains. The impact of drift can be perceived especially when scanning regions, that have already been scanned and reconstructed before. By detecting this loops, i.e. regions which have already been scanned, the accuracy of the reconstruction can be improved. Once a loop is detected, the model has to be transformed in such a way to align the corresponding parts. While there exist solutions that perform the so called loop closure [AFDM08] for point clouds, at the time of writing this thesis, there does not exist any publicly available 9 2. Fundamentals of Real-Time Surface Reconstruction Figure 2.2: Distorted reconstruction due to camera drift. method to compute this in real-time for large-scale scenes stored as a dense SDF. Thus, the implementation in this thesis does not integrate loop closure. Data Integration Besides camera pose estimation, the second step of a real-time surface reconstruction method is the integration of the current camera information into the 3D model. By using the inverse of the estimated camera transformation matrices, all images of the camera stream can be mapped to a common world coordinate system. The 3D world position of the individual pixels can then be inferred with the depth information of the individual images, resulting in a point cloud. Finally the individual 3D points of this cloud need to be integrated and combined with the existing model. Since this step depends entirely on the used scene representation, it is not described in more detail here. 10 CHAPTER 3 State-of-the-Art At the time of writing, there is no prior art that combine 3D reconstruction, network streaming and a mesh extraction component into a single application. However, all these areas are well studied itself and the most relevant approaches are presented in the next four subsections. 3.1 Dense Real-Time Surface Reconstruction 3D surface reconstruction methods, which incrementally fuse camera data into a single representation of the scene, got very popular in recent years, especially since the release of the Microsoft Kinect RGB-D camera [Cor16a]. 3.1.1 KinectFusion One of the first algorithms, which is able to reconstruct scenes in a dense way in real-time is called KinectFusion[NIH+11]. It estimates the camera pose with ICP where it uses a synthetic depth map from the current model to limit the camera drift. As a model representation, it uses a dense volumetric SDF. KinectFusion achieves real-time rates because both tracking and integration is performed on the GPU. The major limitation of KinectFusion is, that it can only scan desktop-sized scenes and fails to reconstruct larger environments. This is due to the fact, that the SDF is stored in a regular 3D voxel grid on the GPU. Therefore, both space which is occupied by objects and empty space, is stored the same way, which leads to a large memory footprint. By decreasing the size of the individual voxels in order to store finer geometric details or by increasing the total scene size, the amount of the required memory increases cubically. On a GPU with two GB memory, a volume with 5123 can be stored. This corresponds to a cube with around 2.5m side length when using a voxel resolution of 0.5 cm. There are now several methods, which extend the KinectFusion algorithm to overcome the space limitation. 11 3. State-of-the-Art 3.1.2 Extending the Scale Camera Active Volume on GPU Camera to CPU to GPU Figure 3.1: Principle of a Moving Volume approach in 2D. Only a small parts of the scene is stored on GPU (black), the rest is maintained on CPU (gray). When the camera moves to the right, the red blocks are streamed from GPU to CPU to make space for the new area (green). Moving Volume Approaches Moving Volume approaches reconstruct the scene by dividing it into an active and an inactive volume, where only the active volume is stored inside the GPU. This volume is translated according to the estimated camera motion. It always holds those parts of the scene which are in front of the camera. The scene is still represented as a regular grid, but the parts of the scene that fall out of the current active volume are streamed to the CPU to make space for new data on the GPU. The principle is illustrated in Figure 3.1. That way, the supported scene size is solely limited by the available CPU memory. However, the size of the active volume is still restricted. This has the effect that either fine details cannot be reconstructed or that only close-by structures can be scanned (even if the sensor supports a larger depth range). Point Cloud Library’s LargeKinfu [HF16], Moving Volume KinectFusion [RM12] or the most recent Kintinuous project by Whelan et al. [WKJ+14] implement this moving volume idea. In Kintinuous, the parts which are streamed out of the active volume to the CPU are converted to point clouds first. This further reduces the required memory on the CPU, but existing point information on CPU is not considered when revisiting an already scanned area. If the information is contained as SDF, it can be uploaded to the GPU again and fused with new data. Implementations of LargeKinfu and Moving Volume KinectFusion are available for public use and at the time of writing this thesis, the authors of Kintinuous have also released a public version of their code. Hierarchical Approaches Besides the moving volume approaches, there are several methods which try to increase the supported scale by ignoring empty space. This can be achieved by using octrees or similar hierarchical data structures. In a SDF scene representation, all data that is further away from the closest surface than a certain threshold can be considered as empty space. Zeng et al. use a ten level octree on the 12 3.1. Dense Real-Time Surface Reconstruction GPU and are able to reconstruct a space about ten times larger (8 m x 8 m x 2 m) than the original KinectFusion algorithm at the same resolution [ZZZL13]. Steinbrücker et al. further extend the scale and can scan a corridor consisting of nine rooms (45 m x 12 m x 3.4 m) while still fitting in their 2.5 GB GPU memory [SKCS13]. In contrast to Zeng et al. even the color information of the scene is integrated. They apply a multi-resolution strategy, where not only leaves but also octree branches can hold scene data. Measurements with high distance to the sensor (including more noise) are stored with a lower resolution at higher levels of the octree and close surfaces are represented at lower levels with finer a resolution. A voxel size of up to 0.5 cm is used. A CPU version of this approach is available for public use, however this does not contain a camera tracking component. Chen et al. avoid octrees including their pointer overhead and propose a regular spatial subdivision with are more shallow hierarchy [CBI13]. This results in a better performance and enables efficient GPU-CPU streaming of sub-parts of the scene to support scales only limited by the CPU memory. Voxel Block Hashing Even though Chen et al. improve the speed, the drawback of hierarchical data structures is still, that the required traversal of the hierarchy generally introduces additional computational overhead, which leads to a reduced performance. Especially for mobile reconstruction systems as targeted in this thesis, the performance is critical. To avoid these performance issues, Nießner et al. introduce a technique called Voxel Block Hashing, which is both fast and space efficient [NZIS13]. The main idea is, that they store data only near actual surfaces but without resorting to a scene hierarchy. This is achieved by introducing a spatial hash function which maps a 3D voxel position from world space to a hash table entry. The hash table is maintained on the GPU and only data (SDF and color) for currently visible voxels is stored on the GPU memory. All other voxels are streamed out to the CPU similar to the Moving Volume techniques. Voxel Block Hashing requires only about 10% of the memory in comparison to a regular 3D grid and is superior to other techniques regarding the frame rate. It is integrated in the InfiniTAM framework [KPR+15] and is explained in more detail in Section 4.1. Figure 3.2 shows a comparison of storing the scene in a regular grid (as in standard KinectFusion or Moving Volume approaches), a hierarchical grid or using Voxel Block Hashing. 3.1.3 Improving the Camera Pose Estimation Most of the mentioned surface reconstruction approaches integrate depth-based ICP for camera pose estimation. For environments with a lot geometric features it works well, however for some scenes, such as hallways, it likely fails. The low geometric information alone does not allow to compute a unique pose in these cases. In order to get a good pose estimate, not only the depth but also the color information of the RGB-D sensor can be used. A common approach is, to estimate sparse visual features and find matches between image pairs. For example, Henry et al. [HKH+12] use FAST features in combination with ICP. In order to better exploit all the available color information, Steinbrücker et 13 3. State-of-the-Art Regular Hierarchical Voxel Block Grid Grid Hashing Figure 3.2: Comparison of storing the scene in a regular grid (left), a hierarchical grid (middle) or using Voxel Block Hashing (right). al. propose a dense estimation where they minimize both the photometric error and geometric error between two consecutive RGB-D image pairs [SSC11]. In the Kintinuous project, Whelan et al. integrate Steinbrücker’s RGB-D tracking and combine it with frame-to-frame ICP [WJK+13]. Loop Closure As mentioned in Section 2, another way to improve camera pose estimation is to integrate a loop closure procedure. Current state-of-the-art approaches mostly apply appearance-based methods using a pose graph to detect loops [WKJ+14, LM13]. In a pose graph, each node corresponds to a keyframe and stores a camera image with the corresponding pose estimate. A new node is inserted, whenever the camera moved more than a defined threshold or when some time had passed. The main idea is then to compare the current camera image with the previous keyframes using sparse visual features based on the color image. Whenever a loop is found, a new edge is added to the pose graph and the graph with its poses can be corrected by running a pose graph optimization algorithm. After adjusting the existing pose estimations, the current 3D model is deformed. Usually, a point-based representation is used in these approaches and in this case, the deformation can be done efficiently. A drawback of such appearance-based methods is, that they do not work well in low-light environments or when the camera is moved too fast due to motion blur. Apart from that, to be able to detect loops, previous regions have to be seen from a similar viewpoint, so that corresponding features can be found. Recently, Fioraio et al. introduced a method which does not rely on the color information and which works for large scale volumetric models. They are [FTF+15] able to perform online global optimization without explicitly performing loop closure. The scene is divided into a number of subvolumes, where each subvolume is assigned a pose, and these poses are optimized by performing geometric registration of the volumes. The subvolumes are fused together in order to get a global map. The authors state that their approach is able to produce results similar to state-of-the-art offline in only a fraction of time. An implementation of this approach is not available to the public at the time of writing. 14 3.2. Network Streaming 3.2 Network Streaming The streaming of a 3D reconstruction depends very much on the representation of the data to be sent. In order to achieve a low bandwidth requirement, a proper compression method is necessary. This method can be either lossless, i.e. all the original information is contained, or lossy. In that case, information is sacrificed to achieve a higher compression rate. Since RGB-D sensors already introduce some noise, a loss of information due to compression can be acceptable if it is below the noise level. 3.2.1 Transmission of Images Most of the related approaches send the current camera images or images of the current reconstruction as seen from the camera’s viewpoint over network. Sending raw images is not feasible, as a Kinect camera produces depth and RGB images with about 45 MB per second. In order to keep the data rate low, the images can be compressed using standard image compression methods or algorithms specifically designed for depth data. Coatsworth et al. suggest to compress RGB images using the lossy JPEG compression and depth images using the lossless DEFLATE algorithm [CTF14]. They evaluated in an experiment that this combination offers a good compromise between compression ratio and computing complexity. In their approach, each image is encoded separately. The redundancy usually observed in a camera stream is not exploited. Nenci et al. also consider this temporal relation between the depth images using the H.264 video codec to achieve a higher compression rate [NSS14]. Since video codecs like H.264 usually only support 8 bit per pixel, they are not directly applicable for depth data with 16 bit per pixel. Moreover, they result in blurred edges which results in strong artifacts in the surface reconstruction. Nenci et al. try to solve this by dividing the depth image into multiple channels, where each channel contains a certain (normalized) depth interval. The channels are then compressed individually. Generally, this scheme is targeted at a lossy compression mode. When choosing lossless compression, it even results in a higher bandwidth (0.31 bytes per pixel) than the approach by Coatsworth et al. (0.27 byte per pixel). A further method for depth maps by Mehrotra et al. exploits the fact that the Kinect depth map accuracy decreases with distance. As a consequence, distant values can be compressed higher without introducing additional error. The authors claim, that their near-lossless method achieves a compression rate between 7 and 16. 3.2.2 Transmission of the 3D Model Besides compressing and streaming images, the second option is to transmit the re- constructed model, i.e. the point cloud, the mesh or the SDF, itself. In that case, an important requirement besides a low computational complexity and a low data rate is the support for incrementally acquired models. There are a number of methods which make use of octrees to compress the space. 15 3. State-of-the-Art Point Clouds Kammerl et al. developed a real-time approach for point clouds streams [KBR+12]. They encode the structural differences between the octrees of two consec- utive point clouds and achieve therefore both spatial and temporal compression. The compression is lossy and depends on the desired voxel resolution of the scene. With a resolution of 5 mm, a compression ratio of 1:27 is achieved (0.45 bytes per point) when compared to the uncompressed point cloud with 12 bytes per point. Kammerl’s method is available as part of the Point Cloud Library. Recently, Golla and Klein introduced another point cloud progression algorithm for real-time streaming which is based on local 2D parameterizations [GK15]. They divide the scene into a regular 3D voxel grid. The point cloud within each voxel is compressed separately by parameterizing them as planar patches. The actual point positions are encoded with height maps and holes are represented by occupancy maps. For these two maps, standard image compression techniques are used and the planar patches are compressed using the Lempel-Ziv-Markov chain algorithm. Similar to that, Morell-Gimenez et al. achieve a spatial compression by detecting planes [MOCGR14] in the point data and representing those with a more compact form. However, it is not specifically targeted at the incremental transmission of reconstructed model. Meshes Apart from streaming point clouds, another option is to compress and transmit the reconstruction data as meshes [MLDH15]. However, meshes are usually not the representation used during scene reconstruction. They have to be extracted first which introduces additional computational cost at the mobile scanning computer. Moreover, two different representations of the same scene would have to be stored on the mobile platform which increases the memory footprint. As a consequence, mesh transmission is not considered as an option in this thesis. SDF Streaming of volumetric reconstruction data stored as a SDF is not done in related work but one can apply compression methods targeted at arbitrary data. Note, that the scene representation chosen for the 3D reconstruction not only determines the supported scene size but also strongly affects the network transmission. When the model is stored very memory efficient, the possible amount of data to be transmitted is already limited. In case a volumetric SDF model is used, it contains many zeroes or similar numbers. This data can be compressed effectively by using a lossless general purpose algorithm like run-length encoding or more advanced methods such as DEFLATE or bzip2. The latter achieve a higher compression but are also more computational expensive [Sal04]. 3.2.3 Network Protocols Independently of the chosen compression scheme, the data needs to be sent over network with an appropriate network protocol. Depending on the data to be streamed, the network protocol may need to satisfy different requirements. If for example images are transmitted and a real-time rate is required, the latency of the transmission is critical, whereas some packet losses might be acceptable. If the reconstructed model is streamed, 16 3.3. Surface Extraction the most important factor is the throughput. The Transmission Control Protocol (TCP) is connection-oriented, ensures ordered data delivery and is reliable, i.e. no data is lost [For02]. The User Datagram Protocol (UDP) on the other hand is connectionless, unreliable and does not guarantee ordered delivery, but comes with less overhead than TCP. The advantage of these two main protocols is, that they are known and widely supported. Besides those standard options, there exists the Stream Control Transmission Protocol (SCTP) [SM01], which is reliable, connection-oriented and allows both ordered and unordered data delivery. In contrast to TCP, it preserves message boundaries like UDP. Moreover, it supports multi-streaming, i.e. multiple streams between two endpoints. When comparing SCTP with UDP and TCP regarding their performance, SCTP achieves the highest throughput but with the highest delay [NM12]. Another state-of-the art protocol is the UDP-based Data Transfer Protocol (UDT) [GG07], which is intended for transferring very large datasets over high-speed wide area networks. In contrast to standard UDP, it is reliable and connection oriented. 3.3 Surface Extraction To supply live feedback during scanning, most reconstruction frameworks apply raycasting to visualize the reconstructed surfaces. However, as already mentioned, a mesh represen- tation is desired for exploration purposes and has to be extracted from the underlying model. As the network streaming, the mesh extraction should work for dynamic data. Thus, it needs to be able to update in real-time upon model changes. 3.3.1 Mesh Extraction from Point Clouds To compute a mesh from a dense point cloud, a popular choice is Poisson Surface Reconstruction [KBH06], which first computes a mathematical model before triangulating it. This method however does not perform well for reconstruction data since it is developed for continuous surfaces without holes. In order to extract a mesh from point clouds, which are measured by a camera, the Greedy Projection Triangulation (GPT) can be used [MRB09]. It directly connects the points instead of computing a mathematical model first. The triangulation happens locally, where only the neighborhood of a point is taken into account. The method works in real-time for incremental noisy data sets. The major issue of this approach is, that it keeps all the input points. As a result, a mesh with a huge amount of triangles is generated. Besides requiring a lot of memory, it hinders interactive exploration of the mesh when reconstructing large spaces . To tackle this issue, Ma et al. introduce an algorithm to detect planar regions and represent those with fewer points, i.e. larger triangles [MWB+13]. Planes are detected by using a region growing segmentation based on the point curvature. The segmented points of a plane are then converted to a triangular mesh using their new QuadTree-Based (QTB) algorithm. Within this method, a quadtree is applied for point decimation. All 17 3. State-of-the-Art boundary points of a planer region are contained and all interior points are replaced by a set of new ones. The number of new vertices should be as small as possible without causing skinny triangles, i.e. triangles with very small angles. All points, not belonging to any planar regions, are triangulated using GPT. While this approach reduces over 80% of planar points, it is not designed for incrementally acquired point clouds. Whelan et al. improve the robustness and speed of the algorithm and most importantly extend it to allow incremental mesh growing [WMB+15]. Apart from that, they added the capability to maintain the color information of the reduced planar segments by automatically generating textures. This incremental meshing is integrated in the Kintinuous project, however the publicly available version does not include it but uses basic GPT instead. 3.3.2 Mesh Extraction from SDF To create a triangular mesh from a volumetric representation such as a SDF, the standard choice is to apply the Marching Cubes algorithm [LC87]. This algorithm steps through the volume and takes eight SDF values at a time, which would lie on the corners of an imaginary cube. Using these eight values, it is possible to find the triangles, which represent the surface passing through that cube. The final mesh is then just the set of triangles of all cubes. This method is applied in the thesis and is explained in more detail in 4.7. A drawback of this method is, that it produces a very high number of triangles if one wants to generate a mesh with a high level of detail. The tessellation factor, i.e. the size of the triangles, is determined by the resolution of the volumetric SDF and it is the same all over the mesh, also for planar regions. Figure 3.3 shows the dense triangulation of a reconstructed couch. Figure 3.3: Marching Cubes Triangulation from a volumetric 3D reconstruction with a voxel resolution of 1 cm. The left image shows the triangulation and the right image shows the corresponding colored mesh. There exist adapted algorithms based on the original Marching Cubes algorithm like Marching Tetrahedra or Adaptive Marching Cubes, which try to solve this issue [NY06]. 18 3.4. Visualization and Exploration Steinbrücker et al. integrate Marching Cubes in their reconstruction pipeline to dy- namically extract a mesh from the SDF model [SKCS13]. Since they store the model at different levels in an octree with different voxel resolutions, also the corresponding mesh is computed with different levels of detail. Areas that are scanned closely, produce smaller triangles than far away regions. Further improvements, such as point decimation in planar regions, are not integrated. The dynamic update of the mesh is achieved by recomputing the mesh of the corresponding parts of the octree. 3.4 Visualization and Exploration 3.4.1 Visualization using 3D Engines When the triangular mesh representation is available, the standard rendering pipeline using rasterization can be applied. For that purpose, there exist a number of game engines which ease the development process. These engines already incorporate advanced techniques to produce realistic visual and physical effects, such as collision detection, while still providing a high performance. Among the current state-of-the art engines are Unreal Engine 4 (UE4) [Epi16] and Unity 5 [Uni16]. Both have multi-platform support and are free for personal use. UE4 is based on C++ and has Blueprints, a visual scripting system, which allows fast prototyping without any coding. Unity is based on C# and its main advantage is the large asset store, a library which offers a lot of additional functionality. Since in this theses the goal is to explore a reconstruction while still scanning, the main requirement is the ability to render a large dynamically changing mesh. Both in UE4 and Unreal this can be achieved using procedural meshes. UE4 has more advanced lighting capabilities, however the lighting is less important since the reconstruction already contains the measured color. Besides that, another requirement is the support for VR devices, such as an HMD or an ODT, in order to allow the desired immersive exploration. As both have support for an HMD, but not for any ODT, they can be used equally well. The treadmill has to be integrated in both scenarios. In this thesis, Unreal Engine 4 is used, mainly because it builds on C++, which makes the integration with the rest of the reconstruction pipeline easier. 3.4.2 Exploration with VR Hardware Head-Mounted Displays HMDs (for VR) are displays, worn on the head like glasses, and provide a much higher field of view compared to a normal desktop screen. They enable a stereoscopic view by showing separate images for both eyes. Moreover, these devices track the position and orientation of the user’s head which allows to adjust the visualization according to the current head movement. This way, the user is able to explore the virtual surroundings in a natural way just by turning the head. State-of-the art devices include the Oculus Rift [Ocu16a], HTC Vive [Hig16] or PlayStation VR [Son16]. Besides a high display resolution and field of view, a high refresh rate and low latency is critical to fully immerse the user in the virtual world and to avoid symptoms like headache or nausea [MS92]. The so called cybersickness can also be introduced by 19 3. State-of-the-Art the VR content when different sensory cues are in conflict, e.g. when experiencing a virtual rollercoaster ride, while actually sitting at a desk. To enhance the VR experience and to limit motion sickness, the user can move in the virtual environment by actually walking the the real world. Omnidirectional Treadmills While HMDs allow to look around freely, walking more than a few steps is usually not possible because of the limited available space in the real world. An ODT can be used to tackle this problem. Such a device enables a person to walk in virtual space while actually staying at the same position. In contrast to a typical treadmill, the person can move in any direction. Generally an ODT can be either active or passive. Active systems track the user’s movement and try to maintain a still position by moving the user in the opposing direction. With such an approach, the user performs a natural walking movement like in the real world. In passive systems, the users stands on a stationary platform and is fixed to the ODT. The user moves by sliding the feet over a low-friction surface. These systems are lighter and more affordable than active system, but enable a less natural experience. The Infinadeck [Inf16] is an example of an active system. The Cyberith Virtualizer [CH14] and the Virtuix Omni [Vir16] are two VR treadmills, which are both passive systems. In this thesis, the Oculus Rift as well as the Cyberith Virtualizer are integrated in the immersive exploration setup. 20 CHAPTER 4 Methodology In this chapter, the concepts of the implemented 3D reconstruction system are introduced and the applied methods are explained. Implementation details and instructions, how to use the system, can be found in the following Chapter 5. The system is based on the InfiniTAM framework [KPR+15] from Oxford University which uses a volumetric SDF as a scene representation. InfiniTAM integrates Voxel Block Hashing [NZIS13] to store the model efficiently in terms of both memory usage and computational performance. UE4 [Epi16] is used for visualization and exploration purposes. The thesis implementation allows to scan and reconstruct the geometry of large environ- ments using an off the shelf camera like the Microsoft Kinect, which delivers both color and depth information. The acquired reconstruction data is incrementally streamed over network, and the gradually expanding reconstruction can be explored in an immersive VR setup. Since network transmission is involved, the system is divided into a server part and a client part as illustrated in Figure 4.1. Colored parts indicate novel features. Yellow modules modify or extend existing functionality, whereas green modules are newly added. The server performs the actual scene reconstruction, which consists of the camera pose estimation, the integration of the camera images into the model, the swapping of data between the GPU and the CPU memory and the raycasting procedure. This raycasting step creates a synthetic depth map from the volumetric model, used both for camera tracking and live feedback. The network transmission of the volumetric model is performed in a separate procedure after the GPU - CPU swapping stage. The client part of the framework allows to explore the reconstruction. It listens for incoming voxel data and assembles it again, so that the client has an exact copy of the server-side model. Since a mesh representation is used for exploration, it is extracted from the volumetric representation and dynamically updated after integrating new voxel data. The computed mesh is passed to the visualization module without interrupting it. As already mentioned, this visualization and exploration module builds upon UE4 and runs in a separate process. 21 4. Methodology Figure 4.1: Overview of the reconstruction pipeline. The server performs the scene reconstruction and sends the model incrementally to the client. The client maintains an up-to-date mesh representation and enables live exploration of the reconstructed scene. Yellow color indicates, that corresponding parts are modified functionality Since main parts of the reconstruction system rely on properties of the Voxel Block Hashing data structure, it is introduced first. Afterwards, the individual parts of the implemented pipeline are explained. 4.1 Model Representation 4.1.1 Truncated Signed Distance Function During the reconstruction, the model is represented as a three-dimensional truncated SDF (TSDF) [CL96]. A SDF stores for the distance each point in space to the closest object surface, where the two sides of a surface are distinguished by the sign of the distance. The zero crossing of the SDF then determines the surface location. In contrast to a standard SDF, a TSDF only considers values within a certain range from the surface. All values further away are set to the maximum distance. The considered range of distances is called truncation band. Figure 4.2 shows the TSDF for a simple rectangular object in 2D, where the width of the truncation band is set to three. While in the implementation, the Euclidean distance measure is used, the Manhattan distance is applied in this example in order to show nicer numbers. 22 4.1. Model Representation 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 3 3 3 3 2 1 1 1 1 1 2 3 3 2 1 -1 -1 -1 -1 -1 1 2 3 2 1 -1 -2 -2 -2 -1 1 2 3 2 1 -1 -1 -1 -1 -1 1 2 3 3 2 1 1 1 1 1 2 3 3 3 3 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 Figure 4.2: Example of a TSDF (with Manhattan Distance) for a simple object The surface is defined as the zero crossing and is illustrated in red. 4.1.2 Voxel Block Hashing The basic data structure to store such a function is a regular voxel grid, where each voxel stores a distance value. The position of the voxel is encoded implicitly by its index within the grid. Since all values outside the truncation band are the same, they do not contain any valuable information and thus this representation is not space efficient. Voxel Block Hashing [NZIS13] overcomes this limitation and is applied in the developed system. In this approach, the world is divided into so-called voxel blocks, where each of these blocks holds n3 voxels. The position of the voxels within the block is determined by its index. The main difference to the basic representation is, that information is not stored densely for every point in space, but only for regions which are close to a surface. In order to store a voxel block, at least one of its voxels has to be inside the truncation band of the TSDF. Note, that in this case, there is no implicit connection between the voxel block position and its index, because the data is stored in an unstructured way. Instead, the blocks are addressed using a spatial hash function, which maps points in space to a hash table. Given a voxel block world position, one can find the corresponding hash table entry. But given the index of an entry, the world position cannot be retrieved. As a consequence, besides the information, if the corresponding block is actually allocated, each hash table entry stores the block position. The hash function to compute an index h from a 3D voxel block position with integer world coordinates x, y and z can be seen in Equation 4.1, h = ((x× p1)⊕ (y × p2)⊕ (z × p3)) mod n (4.1) where n is the number of hash table entries and p1, p2 and p3 are large prime numbers. In InfiniTAM, these numbers are set to 73856093, 19349669 and 83492791. The ⊕ operator defines a logical XOR and mod stands for the modulo operator. The hash table usually contains a lot of empty entries to minimize hash collisions. In order to avoid allocating a lot of empty memory, the actual voxel data of the block, i.e. TSDF values and colors, is not saved in the hash table directly, but in an additional voxel block array. Each hash table entry maintains a reference to this array. Figure 4.3 shows the principle of the data structure in two dimensions. For illustration purposes, one voxel block only holds 42 voxels. In the implementation, each block stores 83 voxels. 23 4. Methodology Hash Table Voxel Blocks World Object SufaceCamera Figure 4.3: The principle of the Voxel Block Hashing data structure. A hash function maps world positions to a hash table. Each hash table entry maintains a block world position and a reference to the voxel block array. The voxel block array stores the actual TSDF and color data. Handling Hash Collisions Note, that the hash function is not injective, which means several world positions can be mapped to the same index. Such hash collisions are handled by extending the hash table with an additional array, which holds all collided entries. This array is called the unordered part and the standard hash table region is called the ordered part. Figure 4.4 illustrates this concept. Each hash table entry (either in the ordered or unordered part) stores a possible reference to a further entry in the unordered part. Therefore, all voxel blocks with the same hash index form a linked list. Voxel blocks, stored in the ordered part, can be retrieved in constant time, whereas a unorderedordered Hash Table World Figure 4.4: Division of hash table in an ordered and an unordered part to resolve hash collisions. Three voxel blocks map to the same hash index in the ordered part. The green block, which is allocated at first, is stored at that index. References are maintained to entries in the unordered part, where the other two blocks are stored. lookup in the unordered part involves a little overhead, since the linked list has to be traversed. However, by creating a large hash table, which ensures a low load factor, the number of collisions can be reduced and the performance remains high. 24 4.2. Camera Pose Estimation Generally, Voxel Block Hashing allows for a very efficient implementation, even faster than the original KinectFusion. According to Kähler et al., InfiniTAM also runs on a Tablet with more than 30 fps. 4.2 Camera Pose Estimation The first step in the reconstruction pipeline is the camera pose estimation. InfiniTAM integrates the ICP approach used in the orignial KinectFusion implementation [NIH+11], which is solely based on the depth image. Apart from that, InfiniTAM also integrates an option to compute the pose using the color image in a dense frame-to-frame manner but it cannot be combined with the ICP tracker. 4.2.1 Iterative Closest Point The idea of ICP is to find the transformation which best aligns two point clouds [BM92]. Basic ICP selects a number of points from the first cloud and finds, for each of these points, the closest point in the other cloud. As a result, a set of point pairs is generated where pairs with a too big distance are rejected. This so called data association step is computationally expensive. In order to find the nearest neighbor faster, a kd-tree data structure can be used. After the data association, the translation and rotation, which maps the first cloud onto the other one, is computed. This is achieved by constructing and minimizing an error function such as the sum of squared distances between the associated points (si and pi): Error = ∑ (Rsi + t− pi)2 (4.2) R represents the rotation matrix and t the translation vector to be found. The function has a closed form solution. After aligning the point clouds with the found transformation, the whole process with association and alignment is started again and repeated iteratively until the error converges. Since ICP is a gradient descent method, it relies on a good seed point and can get stuck in a local minimum. Generally, basic ICP can be applied to depth maps by converting them to 3D point clouds first. However, since basic ICP is time-consuming, a more efficient variant is used in the field of real-time 3D reconstruction. It can be computed in parallel on the GPU and allows to use all the input data instead of just a selection of points. Projective Data Association One important feature is the usage of projective data association, which makes it possible to find corresponding points very fast without the need for a special data structure. The idea of this data association is, to project the points from one depth image into the other one and associate those points that fall on the same pixel. A point-pair is rejected if the depth difference of the pixels is too big or if the surface normals are not similar enough. Since in live camera tracking, the pose of the current frame is unknown, this projection cannot be performed directly. However, when 25 4. Methodology scanning the scene at 30 frames per second, the camera motion can be assumed to be very small between consecutive frames. For this reason, the current camera pose estimate is initialized with the pose estimate of the previous frame in the first ICP iteration. Point-to-Plane Error Metric Another difference to basic ICP is the use of a point- to-plane error metric. Instead of considering point-to-point distances, the goal is to minimize the distances between the points of frame number one and the tangent planes at the corresponding points of frame number two. According to Rusinkiewicz, this is slower but usually results in a better convergence rate [RL01]. Equation 4.3 shows the error function for this case. Error = ∑ ((Rsi + t− pi) · n)2 (4.3) The difference to Equation 4.2 is the added scalar product with the normal vector ni (at point pi). This error function can be minimized using a general non-linear least squares algorithm like Levenberg-Marquardt or Gauss-Newton [Bjö96]. The latter is applied within InfiniTAM. Figure 4.5 illustrates the difference between the point-to-point and the point-to-plane distance. Tangent plane p s point-to-plane point-to-point Model Source Figure 4.5: Difference between point-to-point and point-to-plane distance. The blue curve represents the reconstructed model and the red curve represents the current camera information, which needs to be aligned to the model. The model point p and the source point s are associated with each other. The point-to-point distance is given as the direct distance between the two points. The point-to-plane distance is the distance between s and its orthogonal projection on the tangent plane of p. The ICP procedure can be performed multiple times on different resolutions of the input images to further improve the convergence behavior. At first, the pose is computed on a coarse resolution and then, this pose deals as initial estimate for the finer levels, where it is iteratively refined. In the InfiniTAM framework, a five scale image pyramid is used, where in the first three coarse levels, only the three Degree Of Freedom (DOF) camera rotation is estimated. In the last two finer levels, the full six-DOF pose is computed. 26 4.3. Data Integration 4.2.2 Limiting the Drift Frame-to-Model Tracking In order to limit the drift of the pose estimates, a frame- to-model approach is applied instead of frame-to-frame tracking [NIH+11]. For that purpose, a depth image is synthesized from the current volumetric model using raycasting. To compute the relative camera translation between the current and the previous frame, the synthetic depth image is used in combination with the current camera depth image, instead of two consecutive camera images. Procedures to detect and close loops are not integrated in the reconstruction pipeline. As a result, even the frame-to-model tracking leads to notable drift in larger scenes. Integration of Inertial Measurement Unit Kähler et al. propose to integrate an Inertial Measurement Unit (IMU), which senses the orientation with a gyroscope, to improve the pose estimates [KPR+15]. They do not apply any form of sensor fusion, but just replace the ICP rotation estimate with the relative IMU rotation. Since only the 3- DOF camera translation has to be estimated with ICP, Kähler et al. use a reduced image pyramid with two levels instead of five. Besides limiting the drift, the IMU integration reduces the required computational which makes it possible to achieve real-time frame rates on Tablets. The publicly available InfiniTAM version only allows to load prerecorded IMU data. The thesis implementation enables to use also live data of an Android device by streaming its orientation via UDP to InfiniTAM. 4.3 Data Integration After the pose of the camera is computed for the current frame, the depth and color information can be integrated into the volumetric model. Each voxel of the model stores a running average of the color and the TSDF information, where the information of the last n measurements is considered. InfiniTAM uses a value of 100 for n per default. Find Visible Voxel Blocks The first step of the data integration is to determine which voxel blocks are actually visible, and thus need to be updated. Since the traversal of all voxel blocks is too expensive, a list of visible voxel blocks is maintained in an incremental way. In each frame, it is examined, if the blocks visible in the previous frame are still visible. This is achieved by projecting all eight corners of a block to the current image plane (using the estimated camera pose and the intrinsic camera parameters). If any corner falls inside the image boundary, the block is visible. Blocks, which are not visible anymore, are removed from the list. Previously invisible blocks are found the inverse way. The pixels of the current depth image are projected into the world space. The corresponding blocks are found by evaluating the spatial hash function and performing a lookup in the hash table. If the voxel block is not stored yet for a given position, a new one is allocated. Finally, the block is added to the list of visible blocks. 27 4. Methodology Update TSDF and Color The actual integration is performed in a similar way than in the original KinectFusion system. All the visible blocks are transformed to the current camera coordinate system using the inverse of the estimated camera pose. The individual voxels of those blocks are then further projected into the current depth image using the known intrinsic depth sensor parameters. The depth of the projected voxels and the values of the depth image at the corresponding positions are compared. Whenever the difference (which represents an SDF value) is within the truncation band or whenever the stored voxel is in front of the measured surface, the voxel has to be updated. The depth difference is truncated at the maximum TSDF value and integrated into the running average. To update the color information, the voxels are projected into the color image to retrieve the corresponding RGB data. Note, that the used depth difference is not a true TSDF value, but a projected one since the distance is only measured along the view direction. There could be a closer distance in the other directions. While for surface extraction (using raycasting) this does not affect the quality, it can lead to problems at 3D gradient computations as performed for normal estimation [KPR+15]. 4.4 GPU - CPU Swapping Generally, the camera pose estimation and data integration is performed on the GPU to ensure real-time performance. The direct use of data stored in the main memory from the GPU is not possible. In order to access it, it has to be copied to GPU memory. Since this is too expensive to perform for every data access, the model is stored directly on the GPU. The major limitation in this case is the low available amount of GPU memory, compared to the main memory. Although Voxel Block Hashing ensures a low memory footprint compared to other scene representations, truly large scenes, such as entire building complexes, are still not possible. This is especially true when requiring a fine voxel resolution. 4.4.1 Existing InfiniTAM Functionality For this reason, the idea of Moving Volume techniques is integrated in InfiniTAM: Only those parts of the model which are currently needed for processing, are stored inside the GPU memory, all other parts are moved to the main memory. When revisiting previously scanned areas, the corresponding voxel blocks are uploaded to the GPU again, so that this existing information can be combined with new data. The GPU - CPU swapping can be achieved very efficiently, when an unstructured scene representation, such as Voxel Block Hashing, is used. Each voxel block can be accessed and processed individually, without the need of maintaining any hierarchy information. Voxel blocks required for processing, are exactly those, which are currently within the camera’s view frustum. Voxel blocks, that turned from visible to invisible between the previous and the current frame, are transferred to the CPU memory. Previously hidden blocks, which are now inside the view frustum are copied to the GPU memory. 28 4.4. GPU - CPU Swapping As explained in Section 4.1, the actual voxel data is stored in the voxel block array. This array lies inside the GPU memory and when using GPU - CPU swapping, it only holds the visible part of the model. The CPU memory, which stores the complete model, is called the global memory. The GPU - CPU swapping takes place after data integration and thus, the required voxel block visibility information is already available. However because of this order, an enlarged camera view frustum has to be used for the computation of the visibility information. This way it is ensured, that for the processing of the next frame, all the required voxel blocks are inside the GPU memory. Figure 4.6 shows an example, where one can see which blocks are swapped, when the camera rotates left. The dotted line represents the enlarged frustum. Green voxel blocks (currently on CPU) are swapped in to GPU, whereas red blocks (currently on GPU) are swapped out to CPU. The white (on CPU) and gray (on GPU) blocks remain unchanged. Camera Camera (rotated left) Figure 4.6: Voxel block visibility check for GPU - CPU swapping. All blocks within the enlarged view frustum (dotted line) have to be inside the GPU memory. The red blocks are swapped out to CPU and the green blocks are swapped in to GPU. 4.4.2 Modifications to InfiniTAM Decreasing Global Memory Footprint In the original InfiniTAM framework, the global memory contains exactly the same amount of elements as the hash table. That means, that there is an implicit one to one mapping from a hash table entry on GPU to a voxel block entry in the global memory, which is advantageous for the swapping process. The drawback is, that a lot of CPU memory is required for the global memory since the hash table is usually large and contains many empty entries to reduce collisions. As a consequence, the potential advantage of GPU - CPU swapping is mitigated and large scenes still do not even fit into the CPU memory. In this thesis, the global memory is adapted to tackle this issue. An additional auxiliary array of indices is stored in the CPU memory with the same size as the hash table. One element of this array then forwards to 29 4. Methodology an entry in the global memory. This strategy slightly reduces the performance because of the additionally required lookups in the auxiliary array. However, the size of the global memory is greatly reduced since it does not store any empty voxel blocks. Note, that this strategy is similar to the mapping between the hash table and the voxel block array. Swapping Out Voxel Block Positions In order to allow the transmission of voxel blocks over network to a remote party, they need to be copied to CPU first. Sending them directly from GPU is not possible. InfiniTAM’s swapping procedure is therefore also beneficial for the network transmission feature since it performs the mandatory voxel block copying already. To be able to integrate the voxel blocks on the remote side again, not only the voxel data itself, but also the voxel block world position is required. In the original framework however, only the voxel data is swapped out to CPU. As a consequence, the swapping is extended in this thesis to include the voxel block world positions as well. The positions are only copied one-way to the CPU, since the other way is not necessary. More details on the network transmission module can be found in Section 4.6. Figure 4.7 illustrates both swapping in (right image) and swapping out (left image) of voxel blocks. The modifcations are highlighted yellow. When swapping out a block, also the hash table index is transferred to CPU. Given this index and the auxiliary array, the block within the global memory can be found and updated. After swapping the data to global memory, the data in the GPU voxel block array is deleted. The swapping in from CPU to GPU happens in two stages. First, the hash table indices of the required voxel blocks are transferred to CPU. Given these indices, the CPU is then able to locate the corresponding blocks in the global memory and copy the data to the GPU. Finally, the GPU stores these blocks in newly allocated entries in the voxel block array. 1. 2. Hash Table Voxel Block Array GPU Index Array Global Memory CPU 1 4 6 GPU CPU 5 10 5 10 ... ... ... ... ... 5 10 Positions References References ... Positions Voxel Data Indices 1 4 6 1 4 6 Figure 4.7: GPU - CPU swapping procedure. The left figure illustrates the swapping out to CPU and the right figure shows the swapping in to GPU. Extensions to InfiniTAM are highlighted in yellow. 30 4.5. Raycasting 4.5 Raycasting Raycasting is the basic method to create an image directly from a volumetric TSDF model [FVDF+94]. It creates the image by shooting rays into the scene, which start at the virtual camera position and run through the pixels of the image plane. For each pixel, a single ray is casted and the first intersection point of the ray with the model is computed. The color of the intersection point determines the color of pixel. If a depth image is desired, the pixel value is set to the distance between the camera and the intersection point. The live image extraction from the current state of the reconstruction is crucial during scanning in order to known if the camera pose estimation is working correctly and to see which regions still need to be scanned. Besides the live user feedback, raycasting has another major use case. As stated in Section 4.2 a frame-to-model tracking approach is applied in order to minimize the camera drift. This requires the synthetic creation of a depth and normal map from the current model, which can be achieved with raycasting. In order to find the first object intersection, i.e. the zero crossing of the TSDF, one has to step along the ray and sample the model. The trivial solution is to check all voxels along the ray, which is however too expensive for real-time approaches. Since the model is represented as a TSDF, information about the distance to the surface is available and therefore, larger steps are possible. However because the SDF is truncated, the surface location cannot be inferred directly and the maximum step length is limited by the size of the truncation band, so that the surface is not missed. This is true for the original KinectFusion approach, where the TSDF is stored in a regular voxel grid. The use of Voxel Block Hashing enables to skip large empty portions of the scene since only data near actual surfaces is stored. The possible area of the ray intersection can be limited by computing the minimum and maximum depth of the scene along the ray direction. This greatly improves the overall performance of the reconstruction system in comparison to the standard KinectFusion. To find the bounds for each ray, all visible voxel blocks are projected into the image space of the virtual camera. Then for each block, the image space bounding box of the eight projected corners is computed along with the minimum and maximum depth of this projected block. Finally, all bounding boxes are rendered to a modified z-buffer. This buffer holds the minimum and maximum observed depth values of all visible voxel blocks. The minimum depth value now determines the start point of the ray. At first, an allocated voxel block has to be found. If no block is allocated at the current ray position, a step is made with the size of a voxel block. If an allocated block is found, but the TSDF value of the current voxel is outside the truncation band, a step is made with the size of the truncation band. When the sampled voxel is inside the truncation band, the step length is given by its TSDF value. The surface is found when the first TSDF value is negative. Note, that usually the sample points along the ray do not exactly correspond to voxel positions and therefore the TSDF values have to be interpolated. For performance reasons, a trilinear interpolation is only performed when being close 31 4. Methodology to the surface, otherwise a nearest neighbor lookup is performed. The corresponding color can be interpolated like the TSDF. A final depth map is generated by rendering the depths of the intersection points. For some applications, like camera pose estimation, additionally the normal vector is required. This vector can be computed either in image space from the computed depth map or in object space from the TSDF. Usually, the normal direction for a given point in a SDF corresponds to the direction with the strongest increase of the values, the gradient. However, as explained in section 4.3, the model does not contain true TSDF values since distances are only measured along the viewing direction of the scanning camera. Therefore, distances are not always maximal in orthogonal direction which can lead to wrong normals. Normal vectors in image space can be evaluated by computing cross products of vectors between neighboring points. This is faster than doing a gradient computation in object space, but also produces artifacts at depth discontinuities. For tracking and live visualization, the latter approach is used. For mesh extraction, a 3D gradient computation is performed since no depth map is available in this case. 4.6 Network Transmission The network transmission feature allows to send the reconstruction data incrementally over a wireless network to a remote client. The client receives this data and integrates it into its scene representation again, so that both server and client store exactly the same model. 4.6.1 Server Side Compression and Transmission The network transmission module runs in parallel in order to avoid blocking the recon- struction process. The networking feature is not part of the original InfiniTAM framework and is added as part of this thesis. The major requirement of the transmission is to keep the required bandwidth as low as possible, whereas the latency is secondary. It is not that important if the model update on the client side is delayed by couple of frames, as long as the transmission keeps up with server side model acquisition over time. Note, that this is a clear advantage over streaming live images, which require an interactive frame rate with low latency. Since Voxel Block Hashing is used, the space is already compressed efficiently. Moreover, individual voxel blocks can be addressed and processed independently of the rest of the scene, which is a big advantage for this streaming feature. In order to further save bandwidth, redundancy is removed by streaming only new or changed data. Generally, voxel blocks outside the view frustum cannot change and can be ignored. Only blocks inside the frustum integrate new information, but due to errors in camera pose estimation or sensor noise, they are likely to change every frame and would have to be retransmitted many times. However, voxel blocks falling out of the current view frustum store new 32 4.6. Network Transmission data and remain unchanged in the future if the scanning camera is not revisiting this area. As a consequence, exactly those blocks are transmitted to the client every frame. The GPU - CPU swapping module already performs the identification of voxel blocks which fall out of the view frustum. Moreover, they are moved from GPU to CPU. The swapping module marks those blocks, so that the network streaming module is informed. In InfiniTAM’s original swapping module, only the TSDF and color data of the voxels is exchanged between GPU and CPU memory. The world position of a corresponding voxel block is not included. Since this position is mandatory for the client in order to integrate the voxel block in its scene representation correctly, the swapping procedure is adapted, so that it also copies the block position one way from GPU to CPU. The network transmission module waits for voxel blocks to be marked by the GPU - CPU swapping module. In order to achieve a higher compression, it gathers multiple voxel blocks in larger chunks instead of compressing each block alone. Moreover, the TSDF, color and position of all blocks are grouped into three arrays, so that common data types are stored together. Finally, the arrays are compressed individually and transmitted to the client. The network transmission is designed in such a way, that it detects when the connection to the client is lost. Since the networking runs in parallel to the main processing, the scene reconstruction continues. The network module waits until the connection is reestablished and once the client is reachable again, all the data reconstructed in the meantime is transmitted. Voxel Block Compression For one voxel block with 83 voxels, 512 TSDF values, 512 color values and one position is transmitted. Generally, the voxels also contain weights for the integration of new color and TSDF values, however they can be omitted at transmission since they are not required at the client side. In an uncompressed format, the amount of data to be streamed for a single block would be equal to 1024 bytes for the TSDF, 1536 bytes for the color and 6 bytes for the position. Because the color is used only for visual appeal and makes up for most of the data, it may be deactivated to further decrease the bandwidth. For compression, the DEFLATE algorithm is used [Deu96]. This is a lossless general purpose method, which is also used for the PNG image format or the ZIP archive file format. DEFLATE uses concepts of both Huffmann coding and LZ77. It achieves the compression in two stages: At first, the data stream is analyzed to find duplicate strings, which are then replaced by pointers. Afterwards, individual symbols are exchanged with new ones according to their number of occurrence. Common symbols are replaced by shorter representations, whereas rarely used symbols are replaced by longer representations. Network Protocol Regarding the selection of a network protocol, the main goals are that no data is lost and that the throughput is maximized. The guaranteed order of the data, i.e. data is received in the same order in which it is sent, is not important as long as it is taken care that old data cannot overwrite newer data during the client side integration. Besides that, the latency is secondary as already mentioned above. In this 33 4. Methodology thesis, the network protocol is not of primary concern and thus, only the two standard protocols TCP and UDP are considered because they are widely supported. UDP is not an option in its standard form since it is unreliable, i.e. data packets can be lost during transmission. As a consequence, TCP is selected, even though it contains unrequired features such as ordered transmission, which results in a higher bandwidth. 4.6.2 Client Side integration The client stores the reconstructed scene as a lossless copy in the same format as the server. A hash table maintains voxel block positions and references to a voxel block array, which stores the actual voxel data. In contrast to the server, the representation is stored completely on the CPU memory and not (partly) on the GPU. The power of the GPU is not required at the client since it does not perform expensive camera tracking or raycasting. As a consequence, the voxel block array (now on the CPU) is used directly to store the whole scene instead of the additional global memory. The size of the array should be similar to the size of the server side global memory. GPU - CPU swapping is not required. Note, that even though the same representation is used on both server and client, the data might be stored in a different way in the hash table. The order in which voxel blocks are encountered is usually not the same on client and server. It can happen, that on the server side, there already exists a voxel block with the same hash value and thus, the new block has to be stored in the unordered part of the hash table. On the client side however, there might be free space in the ordered part where the block must be stored. For this reason, the client cannot just inherit the server side hash table index of the voxel block but has to perform its own voxel block allocation. The client listens for incoming voxel data and whenever new data arrives, the individual arrays (TSDF, colors and positions) are decompressed and assembled to voxel blocks again. Given the position of a voxel block, the client then computes the corresponding hash value and checks if there already exists a block for that position. If so, the TSDF and color values are overwritten with the new values. Otherwise, a new voxel block entry is allocated in the hash table and filled with the new data. Furthermore, the integration procedure informs the meshing procedure that a new voxel block arrived at the corresponding part of the scene so that the mesh of this region is updated. 4.7 Dynamic Scene Update In order to visualize and subsequently explore the scene at the client side, a triangular mesh is computed from the underlying volumetric representation using Marching Cubes [LC87]. The mesh is updated on-the-fly upon model changes so that it can be explored while still scanning the scene. The dynamic update is achieved by partitioning the scene into several subvolumes, each holding an individual mesh. The meshes are recomputed whenever underlying voxel blocks changed. Every subvolume has a unique identifier ID which can be computed from the world 34 4.7. Dynamic Scene Update position of any underlying voxel block. The client’s Network Transmission module performs this computation for every newly arrived voxel block to identify the mesh block to be updated. The computed ID is checked for existence within a map of mesh block identifiers MID, which holds the identifiers of all mesh blocks to be updated. If the index does not exist, it is added to MID. The Scene Update module can also access MID. It regularly reads and removes an ID from the map. Given the ID, the world position of the mesh block is reconstructed and the corresponding volume is triangulated. The following subsections explain the scene partitioning, the mapping between the scene parts and the corresponding voxel blocks and the actual update of individual meshes in more detail. 4.7.1 Scene Partitioning Inspired by Steinbrücker et al. [SKCS13], the mesh representation of the scene is composed of a number of smaller individual meshes, where each mesh covers a certain region of the volumetric scene. In contrast to Steinbrücker’s approach, the volumetric scene is only stored at one resolution, which eases the mesh computation. The scene is divided into a regular 3D grid of mesh blocks, where each mesh block holds its own mesh and covers a region of n3 voxel blocks. An example is shown in Figure 4.8. Real-time updates are 80 cm Figure 4.8: Partitioning of a test scene into individual mesh blocks, where each block contains its own mesh. The mesh of a single mesh block is highlighted in green. achieved by recomputing an individual mesh, whenever new data for any of its underlying voxel blocks arrived. The triangulation of the voxel blocks is performed using an extended InfiniTAM Marching Cubes implementation, as described in Subsection 4.7.3. The size of the mesh blocks is defined by the number of voxel blocks they should cover. Increasing the mesh block size leads to fewer but larger meshes that need to be updated even when only small parts contain new information. Decreasing the mesh block size optimizes updating of a each single mesh but increases the overhead for mesh maintenance 35 4. Methodology and rendering since the total number of meshes becomes very large. Moreover, the number of vertices increases since a vertex on the border of a mesh block is contained in each adjacent mesh block, too. For the thesis setup, sufficient update rates were found with ten voxel blocks for each dimension per mesh. Note, that the metric size of a mesh block also depends on the chosen voxel resolution and number of voxels per voxel block. If one uses voxel blocks with 83 voxels at a resolution of 1cm, one mesh block covers a volume of 80x80x80cm. The scene is divided into a regular 3D grid of mesh blocks, where one grid element contains multiple voxel blocks. In the thesis implementation, such a mesh block covers 103 voxel blocks, which leads to an area of 80x80x80 cm (at 1 cm voxel resolution). A maximum number of 1000 mesh blocks per dimension is used. As a consequence, the maximum size of the mesh representation is 800 m3 at a 1 cm resolution. Figure 4.8 shows a the partitioning of a reconstructed test scene into 80 cm3 mesh blocks, where the mesh of a single mesh block is highlighted in green. In the right image, one can also see the corresponding mesh block. In total, the scene consists of 130 individual meshes. 4.7.2 Mapping between Voxel Blocks and Mesh Blocks To establish a correct mapping between the voxel blocks vB and their mesh block mB, each mesh block is identified by a non-negative and unique integer ID. This ID can be computed – as described by Algorithm 4.1 – from a given 3D voxel block position vbPos ∈ Z3. First, vbPos is shifted to the positive octant of the coordinate system. Afterwards, the shifted position is divided by the number of voxel blocks per mesh block (integer division). This results in a three dimensional mesh block ID3 ∈ Z3. Finally, ID3 is flattened to a one dimensional integer ID. Note, that due to the division by the number of voxel blocks per mesh block, all voxel blocks within a mesh block produce the same mesh block ID. Algorithm 4.1: Mesh Block ID from Voxel Block Position Input: Voxel Block Position vbPos ∈ Z3, Number of Mesh Blocks per Dimension mB, Number of Voxel Blocks per Mesh Block (per Dimension) vB Output: Mesh Block ID ∈ Z 1 ID ← −1; 2 vbPosshifted ← vbPos+ mB 2 · vB; 3 ID3 ← vbPosshifted / vB; //3D Mesh Block ID 4 if ID3(x) ∈ [0,mB) and ID3(y) ∈ [0,mB) and ID3(z) ∈ [0,mB) then 5 ID ← ID3(x) ·mB2 + ID3(y) ·mB + ID3(z); //1D Mesh Block ID 6 end 7 return ID; In order to allow this ID computation, the maximum number of mesh blocks per dimension mB has to be predefined. Limiting it to 1000 keeps the generated index below one billion 36 4.7. Dynamic Scene Update (four bytes), but still provides a sufficient large address space. Furthermore, in this case, the flattened integer ID is constructed in simple way: the first three digits represent ID3(x), followed by three digits for ID3(y), and three for ID3(z). Given a mesh block ID, one can reconstruct the world position of that mesh block mbPos ∈ Z3 according to Algorithm 4.2. First, the three dimensional ID3 can be extracted from ID (lines 3 - 5). Next, the shifted mesh block position is computed (line 6) and finally the shifting is inversed (line 7). The resulting mesh block world position mbPos corresponds to the voxel block with the smallest position in the mesh block, i.e. each coordinate is smaller than or equal to the corresponding coordinate of all other voxel blocks. Algorithm 4.2: Mesh Block Position from Mesh Block ID Input: Mesh Block ID ∈ Z, Number of Mesh Blocks per Dimension mB, Number of Voxel Blocks per Mesh Block (per Dimension) vB Output: Mesh Block Position mbPos ∈ Z3 1 mbPos = ( 0 0 0 ) ; 2 if ID >= 0 then 3 ID3(x)← ID / (mB2); 4 ID3(y)← (ID −mB2 · ID3(x)) / mB; 5 ID3(z)← ID −mB · (ID3(y) +mB · ID3(x)); 6 mbPosshifted ← ID3 · vB − mB 2 · vB; 7 mbPos← mbPosshifted − mB 2 · vB; 8 end 9 return mbPos; 4.7.3 Triangulation of a Mesh Block The triangulation of individual scene parts is performed with Marching Cubes. The Marching Cubes algorithm can be performed efficiently in parallel on the GPU for large meshes. However in this implementation, the mesh generation is performed by the CPU only because the client side representation lies on the CPU. This avoids the need for the expensive data transfer between the CPU and the GPU memory. Moreover, since only small regions of the mesh have to be updated at a given time, the power of the CPU is enough to ensure real-time performance. The GPU of the client is utilized for rendering only. The main idea of Marching Cubes is to walk through the volume and compute the triangles representing the isosurface at each step individually. Eight voxels representing the corners of an imaginary cube are processed at a time. By comparing the SDF values of these voxels, the isosurface passing through the cube can be inferred. If the SDF of one voxel is positive and the SDF of another voxel is negative, the surface has to pass anywhere in between. The information, if the eight SDF values are in front of or behind 37 4. Methodology the surface, is enough to determine the general triangle configuration, i.e. the number of required triangles and their approximate position and orientation. The information, on which side of the surface the voxel are, is encoded by the signs of the SDF and can be stored as a single byte, where one bit is used for each voxel. This byte then serves as an index to perform a lookup in a table, which stores the precomputed triangle configurations for all 256 bit combinations. These 256 combinations can be reduced to 15 families, where combinations within one family represent triangle configurations which are symmetrical or rotations of each other. Figure 4.9 illustrates the principle in two dimensions (Marching Squares). 1 1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1-1 -1 Case 0 Case 5 Case 8 Case 13 Case 11 Case 14 Case 6 Case 1 Case 2 Case 4 Case 9 Case 12 Case 15 Case 10 Case 7 Case 3 Figure 4.9: Principle of Marching Cubes in two dimensions (Marching Squares). The right image shows the lookup table storing the possible line configurations. The top left image represents the SDF data and the bottom left image shows the computed surface. The dot color indicates if the point is inside (white) or outside (black) the surface. Marked in red is one square, where the inside-outside combination corresponds to configuration 14. Figure adapted from Wikipedia1. Given the correct triangle configuration, the exact vertex positions of the triangles have to be determined. Note, that the vertices always lie on the edges of the imaginary cube. The final position P of a vertex is computed by taking the corresponding two SDF values (SDF1, SDF2) and positions (p1, p2) of the edge and performing linear interpolation as shown in Equation 4.4. 1https://commons.wikimedia.org/wiki/File:Marching_squares_algorithm.svg 38 4.7. Dynamic Scene Update P = p1 + −SDF1 SDF2 − SDF1 ∗ (p2 − p1) (4.4) The vertex colors are interpolated the same way as the positions. The vertex normals can be extracted by performing a 3D gradient computation at the computed vertex positions. Once all triangles have been computed for the current imaginary cube, the processing continues with next cube. Note, that this cube contains four of the same voxels as the previous cube, if the previous cube is adjacent. Previously computed information could be reused in these cases but is currently not done in the thesis implementation for simplicity reasons. Once the whole volume is processed, the final mesh is given as the set of all computed triangles. Extension to InfiniTAM’s Marching Cubes InfiniTAM includes a basic Marching Cubes approach to compute an uncolored mesh in a post-processing step. Besides the ability to update the mesh in real-time, the implementation is extended to include the vertex normals and vertex colors. Moreover, duplicate vertices within one mesh block are avoided by utilizing a hash map to check if a given vertex already exists. This check requires additional resources, but it also allows to skip normal and color computation for existing vertices. Apart from that, it reduces the required amount of memory. Instead of delivering an array of triangle structures, the meshing procedure now generates separate position, color, normal and vertex index arrays. The latter contains indices referencing vertices in the other arrays, where three consecutive indices define one triangle. This is the standard representation for rendering and is also required by UE4, which is used for visualization. 4.7.4 Shared Memory Connection and Dynamic Mesh Update in UE4 As already stated, the mesh of the whole scene is divided into a regular grid of smaller meshes. All these small meshes are stored and maintained by the rendering and exploration procedure, which runs as a separate process and uses UE4. Within UE4, the scene is represented as a set of procedural meshes, which can be added and modified at runtime. The UE4 process communicates with the client’s meshing procedure using shared memory. With shared memory, different processes on the same computer can directly access the same memory region and communicate with each other efficiently. Once the meshing module generates a new mesh of a mesh block, it writes the resulting arrays (positions, colors, ...) along with a unique mesh block index to the shared memory. The UE4 process regularly checks if new data is available and if yes, it reads the data and cleans the shared memory region to make space for a new mesh. Given the mesh block index, UE4 knows which procedural mesh requires an update. If the index is unseen, a new procedural mesh is added at runtime. 39 4. Methodology 4.8 Visualization and Exploration The reconstructed scene can be explored right from the beginning of the scanning process, so the scene can change and grow while exploring it. 4.8.1 Rendering with Unreal Engine 4 As previously mentioned, the existing game engine UE4 is integrated, since it already incorporates advanced rendering and physical effects, which are both vital for exploration. When providing the color information of the scene, the exploration is enhanced because the depth perception is increased and one can recognize known objects easier. Furthermore, the reconstructed scene looks more like reality, which can lead to a higher degree of immersion during exploration in a VR setup. At the scanning process, the RGB-D camera already captures the color of the scene which contains the full lighting information including shadows. As a consequence, the camera RGB information can be used directly for rendering without any further shading calculations. Note, that the specular lighting component depends on the current view point of the person looking at the scene. This property is very notable on glossy surfaces like metal where one can perceive specular highlights. The highlight should change when moving the head, which is not true for the captured model. As a consequence, when scanning the scene, the measured color is only correct for the corresponding camera position. Since the objects are usually scanned from various different locations, the color information of multiple view points is averaged and the specular lighting problem is less notable. For rough surfaces, which only have a small specular lighting component, the measured color represents the true lighting information much better. In order to include dynamic specular highlights and to enhance the 3D perception during exploration, three artificial points lights are added to the reconstructed scene. Besides basic lighting calculation for those lights, no advanced rendering effects are applied in order to enhance the rendering frame rate. Keep in mind, that even without specular surfaces, the color quality of the reconstructed model is still degraded compared to a synthetic 3D model due to slightly wrong camera pose estimation, motion blur, auto exposure or errors in the depth and color image alignment. Whelan et al. propose to improve the color information by ignoring measurements near depth discontinuities [WJK+13] at color integration. Moreover, they apply a weighted color update, where colors measured at a low angle (between surface normal and camera view direction) are included more strongly. This improvement is currently not included in the InfiniTAM framework and is therefore neither contained in the thesis implementation. 4.8.2 Exploration with VR Hardware The thesis implementation supports the exploration with various input and output devices. Regarding the output device, one can use a standard desktop screen or a HMD. For input, one can use a keyboard/mouse setup, a gamepad or an ODT. In this thesis the Developer 40 4.8. Visualization and Exploration Kit 2 (DK2) of the Oculus Rift HMD is used, which is illustrated in Figure 4.10 [Ocu16a]. The DK2 has a single full HD display which provides a field of view of 100 degrees. It is shared by both eyes, so for each eye there are 960 x 1080 pixels available. The rendering can be updated with maximum rate of 75 Hz. Besides tracking the orientation at 1000 Hz using a built in inertial measurement unit (IMU), the DK2 supports optical positional tracking with an update rate of 60 Hz. However, the positional tracking only works when the Rift is facing the camera and therefore, it does not work for all orientations. Since in this thesis, the Rift is used in combination with an ODT, the positional tracking is turned off. Figure 4.10: Oculus Rift HMD (Developer Kit 2)[Ocu16a]. The Cyberith Virtualizer ODT [CH14] can be used instead of the gamepad/keyboard input when using a HMD for display. The Virtualizer allows to move in the virtual world while actually staying at the same position in the real world. It also enables to decouple the walking and viewing direction, i.e. the user can walk in one direction while looking in another direction. This is usually not possible with standard input devices. A mechanical construction keeps the user at the same place while he walks on a low friction surface. Walking is performed by leaning the body forward so that one foot slides backwards while the other one performs a step. Figure 4.11 illustrates the construction. The treadmill’s surface has built-in sensors and detect the speed and direction of motion. Depending on how strong the user leans into the fixed construction, the walking speed changes since the feet slide backwards faster or slower. The user wears textile overshoes in order to ensure a level of friction which allows effortless walking while not being too slippery. The user can rotate freely 360 degrees in the Virtualizer so that any walking direction is possible. Moreover, the device allows to crouch, jump or sit which distinguishes it from similar treadmills like the Virtuix Omni. While the Oculus Rift is natively supported by UE4, a plugin is developed in this thesis in order to integrate the Virtualizer. The UE4 plugin allows to retrieve the current body rotation, the walking direction and speed and the current height of the user’s hip. The returned information is then used to update the position and orientation of the virtual character. 41 4. Methodology Figure 4.11: The principle of the Virtualizer. The user is placed in the middle of the device and is fixed to the ring with a belt system. Walking is performed by leaning into the belt system so that the feet slide backwards. Optical sensors, integrated into the low friction surface, measure the motion. [CH14] 4.9 Current Limitations In order to be used in general real world applications, some parts of the implemented system have to be improved in future work. Camera Pose Estimation The used InfiniTAM framework enables computationally efficient 3D reconstruction and allows to large scale scenes with a low memory footprint. Capturing a large environment successfully is however challenging, due to the applied camera pose estimation. The reconstruction process cannot continue once a camera pose could not be estimated because the pose of the previous frame is always required for the current pose estimation. Such tracking failures can occur when the scanning camera is moved too fast or when the scanned environment does not contain enough features, as in hallway scenes. An option to find the camera pose again is to match the current depth image to the model, where only a certain region around last known camera position has to be considered. The developed reconstruction pipeline however does not detect tracking failures, nor does it integrate any form of pose recovery. Even when a large-scale environment is reconstructed successfully, the resulting model is likely to be misaligned due to the accumulated camera drift. Loop closure for large-scale dense SDF representations is still an open problem and is not integrated in the framework. Mesh Extraction A further issue with the developed system is the huge amount of triangles produced by the Marching Cubes triangulation. Currently, the extracted 42 4.9. Current Limitations mesh is only used for rendering without advanced effects. For this purpose, the high triangle count is still manageable with modern hardware. However, the mesh needs to be simplified when interaction with the model is desired. Furthermore, the mesh representation is currently stored completely inside the client’s GPU memory and large meshes could exceed the available space. Exploration Collision detection between the virtual character controlled by the user and the reconstructed model is currently not performed. Wrong depth measurements can lead to artifacts, which block the way and permit the user to pass. Because of the missing collision detection, the user does not follow the terrain of the scene but walks on a fixed invisible plane and may also pass through walls. Moreover, the position and orientation of the reconstructed model is determined by the initial pose of the scanning camera. In order to ensure, that the invisible plane fits to the floor of the reconstructed scene, the depth camera has to be horizontally aligned at a height of 150 cm above the floor at the beginning of the scan. In future work, the model has to be cleaned from blocking artifacts in order to perform collision against the extracted mesh. This would also allow to reconstruct and explore multi-storey buildings, which is currently impossible. 43 CHAPTER 5 Implementation In this chapter, the implementation of the 3D reconstruction pipeline is highlighted. The focus lies on those parts which are adapted or newly added to the reconstruction pipeline as part of this thesis. For all existing and unchanged parts, it is referred to the original publications of InfiniTAM[PKC+14][KPR+15]. Used Frameworks and Libraries Both server and client applications are imple- mented in C++ on Windows 7 and use following libraries. Server: • InfiniTAM 2 [KPR+15] • OpenGL/GLUT (freeglut 3.0) [Fre16] • OpenNI 2.2 [Occ16] • CUDA 7.0 [Cor16b] • Windows Sockets 2 (Winsock2) • zLib 1.2.8 [DG96] Client: • InfiniTAM 2 • Unreal Engine 4.9 [Epi16] • Ultimate Shared Memory (usm) [Cho16] 45 5. Implementation • Windows Sockets 2 (Winsock2) • zLib 1.2.8 • Optional: OpenGL/GLUT (freeglut 3.0) • Optional: CUDA 7.0 While the original InfiniTAM framework, as well as UE4, supports a wide range of platforms including Windows, Linux or Android, the developed reconstruction pipeline currently only runs on Windows due to the usage of Winsock and usm. The rest of this chapter is structured in the following way: first, an overview of the implementation is given in Section 5.1. Afterwards, it is explained in Section 5.2, how to run the 3D reconstruction system. The remaining Sections 5.3 to 5.6 describe the structure and usage of the changed and newly added features more detailed. 5.1 Implementation Overview This section explains the structure of the most important classes and shows how the individual modules and their threads communicate with each other. 5.1.1 Class Structure As it can be seen in Figure 5.1, the reconstruction pipeline consists of three different processes: The InfiniTAM Server, the InfiniTAM Client and the UE4 Visualization. The yellow color indicates, that corresponding classes are changed, compared to the original InfiniTAM implementation. Note, that this just reflects important changes. There can also be minor changes in unmarked classes. Green classes are newly added to the framework. InfiniTAM Server The UIEngine is the entry point to InfiniTAM and controls the application. It processes user input and displays an image of the reconstruction’s current state. The UIEngine maintains three other engines: The ImageSourceEngine, the IMU- SourceEngine and the ITMMainEngine. The ImageSourceEngine handles the connection to the camera and provides images for the reconstruction. By extending this class, one can add support for any image source. In this thesis, the OpenNIEngine is primarily used and allows to process live data from the Asus Xtion Pro Live camera as well as already captured OpenNI sequences. The IMUSourceEngine makes it possible to integrate IMU orientation data to improve the camera tracking. Since the default implementation allows to load only saved data from file, it is extended in this thesis with the IMUNetworkEngine to use live data. The UIEngine gathers new input data from these two source engines and forwards it to the ITMMainEngine, which performs the actual reconstruction process. The ITMMainEngine is divided into three main parts: The ITMVisualizationEngine, the 46 5.1. Implementation Overview Figure 5.1: Overview of the most important classes and their relations. The yellow color indicates changed parts compared to the original InfiniTAM implementation. The green color indicates newly added parts. ITMTrackingController and the ITMDenseMapper. The main task of the ITMVisualiza- tionEngine is to compute an image of the reconstructed model using raycasting, which is used both by the UIEngine as well as the ITMTrackingController. The ITMTrackingCon- troller maintains a specific camera tracker, which handles the camera pose estimation. By extending the ITMTracker class, one can add new tracking methods. As mentioned in the previous chapter, InfiniTAM uses a depth-based ICP tracker per default. Within the thesis, the ITMTrajectoryReader is added, which allows to read and use precalculated camera poses instead of performing live tracking. The ITMDenseMapper handles both the integration of camera data into the model as well as the GPU-CPU swapping. The integration is performed by the ITMSceneReconstructionEngine and the swapping is performed by the ITMSwappingEngine. Besides those two engines, the ITMDenseMapper 47 5. Implementation contains the newly added ITMStreamingEngine, which transmits the reconstruction data to the client. InfiniTAM Client and UE4 Visualization The client application is divided into an InfiniTAM process and an UE4 process. In contrast to the server, large portions of the original InfiniTAM framework are not required, since the client does not need to perform pose estimation, integration of camera images into the model or raycasting. What is left, is the general voxel block data structure as well as a trimmed ITMMainEngine. The ITMMainEngine contains the ITMStreamingEngine directly, which receives the data from the server side ITMStreamingEngine. Furthermore, the ITMMainEngine maintains an ITMMeshingEngine and an ITMSharedMemoryEngine. The ITMMeshingEngine gen- erates the mesh representation and the ITMSharedMemoryEngine passes the computed meshes to the UE4 process via shared memory. The UE4 process also contains a Shared- MemoryEngine, the counterpart of the ITMSharedMemoryEngine. Apart from that, it contains a DynamicMesh object, which allows to update the UE4 scene representation on-the-fly. Lastly, a UE4 plugin is implemented to integrate the Cyberith Virtualizer. 5.1.2 Thread Communication The developed reconstruction pipeline uses several threads to keep the performance high and to avoid unnecessary blocking. Figure 5.2 shows all threads and how they communicate with each other. The threads are shown in gray. They communicate via commonly accessible data structures shown in white. The InfiniTAM server is composed of two threads. The main thread performs the actual scene reconstruction and the additional network transmission thread sends the reconstruction data via TCP to the client. The interface between these two threads is the StreamingSet. This set contains references of all those blocks which have to be transmitted to the client. The InfiniTAM client is composed of at least two threads. The network transmission thread receives data from the server and integrates it into the model. One or more meshing threads take care of the mesh generation. They communicate with the network transmission thread via the MeshIDMap. This map contains information which mesh blocks require an update. By using more than one meshing thread, multiple meshes can be recomputed at the same time. Apart from the InfiniTAM threads, there are additional threads within UE4. The shared memory thread receives the mesh data from InfiniTAM utilizing a shared memory region, where the meshing threads store their results. The shared memory thread converts and stores the mesh data in temporary arrays. The UE4 game thread finally reads these arrays and handles the dynamic scene update. 5.2 Usage Instructions As already mentioned, the reconstruction pipeline consists of three different processes: The InfiniTAM Server, the InfiniTAM Client and the UE4 Visualization. Each process has 48 5.2. Usage Instructions Figure 5.2: Overview of the individual threads, shown in gray. The threads communicate with each other via commonly accessible data structures shown in white. to be started separately, so on the client side, one has to run two processes. After starting one side (client or server), the system waits for the other side to be connected before the reconstruction process can be started. The system does not contain a configuration file, which can be loaded at runtime. All the settings are hardcoded and have to be adjusted at compile time. The InfiniTAM settings of both server and client can be found in two different files: ITMLibSettings.cpp and ITMLibDefines.h. 5.2.1 Server Server Settings The size of the scene memory does not grow dynamically during reconstruction, but has to be defined beforehand in the ITMLibDefines header. Here, one can set the maximum number of hash table entries, where SDF_BUCKET_NUM corresponds to the ordered part and SDF_EXCESS_LIST_SIZE to the unordered part. The SDF_GLOBAL_BLOCK_NUM definition represents the maximum number of voxel blocks to be stored in the global CPU memory. One block requires 4 KB if color is used and 2 KB if no color is used. The number of voxel blocks to be stored in GPU memory is defined by SDF_LOCAL_BLOCK_NUM. It needs to be large enough to store all voxel blocks, which are currently in the field of view of the scanning camera. This number depends on the chosen voxel resolution and the size of the view frustum. In most cases, the default value does not have to be changed. Beside the memory sizes, a further setting in ITMLibDefines.h is the voxel type to be used. It defines if the reconstruction should 49 5. Implementation contain color or not. A C++ typedef controls the type of ITMVoxel: typedef ITMVoxel_s_rgb ITMVoxel ; // With Color typedef ITMVoxel_s ITMVoxel ; // Without Color The voxel type on the server and the client need to be set to the same type, otherwise the client’s network module expects too much or too little data from the server. All other server settings are defined in the constructor of the ITMLibSettings class. Here, one can choose the size of the SDF truncation band, the voxel resolution or the minimum and maximum considered depth range for data integration. For instance, the line sceneParams (0 . 02 f , 100 , 0 .01 f , 0 . 2 f , 3 . 0 f , fa l se ) selects a 2 cm truncation band with a voxel size of 1 cm. Only depth measurements with a value between 0.2 m and 3 m are integrated into the model. The value 100 defines, that the color and the TSDF value of a voxel is computed using a running average of 100 measurements. If the value is reached, older measurement are removed to integrate new ones. The last boolean value can be used to avoid replacing existing measurements. If the flag is set to true, new measurements are not integrated anymore. Besides these scene parameters, the ITMLibSettings class controls the camera tracking method and the network streaming. By setting the trackerType variable to TRACKER_ICP, a standard depth-based ICP tracker is applied. More details on the available trackers can be found in Section 5.3. The variable useNetworkStreaming enables network streaming and the network port to be used is defined by serverPort. One can also disable GPU - CPU swapping with useSwapping, however, it is mandatory for network streaming. Server Execution When executing the compiled server program with no command line parameters, the system tries to use live image data from a connected RGB-D camera (such as the Asus Xtion Pro Live) with a default camera calibration. To improve tracking and integration, one can provide a path to an existing calibration file as the first command line argument. The content of the default calibration file can be seen below: 640 480 525 .0 525 .0 319 .5 239 .5 640 480 525 .0 525 .0 319 .5 239 .5 1 0 0 0 0 1 0 0 0 0 1 0 0 0 50 5.2. Usage Instructions The first three lines correspond to the RGB sensor and the following three lines correspond to the depth sensor. For each sensor, the size of the image (first line), the focal length (second line) and the principal point (third line) are given. All values are expected to be pixels. Besides that, the file contains a 4x4 matrix, which maps points from the RGB image to the depth image. In the default calibration, the identity matrix is used. The values in the last line are required for the conversion of Kinect disparity values to depth values. For OpenNI devices, as used within the thesis, these values are ignored. When applying standard ICP tracking, only the depth sensor information is required for camera tracking. The RGB sensor information is only used to integrate the color into the model correctly. The OpenNICalibTool1 performs a calibration for OpenNI supported cameras like the Asus Xtion and generates a file in the InfiniTAM format. Apart from the camera calibration file, a path to an existing OpenNI file can be provided as the second command line argument to use a prerecorded scene instead of live camera data. After running the program, the user has to start live reconstruction by pressing the b key. Since the first camera pose determines the orientation and position of the whole reconstructed model, the scanning camera has to be aligned with the real floor plane before starting the process. Moreover, the camera should be positioned at a height of 1.5 meters above the floor in order to fit the size of the virtual character used for exploration at the client side. After the process has started, one can switch between different live visualization modes or pause the integration of new data. The corresponding keys are displayed at the bottom of the screen. The program is closed by pressing the Escape key. Figure 5.3 shows a screenshot of the server’s user interface. Figure 5.3: Screenshot of the InfiniTAM server application. 1https://github.com/carlren/OpenNICalibTool 51 5. Implementation 5.2.2 Client Client Settings Equally to the server, the ITMLibDefines header defines, if color should be used or not. Both server and client need to use the same voxel type. The hash table size is chosen using SDF_BUCKET_NUM and SDF_EXCESS_LIST_SIZE, which should be set as large as at the server side. The maximum number of supported voxel blocks, i.e. the scene size, is controlled by SDF_LOCAL_BLOCK_NUM. Note, that this is in contrast to the server, where this definition controls the number of blocks on the GPU. This is due to the fact, that the client does not use the GPU but stores everything on the CPU memory. Therefore, the SDF_LOCAL_BLOCK_NUM corresponds to the server’s SDF_GLOBAL_BLOCK_NUM and should be set equally large. Moreover, the ITMLibDefines header controls meshing characteristics such as the size of one mesh block (in voxels blocks) and the number of supported mesh blocks per dimension (MESH_CHUNK_SIZE and MESH_CHUNKS_PER_DIM ). The ITMLibSettings class contains network transmission parameters such as the name/IP address of the server (serverName) and the corresponding network port (serverPort). Client Execution In order to run the client, the UE4 process should be executed at first to avoid any problems. Afterwards, the InfiniTAM process has to be started. Both processes do not require any command line arguments. At startup of the UE4 process, it is checked which input and output devices are connected. If a HMD is detected, it is used automatically instead of the standard display. The same is true for the Virtualizer. If a Virtualizer device is found, navigation is only possible with the treadmill. Otherwise, the user can navigate with either the gamepad or the keyboard. When using a gamepad, one moves with the joystick or with the arrow buttons. When using a keyboard, movement is performed with WASD or the arrow keys. In this non-Virtualizer setup, the walking direction is defined by the viewing direction and the viewing direction is determined directly by the HMD or using the mouse (when using a standard display). Figure 5.4 shows a screenshot of the client’s user interface. The following sections describe the structure and usage of the changed and newly added features in more detail. At first, the camera tracking module is examined, followed by the network transmission. Afterwards, the dynamic scene update including the mesh computation, the shared memory connection between InfiniTAM and UE4, and the actual mesh update within UE4 is explained. Finally, features of the visualization and exploration are highlighted. 5.3 Camera Pose Estimation The camera tracking behaviour is defined by the used camera tracker object. During initialization of the ITMMainEngine, the camera tracker is created by the ITMTrackerFac- tory and then maintained by the ITMTrackingController. The selection of specific camera tracking method happens in the ITMLibSettings class by adjusting the trackerType variable. By default, depth-based ICP tracking is used. InfiniTAM supports standard 52 5.3. Camera Pose Estimation Figure 5.4: Screenshot of the InfiniTAM and UE4 client application. ICP tracking (TRACKER_ICP) as well as weighted ICP tracking (TRACKER_WICP). The latter uses an ICP weighting scheme which integrates a model of the sensor noise [NIL12]. When using one of the two ICP trackers, the tracking behavior can be further influenced with the ITMLibSettings variables depthTrackerICPThreshold and depthTrack- erTerminationThreshold. The ICP threshold defines the maximum depth difference for the ICP correspondence. Two points are only associated as pair, if the distance between them is below the threshold. The termination threshold represents the error value when to stop the ICP iterations. Once the error is reached, the ICP process stops early before the maximum amount of iterations. Besides depth-based ICP tracking, one can also choose color-based tracking (TRACKER_COLOR), however, it resulted in stronger drift and more frequent tracking failures during testing. IMU Integration In order to get an improved tracking performance, one can add IMU rotation estimates to ICP tracking by selecting the IMU tracker (TRACKER_IMU). In this case, the camera rotation is initialized in each frame with the IMU rotation followed by a limited ICP pose estimation. In contrast to standard ICP, a smaller image pyramid with less levels is used where the focus lies on the estimation of the camera translation. The rotation is only estimated on the finest resolution. Since the default implementation does not support the integration of live IMU data, it is added in this thesis. The IMUNetworkEngine is a subclass of IMUSourceEngine and listens for IMU data coming via a UDP network connection. After creating the IMUNetworkEngine, it waits for new IMU orientation data at port 5555. The expected format of an orientation are four numbers (separated by a comma), which correspond to a timestamp and a 3D 53 5. Implementation Android Rotation Vector 2. The most recently received orientation is converted to a rotation matrix and is used as the rotation estimate as described above. For this thesis, the integrated IMU of a Nexus 4 smartphone is used. The smartphone is attached to the RGB-D camera and its orientation is streamed to the scanning computer with the existing Android application HyperIMU [Cam16]. This application is able to read the Android Rotation Vector and send it via TCP or UDP. In order to minimize the latency, a tethered connection is used between the smartphone and the scanning computer. Due to the reduced ICP computations, the IMU Tracker makes it possible to run InfiniTAM directly on tablets with a Tegra K1 chip, such as the Google Nexus 9 or the NVIDIA Shield. Note, that the thesis implementation however only runs on Windows. Loading a precalculated trajectory Besides the support for live IMU data, the tracking module of InfiniTAM is extended to enable the usage of a precalculated camera trajectory instead of performing live tracking. Two different trajectory readers, which support different data formats, are implemented. The TRACKER_READER allows to load camera poses of the scenes from Choi et al. [CZK15] and Zhou et al. [ZK13]. For each frame, the corresponding camera pose is given as a 4x4 matrix 3. Following line shows an example pose for a single frame: 0 0 1 1.0000000000 0.0000000000 0.0000000000 2.0000000000 0.0000000000 1.0000000000 0.0000000000 2.0000000000 0.0000000000 0.0000000000 1.0000000000 −0.3000000000 0.0000000000 0.0000000000 0.0000000000 1.0000000000 One pose is encoded by five lines, where the first line holds the frame number (third number) and the other four lines store the 4x4 matrix. Parts of the code to load the trajectory are taken from Choi et al. [CZK15]. The second trajectory reader TRACKER_READER_RTAB makes it possible to read a trajectory as generated by RTAB-Map [LM13]. RTAB-Map stands for Real-Time Appearance-Based Mapping and is graph-based SLAM application integrating loop closure. The loop detection is based on sparse features of the RGB image. The trajectory format is similar to the TUM RGB-D SLAM format 4. One pose is given as eight numbers, which correspond to a timestamp, a 3D translation vector (x, y, z) and a rotation quaternion (qw, qx, qy, qz): timestamp x y z qw qx qy qz Note, that RTAB-MAP enforces real-time processing. If the processing of a single frame takes longer than the update rate of the camera, frames are dropped. As a result, the generated trajectory does not store a pose for every input frame. One option is to 2http://developer.android.com/guide/topics/sensors/sensors_motion.html#sensors-motion-rotate 3http://redwood-data.org/indoor/fileformat.html 4http://vision.in.tum.de/data/datasets/rgbd-dataset/file_formats 54 5.4. Network Transmission ignore frames with missing poses, i.e. the data of those frames is not integrated into the model. The second option is to use the trajectory reader, similar to the IMU tracker, in combination with the ICP tracker. The precalculated RTAB-Map pose, if available, is only used as a seed point for a following ICP procedure. This second option is set as the default behaviour for TRACKER_READER_RTAB. 5.4 Network Transmission The incremental transmission of reconstruction data over network is enabled by the adapted ITMSwappingEngine at the server and the newly added streaming engines at both server and client. The network transmission can be turned on via the useNet- workStreaming variable in the ITMLibSettings class. The streaming engines of server and client communicate via TCP, where the default port is 5445. Note, that both the client and the server need to use the same scene parameters. The voxel resolution is transmitted to the client directly after a new connection is established. However, since the information if voxels contain color is hardcoded, it still has to be ensured at compile time, that both server and client use the same voxel type. 5.4.1 Server Side Compression and Transmission As mentioned in the methodology chapter, all voxel blocks, that fall out of the view frustum (and are swapped to CPU), are transmitted to the client, along with their world positions. For this purpose, the server’s ITMSwappingEngine is adapted, that it informs the ITMStreamingEngine about newly swapped out blocks. This is achieved by adding the corresponding blocks to the commonly accessible StreamingSet. In order to save memory, each element of this set does not actually store the voxel data but only a reference to the global CPU memory, where the swapped out data is stored anyway. Moreover, each set entry contains the block world position, which is required for the client side integration. For this purpose, the swapping out function of the ITMSwappingEngine is changed, so that also the block positions are copied to CPU. Unlike the voxel data, the positions are however not stored permanently in the global memory. The ITMStreamingEngine runs in parallel to the reconstruction process in a separate thread and regularly checks, if the StreamingSet contains new entries. All blocks referred to in the queue, are gathered and removed from the queue. They are compressed and transmitted to the client via TCP. When the queue is empty, the ITMStreamingEngine checks again after the next frame. If the connection is lost, the current blocks to be transmitted are added to the StreamingSet again and the ITMStreamingEngine waits for the same client to reconnect. During this time, the StreamingSet keeps growing and remembers which voxel blocks need to be transferred upon reconnection. The transmission is based on the Winsock2 library, which is contained in the Windows operating system. The main steps of the server side transmission process are illustrated in Figure 5.5. 55 5. Implementation Figure 5.5: Main steps of the ITMSwappingEngine and ITMStreamingEngine for network transmission at the server side. Keeping the data rate low By using a set data structure for the StreamingSet, each element is unique. If a block is swapped out to the CPU, which is already in the set (i.e. it was swapped out before but it was not transmitted since then), it is not added again. Even though, the StreamingSet is not updated in this case, always the most recently swapped out voxel data is to be transmitted since references to the global memory are stored instead of a data copy. This feature decreases the data rate because multiple retransmissions are reduced. In order to further minimize the required bandwidth, the voxel blocks are compressed in a lossless way using the zLib library (version 1.2.8) before transmission. All gathered voxel blocks and their positions are splitted up into three arrays, where the first array contains the 3D world positions of all blocks, the second array contains all TSDF values and the last array stores the color values. The grouping of the same data types is performed with the intention to achieve a higher compression. Besides that, any padded bytes (due to memory alignment) are avoided, which further decreases the memory footprint. The SDF and color arrays are then compressed individually. The positions are not compressed because the amount of data compared to the SDF and color data is negligible. One uncompressed block requires 6 bytes for the position, 1024 bytes for the SDF data and 1536 bytes for the color. By limiting the speed of the streaming engine thread, the data rate can be reduced even more at the cost of latency for two reasons. If data is for example only transmitted once a second instead of every frame, more blocks have to be streamed and compressed at a time and a possibly higher compression can be achieved. Besides that, if more blocks accumulate in the StreamingSet, the chance is higher that a block is already contained. This limits repeated transmission of the same blocks. Currently, blocks are transmitted to the server, even if they did not change since the last time they fell out of the view frustum. By detecting and sending only changed blocks, the data rate could be further improved. 56 5.5. Dynamic Scene Update 5.4.2 Client Side Decompression and Integration Since the client does not have a data integration module, the ITMStreamingEngine takes care of the integration process itself. After receiving the voxel resolution of the server, it initializes the hash table and the voxel block array. Keep in mind, that on the client side, this array stores the whole model on CPU. After the initialization, it starts listening for incoming data in a separate thread. Whenever new data arrays arrive, they are decompressed using zLib and composed to voxel blocks again. The missing voxel weights are filled with default values, since they are not required at the client. These fields could be removed from the voxel type at the client side in order to save memory. Afterwards, the arrived blocks are allocated into the hash table, if necessary, and the voxel block array is updated. Figure 5.6 shows the main steps of the client’s streaming process. The ITMStreamingEngine informs the ITMMeshingEngine about new data via Figure 5.6: Main steps of the ITMStreamingEngine at the client side. a commonly accessible data structure, similar to the server side ITMStreamingEngine and ITMSwappingEngine. The ITMMeshingEngine maintains the HashIDMap, a hash map containing key value pairs of integer IDs (key) and integer timestamps (value). The ID identifies a mesh block and the timestamp refers to the time when the ID was added to the map. The ITMStreamingEngine computes for each arrived voxel block, the corresponding mesh block ID from the 3D voxel block position and adds it to the HashIDMap with the current timestamp. This timestamp is increased every second. Note, that all voxel blocks within a mesh block result in the same mesh ID. When the map already contains a computed mesh ID, only the existing timestamp is updated with the newer one. 5.5 Dynamic Scene Update The client is able to dynamically recompute the meshes of regions, where voxels changed, using the ITMMeshingEngine. The ITMSharedMemoryEngine then passes the computed meshes to the UE4 process. 5.5.1 Mesh Extraction The ITMMeshingEngine runs in parallel and can use more than one thread for an improved performance. The number of desired threads is specified when creating the ITMMeshingEngine, where four threads are used per default. Each meshing thread 57 5. Implementation Figure 5.7: Main steps of the scene update in InfiniTAM. maintains its own mesh memory but all access the same MeshIDMap. This map tells the threads, which mesh blocks require a mesh update. Every thread regularly checks the MeshIDMap for new entries using a thread-safe function. With the aid of the stored timestamps, always the oldest entry is returned. This entry corresponds to the mesh block, which did not change for the longest time and thus, has the highest probability to remain this way. Using this strategy, the available computing power is utilized in an efficient manner since the required number of mesh updates is minimized. Note, that if there is enough computing power available, mesh blocks are still updated even if they are changing from frame to frame. The ITMMeshingEngine also ensures, that two threads cannot process the same mesh block at a time. This achieved by maintaining a separate set of currently processed mesh block IDs. In the case, that the oldest ID is currently processed by another thread, the thread updates the next oldest. The returned mesh block ID is converted to a world position and the mesh for the corresponding cubic mesh block area is generated. 5.5.2 Shared Memory Connection The computed meshes then have to be transferred to the UE4 process. The connection between the InfiniTAM client and UE4 is implemented with shared memory. For that purpose, the library Ultimate Shared Memory [Cho16] is used, which provides a straightforward interface for shared memory access on a Windows machine. An ITMSharedMemoryEngine is added to the InfiniTAM client, which performs the writing of mesh data to a shared memory region. This region is defined to have space for two meshes. It holds two vertex arrays, two color arrays, two normals arrays and two index arrays. As a consequence, one mesh can be written by InfiniTAM while the other mesh is read at the same time by the UE4 process. However, all meshing threads use the same ITMSharedMemoryEngine. Only one of the meshing threads can pass its computed mesh to UE4 using the ITMSharedMemoryEngine at a time. If multiple meshing threads finished processing at the same time, all but one have to wait. Besides the data arrays, the 58 5.6. Visualization and Exploration shared memory also holds an info field for each mesh. This info field tells the UE4 process, how many vertices and triangles are contained in the arrays, so that it knows how much data is needs to read. Moreover, the info field stores the mesh ID. Apart from identifying meshes, the ID field is also used to indicate, that UE4 is finished with reading the data from shared memory by setting the ID to -1. This way, the ITMSharedMemoryEngine of InfiniTAM is informed, that the corresponding mesh data can be overwritten with new data. Figure 5.7 illustrates the main steps of the scene update in InfiniTAM. 5.5.3 Scene Update in UE4 In order to update the scene in UE4, the experimental procedural meshing feature is used. This feature allows to create and change meshes at runtime. The procedural mesh class allows to store multiple mesh sections, which can be updated individually. This would fit the mesh block characteristic very well. However, it turned out, that updating the mesh sections of a single procedural mesh is slower than having multiple procedural meshes with only one section. As a result, the DynamicMesh class is created, which stores several procedural mesh objects with only a single section. At initialization of the DynamicMesh, 650 of such (empty) procedural meshes are created to avoid any performance issues at runtime. However, if more meshes are required, they are dynamically added in groups of ten meshes. Besides the DynamicMesh class, the UE4 process has a SharedMemoryEngine in order to access and read the mesh data. It runs in a separate thread in parallel to the UE4 game thread and regularly checks if new data is available in the shared memory region. Whenever the ID field is bigger or equal to 0, data is available and the UE4 scene has to be updated. Since the scene update cannot be performed from a parallel thread, the UE4 SharedMemoryEngine converts and saves the data into temporary UE4 arrays which are also accessible by the game thread. The game thread can then update the DynamicMesh in the next frame. 5.6 Visualization and Exploration The rendering and exploration of the scene is performed with UE4 per default. In order to explore the reconstruction in a VR setup, the hardware needs to be integrated into the pipeline. In contrast to the Oculus Rift, the Virtualizer is not supported natively by UE4. As a consequence, a plugin is developed. Besides the default UE4 exploration, the reconstruction pipeline can be also used with InfiniTAM only. The Virtualizer integration in the standard UE4 mode is highlighted first, followed by an explanation of the standalone rendering. 5.6.1 Virtualizer Integration in Unreal Engine 4 Within the thesis, a UE4 plugin is developed to integrate the Cyberith Virtualizer. It acts as a wrapper to the existing DLL from Cyberith and exposes the functionality to UE4. The DLL is able to connect to the Virtualizer hardware and retrieve information without any other drivers to be installed. 59 5. Implementation Figure 5.8: Class structure of the UE4 Virtualizer plugin. Figure 5.8 shows the class structure of the plugin. It is divided into two parts: The actual C++ plugin and a second interface which allows to call the C++ plugin functions via the UE4 blueprint system. The C++ plugin is added by extending UE4’s module interface with the IVirtualizer interface. The actual implementation of the IVirtualizer interface is contained in the VirtualizerDevice class. This class handles the connection to the underlying Cyberith DLL and forwards the function calls. Most importantly, the C++ plugin allows to retrieve following information from the Virtualizer: • Orientation of the user’s hip (GetPlayerOrientation) • Walking direction relative to current body rotation (GetMovementDirection_Local) • Walking direction (GetMovementDirection_Global) • Walking speed (GetMovementSpeed) • Height of the user’s hip (GetPlayerHeight) Besides passing the function calls to the DLL, the plugin handles any required type conversions since the Cyberith DLL returns scalar values for all functions. These values are converted to more meaningful types, such as 3D direction vectors for the player orientation or the movement direction. The retrieval of the global movement direction is also not offered by the DLL. It is added, in order to avoid the repeated manual computation from the player orientation and local movement direction. Note, that the Virtualizer plugin is not implemented as a standard input controller like a keyboard or a gamepad. As a consequence, one has to call the plugin functions manually every frame and adjust the players orientation and position accordingly. The additionally implemented Blueprint Function Library can be used for that purpose, which contains only static functions. Internally, it maintains a VirtualizerDevice object, which handles the function calls. Figure 5.9 shows the integration of the Virtualizer with a UE4 character using it’s blueprint. The characters position is updated at every frame by retrieving the Virtualizer information and passing it to the UE4 function AddMovementInput. 60 5.6. Visualization and Exploration Figure 5.9: Integration of the Virtualizer via UE4’s blueprint system. 5.6.2 Client Standalone Visualization Apart from the standard UE4 rendering and unlike previously explained, the client can be used with InfiniTAM only. In this case, the UI Engine is used the same way as on the server, where standard raycasting of the volumetric representation is performed. As a consequence, the ITMVisualiationEngine is used also on the client side. The ITMMeshingEngine and the ITMSharedMemoryEngine are not required and are disabled. Figure 5.10 shows the modified class structure of the client for this case. The standalone Figure 5.10: Class structure of the client when using the standalone visualization mode instead of UE4. mode is mainly used for testing and allows to analyze the latency of the network streaming because the individual voxel block updates are immediately visible. When using the UE4 mode, updates are only visible after the corresponding mesh is recomputed. In order to perform raycasting in real-time, the client has to be used with CUDA support, where the model is stored on the GPU. The CUDA support has to be switched on at compile time by disabling the flag #COMPILE_WITHOUT_CUDA. However, since the client does not integrate any form of GPU-CPU swapping, the whole model needs fit 61 5. Implementation inside the GPU memory and thus, the voxel block array needs be big enough. Note, that only the ITMSharedMemoryEngine is implemented with CUDA support but not the ITMMeshingEngine and the ITMSharedMemory Engine. Thus, if using the UE4 mode, CUDA must be disabled to avoid any errors. The visualization mode can be changed in the client’s ITMLibSettings class. Within the thesis, the UI Engine is also adapted in such a way, that it is possible to render the raycasted image to the Oculus Rift using the Oculus SDK 0.6. Note, that this feature is however disabled by default because high frame rates over 30 fps could not be achieved for a high definition resolution, as required by the Rift, even with CUDA support. The Oculus Rift support can be enabled using the flag #WithRift at compilation. Both at the server and at the client, the Rift support can be enabled. Of course in this case, the standalone mode is mandatory at the client. 62 CHAPTER 6 Experimental Results This chapter covers the evaluation of the implemented reconstruction pipeline and consists of two parts. At first, the results of a technical evaluation of the system performance are presented in Section 6.2. Besides that, a user study is performed, which examines the effectiveness of the system regarding the spatial knowledge acquisition in unknown environments. This study is presented in Section 6.3 along with its results. 6.1 Test Data While the used InfiniTAM framework allows to store large scale environments in memory effectively, the scanning is challenging and for scenes with few geometric details like hallways often not possible. It has to be performed by skilled users since the camera always needs to see enough features to avoid the loss of the tracked camera pose. A tracking failure requires to start the scan process from the beginning since InfiniTAM does not integrate a pose recovery feature. Even if one manages to reconstruct large scenes without tracking failure, the resulting models are not well aligned and overlap because InfinTAM suffers from camera drift and does not integrate any form of drift correction like loop closure. Kähler et al. propose to use the rotation estimate of an IMU to improve the camera tracking, but tests in this thesis only found minor improvements and do not solve the general problem. Since the aim is to explore the reconstructed environments, well aligned models are however crucial. Thus, for evaluation purposes, live ICP camera tracking is suspended and precomputed camera trajectories are used instead. All other steps of the pipeline are still performed in real-time. In order to compute optimized trajectories for prerecorded camera streams, a state-of-the- art offline reconstruction system by Choi et al. [CZK15] is used. This system considers both color and depth information and tracks the camera using either the RGB-D odometry [SSC11] or the Kintinuous [WJK+13] approach. The key idea is that the recording is divided into several smaller fragments, where in a first step, each fragment produces its 63 6. Experimental Results own geometry with its own trajectory. All fragments are then pairwise aligned using geometric registration. A pose graph for the whole recording is generated and globally optimized. The generated trajectory can be loaded with the developed trajectory reader. During evaluation, three different prerecorded camera streams are used in order to reproduce the same scene multiple times. The data sets are referred to as Copy Room, Lounge and Flat, where the first two sets are taken from Zhou et al. [ZK13] and represent smaller scenes consisting of a single room. These two sets already come with precomputed trajectories. The Copy Room stream consists of 5490 frames (around 2:45 min) and the Lounge recording has 3000 frames (around 1:30 min). The Flat data set is recorded for this thesis and represents a Viennese apartment with 93 m3 consisting of a hallway, four rooms, a bathroom and a toilet. The recording contains 22175 frames (around 11:10 min) and the corresponding camera trajectory is estimated in a prior step with the mentioned offline approach. The computation is, however, tedious and takes over 20 hours with modern hardware (Core i7-4940MX, Gefore GTX980M, 16GB RAM). Since one needs to find appropriate parameters to yield a well aligned reconstruction, the system has to be executed multiple times. As a consequence, only a single large scale scene is captured and optimized for this thesis evaluation. Figure 6.1 shows a top view on reconstructions of all three test scenes, where the Flat scene is shown at the top, the Lounge scene at the bottom left and the Copy Room at the bottom right. All data sets are OpenNI RGB-D sequences with both a depth and color resolution of 640 x 480 and are captured with an Asus Xtion Pro Live camera at 33 Hz. The depth and color image are already registered using the default internal calibration parameters. 6.2 System Performance Used Hardware The technical system evaluation is performed on a standard notebook with an Intel Core i7-4940MX processor (3.10 GHz) and a GeForce GTX 980M graphics card. Besides that, the notebook has 16 GB of memory and runs Windows 7 64 bit. Both client and server application are executed on this single notebook, so the network streaming happens only locally, which simulates a perfect network connection. This section examines the performance of the developed reconstruction pipeline, but note, that the accuracy of the camera pose estimation is not the focus of this thesis. Instead, the ability to store, transmit and explore large environments is at the heart of this technical evaluation. For this reason, the precomputed camera trajectories are also used in this section. Details regarding the camera tracking performance can be found in the original InfiniTAM paper [KPR+15]. The rendering performance of UE4 is also not analyzed in depth. In all test cases, the number of achieved frames per second lies above 200 and a drop of the frame rate during a mesh update is not noticeable. The following evaluation focuses on three parts: 3D Reconstruction, Network Transmission and Meshing. The reconstruction pipeline is executed eight times for each of the three test data sets with different scene parameters. The voxel size is changed between 0.5 cm and 1 cm and the width of the TSDF truncation band is set to either 2 cm or 4 64 6.2. System Performance Figure 6.1: Top view on reconstructions of the three used test data sets. The Flat data set can be seen at the top, the Lounge data set is illustrated at the bottom left and the Copy Room data set is shown at the bottom right. All scenes are represented with the same scale to compare the sizes. cm. Besides that, the scenes are reconstructed with and without color, but the meshing is only evaluated with color. The Copy Room and Flat data sets are processed at the 65 6. Experimental Results original camera frame rate of 33 fps. The Lounge recording is processed at only 25 fps since the original speed looks unnaturally fast. 6.2.1 3D Reconstruction The 3D reconstruction module must be able to process the data at least at the same speed as the camera’s frame rate in order to be real-time capable. Since the used Xtion Pro Live camera provides images at 33 Hz, the maximum processing time for a single frame is 30 ms. On the test system, InfiniTAM’s reconstruction process takes only 5 ms per frame in average to track the camera and integrate the data into the model. Another 4 ms are required for rendering the live feedback, consisting of the raycasted image and the input RGB-D image pair. Kähler et al. already analyzed InfiniTAM’s processing speed in more detail [KPR+15]. The required memory to store a reconstructed scene in the used volumetric data format is now analyzed in more detail. Both server and client need to be able to store this amount of data. Additionally, the client stores the mesh representation but the memory consumption of the mesh is not considered in this section. See Subsection 6.2.3 regarding details about the mesh generation. As per default, InfiniTAM uses a hash table with a maximum of 1179648 addressable entries, where the ordered part holds 1048576 entries (88.8%) and the unordered part stores 131072 (11.1%). In order to store this hash table and auxiliary data, only about 20 MB are necessary since each entry is represented with 16 bytes. The hash table size should be kept high, because, if either the ordered part or the unordered part is fully occupied, the reconstruction process cannot continue. Keep in mind, that the actual voxel data is not contained in the table directly. The global memory (or the voxel block array on the client side), which holds this data permanently on CPU, is allocated with 327680 entries. This results in a memory usage of around 640 MB for uncolored data (4 × 512 byte / voxel block) and 1280 MB for colored data (8 × 512 byte / voxel block). The global memory can be smaller than the hash table because big portions of the hash table usually remain empty. Note, that the last byte (4th or 8th) of each voxel is actually an empty byte and is only added for memory alignment reasons. By disabling this data padding, the memory footprint can be reduced by 13% (colored) or 25% (uncolored). However, this goes hand in hand with a reduced performance since processors are optimized for aligned data. Table 6.1 shows the occupancy of the hash table for each test scene after the reconstruction process has finished. The parameter µ stands for the width of the SDF truncation band, i.e. the maximum considered distance from an object surface. One can also see the amount of actually used voxel block memory, which constitutes the minimum size of the global memory. The single room environments require below 70 MB of memory for an uncolored reconstruction at a voxel resolution of 1 cm. When enabling color, a twice as large global memory has to be provided. Furthermore, the required memory is raised by 30% to 40% in average if the truncation band width is increased from 2 cm to 4 cm. 66 6.2. System Performance Scene Parameters Hash Table Occupancy Memory Scene Voxel Size µ Entries Ordered Unordered Colored Uncolored CopyR. 1 cm 2 cm 25K 2.24% 1.03% 97 MB 49 MB 4 cm 30K 2.66% 1.51% 117 MB 58 MB 0.5 cm 2 cm 139K 12.06% 9.59% 544 MB 272 MB 4 cm 175K 14.79% 14.81% 682 MB 342 MB Lounge 1 cm 2 cm 27K 2.49% 0.42% 104 MB 52 MB 4 cm 33K 3.10% 0.65% 130 MB 65 MB 0.5 cm 2 cm 152K 13.44% 8.37% 594 MB 298 MB 4 cm 201K 17.37% 14.33% 786 MB 394 MB Flat 1 cm 2 cm 207K 18.02% 13.94% 811 MB 406 MB 4 cm 279K 23.55% 24.75% 1093 MB 547 MB 0.5 cm 2 cm - - 100.00% - - 4 cm - - 100.00% - - Table 6.1: Hash table occupancy with default hash table size. Scene Parameters Hash Table Occupancy Memory Scene Voxel Size µ Entries Ordered Unordered Colored Uncolored Flat 0.5 cm 2 cm 1.28M 26.36% 33.33% 5009 MB 2508 MB 4 cm 1.85M 35.81% 66.89% 7247 MB 3629 MB Table 6.2: Hash table occupancy with enlarged hash table. Both ordered and unordered part are increased by a factor of four. This is due to the fact, that more voxel blocks fall into the truncation band and have to be maintained. An enlarged truncation band can lead to a better camera tracking behavior and avoids artifacts at the mesh extraction process. The biggest impact on the memory however has the voxel size. By dividing the side length in half, one original voxel has to be replaced by eight new ones. But the finer resolution also allows to discard some of the new voxels if they fall out of the truncation band. As a consequence, a finer resolution of 0.5 cm increases the hash table occupancy by a factor of around six. The total occupancy for the single room scenes changes from around 2% up to 16%, while the Flat data set cannot even be processed completely since it leads to a full unordered part of the hash table. To be able to reconstruct the whole scene, the hash table is enlarged. Both the unordered and the ordered part are increased by a factor of four. While a larger unordered part allows to handle more hash collisions, a bigger ordered part reduces the probability that collisions appear in the first place. Since the memory requirement for the hash table is low, its size should be set too large than too small. Figure 6.2 shows the case with the expanded hash table. Up to 1.85 million blocks are allocated for the Flat scene, which leads to a memory footprint between 5 and 7 GB (colored). The reconstruction of the same scene at a 1 cm resolution requires only 0.8 - 1.1 GB of memory. 67 6. Experimental Results Scene Parameters Colored Uncolored Scene Voxel Size µ Sent Blocks Sent Data CR Sent Data CR CopyRoom 1 cm 2 cm 70K 51 MB 3.34 28 MB 2.48 4 cm 87K 69 MB 3.09 40 MB 2.11 0.5 cm 2 cm 379K 365 MB 2.54 202 MB 1.84 4 cm 490K 488 MB 2.46 295 MB 1.63 Lounge 1 cm 2 cm 80K 60 MB 3.24 31 MB 2.52 4 cm 104K 78 MB 3.24 43 MB 2.35 0.5 cm 2 cm 440K 428 MB 2.52 225 MB 1.93 4 cm 608K 552 MB 2.70 312 MB 1.91 Flat 1 cm 2 cm 540K 238 MB 5.54 154 MB 3.44 4 cm 776K 392 MB 4.85 229 MB 3.33 0.5 cm 2 cm 3.172M 2010 MB 3.86 1104 MB 2.82 4 cm 4.913M 2768 MB 4.34 1652 MB 2.92 Table 6.3: Amount of transmitted voxel blocks and (compressed) data in MB along with the achieved compression ratio (CR). The shown data does not contain any TCP/IP overhead. 6.2.2 Network Transmission As described in Section 4.6, those voxel blocks are streamed to the client, which fall out of the current camera’s view frustum. Besides the TSDF and color information, each voxel contains weights, which can be excluded from network transmission to save bandwidth. Each voxel then only uses 2 bytes (uncolored) or 5 bytes (colored) in an uncompressed format, instead of 4 or 8 bytes. Table 6.3 illustrates the results of the network transmission for all three test scenes. The amount of sent data, shown in the table, considers only the actual payload and does not include any overhead created by the TCP/IP protocol. One can notice, that the number of sent voxel blocks is significantly higher than the number of all allocated blocks in the scene as shown in Tables 6.1 and 6.2. For example, in the CopyRoom scene with a voxel size of 1 cm and a truncation band of 2 cm, 70000 voxel blocks are transmitted to the client but the scene consists of only 25000 blocks. In average over all test scenes, the number of streamed voxel blocks is 2.6 - 3.1 times higher than the number of allocated voxel blocks. This is due to the fact, that the same voxel blocks fall out of the view frustum multiple times, when revisiting areas during the scan process. Table 6.3 also shows, that the compression ratio (CR) is higher when using color compared to using no color. As a consequence, the cost of using color is mitigated since not 2.5 times more data has to be sent but only 1.5 - 2.0 times more. Another visible trend is the positive effect of a lower voxel resolution on the compression ratio. In all test cases, a better compression is achieved at a voxel size of 1 cm compared to 0.5 cm. The width of the truncation band has no effect on the compression. Besides the parameters, also the scene itself has an influence on the compression as the Flat data set leads to higher ratios 68 6.2. System Performance Scene Parameters Colored Uncolored Voxel Size µ Avg MBit/s Max MBit/s Avg MBit/s Max MBit/s 1 cm 2 cm 3.1 16.5 1.7 11.7 4 cm 4.3 24.2 2.5 14.8 0.5 cm 2 cm 23.0 113.8 12.4 68.8 4 cm 30.7 175.0 18.0 110.6 Table 6.4: Average and maximum data rates of all three test scenes. 1 cm, 2 cm 1 cm, 4 cm 1 cm, 2 cm 1 cm, 4 cm 0.5 cm, 2 cm 0.5 cm, 4 cm 0.5 cm, 2 cm 0.5 cm, 4 cm 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 MBits per second Figure 6.2: Average data rates. The first value of the bar label represents the voxel size and the second value represents the width of the truncation band. Blue bars indicate colored reconstructions. compared to the other two scenes. A higher compression ratio is generally achieved when the data has many identical values or repeating patterns. The resulting data rates during the transmission are listed in Table 6.4, where one can see the average and the maximum rates in MBits per second of all test cases. The average data rates are also illustrated in Figure 6.2 to inspect the differences visually. A colored reconstruction with a voxel size of 1 cm and a truncation band of 4 cm led to an average data rate of 4.3 MBit/s, however there occurred peaks up to 24.2 MBit/s. In average, these maximum data rates are 5 - 6 times higher than the normal data rate. As already explained above, the data rate for colored reconstructions is between 1.5 and 2.0 times higher than for uncolored reconstructions. Furthermore, it can be seen, that the finer 0.5 cm resolution results in a seven times higher data rate. Finally, the enlarged truncation band of 4 cm leads to a 40% increase of the bandwidth requirement. It has to be noted, that the data rate is very dependent on how the scene is scanned. If the robot or the person scanning the scene is moving very fast, the data rate also increases because more blocks fall out of the view frustum every second. Especially fast rotations lead to peak data rates and should be avoided. By decreasing the maximum 69 6. Experimental Results considered depth (i.e. moving the frustum’s far plane closer), one also minimizes the bandwidth since less blocks can fall out of the smaller view frustum. The full range of the depth camera should not be used anyway since the depth error increases quadratically with increased distance to the sensor. In all tests of this evaluation, the maximum depth is set to three meters. 6.2.3 Meshing In this section, the processing speed of the meshing engine as well as the extracted meshes are analyzed. Regarding the processing speed, the meshing module requires 65 ms in order to extract the mesh of one mesh block. Note, that this number represents an average value. The mesh extraction can involve peak values ranging from 700 ms to 1 second per mesh. When using only a single thread for the mesh extraction, the resulting mesh can be written to the shared memory with no delay. If multiple threads are used, a little overhead is introduced due to synchronization. For instance, four meshing threads increase the processing time for a single mesh by around 3 ms. The advantage of using more threads is the increased update rate of the mesh representation. When using four meshing threads during the reconstruction of the Flat scene, the whole mesh representation is updated around 40000 times in contrast to only 10000 times, when using a single thread. As a consequence, the mesh representation grows more smoothly and changes in the volumetric model are reflected earlier. With the used hardware, also one thread is enough to achieve a real-time capable update rate. However, the mesh representation expands in bigger blocks. The remaining part of this subsection presents details about the extracted meshes. In Table 6.5, one can see the number of individual meshes (which corresponds to the number of used mesh blocks) along with the total amount of vertices and triangles. Moreover, the required memory to store the meshes is listed. One vertex can be stored with 28 bytes, where 12 bytes are needed for the position, 12 bytes are used for the normal vector and 4 bytes are required for the color. A triangle defined by three vertex indices requires another 12 byte (4 byte per index). In average, one individual mesh consists of around 7000 vertices and 13000 triangles, however, there is the tendency, that the size of the individual meshes decreases with a bigger truncation band. For instance, the mesh representation for the Flat scene (0.5 cm voxel size) requires 1141 MB at a truncation band of 4 cm but 1245 MB at 2 cm. Note, that this is in contrast to the volumetric representation, which requires more memory at a larger truncation band. The total number of meshes is not affected by the truncation band parameter but by the voxel resolution. The Flat scene at 1 cm voxel size and 4 cm truncation band leads to 741 individual meshes, whereas a voxel size of 0.5 cm already results in 3353 meshes. In average, the number of meshes is four to five times higher when using the smaller voxel size of 0.5 cm. Note, that by using the smaller voxel size, one mesh block is subdivided into eight new ones, but not all of the new ones actually contain meshes. 70 6.2. System Performance Scene Parameters Extracted Mesh Representation Scene Voxel Size µ Meshes Memory Vertices Triangles CopyRoom 1 cm 2 cm 128 47 MB 0.98M 1.81M 4 cm 129 44 MB 0.92M 1.73M 0.5 cm 2 cm 582 212 MB 4.42M 8.25M 4 cm 584 191 MB 3.95M 7.47M Lounge 1 cm 2 cm 142 45 MB 0.95M 1.71M 4 cm 152 42 MB 0.88M 1.63M 0.5 cm 2 cm 578 216 MB 4.52M 8.32M 4 cm 490 165 MB 3.42M 6.42M Flat 1 cm 2 cm 724 270 MB 5.71M 10.29M 4 cm 741 262 MB 5.47M 10.13M 0.5 cm 2 cm 3335 1245 MB 26.07M 47.97M 4 cm 3353 1141 MB 23.69M 44.47M Table 6.5: Number and size of the extracted meshes (for colored reconstructions only). 6.2.4 Streaming Meshes versus Streaming Voxels When comparing the required memory for the mesh representation (Table 6.1) and the volumetric one (Table 6.5), it is apparent that a mesh is much cheaper. This is also illustrated by Figure 6.3, which shows the required memory to store the Flat scene as mesh (blue) or as TSDF voxel blocks (green) for different scene parameters. 1 cm, 2 cm 1 cm, 4 cm 0.5 cm, 2 cm 0.5 cm, 4 cm 1 cm, 2 cm 1 cm, 4 cm 0.5 cm, 2 cm 0.5 cm, 4 cm 0 1000 2000 3000 4000 5000 6000 7000 Required memory in MB Figure 6.3: Required memory to store the Flat scene either as mesh (blue) or as TSDF voxel blocks (green). The first value of the bar label represents the voxel size and the second value represents the width of the truncation band. Only colored reconstructions are considered. 71 6. Experimental Results The network transmission of the extracted meshes instead of voxel blocks would therefore also require significantly less bandwidth, if the streaming happens after the reconstruction has finished. For instance, the Flat data set at 1 cm resolution and 4 cm truncation band requires 1093 MB (colored) of voxel block data to be streamed in contrast to only 262 MB for the mesh representation. Both values represent the data before compres- sion. Challenging in mesh streaming is, however, the incremental update of the scene during scanning. While at voxel block streaming, each voxel block can be updated and transmitted individually, a mesh has to be retransmitted completely whenever any of the underlying voxel blocks have changed. To avoid frequent mesh retransmissions, one could send a mesh only after there are no changes for a certain time. This strategy implies, that the scene does not build up as continuously on the client side compared to voxel block streaming. Another option is to triangulate and update each voxel block individually to avoid retransmission of unchanged mesh parts. In both cases, increased resources are required at the mobile server to perform the mesh update in real-time, in addition to the reconstruction. An advantage of mesh streaming, besides the lower data rate, would be the decrease in the required memory on the client side. Only the mesh representation, but not the volumetric one, needs to be stored in this case. 6.3 User Study Besides the technical analysis of the system performance, a user study, based on the developed framework, constitutes the second part of the evaluation. One aim of this thesis is to provide the means to explore and get to know unknown remote locations. The user should get an overview quickly and should be able to navigate in these scenes. It was shown in previous studies, that the spatial knowledge acquisition is enhanced when performing virtual exploration as natural as possible [RL09]. With standard I/O devices, the user sits at a desk and navigates through the scene in a purely virtual way by pressing buttons. As a consequence, the user has to build a mental map of the environment solely from the visual information displayed on a desktop monitor. However, when walking in the real world, one can also resort to non-visual cues in order to track the orientation and position within the environment. Several studies have come to the conclusion, that especially the feedback of real body rotation, coming from the inner ear and the muscles, helps to orientate in the world [RB04]. By integrating a HMD like the Oculus Rift, it is possible to provide this feedback also in the virtual world. The user can control the virtual rotation by rotating the head in the real world. While real body rotation can be integrated very well, natural walking is more complex to achieve. Usually, the space available in the real world is limited and therefore, also the virtual space is constrained, when one wants to map real and virtual walking one to one. An ODT tries to solve this issue by enabling unlimited virtual walking, even though walking in the treadmill is not entirely natural. While real walking helps to better estimate distances, it is unclear if this is true for walking in a passive ODT. 72 6.3. User Study 6.3.1 Objective While related work has studied spatial knowledge acquisition in artificially-generated virtual environments using immersive VR input- and output devices [RL09], no prior art exist, that investigates user perception of streamed 3D reconstructions as well as spatial knowledge acquisition in streamed reconstruction. With this study, the gap should be closed. It should constitute a foundation for understanding users’ perception in remotely captured and incrementally reconstructed dense 3D scenes while using means for active navigation in immersive VR. Therefore, the focus lies on investigating the following questions: • How do input device and scene representation influence spatial knowledge acquisi- tion? • How do users perceive streamed 3D reconstructions in an immersive virtual envi- ronment? Besides analyzing the general perception and the spatial knowledge acquisition, it is interesting to see, if using the ODT has an effect on cybersickness. Walking virtually, while actually remaining seated in the real world, leads to a sensory conflict and promotes cybersickness. The user receives visual feedback which is in contrast to the information provided by the muscles and the vestibular system. By mapping virtual and real walking one to one, the conflict can be resolved. However, it is unclear, if this is also true for an ODT. In this case, the user still remains at the same place in the real world and receives a different sensory feedback compared to natural walking. 6.3.2 Means of Evaluating Spatial Knowledge Types of Spatial Knowledge This section introduces the used methodology to measure the spatial knowledge of the test persons. Before evaluating knowledge of the environment’s spatial arrangement, one needs to know, that this knowledge can be classified into three types: Landmark knowledge, route knowledge and survey knowledge [SW75]. Landmark knowledge can be considered as the lowest form of spatial knowledge. A person has this kind of knowledge if he or she is able to recognize already visited places. However, the location of two or more different places in relation to each other is unknown. With landmark knowledge only, the person is not able to find a way through the environment between place A and place B, even if both are known. Route knowledge enables exactly this wayfinding between two known locations, if a path connecting theses places was already traveled before. Note, that only known routes can be taken but no alternative ones because the overall layout of the scene is still unknown. The person only remembers which turns to take but not the exact distance or orientation of the whole path. The last type, the survey knowledge, constitutes the most complete form of spatial knowledge. Having a survey knowledge allows to imagine the environment from a bird’s eye perspective in the correct scale. As a consequence, one is able to find routes between 73 6. Experimental Results arbitrary places in the environment. Shortcuts can also be inferred, even if they were not traveled before. Evaluation Methods Depending on the spatial knowledge type to be tested, different evaluation tools can be used. Landmark knowledge is mainly evaluated using landmark recognition tests. Since landmark knowledge represents only a very rough idea of the environment, it is not considered to be evaluated within this thesis. The most common and natural manner to test route knowledge is, to let the user reproduce an already visited path between a starting point and a target point. The user performs this so called wayfinding task or route replication task while being in the test environment [BTG97]. In order to measure the degree of spatial knowledge, one can record the number of wrong turns, the amount of required time, the length of the traveled path or the body rotations performed. The more junctions are contained in the path, where the user has to decide which way to follow, the more expressive results can be obtained. The wayfinding tasks can also reveal survey knowledge if the user is able to find previously unknown shortcuts on purpose. A further means to test survey knowledge is sketch mapping[Moo76]. In sketch mapping, the test person has to draw a map of the explored environment from a bird’s eye perspective. In principle, this task can be considered an effective tool since it requires to reproduce exactly the user’s mental representation of the scene. However, the results depend very much on the drawing skills of the user and it is challenging to distinguish map distortions due to bad drawing skills and map distortions due to lack of spatial knowledge [Wal99]. Besides that, no standard means of evaluating the drawn maps exist. In many approaches, the maps are only evaluated qualitatively using independent reviewers. A method derived from sketch mapping, which does not involve any drawing, is called the map placement task [Wal99]. Within that task, the user has to arrange magnets on a white board, which represent landmarks of the scene. The goal is to place all landmarks in the correct relative location to each other. The final arrangement can be rotated and uniformly scaled in order to produce a normalized representation. The angles and distances between the magnets can then be used to infer the degree of spatial knowledge. Finally, spatial knowledge can be also tested using pointing tasks [Sie81]. As the name suggests, the test person has to point in the direction of an unseen target, where the angle between the pointed direction and the correct direction provides the main error measure. An advanced version of pointing to unseen objects is called projective convergence [Sie81]. Within that method, the user has to estimate the direction and distance from three different locations to the same target as illustrated in Figure 6.4. The three distance-angle pairs result in three points, which form a hypothetical triangle shown in red. This triangle is then the basis for four spatial knowledge measures: Mean angle error, Mean Miss Distance, Locational Error and Consistency. The Mean angle error represents the average of the absolute angular errors, i.e. the difference between the estimated direction and the correct direction. The Mean Miss Distance, or locational accuracy, is given as the sum of the distances between each triangle corner and the target point. The Locational Error is very similar to the Mean Miss Distance and represents an alternative measure. It is defined as the distance between the triangle’s centroid and the 74 6.3. User Study Target Pointing Location 1 Pointing Location 2 Pointing Location 3 Figure 6.4: Principle of the projective convergence task. The user has to estimate the direction and distance to a target (black dot) from three different locations (white dots). The three estimated positions from a triangle (red), which is the basis of the spatial knowledge measures. target point. Finally, the Consistency is given as the perimeter of the triangle. 6.3.3 Apparatus Input Device and Scene Representation The experiment is conducted with the developed 3D reconstruction system. The entire scene is presented to the users at once (referred to as Data 1 ) or the scene is incrementally updated in a frame-wise manner to simulate live exploration while the remote capturing is still running (referred to as Data 2 ). As navigational input devices, the two-handed XBox gamepad with joystick functionality, and the Virtualizer ODT, are used. All tests are performed by viewing the scene with an Oculus Rift DK2, and combined either with the gamepad (referred to as Setup 1 ) or with the ODT (referred to as Setup 2 ), as shown in Figure 6.5. When using Setup 2, the viewing direction and the walking direction is decoupled, i.e. the participant can walk in one direction while looking in another direction. When using Setup 1, the walking direction is determined by the orientation of the HMD. The test participant sits on a swivel chair, which can be rotated 360 degrees, to be able to walk in all possible directions. The swivel chair is chosen according to Nybakke et al.[NRI12], so that vestibular cues of real body rotation are integrated in both setups. This way, only the actual walking in the ODT can be evaluated. Test Environment All participants explore the same environment since scanning and reconstructing multiple large scale environments is challenging, especially if the scene complexity should be the same. The Flat data set from Section 6.1 is used for this study in combination with the precomputed camera trajectory to provide a well aligned model. The size and layout of the Flat with its three rooms, two bathrooms, kitchen and long nested hallway challenges the user in terms of navigation and orientation, and thus is well suited to examine users’ subjective perception and spatial knowledge acquisition. 75 6. Experimental Results + + Scene Viewing NavigationPhysical State (a) Setup 1 + + Scene Viewing NavigationPhysical State (b) Setup 2 Figure 6.5: Two different input/output device setups. Since the data is prerecorded, the reconstructed scene is also known beforehand. As a result, basic collision information is fitted manually to the reconstructed model in order to prohibit the users walking through any objects. Invisible cuboids are used for that purpose. Note, that reconstruction errors do not pose a problem this way. If performing collision detection against all reconstructed triangles instead, reconstruction errors can block the way. Besides avoiding to walk through objects, the invisible collision geometry also ensures, that the user’s height above the floor is the same all over the scene. Since the floor can be bent due to camera drift, it is necessary to adjust the user’s position accordingly, using invisible floor planes. System Settings In the Flat data set, the scanning party proceeds very slowly through the environment in order to avoid any tracking failures. In consequence, the scene also expands only slowly. For the user study, the processing is speeded up by a factor of two, so that the users have the full scene available within a reasonable amount of time. After five minutes and 30 seconds, the environment is fully reconstructed. The scene is reconstructed with color information at a voxel resolution of 1 cm3 and only camera information within a distance of three meters is integrated into the model. The width of the truncation band is defined to be 4 cm. The size of a single mesh block is set to the standard value of 80 cm3, i.e. one mesh block covers the area of 103 voxel blocks. Used Hardware Both server and client run on the same computer and streaming happens only locally on this single PC in order to avoid any networking problems during the user tests. The user study is performed on a desktop computer with an Intel Core i7-4790K processor (3.10 GHz) and a GeForce GTX 980Ti graphics card. Besides that, the PC has 32 GB of memory and runs Windows 8.1. 6.3.4 Participants Forty-two (42) participants (12 females, 28.5%) were involved in the experiment, while forty (40) participants (12 females, 30%) successfully finished the experiments. Two male participants aborted their experiments due to cybersickness symptoms, both encountered while using the ODT. Participants’ ages ranged between 18 and 57 years (mean µ = 31.15, standard deviation σ = 7.9 years). 25 participants reported previous experience with HMDs while seven own a HMD. 76 6.3. User Study 6.3.5 Study Design The study procedure for a single user consists of five stages: 1) introduction and pre- questionnaire, 2) training, 3) exploration, 4) evaluation, and 5) a post-questionnaire. At stage 1, users are informed about the study and the procedure. Additionally, they have to fill out a pre-questionnaire. Stages 2) - 4) are performed by each user wearing the immserive VR hardware. At stage 2, users are introduced to the input- and output hardware – either Setup 1 or Setup 2 – by explanation and demonstration. Next, users have up to 5:00 minutes to familiarize with the hardware by freely walking in a test environment, which comprises a simple UE4 scene with some artificial virtual objects. As soon as the user feels confident or the maximal training time is reached, the exploration stage (3) starts. Therefore, the 3D reconstruction is either presented in Data 1 or Data 2 mode. The users can freely walk up to 6:00 minutes with the instruction to explore the scene, more information are not given. Upon completion of the exploration, users have to perform two tasks in the evaluation phase (4) to assess their spatial knowledge. Therefore, they keep wearing the VR hardware and perform a projective convergence task and a wayfinding task as explained in Section 6.3.2. Projective convergence is done first to avoid further spatial knowledge acquisition by free walking during the wayfinding task. At the end of the procedure, users have to fill out a post-questionnaire (5). Projective Convergence and Wayfinding Figure 6.6 illustrates the target point as well as the three pointing locations for the projective convergence task. As spatial knowledge measures, Mean Angle Error, Consistency and Mean Miss Distance are applied. In Figure 6.7, one can see an example result for the projective convergence task. During the wayfinding task, the Total Trajectory Length, the Total Amount of Head Rotation and the Total Time is recorded. The starting point and the target location are also shown in Figure 6.6. Sketch mapping is not integrated in the user study due to the issues related to drawing skills and also due to the lack of a standard quantitative analysis. Map placement is not performed because the user is required to remember various landmarks. If the test person cannot recognize the test landmarks, the user is not able to perform the task, even if he or she has a good understanding of the overall layout of the scene. Note, that this is also true for the used wayfinding and projective convergence task, but in this case, only two landmarks are required. The toilet is selected as the wayfinding target and the washing machine in the bathroom is selected as the pointing target because both are unique and noticeable locations within the flat. Pre-and Post-Questionnaire The used questionnaires can be found in Appendix A and B. In both pre-and post-questionnaire, it is asked for cybersickness symptoms using the simulator sickness questionnaire (SSQ) [WS98]. The indicator Simulator Sickness is computed as the difference between the sums of perceived symptoms before and upon test completion. In the post-questionnaire, further subjective measures are included to collect participants’ perceptions of the immersive exploration. The measures are evaluated using 77 6. Experimental Results Figure 6.6: Locations used for the evaluation tasks. The filled green circles represent the three pointing locations for the projective convergence task and the green unfilled circle shows the corresponding target point. The red circles illustrate the starting (filled) and target location (unfilled) for the wayfinding task. Figure 6.7: Result of a single projective convergence task. 78 6.3. User Study 5-point Likert scale ratings, denoting participants’ degree of agreement with the following statements: • Perceived Satisfaction of Input measured as participants’ evaluations of the state- ment “Please rate how much you enjoyed using the device for navigation.” • Perceived Scene Quality measured as participants’ evaluations of the statement “Please rate how much you enjoyed the quality of the virtual environment.” • Perceived Spatial Understanding measured as participants’ evaluations of the state- ment “I was able to quickly get an overview of the indoor environment.” For both Perceived Satisfaction of Input and Perceived Scene Quality, scale levels are: not at all (1), not much (2), moderate (3), much (4), very much (5). For Perceived Spatial Understanding, scale levels are: strongly disagree (1), disagree (2), neither agree nor disagree (3), agree (4), and strongly agree (5). Furthermore, the task load and system usability is measured as follows: • Perceived Task Load is measured with the NASA Task Load Index (TLX) [HS88] and is computed as a mean of questionnaire’s results. Scale levels are: very low (1), low (2), average (3), high (4), and very high (5). • Perceived System Usability is measured with the System Usability Scale (SUS) [Bro96] and is computed as a mean of questionnaire’s results. Scale levels are: strongly disagree (1), disagree (2), neither agree nor disagree (3), agree (4), and strongly agree (5). 6.3.6 Results The study is conducted using an independent factorial design with the independent variables Setup and Data, while the dependent variables are the computed performance measures Mean Angle Error, Consistency, Trajectory Length, Total Rotation and Total Time, and the subjective measures, as denoted in the previous subsection. The quantitative data is analyzed using two-way independent ANOVA, when suitable pairwise t-tests with Bonferroni adjustment are employed. For each combination of Setup and Data, ten randomly assigned participants (three females, 33.3%) performed the test. Upon explanation of the hardware at stage two, users took in average µ = 1:13 min (σ = 0:37 min) to familiarize with the setup within the virtual test scene with no significant difference between Setup 1 and Setup 2. For Perceived Satisfaction of Input, an average satisfaction of µ = 3.95 (σ = 0.876) is found, with no significant effect by neither Setup (p = 0.377) nor Data (p = 0.489). No participant using Setup 1 reported difficulties with using the input device, while only five participants using Setup 2 (25%) were able to constantly navigate without any reported or observed 79 6. Experimental Results problems. Three of them were novice users. 75% of the users with Setup 2 had various levels of difficulties to explore the scene with the Virtualizer. This observation is backed by the users’ subjective measure Perceived Task Load as highlighted in Figure 6.8. Figure 6.8: Qualitative results of the Perceived Task Load. Here, a significant main effect of Setup can be reported with p < 0.042, while no significance is found for Data (p = 0.068), nor for the interaction between Setup and Data (p = 0.205). When analyzing the TLX results in particular, a main effect of Setup for the question “How physically demanding was the task?” is found with p < 0.007, reflecting the higher perceived physical demand when using Setup 2. Furthermore, a main effect of Data for “How hard did you work to accomplish your level of performance?” can be reported with p < 0.03, reflecting the higher perceived mental load for Data 2. Overall, users explored the scene in average for µ = 4:49 min (σ = 1:22 min). Users performing the test with Data 1 explored in average for µ = 3:59 min (σ = 1:18 min) while users with Data 2 required the entire 6:00 minutes due to the time it takes until the entire model is streamed. For Simulator Sickness, an increase of symptoms is found with µ = 3.60 (σ = 4.01), however no significant effect can be reported by neither Setup (p = 0.350) nor Data (p = 0.277). For the perceived visual quality of the 3D reconstruction Perceived Scene Quality, an average satisfaction of µ = 3.78 (σ = 0.947) is found. All users were astonished by the size of the flat and constantly commented on details they saw while exploration, such as the coffee machine and pictures. Three users reported to perceive the 3D geometry with low quality due to the HMD resolution, while two users commented on 3D model artifacts and the subjectively perceived low quality of the model’s geometry and texture. Four users reported they would feel like snoopers and stated to feel actually being in the flat. Finally, users reported similar Perceived System Usability for both Setup and Data with no significant differences. Analyzing the quantitative performance data of the spatial knowledge acquisition tasks, users performed similarly for the projective convergence measures Mean Angle Error, Consistency and Mean Miss Distance as well as for the wayfinding measures Trajectory Length, Total Rotation and Total Time. A main effect of Setup and Data cannot be determined for any of the measures (p > 0.1 for all measures). This finding is backed by 80 6.3. User Study the subjective measure Perceived Spatial Understanding, where significant results can be reported by neither Setup (p = 0.453) nor Data (p = 0.707). Even though there are no significant differences, the best results of the projective convergence task, i.e. the lowest values, are achieved with Setup 2. For instance, only one of the top ten Consistency values (the 4th) is accomplished with Setup 1 and only two of the top ten Mean Angle Error values are accomplished with Setup 1 (2nd and 7th). Figure 6.9 shows the results of projective convergence task. It illustrates the tendency, that Setup 2 leads to lower errors in the spatial knowledge tasks in average, but with a higher variance. In contrast to the Setup, the Data mode does not affect the results. Figure 6.9: Quantitative results of the projective convergence Task. With Setup 1, users tend to navigate quicker and straighter than with Setup 2, indicated by Total Time, Total Rotation and Trajectory Length. This is also visible in Figure 6.10, which shows typical trajectories for each setup. However, Trajectory Length varied heavily for Setup 1 due to false navigation turns that have been observed, also indicated by Total Rotation. These findings can be observed in Figure 6.11, which illustrates the quantitative results of the wayfinding task. Similar, to the projective convergence task, the Data mode does not change the results. 81 6. Experimental Results Figure 6.10: Example paths traveled during the wayfinding task. The left path represents a typical trajectory performed with the Virtualizer (Setup 2 ) whereas the right path shows a trajectory performed with the gamepad (Setup 1 ). Figure 6.11: Quantitative results of the wayfinding task. 82 6.3. User Study 6.3.7 Discussion The user study is performed to gain insights into general user perception of streamed 3D reconstructions as well as to investigate spatial knowledge acquisition in streamed reconstruction. The quantitative as well as qualitative feedback on the virtual 3D scene reveals high acceptance of the concept by the forty test subjects. Many users immediately suggested real-world use cases such as “I would love to have such an application to be able to have a look into a hotel before I visit it”. Users’ subjective measures as well as qualitative feedback indicated a high degree of immersion. Comments ranged from “I feel like being in the flat”, “I feel like a snooper”. Furthermore, test participants wanted to actively get engaged with the virtual 3D reconstruction, indicated by comments such as “I would like to open the entrance door.”, “Can I take something out of the shelf?”. No participant using Data 2 reported to feel distracted by the data representation. They rather showed high excitement, indicated by comments such as “This is so Matrix- like”. They reported positively on the concept of being decoupled from the view of the remote capturing entity and stated to explore the environment as in reality, by strolling around or investigating one room after the other. Three users asked for a functionality to control the view of the capturing, and four for a faster reconstruction. Given the quantitative performance as well as subjective measures, the evaluation did not indicate a significant influence on spatial knowledge acquisition, neither by the input device nor by the data representation. Regarding the input device, it can be noted that the majority of participants using Setup 2 had difficulties in navigating with the ODT, as they did not have prior experience with this input device. Thus, they had to concentrate on using the device and could not fully focus on the environment that mitigates the advantages of this direct navigational input device. This is also supported by Waller [Wal99], who found that the proficiency with the navigational interface strongly affects the ability to acquire spatial information. Therefore, the author cannot draw conclusions on the influence of a low-friction ODT with its proprioceptive feedback on spatial knowledge acquisition within a 3D real-world reconstruction. The user study revealed the tendency, that experienced users are able to build up spatial knowledge more efficiently with the Virtualizer, however, further research is required to study this question. Promisingly, at the time of writing this thesis, a Virtualizer software update is released that filters sensor readings and thereby improves ease of walking for novice users. Furthermore, no significant effect of the data representation on spatial knowledge acquisition can be determined which is an interesting finding. It indicates, that spatial knowledge acquisition is not reduced when exploring the scene from the very beginning of the reconstruction process, compared to exploring the scene in its entirety after the reconstruction has finished. This result is in contrast to the finding of Passini [Pas92], who states, that the spatial knowledge is decreased when less decisions are made during exploration, such as when following a guide. With the proposed system, the map is expanded only at one place at a time, which also limits the decision where to go. With a streaming representation, solely the perceived mental load is increased. Both aspects are vital finding to real-world uses cases. 83 CHAPTER 7 Conclusion In this thesis, a 3D reconstruction system is developed, which allows to reconstruct an environment and, at the same time, explore the acquired 3D model of the environment in Virtual Reality at a remote location. Existing systems in the field of 3D reconstruction already make it possible to create such 3D models of real environments by capturing the scene with a RGB-D camera. However, all existing prior art lacks the ability to explore the model while it is still scanned. A live visualization is provided, which shows the current state of the reconstruction. However, this visualization shows the model as seen from the scanning camera’s viewpoint. A free exploration, independent from this viewpoint, is only supported after the reconstruction process has finished and a mesh has been extracted in a post-processing step. The scanning of the environment and the exploration of the reconstructed model is usually performed by two different parties. At location A (server), a person or an autonomous robot, equipped with a mobile computer and a RGB-D camera, is reconstructing an unknown environment. At location B (client), another person wants to explore this remote environment in a virtual way. If the transmission of the reconstructed model and the mesh extraction happens only after the scanning is finished, the person at the client side has to wait a long time before any findings can be obtained. 7.1 Contribution Within this thesis, the existing state-of-the-art 3D reconstruction framework InfiniTAM is adapted and extended in such a way, that it allows to scan large scale environments and explore these environments at the same time at remote location. The support for large scale models is already provided by InfiniTAM’s efficient data representation. It stores the model internally as a volumetric signed distance function. In order to save memory, empty space is ignored by applying Voxel Block Hashing. To be able to store a scene with 90 m2, 800 - 1000 MB are required, when the model is stored with color at a 85 7. Conclusion resolution of 1 cm. Apart from the low memory footprint, InfiniTAM is implemented efficiently regarding the computational performance and runs on a tablet (NVIDIA Shield, Google Nexus 9) at real-time rates (over 30 fps). Note however, that the current thesis implementation only runs on Windows platforms. The live and remote exploration is enabled by adding incremental transmission of the reconstructed 3D model and real-time mesh extraction. With these two additions, the waiting time is avoided and the reconstructed model can be remotely explored from the beginning of the scan process. Environmental knowledge is gained quicker and decisions can be made earlier. Apart from that, the scanning of the environment could be influenced by the remote party if anything interesting is detected during exploration. Incremental Model Transmission The novel networking module streams all those voxels, which fall out of the scanning camera’s view frustum, to a remotely connected client. The voxels are compressed in a lossless way using DEFLATE and are transmitted with TCP. In combination with the already space-efficient model representation of InfiniTAM, average data rates between three and four MBits/sec are achieved, when reconstructing a scene with color at a voxel resolution of 1 cm. Note, that this value only corresponds to the payload, and does not integrate any networking overhead. Dynamic Mesh Update At the client side, the model is additionally stored in a mesh representation since a mesh is favored for efficient rendering and interaction. The scene is divided into a regular 3D grid, where each grid element holds its own mesh. Each mesh is recomputed using CPU-based Marching Cubes, whenever any underlying voxel changed. Even though the computation is performed on CPU, the meshing module is able to update the mesh fast enough to keep up with the scanning and reconstruction process. On modern hardware (Core i7-4940MX, Gefore GTX980M, 16GB RAM), the computation time for a single mesh is below 100 ms in average. Integration of Unreal Engine 4 For visualization and exploration purposes, the extracted meshes are rendered using the existing Unreal Engine 4. A shared memory based communication between the client side InfiniTAM application and the UE4 project is implemented in order to forward the mesh data. Within UE4, the rendering is performed with over 200 fps, even for large scenes consisting of over ten million triangles. The dynamic update of the mesh data during runtime does not lead to notable drops in the frame rate. Exploration in Virtual Reality The remote environment can be explored in an immersive Virtual Reality Setup, consisting of an HMD and an omnidirectional treadmill. A UE4 plugin is developed in order to integrate the Virtualizer, an omnidirectional treadmill by Cyberith. Using the VR Setup, the scene can be explored in a more natural way. A user study is performed to examine the users’ general perception of and spatial understanding within 3D reconstructed environments using the VR devices. The 86 7.2. Future Work results indicate a high acceptance of the general concept, combined with a high degree of perceived immersion into the virtual real-world environment. A significant effect of the input device on spatial knowledge acquisition could not be found which might be due to problems users encountered when using the ODT for the first time. The results revealed that spatial understanding is not affected by the way the data is presented to users, which is promising for use cases that require exploration from the very beginning of the reconstruction process. The presented work provides a foundation for enabling immersive exploration of remotely captured and incrementally reconstructed dense 3D scenes and understanding users’ perception within this real-world virtual environments. 7.2 Future Work The implemented system allows to scan large environments and store them efficiently in memory. However, in order to be used by untrained users in arbitrary scenes, the camera pose estimation needs to improved. Due to frequent tracking failures, it is challenging to process longer video sequences, especially in areas with few geometric features. Even if one manages to reconstruct the whole environment, camera drift inevitably leads to misaligned models. The effect of tracking failures can be mitigated by integrating a pose recovery procedure. If a valid camera pose is found again, the scanning process does not have to be started from the beginning, but can continue from the current position. Camera drift can be minimized by detecting already visited places and closing the loop. Challenging in this case, is the fact, that the model of both server and client needs to be updated. One option, which avoids retransmitting all changed parts of the model, is to send only the loop closure instructions and to perform the actual deformation on both server and client. Apart from loop closure, the general tracking accuracy can be improved by fusing various types of sensors. Currently, only the depth images are used for tracking, but one can also use the color information, IMU data or the strength of Wi-Fi or bluetooth signals. As the technical evaluation revealed, a mesh (or a point cloud) representation is more memory efficient than the TSDF. Streaming the mesh data, instead of the voxels, can significantly reduce the required data rate. For that purpose, the meshing has to be performed on the limited mobile platform and therefore, a computationally more efficient mesh extraction has to be developed. Moreover, one needs to find a strategy, how to cope with the dynamic nature of the volumetric data structure. Currently, changes in the volumetric representation are reflected at the client side right away by recomputing the corresponding meshes. If the mesh extraction happens on the server side, the meshes would have to be retransmitted after every update. This also includes the unchanged parts of the mesh. Independently of the data type to be sent, the data rate can be improved by choosing a proper network protocol. The current implementation uses the TCP protocol because it is widely supported. However, it contains unrequired features, which increase the data rate. In future work, a more efficient protocol with less overhead can be evaluated and 87 7. Conclusion integrated. Apart from that, the network transmission should be ported from Windows to a mobile operating system, like Android or iOS, by replacing Winsock2 with a multi- platform library. This way, the 3D reconstruction process can be performed conveniently with a handheld tablet. Regarding the mesh computation, a current drawback is the high number of triangles generated by Marching Cubes. In order to allow real-time interaction for larger scenes, the meshes need to be optimized. One possibility is geometry simplification, such as plane detection. The planes can then be represented with fewer triangles. Challenging is however the fact, that the volumetric data does not only expand, but can also change in existing regions. One could detect meshes, which did not change in the recent past and fuse those to bigger meshes, so that larger optimizations are possible. Whenever underlying voxels changed, the bigger mesh is replaced by the smaller ones again. Finally, the influence of navigational input on spatial knowledge acquisition requires further investigations, i.e. by examining standard input devices in a non-immersive VR, the updated version of the Virtualizer ODT as well as real walking in immersive VR. 88 APPENDIX A User Study Pre-Questionnaire 89 Participant Number: ________ Age ___________ (years / months) [ ] Male [ ] Female Do you have previous experience with Virtual Reality? [ ] Yes [ ] No Do you own a head mounted display (HMD)? [ ] Yes [ ] No Pre-Exposure Simulator Sickness Please circle below if any of the symptoms apply to you now. You will be asked to fill this again after the experiment. 1. General discomfort None Slight Moderate Severe 2. Fatigue None Slight Moderate Severe 3. Boredom None Slight Moderate Severe 4. Drowsiness None Slight Moderate Severe 5. Headache None Slight Moderate Severe 6. Eyestrain None Slight Moderate Severe 7. Difficulty focusing None Slight Moderate Severe 8. Salivation increase None Slight Moderate Severe 9. Salivation decrease None Slight Moderate Severe 10. Sweating None Slight Moderate Severe 11. Nausea None Slight Moderate Severe 12. Difficulty None Slight Moderate Severe concentrating 1/2 13. Mental depression No Yes ( Slight Moderate Severe ) 14. "Fullness of the head" No Yes ( Slight Moderate Severe ) 15. Blurred vision No Yes ( Slight Moderate Severe ) 16. Dizziness eyes open No Yes ( Slight Moderate Severe ) 17. Dizziness eyes close No Yes ( Slight Moderate Severe ) 18. Vertigo No Yes ( Slight Moderate Severe ) 19. Visual flashbacks No Yes ( Slight Moderate Severe ) 20. Faintness No Yes ( Slight Moderate Severe ) 21. Aware of breathing No Yes ( Slight Moderate Severe ) 22. Stomach awareness No Yes ( Slight Moderate Severe ) 23. Loss of appetite No Yes ( Slight Moderate Severe ) 24. Increased appetite No Yes ( Slight Moderate Severe ) 25. Desire to move bowels No Yes ( Slight Moderate Severe ) 26. Confusion No Yes ( Slight Moderate Severe ) 27. Burping No Yes ( Slight Moderate Severe ) 28. Vomiting No Yes ( Slight Moderate Severe ) 29. Other No Yes ( Slight Moderate Severe ) 2/2 APPENDIX B User Study Post-Questionnaire 93 1/6 Participant Number: ________ Please indicate what kind of navigation input you have been using during the experiement: [ ] Gamepad [ ] Omnidirectional Treadmill Please indicate how the virtual environment was presented to you: [ ] All at once [ ] Incrementally over time Post-Exposure Simulator Sickness Please circle below again if any of the symptoms apply to you now. 1. General discomfort None Slight Moderate Severe 2. Fatigue None Slight Moderate Severe 3. Boredom None Slight Moderate Severe 4. Drowsiness None Slight Moderate Severe 5. Headache None Slight Moderate Severe 6. Eyestrain None Slight Moderate Severe 7. Difficulty focusing None Slight Moderate Severe 8. Salivation increase None Slight Moderate Severe 9. Salivation decrease None Slight Moderate Severe 10. Sweating None Slight Moderate Severe 11. Nausea None Slight Moderate Severe 12. Difficulty None Slight Moderate Severe concentrating 13. Mental depression No Yes ( Slight Moderate Severe ) 2/6 14. "Fullness of the head" No Yes ( Slight Moderate Severe ) 15. Blurred vision No Yes ( Slight Moderate Severe ) 16. Dizziness eyes open No Yes ( Slight Moderate Severe ) 17. Dizziness eyes close No Yes ( Slight Moderate Severe ) 18. Vertigo No Yes ( Slight Moderate Severe ) 19. Visual flashbacks No Yes ( Slight Moderate Severe ) 20. Faintness No Yes ( Slight Moderate Severe ) 21. Aware of breathing No Yes ( Slight Moderate Severe ) 22. Stomach awareness No Yes ( Slight Moderate Severe ) 23. Loss of appetite No Yes ( Slight Moderate Severe ) 24. Increased appetite No Yes ( Slight Moderate Severe ) 25. Desire to move bowels No Yes ( Slight Moderate Severe ) 26. Confusion No Yes ( Slight Moderate Severe ) 27. Burping No Yes ( Slight Moderate Severe ) 28. Vomiting No Yes ( Slight Moderate Severe ) 29. Other No Yes ( Slight Moderate Severe ) 3/6 Exploration & Orientation 1. Please specify how much you agree with the following statement: “I was able to EASILY find my way the environment? Strongly disagree Disagree Neither agree nor disagree Agree Strongly agree 1 2 3 4 5 2. Please specify how much you agree with the following statement: “I was able to QUICKLY find my way through the environment? Strongly disagree Disagree Neither agree nor disagree Agree Strongly agree 1 2 3 4 5 3. Please specify how much you agree with the following statement: “I was able to QUICKLY get an overview of the entire indoor environment? Strongly disagree Disagree Neither agree nor disagree Agree Strongly agree 1 2 3 4 5 4. Please indicate how well you estimate your spatial understanding of the presented scene. Very poor Poor Acceptable Good Very Good 1 2 3 4 5 5. Please rate how much you enjoyed using the device for navigation: Not at all Not much Moderate Much Very much 1 2 3 4 5 6. Please rate how much you enjoyed the quality of the virtual environment: Not at all Not much Moderate Much Very much 1 2 3 4 5 4/6 Task Load (TLX) Please indicate your perceived amount of the following statements regarding the exploration task: 1. How mentally demanding was the task? Very low Low Average High Very High 1 2 3 4 5 2. How physically demanding was the task? Very low Low Average High Very High 1 2 3 4 5 3. How hurried or rushed was the pace of the task? Very low Low Average High Very High 1 2 3 4 5 4. How successful were you in accomplishing what you were asked to do? Very Poor Poor Acceptable Good Very Good 1 2 3 4 5 5. How hard did you work to accomplish your level of performance? Very low Low Average High Very High 1 2 3 4 5 6. How insecure, discouraged, irritated, stressed and annoyed were you? Very low Low Average High Very High 1 2 3 4 5 System Usability (SUS) Please specify how much you agree with the following statement: 1. I think that I would like to use this concept frequently Strongly disagree Disagree Neither agree or disagree Agree Strongly agree 1 2 3 4 5 2. I found the concept unnecessarily complex Strongly disagree Disagree Neither agree or disagree Agree Strongly agree 1 2 3 4 5 3. I thought the concept was easy to use Strongly disagree Disagree Neither agree or disagree Agree Strongly agree 1 2 3 4 5 5/6 4. I think that I would need the support of a technical person to be able to use this concept Strongly disagree Disagree Neither agree or disagree Agree Strongly agree 1 2 3 4 5 5. I found the various functions in this concept were well integrated Strongly disagree Disagree Neither agree or disagree Agree Strongly agree 1 2 3 4 5 6. I thought there was too much inconsistency in this concept Strongly disagree Disagree Neither agree or disagree Agree Strongly agree 1 2 3 4 5 7. I would imagine that most people would learn to use this concept very quickly Strongly disagree Disagree Neither agree or disagree Agree Strongly agree 1 2 3 4 5 8. I found the concept very cumbersome to use Strongly disagree Disagree Neither agree or disagree Agree Strongly agree 1 2 3 4 5 9. I felt very confident using the concept Strongly disagree Disagree Neither agree or disagree Agree Strongly agree 1 2 3 4 5 10. I needed to learn a lot of things before I could get going with the concept Strongly disagree Disagree Neither agree or disagree Agree Strongly agree 1 2 3 4 5 6/6 Open Questions Please describe in a few words your experience with the application. Please describe in a few words your exploration strategy. What other uses can you suggest for the concept? (please think of at least 3 uses) List of Figures 1.1 Incremental transmission and exploration of a reconstructed scene while scanning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 RGB-D camera Xtion Pro Live from Asus. . . . . . . . . . . . . . . . . . . . 8 2.2 Distorted reconstruction due to camera drift. . . . . . . . . . . . . . . . . . . 10 3.1 Principle of a Moving Volume approach. . . . . . . . . . . . . . . . . . . . . . 12 3.2 Comparison of different volumetric data structures for 3D reconstruction. . . 14 3.3 Marching Cubes triangulation of a 3D reconstruction. . . . . . . . . . . . . . 18 4.1 Overview of the reconstruction pipeline. . . . . . . . . . . . . . . . . . . . . . 22 4.2 Example of a TSDF for a simple object. . . . . . . . . . . . . . . . . . . . . . 23 4.3 Principle of the Voxel Block Hashing data structure. . . . . . . . . . . . . . . 24 4.4 Hash table structure at Voxel Block Hashing. . . . . . . . . . . . . . . . . . . 24 4.5 Difference between point-to-point and point-to-plane distance. . . . . . . . . 26 4.6 Voxel block visibility check for GPU - CPU swapping. . . . . . . . . . . . . . 29 4.7 GPU - CPU swapping procedure. . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.8 Top view on the mesh block grid of a test scene. . . . . . . . . . . . . . . . . 35 4.9 Principle of Marching Cubes in two dimensions (Marching Squares). . . . . . 38 4.10 Oculus Rift HMD (Developer Kit 2). . . . . . . . . . . . . . . . . . . . . . . . 41 4.11 Cyberith Virtualizer omnidirectional treadmill. . . . . . . . . . . . . . . . . . 42 5.1 Overview of the most important classes and their relations. . . . . . . . . . . 47 5.2 Overview of the individual threads and their communication. . . . . . . . . . 49 5.3 Screenshot of the InfiniTAM server application. . . . . . . . . . . . . . . . . . 51 5.4 Screenshot of the client application. . . . . . . . . . . . . . . . . . . . . . . . 53 5.5 Main steps of the ITMSwappingEngine and ITMStreamingEngine for network transmission at the server side. . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.6 Main steps of the ITMStreamingEngine at the client side. . . . . . . . . . . . 57 5.7 Main steps of the scene update in InfiniTAM. . . . . . . . . . . . . . . . . . . 58 5.8 Class structure of the UE4 Virtualizer plugin. . . . . . . . . . . . . . . . . . . 60 5.9 Integration of the Virtualizer via UE4’s blueprint system. . . . . . . . . . . . 61 5.10 Class structure of the client when using standalone visualization. . . . . . . . 61 101 6.1 Top view on reconstructions of the three used test data sets. . . . . . . . . . 65 6.2 Average data rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.3 Required memory to store the Flat scene either as mesh or as TSDF voxel blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.4 Principle of the projective convergence task . . . . . . . . . . . . . . . . . . . 75 6.5 Two different input/output device setups. . . . . . . . . . . . . . . . . . . . . 76 6.6 Locations used for projective convergence and wayfinding tasks. . . . . . . . . 78 6.7 Result of a single projective convergence task. . . . . . . . . . . . . . . . . . . 78 6.8 Qualitative results for the perceived task load. . . . . . . . . . . . . . . . . . 80 6.9 Quantitative results of the projective convergence task. . . . . . . . . . . . . . 81 6.10 Example paths traveled during the wayfinding task. . . . . . . . . . . . . . . 82 6.11 Quantitative results of the wayfinding task. . . . . . . . . . . . . . . . . . . . 82 List of Tables 6.1 Hash table occupancy with default hash table size. . . . . . . . . . . . . . . . 67 6.2 Hash table occupancy with enlarged hash table. . . . . . . . . . . . . . . . . . 67 6.3 Amount of transmitted voxel blocks and data in MB along with the achieved compression ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.4 Average and maximum data rates of all three test scenes. . . . . . . . . . . . 69 6.5 Number and size of the extracted meshes (for colored reconstructions only). 71 102 Bibliography [AFDM08] A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer. Fast and Incremental Method for Loop-Closure Detection using Bags of Visual Words. IEEE Transactions on Robotics, 24:1027–1037, 2008. [Asu16] Asus. Xtion Pro Live [Online]. https://www.asus.com/3D- Sensor/Xtion_PRO_LIVE/, Last accessed on February 10, 2016. [Bjö96] A. Björck. Numerical Methods for Least Squares Problems. Siam, 1996. [BKJP04] D. A. Bowman, E. Kruijff, LaViola J., and I. Poupyrev. 3D User Interfaces: Theory and Practice. Addison-Wesley, 2004. [BM92] P. J. Besl and N. D. McKay. A Method for Registration of 3-D Shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14:239–256, 1992. [Bro96] J. Brooke. Sus - a quick and dirty usability scale. Usability Evaluation in Industry, 189(194):4–7, 1996. [BTG97] J. Bliss, P. D. Tidwell, and M. A. Guest. The Effectiveness of Virtual Reality for Administering Spatial Navigation Training to Firefighters. Presence: Teleoperators and Virtual Environments, 6:73–86, 1997. [Cam16] S. Campisi. Hyper IMU (version 1.5) [Software]. https://play.google.com/store/apps/details?id=com.ianovir.hyper_imu, Last accessed on February 11, 2016. [CBI13] J. Chen, D. Bautembach, and S. Izadi. Scalable Real-time Volumetric Surface Reconstruction. ACM Transactions on Graphics (TOG), 32:113:1– 113:16, 2013. [CH14] T. Cakmak and H. Hager. Cyberith Virtualizer: A Locomotion Device for Virtual Reality. In ACM SIGGRAPH Emerging Technologies, pages 6:1–6:1, 2014. 103 [Cho16] M. Chourdakis. Ultimate Shared Memory: A Flex- ible Class for Interprocess Memory Sharing [Software]. http://www.codeproject.com/Articles/835818/Ultimate-Shared-Memory- A-flexible-class-for-interp, Last accessed on February 10, 2016. [CL96] B. Curless and M. Levoy. A Volumetric Method for Building Complex Models from Range Images. In ACM Conference on Computer Graphics and Interactive Techniques, pages 303–312, 1996. [Cor16a] Microsoft Corporation. Microsoft Kinect [Online]. https://www.microsoft.com/en-us/kinectforwindows, Last accessed on February 10, 2016. [Cor16b] Nvidia Corporation. Cuda (version 7.0) [software]. https://developer.nvidia.com/cuda-toolkit, Last accessed on February 10, 2016. [CTF14] M. Coatsworth, J. Tran, and A. Ferworn. A Hybrid Lossless and Lossy Compression Scheme for Streaming RGB-D Data in Real-Time. In IEEE International Symposium on Safety, Security, and Rescue Robotics, pages 1–6, 2014. [CZK15] S. Choi, Q.-Y. Zhou, and V. Koltun. Robust Reconstruction of Indoor Scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5556–5565, 2015. [Deu96] P. Deutsch. DEFLATE Compressed Data Format Specification Version 1.3. Technical report, RFC 1951, 1996. [DG96] P. Deutsch and J.-L. Gailly. Zlib Compressed Data Format Specification Version 3.3. Technical report, RFC 1950, 1996. [Epi16] Epic Games. Unreal Engine 4.9 [Software]. https://www.unrealengine.com, Last accessed on February 10, 2016. [FB81] M. A. Fischler and R. C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the ACM, 24:381–395, 1981. [FCSS09] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. Reconstructing Building Interiors from Images. In IEEE International Conference on Computer Vision (ICCV), pages 80–87, 2009. [For02] B. A. Forouzan. TCP/IP Protocol Suite. McGraw-Hill, Inc., 2002. [Fre16] FreeGLUT. Free OpenGL Utility Toolkit (version 3.0) [Software]. http://freeglut.sourceforge.net/, Last accessed on February 10, 2016. 104 [FTF+15] N. Fioraio, J. Taylor, A. Fitzgibbon, L. Di Stefano, and S. Izadi. Large-Scale and Drift-Free Surface Reconstruction Using Online Subvolume Registration. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4475–4483, 2015. [FVDF+94] J. D. Foley, A. Van Dam, S. K. Feiner, J. F. Hughes, and R. L. Phillips. Introduction to Computer Graphics. Addison-Wesley Reading, 1994. [GG07] Y. Gu and R. L. Grossman. UDT: UDP-Based Data Transfer for High-Speed Wide Area Networks. Computer Networks, 51:1777–1799, 2007. [GK15] T. Golla and R. Klein. Real-time point cloud compression. In IEEE International Conference on Intelligent Robots and Systems (IROS), pages 5087–5092, 2015. [HF16] F. Heredia and R. Favier. Kinfu Large Scale [Online], Last accessed on February 10, 2016. [Hig16] High Tech Computer Corporation / Valve Corporation. HTC Vive [Online]. http://www.htcvr.com, Last accessed on February 10, 2016. [HKH+12] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. RGB-D Mapping: Using Kinect-Style Depth Cameras for Dense 3D modeling of Environments. The International Journal of Robotics Research, 31:647–663, 2012. [HS88] S. Hart and L. Staveland. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. Advances Psychology, 52:139–183, 1988. [Inf16] Infinadeck. Infinadeck [Online]. http://infinadeck.com/, Last accessed on February 10, 2016. [KBH06] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson Surface Reconstruction. In Eurographics Symposium on Geometry Processing, pages 61–70, 2006. [KBR+12] J. Kammerl, N. Blodow, R. B. Rusu, S. Gedikli, M. Beetz, and E. Steinbach. Real-Time Compression of Point Cloud Streams. In IEEE International Conference on Robotics and Automation (ICRA), pages 778–785, 2012. [KPR+15] O. Kähler, V. A. Prisacariu, C. Y. Ren, X. Sun, P. Torr, and D. Murray. Very High Frame Rate Volumetric Integration of Depth Images on Mobile Devices. IEEE Transactions on Visualization and Computer Graphics, 21:1241–1250, 2015. [LC87] W. E. Lorensen and H. E. Cline. Marching Cubes: A High Resolution 3D Surface Construction Algorithm. SIGGRAPH Comput. Graph., 21:163–169, 1987. 105 [LM13] M. Labbe and F. Michaud. Appearance-Based Loop Closure Detection for Online Large-Scale and Long-Term Operation. IEEE Transactions on Robotics, 29:734–745, 2013. [MLDH15] A. Maglo, G. Lavoué, F. Dupont, and C. Hudelot. 3d mesh compression: Survey, comparisons, and emerging trends. ACM Computing Surveys, 47:44:1–44:41, 2015. [MOCGR14] V. Morell, S. Orts, M. Cazorla, and J. Garcia-Rodriguez. Geometric 3D Point Cloud Compression. Pattern Recognition Letters, 50:55–62, 2014. [Moo76] G. Moore. Environmental Knowing, chapter Theory and Research on the Development of Environmental Knowing, pages 138–164. "Dowden, Hutchinson and Ross", 1976. [MRB09] Z. C. Marton, R. B. Rusu, and M. Beetz. On Fast Surface Reconstruction Methods for Large and Noisy Datasets. In IEEE International Conference on Robotics and Automation (ICRA), pages 3218–3223, 2009. [MS92] M. E. McCauley and T. J. Sharkey. Cybersickness: Perception of Self- Motion in Virtual Environments. Presence: Teleoperators & Virtual Envi- ronments, 1:311–318, 1992. [MWB+13] L. Ma, T. Whelan, E. Bondarev, P. HN de With, and J. McDonald. Planar Simplification and Texturing of Dense Point Cloud Maps. In IEEE European Conference on Mobile Robots (ECMR), pages 164–171, 2013. [NIH+11] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon. KinectFusion: Real-Time Dense Surface Mapping and Tracking. In IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 127–136, 2011. [NIL12] C. V. Nguyen, S. Izadi, and D. Lovell. Modeling Kinect Sensor Noise for Improved 3d Reconstruction and Tracking. In IEEE International Confer- ence on 3D Imaging, Modeling, Processing, Visualization and Transmission, pages 524–530, 2012. [NM12] Nagesha and S. S. Manvi. Performance Analysis of SCTP Compared to TCP and UDP. In Advances in Computing and Information Technology. Springer Berlin Heidelberg, 2012. [NRI12] A. Nybakke, R. Ramakrishnan, and V. Interrante. From Virtual to Ac- tual Mobility: Assessing the Benefits of Active Locomotion through an Immersive Virtual Evironment using a Motorized Wheelchair. In IEEE Symposium on 3D User Interfaces (3DUI), pages 27–30, 2012. 106 [NSS14] F. Nenci, L. Spinello, and C. Stachniss. Effective Compression of Range Data Streams for Remote Robot Operations using H.264. In IEEE International Conference on Intelligent Robots and Systems (IROS), pages 3794–3799, 2014. [NY06] T. S. Newman and H. Yi. A Survey of the Marching Cubes Algorithm. Computers & Graphics, 30:854–879, 2006. [NZIS13] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger. Real-time 3D Re- construction at Scale using Voxel Hashing. ACM Transactions on Graphics (TOG), 32:169:1–169:11, 2013. [Occ16] PrimeSense / Occipital. OpenNI (version 2.2) [Software]. http://structure.io/openni, Last accessed on February 10, 2016. [Ocu16a] Oculus VR. Oculus Rift [Online]. https://www.oculus.com/en-us/rift, Last accessed on February 10, 2016. [Ocu16b] Oculus VR. Oculus Rift Requirements [Online]. https://www.oculus.com/en- us/blog/powering-the-rift/, Last accessed on February 10, 2016. [Pas92] R. Passini. Wayfinding in architecture. Environmental design series. Van Nostrand Reinhold, 1992. [PKC+14] V. A. Prisacariu, O. Kähler, M.-M. Ren C. Y. Cheng, J. Valentin, P. Torr, I. Reid, and D. Murray. A Framework for the Volumetric Integration of Depth Images. Computing Research Repository (CoRR), abs/1410.0925, 2014. [PPS13] P. M Panchal, S. R. Panchal, and S. K. Shah. A Comparison of SIFT and SURF. International Journal of Innovative Research in Computer and Communication Engineering, 1:323–327, 2013. [RB04] B. Riecke and H. Bülthoff. Spatial Updating in Real and Virtual Envi- ronments: Contribution and Interaction of Visual and Vestibular Cues. In ACM Symposium on Applied Perception in Graphics and Visualization, pages 9–17, 2004. [RL01] S. Rusinkiewicz and M. Levoy. Efficient Variants of the ICP Algorithm. In IEEE International Conference on 3-D Digital Imaging and Modeling (3DIM), pages 145–152, 2001. [RL09] R. A. Ruddle and S. Lessels. The Benefits of using a Walking Interface to Navigate Virtual Environments. ACM Transactions on Computer-Human Interaction (TOCHI), 16:5:1–5:18, 2009. [RM12] H. Roth and V. Marsette. Moving Volume KinectFusion. In Proceedings of the British Machine Vision Conference (BMVC), pages 1–11, 2012. 107 [Sal04] D. Salomon. Data Compression: The Complete Reference. Springer Science & Business Media, 2004. [Sie81] A. W. Siegel. The Externalization of Cognitive Maps by Children and Adults: In Search of Ways to Ask Better Questions. In Spatial Representa- tion and Behavior Across the Life Span. Academic Press, 1981. [SKCS13] F. Steinbrücker, C. Kerl, D. Cremers, and J. Sturm. Large-Scale Multi- Resolution Surface Reconstruction from RGB-D Sequences. In IEEE International Conference on Computer Vision (ICCV), pages 3264–3271, 2013. [SM01] R. Stewart and C. Metz. SCTP: New Transport Protocol for TCP/IP. IEEE Internet Computing, 5:64–69, 2001. [Son16] Sony. Playstation VR [Online]. https://www.playstation.com/en- us/explore/playstation-vr/, Last accessed on February 10, 2016. [SSC11] F. Steinbrücker, J. Sturm, and D. Cremers. Real-time Visual Odometry from Dense RGB-D Images. In IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 719–722, 2011. [SW75] A. W. Siegel and S. H. White. The Development of Spatial Representations of Large-Scale Environments. Advances in Child Development and Behavior, 10:9–55, 1975. [TL08] S. "Thrun and J." Leonard. "Springer Handbook of Robotics", chapter "Simultaneous Localization and Mapping", pages "871–889". "Springer Berlin Heidelberg", "2008". [Uni16] Unity Technologies. Unity 5 [Software]. https://unity3d.com, Last accessed on February 10, 2016. [Vir16] Virtuix. Virtuix Omni [Online]. http://www.virtuix.com, Last accessed on February 10, 2016. [Wal99] D. A. Waller. An Assessment of Individual Differences in Spatial Knowledge of Real and Virtual Environments. PhD thesis, University of Washington, 1999. [WJK+13] T. Whelan, H. Johannsson, M. Kaess, J. Leonard, and J. McDonald. Robust Real-Time Visual Odometry for Dense RGB-D Mapping. In IEEE International Conference on Robotics and Automation (ICRA), pages 5724– 5731, 2013. [WKJ+14] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and J. McDonald. Real-Time Large-Scale Dense RGB-D SLAM with Volumetric Fusion. The International Journal of Robotics Research, 34:598–626, 2014. 108 [WMB+15] T. Whelan, L. Ma, E. Bondarev, P. de With, and J. McDonald. Incremental and Batch Planar Simplification of Dense Point Cloud Maps. Robotics and Autonomous Systems, 69:3–14, 2015. [WMG+12] A. Wendel, M. Maurer, G. Graber, T. Pock, and H. Bischof. Dense Reconstruction on-the-fly. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1450–1457, 2012. [WS98] B. Witmer and M. Singer. Measuring Presence in Virtual Environments: A Presence Questionnaire. Presence: Teleoperators and Virtual Environments, 7(3):225–240, 1998. [ZK13] Q.-Y. Zhou and V. Koltun. "Dense Scene Reconstruction with Points of Interest". ACM Transactions on Graphics (TOG), 32:112, 2013. [ZZZL13] M. Zeng, F. Zhao, J. Zheng, and X. Liu. Octree-Based Fusion for Realtime 3D Reconstruction. Graphical Models, 75:126–136, 2013. 109