Automated Digital Content Creation from Point Clouds and Image Data DIPLOMARBEIT zur Erlangung des akademischen Grades Diplom-Ingenieur im Rahmen des Studiums Visual Computing eingereicht von Georg Schenzel, BSc Matrikelnummer 01633078 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Mag. Dr.techn. Peter Kán Mitwirkung: Univ.Prof. Mag.rer.nat. Dr.techn. Hannes Kaufmann Wien, 14. Oktober 2024 Georg Schenzel Peter Kán Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Automated Digital Content Creation from Point Clouds and Image Data DIPLOMA THESIS submitted in partial fulfillment of the requirements for the degree of Diplom-Ingenieur in Visual Computing by Georg Schenzel, BSc Registration Number 01633078 to the Faculty of Informatics at the TU Wien Advisor: Mag. Dr.techn. Peter Kán Assistance: Univ.Prof. Mag.rer.nat. Dr.techn. Hannes Kaufmann Vienna, October 14, 2024 Georg Schenzel Peter Kán Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Erklärung zur Verfassung der Arbeit Georg Schenzel, BSc Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. Ich erkläre weiters, dass ich mich generativer KI-Tools lediglich als Hilfsmittel bedient habe und in der vorliegenden Arbeit mein gestalterischer Einfluss überwiegt. Im Anhang „Übersicht verwendeter Hilfsmittel“ habe ich alle generativen KI-Tools gelistet, die verwendet wurden, und angegeben, wo und wie sie verwendet wurden. Für Textpassagen, die ohne substantielle Änderungen übernommen wurden, haben ich jeweils die von mir formulierten Eingaben (Prompts) und die verwendete IT- Anwendung mit ihrem Produktnamen und Versionsnummer/Datum angegeben. Wien, 14. Oktober 2024 Georg Schenzel v Danksagung Ich möchte diese Gelegenheit nutzen, um allen zu danken, die mich während dieser Diplomarbeit unterstützt haben. Als aller erstes, meine tiefste Dankbarkeit geht an meinen Betreuer Peter Kán, der mir von Anfang an wertvolle Unterstützung gegeben hat. Er hat mir geholfen, meine Ideen zu entwickeln, und sein ständiges konstruktives und detailliertes Feedback hat wesentlich zur Qualität dieser Arbeit beigetragen. Ich möchte auch meinem Co-Betreuer Hannes Kaufmann meinen Dank aussprechen. Er hat mir vor allem beim Start dieses Projektes geholfen und die Zusammenarbeit mit Magna ermöglicht. Danke an alle Mitarbeiter von Magna, mit denen ich zusammenarbeiten durfte: Daniel Schleicher, Banu Bueyueker und Pavlo Tkachenko. Sie haben mich in die richtige Richtung gelenkt und mir wertvolles Feedback gegeben. Danke an Peter Widmer von NavVis. Ich weiß es sehr zu schätzen, dass er nach Wien gekommen ist und mir den VLX-Laserscanner zur verfügung gestellt hat, der es mir er- möglicht hat hochwertige daten zu generieren, die für die erzielten Ergebnisse unerlässlich waren. Abschließend möchte ich mich bei meiner Familie und meinen Freunden herzlichst bedanken. Ihre Unterstützung und ihr Ansporn haben mir geholfen, während all der Monate die ich mit Entwickeln, Experimentieren und Schreiben dieser Arbeit verbracht habe, motiviert und konzentriert zu bleiben. vii Acknowledgements I would like to take this opportunity to thank everyone who supported me throughout the process of writing this thesis. First and foremost, my deepest gratitude goes to my supervisor, Peter Kán, for providing invaluable guidance from the very beginning. He helped me develop my ideas, and his continuous, constructive, and detailed feedback greatly contributed to the quality of this work. I would also like to extend my thanks to my co-supervisor, Hannes Kaufmann, for his help in getting this project started and for facilitating the collaboration with Magna. Thanks to everyone from Magna with whom I had the pleasure of working: Daniel Schleicher, Banu Bueyueker, and Pavlo Tkachenko. They helped guide me in the right direction and provided valuable feedback. Thanks to Peter Widmer from NavVis. I highly appreciate that he came to Vienna and let me use the VLX laser scanner, which allowed me to generate the high-quality input data essential for the results I achieved. Lastly, I want to express my heartfelt thanks to my family and friends. Their support and encouragement helped me stay motivated and focused throughout all the months I spent developing, experimenting, and writing this thesis. ix Kurzfassung Diese Arbeit stellt eine Pipeline vor, die reale Daten wie Punktwolken und Bilder nutzt, um digitale Zwillinge für autonome Fahrsimulationen zu erstellen. Simulationen spielen eine entscheidende Rolle bei der Entwicklung sicherer autonomer Fahrsysteme, da sie kostengünstige und risikofreie Tests ermöglichen. Um die Zuverlässigkeit dieser Simula- tionen zu gewährleisten, müssen die virtuellen Umgebungen der realen Welt sehr ähnlich sein. Unsere Pipeline erzeugt aus den Eingabedaten hochwertige 3D-Meshes mit fotorea- listischen Texturen. Zusätzlich wird eine semantische Segmentierung des rekonstruierten 3D-Modells durchgeführt, die als Grundlage für nachfolgende Simulationsanwendungen dient. Diese semantische Segmentierung wird durch einen Virtual-View-Ansatz erreicht, bei dem 2D-Renderings der Szene mit einem vortrainierten Modell segmentiert werden und die Ergebnisse dann auf die 3D-Szene zurückprojiziert werden. Informationen über den Straßenverlauf und die Fahrspuren werden aus OpenStreetMap bezogen und mit dem 3D-Modell überlagert. Schlussendlich wird das Ergebnis der Pipeline zur Erstellung einer virtuellen Umgebung im Fahrsimulator CARLA verwendet. Wir haben Bild- und Punktwolkendaten von drei verschiedenen Orten gesammelt und die Pipeline mit diesen Daten getestet. Wir haben die Unterschiede in den Rekonstruk- tionen aus beiden Eingabemodalitäten verglichen und ihre Effektivität für praktische Anwendungen evaluiert. Die Rekonstruktionen der Szenen wurden manuell semantisch annotiert, um Referenzwerte für die quantitative Evaluierung des 3D semantischen Segmentationsalgorithmus zu erhalten. Die Pipeline wurde in Python mit dem Ziel eines hohen Automatisierungsgrades imple- mentiert. Sie ist in der Lage innerhalb weniger Stunden einen qualitativ hochwertigen digitalen Zwilling zu erstellen, wobei der/die Benutzer/in weniger als 20 Minuten für manuelle Tätigkeiten benötigt. Der semantische Segmentationsalgorithmus erreicht einen mIoU-Wert von 55,2 und einen F1-Wert von 67,1, was eine gute Leistung der Segmen- tierung der Gitterpunkte in unseren Datensätzen widerspiegelt. Dieser Ansatz ist ein Schritt nach vorne für eine sicherere und schnellere Entwicklung von automatisierten Fahrsystemen. xi Abstract This thesis presents a pipeline that leverages real-world data, such as point clouds and images, to create digital twins for autonomous driving simulations. Simulations play a crucial role in the development of safe automated driving systems, as they enable cost-effective and risk-free testing. To ensure the reliability of these simulations, virtual environments must closely resemble the real world. Our pipeline generates high-quality 3D meshes with photorealistic textures from the input data. Additionally, a 3D semantic segmentation of the reconstructed mesh is performed, providing ground truth data for downstream simulation tasks. This semantic segmentation is achieved using a virtual-view approach, where 2D renderings of the scene are segmented with an off-the-shelf model, and the predictions are projected back into the 3D scene. Information about the road layout and lanes is obtained from OpenStreetMap and aligned with the mesh. Finally, the pipeline output is used to create a virtual map in the driving simulator CARLA. We captured image and point cloud data from three locations and tested the pipeline using this input. We compared the differences in reconstructions from both input modalities, assessed their feasibility, and evaluated their effectiveness for practical applications. Reconstructions of the scenes were manually semantically annotated to provide ground truth for quantitative evaluation of the 3D semantic segmentation algorithm. The pipeline was implemented in Python with the goal of achieving a high degree of automation. It can produce a high-quality digital twin in a matter of hours, requiring minimal user intervention of under 20 minutes. The semantic segmentation algorithm achieves an mIoU of 55.2 and an F1 score of 67.1, reflecting a good performance for labeling the vertices of our datasets. This streamlined approach is a step forward for safer and faster development of automated driving systems. xiii Contents Kurzfassung xi Abstract xiii Contents xv 1 Introduction 1 1.1 Goals of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Related Work 5 2.1 Autonomous Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Driving Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Mesh Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 3D Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Pipeline 11 3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Pipeline Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4 Mesh Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.6 Georeferencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.7 Roads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.8 Manual Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 Implementation 31 4.1 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Road Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 xv 4.6 Mesh Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5 Evaluation 43 5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6 Conclusion 79 6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Overview of Generative AI Tools Used 83 List of Figures 85 List of Tables 87 Bibliography 89 CHAPTER 1 Introduction Advancements in computer vision, vehicle dynamics, and the availability of better sensor modalities are making automated driving systems (ADSs) increasingly relevant. Simulations play a crucial role in the development and testing of ADSs. These simulations require high-quality virtual environments to ensure realistic and effective testing scenarios. Digital twins of real-world scenes are valuable for simulating edge cases or identifying issues in specific scenarios. However, creating such a digital twin is both time-consuming and expensive. These scenes require high-quality, accurate meshes for physical driving simulations, watertight and artifact-free surfaces, and high-quality textures for photo- realistic rendering. Additionally, the models should include semantic labels to provide ground truth for downstream tasks, as well as information about the road network and traffic signs to enable automatic navigation and simulation of vehicles and pedestrians. Instead of manually creating such scenes from scratch, data from laser scans or images can be used to aid their creation. Many tools can produce 3D reconstructions from this data, but they typically only produce a 3D mesh of the scene. Performing a semantic segmentation of the mesh and identifying the road metadata is a complex task with no out-of-the-box solution. 1.1 Goals of the Thesis This thesis aims to produce a (semi-) automatic pipeline to aid in creating simulation- ready virtual scenes for the simulation platform CARLA [14]. CARLA requires two pieces of data to create a virtual world: a semantically segmented 3D mesh and a road network specification in the form of an OpenDrive [3] file. The mesh reconstruction should use either point clouds or images captured at the real-world scene as input and produce a mesh with high-quality textures. CARLA requires the meshes to be split into 1 1. Introduction sub-meshes depending on their semantic class. The whole process is desired to be as automatic as possible, requiring minimal manual intervention. The next goal was to explore the differences between the two input modalities, evaluate the resulting reconstructions, and assess the feasibility of using such data from real- world scenes. Doing so requires image and point-cloud input datasets and ground truth semantically annotated reconstructions. The last goal was to define requirements for the input data and produce guidelines on how to properly capture images and point clouds. 1.2 Methodology 1.2.1 Establishing Requirements This thesis was done on behalf of Magna Engineering Center Steyr GmbH & Co KG1. We engaged with experts in autonomous driving simulations to understand their use cases, requirements, and data acquisition options. This collaborative approach ensured that the research was aligned with industry needs and relevant to the challenges faced in autonomous driving simulations. 1.2.2 Literature Research This thesis spans multiple different research fields. In an extensive literature review, these fields were covered. This included the research of use cases for autonomous driving simulations to understand how the output data of this thesis is used in the development and testing of ADSs. Furthermore, the state-of-the-art in urban 3D datasets, 3D reconstruction from point clouds and images, and 3D semantic segmentation of meshes was researched. 1.2.3 Pipeline Implementation Initially, we tested the whole process with a proof of concept partial implementation of the pipeline to assess the feasibility of our approach. We captured a small image- based dataset, manually performed a mesh reconstruction with RealityCapture, and then performed a semantic segmentation of the mesh using the preliminary implementation of the segmentation algorithm. This initial assessment showed promising results. The mesh and semantic segmentation showed potential, and the initial driving tests in CARLA were successful. Afterward, a modular and flexible pipeline was built to automate the various steps required for the task. This pipeline allowed for experimentation with different approaches and implementations, which made it possible to improve the produced results iteratively. This pipeline was adjusted based on feedback from Magna to ensure it met their specific needs. Two different reconstruction strategies were implemented with the pipeline: reconstruction 1https://www.magna.com/company/company-information/magna-groups/ magna-powertrain 2 1.3. Structure of the Thesis from images and reconstruction from point clouds. A virtual view semantic segmentation approach was implemented to split the mesh into semantic sub-meshes. With this method we could leverage pre-trained 2D segmentation models to project labeled 2D renderings of the scene back onto the mesh. 1.2.4 Data Aquisition Data acquisition involved capturing image-based and point cloud-based datasets. For the image-based data, we used a handheld DSLR camera to capture various intersections with hundreds to a few thousand images each. The point cloud data was collected using the VLX mobile mapping device, which was generously provided by NavVis2. 1.2.5 Dataset Creation Following the pipeline development and data acquisition, the next step was to create datasets to allow for the evaluation of the pipeline. This involved generating 3D recon- structions from both the captured images and point clouds. These reconstructions were then cleaned and manually annotated to provide a ground truth to test the segmentation algorithm. Creating these datasets was labor-intensive but essential for producing reliable results in the evaluation phase. 1.2.6 Evaluation The evaluation phase involved quantitative assessments and descriptive analysis of the pipeline and its reconstruction and segmentation results. We analyzed the segmentation quality with the help of our manually annotated 3D dataset. Reconstructions from both data sources, images, and point clouds were compared against each other. Additionally, we examined the impact of various parameters on the pipeline performance to optimize the outcomes and better understand the strengths and limitations of different approaches. 1.3 Structure of the Thesis In Chapter 2, the results of our extensive literature review are presented, highlighting state-of-the-art methods and technologies relevant to the different tasks at hand. In Chapter 3 the overall design of the pipeline is described. This includes coverage of our defined requirements, a high-level overview of the pipeline, and a detailed breakdown of each individual step performed. Chapter 4 goes into more detail about the implementation of various different tasks that are performed by the pipeline. In Chapter 5, our created dataset, the pipeline outputs, and our evaluation are presented. Chapter 6 concludes this work by giving a brief summary, highlighting some limitations of the implementation, and proposing related topics for future works. 2https://www.navvis.com/ 3 CHAPTER 2 Related Work 2.1 Autonomous Driving The field of autonomous driving is vast, with numerous approaches and emerging tech- nologies for automation at varying levels [60]. Many advancements are due to progress in deep learning and artificial intelligence, with deep learning being applied to challenges such as scene perception, localization, and path planning [23, 27]. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks’ (RNNs) ability to act as object detectors makes them suitable for use in autonomous driving systems [27]. Deep learning is also used to tackle the problem of localization and mapping. It offers a data-driven alternative to traditional model-based methods like SLAM. This is enabled by the increasing amount of data and computational power available [27]. Chib et al. [11] categorize autonomous driving systems using deep learning into modular and end-to-end architectures. Modular architectures divide the problem into different sub- tasks. Such sub-tasks include object detection, localization, semantic segmentation, path planning, and vehicle control. It relies on the sensor data and multiple algorithms and models to produce control outputs. This approach provides a few challenges: First, errors that occur in the output of a sub-task can be propagated to later tasks. For example, a misclassified object might negatively affect route planning, leading to dangerous situations. Second, managing the different modules increases the complexity of such systems. Third, these systems can be inefficient due to duplicate and unnecessary calculations. End-to-end architectures try to tackle these challenges. In an end-to-end system, sensor inputs are mapped directly to control outputs, and only a single model is trained. This eliminates the chance of error propagation and provides a more efficient system. Due to recent advancements in this field, such systems are no longer a black box. They can generate auxiliary outputs, attention maps, and interpretable maps that can help to reason about the system and identify sources of errors [11]. 5 2. Related Work Self-driving cars have many advantages. For users, it provides a stress-free way of transportation with potentially faster commute times. They can assist governments in traffic enforcement, enhancing road capacities, and reducing the number of accidents by reducing distracted or drunk driving. Furthermore, self-driving cars are a greener mode of transportation. By reducing car ownership, less parking space is needed, and optimal fuel consumption can be ensured. With shared access to self-driving cars, this can still be an efficient, personalized, and reliable way of transportation [27]. 2.2 Driving Simulations Testing and validating ADSs before deployment on the road is crucial to ensure the safety of all traffic participants. Performing these tests is expensive, highly regulated, and not risk-free. Simulations can be used to check how ADSs behave in various different and controlled virtual scenarios [31, 60]. The simulations are run on platforms built on modern game engines [14] or even in commercial video games [60]. This thesis focuses on simulations running inside CARLA [14]. CARLA1 (Car Learning to Act) is an open-source simulator for autonomous driving research. It is built on Unreal Engine 42, which offers high-fidelity graphics and physics simulations. CARLA is built in a client-server architecture, where the server handles the simulation, including physics and graphics. The client communicates with the server to interact with the simulation. This includes controlling vehicles, receiving sensor data, or changing the world. The client is implemented as a Python API [14]. To declare road networks used in the simulations OpenDrive [3] is used. OpenDrive is a specification and file format containing information about road geometry, lanes, road markings, and additional features like signs and traffic lights. CARLA is used by various state-of-the-art autonomous driving systems. Osiński et al. [47] trained a reinforcement-learning-based model with simulations in CARLA and showed a successful sim-to-real policy transfer. Their model receives only RGB images and 2D semantic segmentation maps rendered by CARLA as input. It then outputs the vehicle controls. They tested their system in different driving scenarios, where the vehicle had to follow a route from a given list of checkpoints. For domain randomization, they use ten different weather conditions in CARLA, different simulation qualities, cama settings, and image augmentations. Gutiérrez-Moreno et al. [28] trained and tested another deep reinforcement learning model that handles difficult intersections with CARLA. Since CARLA is very computationally intensive, the model was trained using a simpler simulation platform before being refined using CARLA. To provide a suite of scenarios for training and testing autonomous driving systems, CARLA Real Traffic Scenarios (CRTS) has been built. It provides a suite of scenarios 1https://carla.org/ 2https://www.unrealengine.com 6 2.3. Mesh Reconstruction based on real-world traffic conditions and contains tactical tasks that last several seconds [46]. Gomez et al. [22] used CARLA to evaluate their fully autonomous driving architecture using another suite of challenging driving scenarios. These scenarios include stop signs, adaptive cruise control, pedestrians crossing, and pedestrians unexpectedly jumping on the street. CARLA is also being used to create synthetic datasets by rendering the different sensors and various scenarios [45], lowering the cost of development of such systems [47]. 2.3 Mesh Reconstruction Several methods exist for reconstructing 3D meshes from real-world measurements. Depending on the input modality, different solutions are used. Point Clouds Point clouds can be obtained from LiDAR scanners. Reconstructing meshes from point clouds is a well-established problem. Poisson surface reconstruction aims to reconstruct a smooth surface from a set of points and their normals. The goal is to find an implicit function that is zero at each point and its gradient equal to the corresponding normal. This function can be obtained by utilizing Poisson’s equation. From this function, a mesh is created. Poisson surface reconstruction considers all points at once, creates watertight models, and produces smooth surfaces that approximate noisy data [37, 38]. The Ball Pivoting Algorithm reconstructs surfaces by rolling a virtual ball over the point cloud. The ball is rotated around edges until it hits a third vertex, forming a new triangle. This method is effective for point clouds with an even distribution. Water tightness of the resulting model is only guaranteed if the sampled points are denser than the radius of the ball [6]. Images Images can be obtained manually with a handheld camera, from cameras mounted on a capture device [12] (e.g., a car or backpack), or from UAVs [39]. Reconstructing meshes from a set of images is a more complex task, solved with photogrammetry in software like Meshroom [24] or RealityCapture3. Photogrammetry is a process that encompasses many different steps. It starts with feature extraction, where distinctive points in the input images are detected. For robustness, these features must be scale and rotation invariant. The next step is to find pairs of images that overlap. This is done by comparing and matching the features of all images together. Once the images are matched together, the features of pairs of images must be matched. From such a matching, the 3D motion between camera positions can be 3https://www.capturingreality.com/ 7 2. Related Work calculated. Using Structure from Motion [8, 51], these matched features are turned into candidate points in 3D space space. A depth map from each image is calculated and then all depth maps are merged to form a point cloud from the whole scene. This point cloud is then meshed and textured by projecting the images onto the mesh [25]. 2.4 3D Semantic Segmentation 3D semantic segmentation is an active research field with various solutions for different domains and input modalities. We can differentiate between the segmentation of point clouds and meshes. Deep neural networks (DNNs) are a prevalent way to tackle this problem. Initially, the research focused on segmenting point clouds. Some models directly consume point clouds [9, 48, 55]. Other models use a view-based approach, where 2D views of the point cloud are fed into a convolutional neural network [41, 7]. 3D point cloud models can be improved by fusing the points with features of 2D views (images) of the scene [35, 21]. Self-attention can be used to enable scalability to scenes with millions of points [63]. Since meshes have some advantages over point clouds for scene representation [2, 17, 58, 29], research shifted to also perform semantic segmentation on 3D meshes. Meshes are more efficient than point clouds for storing large flat surfaces, requiring fewer data points and less memory. Furthermore, they allow for detailed texture storage. Designing a Deep Neural Network (DNN) to work on meshes presents the challenge of having to process their unstructured format while also implementing learning operations such as pooling, convolution, and feature aggregation [2, 29]. Hanock et al. [29] solved this by applying convolutions on edges and the four edges of their incident triangles. Pooling is applied as an edge collapse operation that retains surface topology. Graph-based neural networks are often used for deep learning tasks involving meshes. Some models use a center of gravity (CoG) approach, where each face is considered as a node in the graph. Additionally, features are extracted from the local texture and fused with the nodes [58, 62, 61]. Gao et al. also use a graph-based neural network. Instead of CoGs, they perform a prior over-segmentation that looks at the planarity of the mesh and then use a graph convolutional network on a graph with the segments as nodes [19]. Another approach is to sample points on the mesh, segmenting the resulting point cloud, and then interpolating the labels back to faces, vertices, or texels [26]. Similarly to the segmentation of point clouds, a view-based approach can be used on meshes, where multiple renderings of the 3D scene are used as the input for the neural network [40, 1, 53]. A view-based method poses the advantage of being able to use 2D models for the segmentation task, for which much larger and more diverse datasets exist. Off-the-shelf 2D semantic segmentation models [64, 57, 59, 10] can be used for this. MMSegmentation4 is a semantic segmentation toolbox written in Python that provides 4https://github.com/open-mmlab/mmsegmentation 8 2.5. Datasets various models pre-trained on a diverse range of datasets. Especially interesting for this thesis are models pre-trained on CityScapes [12]. 2.5 Datasets Semantically labeled 3D datasets of urban scenes exist for both point clouds [5, 16, 30, 39, 17] and meshes [58, 42, 39, 18, 17]. However, we need to also consider the scale and resolution of these datasets. Existing urban-scale datasets lack the granularity of classes required for the task at hand. While most of them contain labels like building, car, and street, they often do not provide labels for traffic signs and sidewalks, which are of high interest to identify in this project. A few point cloud-based datasets provide this desired granularity [5, 16]. 9 CHAPTER 3 Pipeline This chapter outlines the design of the pipeline. We start by discussing the data, software, and usability requirements. We provide a high-level overview of the various pipeline configurations based on the different input modalities. Then, each step the pipeline performs, and the corresponding design decisions are described in detail. The key topics covered include mesh reconstruction, semantic segmentation, georeferencing, road network generation, and manual operations. 3.1 Requirements In this section, the pipeline’s requirements regarding its usage, input data, and output data are presented. These requirements were established through discussions with Magna, aligning with their intended use cases and specifications. While some requirements were defined at the start of the thesis, they evolved during the development process as new insights emerged and access to more data was available. Additionally, insights were gained from preliminary manual tests involving mesh reconstructions and imports into CARLA. We define the requirements regarding the pipeline’s outputs, which are a 3D mesh, a semantic segmentation of said mesh, and aligned road network information. Furthermore, we define requirements for the software itself, including automation, flexibility, and traceability. 3.1.1 Mesh Output The mesh output of the pipeline is used for two different tasks: the rendering of virtual scenes and 3D vehicular physics simulations. These virtual scenes should be photorealistic and accurately represent their real-world counterpart. 11 3. Pipeline To achieve photorealistic rendering, the mesh requires high-resolution textures. Road markings, traffic signs, and text should be sharp and identifiable as such. This is important for downstream tasks, such as testing autonomous driving systems. Ideally, such textures would be extracted directly from the input data. The mesh should be smooth, manifold, and contain no holes. Artifacts such as cut-off objects, dents in the road, or added geometry should be reduced to a minimum. This is crucial to allow cars to drive in the scene during the physics simulation. The final mesh must be provided in the fbx format to be usable in the CARLA import process. This mesh must be split into sub-meshes based on their semantic classes. These sub-meshes should also be named appropriately, conforming to the naming convention CARLA uses for automatically assigning the correct class labels to the meshes. 3.1.2 Semantic Segmentation To perform a semantic segmentation of the mesh, we first need to define the desired classes. The required granularity of the semantic segmentation comes from the target downstream tasks. One of them is being able to move, swap, or remove traffic signs and cars. To do so, we require classes for these objects. Knowing where the sidewalk is can help in performing pedestrian simulations. Autonomous driving simulations require further knowledge about the road and sidewalk. These use cases leave us with the required semantic classes of street, sidewalk, car, and traffic signs. Many downstream semantic segmentation tasks are trained on the CityScapes [12] dataset, which contains the classes mentioned above and classes of similar granularity. This is why classes defined in the CityCscapes are a very good fit for this thesis. 3.1.3 Road Network CARLA uses an OpenDrive file (xodr) to generate its internal representation of the road network. A graph is built from each node and connection defined in the file. This graph contains nodes for each lane in the street. The xodr file uses a local coordinate system that is then used by CARLA. Global positioning of the whole map is provided in this file to allow georeferencing. The road graph should be aligned with the mesh. To do so, the local coordinate systems of the mesh and the OpenDrive file must be aligned. The individual nodes of the road graph can be manually moved and corrected in CARLA, so a perfect alignment is not required. The level of detail of roads and lanes available on OpenStreetMap1 is sufficient for this task. 1https://www.openstreetmap.org 12 3.1. Requirements 3.1.4 The Pipeline Automation The primary goal of the pipeline is to provide a tool that significantly reduces the cost of creating digital twins of street scenes. Automation of each step, as well as execution of all steps together, is crucial in this cost reduction. While full automation is not a strict requirement, the integration of manual steps should be minimized and, where possible, consolidated to optimize efficiency. These manual steps can sometimes offer a more efficient solution for tasks beyond this thesis’s scope. These tasks include the selection of a reconstruction region, selecting sampling points on the roads, fixing errors in the segmentation, and fine-tuning the alignment of the road and mesh. Ideally, these manual steps would be bundled together, thus reducing the need for constant monitoring and intervention throughout the pipeline’s operation. Flexibility The pipeline is designed to serve as a robust platform for experimentation with various approaches and algorithms to address the diverse challenges involved. This flexibility allows for the evaluation of the performance of different strategies in this thesis. Furthermore, the pipeline may provide a foundation for future research and development. To achieve these objectives, the layout of the full pipeline should be easily modifiable. There must be a clearly defined way for managing the in- and outputs of the different steps. Additionally, each step should be capable of exposing a set of parameters, allowing tuning of the behavior of their operation. Traceability Traceability is an important requirement, ensuring that each modification and process within the pipeline is observable and accountable. This involves making the output files of each step viewable. This is important for debugging purposes as well as the evaluation of the individual steps. To further increase the transparency of the pipeline, additional diagnostic data should be stored in the intermediate files. If the user identifies an error in the output, reviewing the intermediate files and identifying where it occurred must be possible. The issue could be addressed by adjusting the parameters or manually correcting the error in an external tool. Afterward, it must be possible to rerun the pipeline from this location. Change Detection The pipeline should have the capability to detect changes in intermediate files, serving two primary purposes. Firstly, it allows the pipeline to automatically skip steps that have already been completed, which is useful for recovering from interruptions or crashes. Secondly, it allows for the manual correction of intermediate files, as mentioned above. The pipeline should only run steps where the inputs are newer than the outputs. This avoids redundant processing and enhances the efficiency of the pipeline. 13 3. Pipeline 3.2 Pipeline Overview Our proposed pipeline can work on different input modalities, each of which requires very different processing steps. Two pipeline layouts have been designed for this thesis. A high-level overview of both variants is given in Fig. 3.1. First, the input data is fed through a set of steps dependent on the chosen input modality. Two strategies are implemented: one for processing images using photogrammetry and another for handling point clouds. Each input-specific strategy produces two pieces of data: a reconstructed and textured 3D mesh and georeferencing information of said mesh. From then on, the pipeline is mostly independent of the input modality. The pipeline continues by running the same set of steps. The remaining part handles the 3D semantic segmentation, xodr file generation, further processing of meshes and metadata, and the final import into CARLA. 3.2.1 Configuration Running the pipeline requires many parameters to be set. This includes the choice of input modality, the project name, and potentially the geographic base point. Furthermore, the user may need to set the paths to some executables on which the pipeline depends. These parameters are passed in a yaml configuration file. The following executables can be configured: • carla: The path to the directory of the CARLA repo, which must be built from source. • VCVars64.bat: Launching CARLA requires the Microsoft C++ toolset, which is shipped with a Visual Studio installation. Default config value: default installation location of Visual Studio 2019 Community Edition. • CloudCompare.exe: Default config value: default installation location of Cloud- Compare. • Blender.exe: Default config value: default installation location of Blender 4.1. • RealityCapture.exe: Default config value: default installation location of Reali- tyCapture. In addition to these required input arguments, each step of the pipeline exposes a multitude of parameters to allow fine-tuning its behavior. When running the pipeline, these parameters can be passed via the command line interface or be set in the config file for easier use and reproducibility. Sensible default values have been set for each of them. These values have been established by the experiments performed in Chapter 5. 14 3.2. Pipeline Overview Figure 3.1: An overview of the pipeline. Showing the steps performed for reconstructions from point clouds (top) and images (middle). On the bottom, the common subsequent steps are shown. Manual steps that are required (red) or optional (orange) are highlighted at the specific point of the pipeline where they occur. 15 3. Pipeline 3.3 Input Data This section discusses the properties and requirements of the two types of input data: images and point clouds. 3.3.1 Images Images are used for the pipeline’s photogrammetry strategy. There are many factors to consider when selecting an appropriate camera, configuring it, and capturing the images to get good results in this process. This thesis uses RealityCapture for the photogrammetry step. Since it is a closed-source application, we do not know what algorithms are used. We assume that with their implementation, we have requirements similar to those for structure from motion (SfM). We derive the following guidelines from the literature about camera considerations for SfM [44, 34] and recommendations by the RealityCapture documentation [49]. Camera To extract as many features for alignment as possible, the camera needs to preserve a lot of detail, have a high dynamic range, and produce images with low noise. Using a high-resolution sensor for this is crucial. A resolution of at least 16 MP is recommended. Mosbrucker et al. [44] recommend a sensor size of at least 300 mm2 to reduce noise. A bit-depth of at least 14 bits allows information to be captured with a high dynamic range. Modern DSLR cameras fit these criteria. Settings The camera must be configured such that it provides a proper exposure of the images. Exposure refers to the amount of light captured by the sensor. The change in exposure is measured in exposure value (EV). An increase of 1 EV means twice as much light being captured. Good exposure in this context means preserving information in the scene’s dark and bright areas. To achieve good exposure, three camera parameters need to be adjusted: Aperture, shutter speed, and ISO. Aperture refers to the size of the opening in the lens. It is measured in f-stops, which is inversely proportional to the aperture. A larger aperture results in more light being captured at the cost of a shallow depth of field, meaning only a narrow area of the scene is in focus. Reducing the aperture allows more of the scene to be in focus. Shutter speed is the duration the sensor is capturing light. A shutter speed that is too low can result in blurry images, especially for handheld or mounted cameras. ISO is the absolute sensitivity of the sensor. Twice the ISO value results in twice the amount of light captured. However, with higher ISO values, the images get increasingly noisy. The optimal aperture is 1-3 f-stop values larger than the lens’s minimum aperture. This reduces artifacts such as chromatic aberration and vignetting. Furthermore, it ensures more of the scene is in focus. Shutter speed should be chosen so that motion blur is minimized. This depends on how and where the camera is mounted, but generally, for lenses with a focal length of 50mm or lower a value of 1/100s to 1/400s is recommended for 16 3.3. Input Data this application. ISO should be chosen as low as possible to reduce noise. The optimization of these parameters can only be done in bright lighting conditions. Disabling autofocus, sensor-based stabilization systems, distortion correction, and noise reduction filters are recommended. Capturing Once the camera and its settings are dialed in, we need to consider when and how to capture the images. Ideal lighting conditions are on bright, overcast days. This results in smooth lighting with fewer hard shadows and enough light for proper exposure. How the scene is captured is crucial for the photogrammetry process of aligning cameras and detecting features. Each part of the scene must be captured by at least three images to be reconstructed. Following the coarse-to-fine rule and covering the scene at different scales is required for optimal results. This is done by first taking pictures of the whole scene at a greater distance, trying to capture a lot of context. Then, moving closer to capture more details. It is important to move closer gradually to allow the images at the different scales to be matched. Another crucial rule is to always move around the scene and not just pivot at one spot. SfM relies on the parallax of the movement for its reconstruction. Only pivoting results in a panorama, where no depth information can be extracted. Furthermore, when moving around an object, it is beneficial to close loops around it. Some of these guidelines are easily applied when capturing a single object. However, capturing urban scenes gets a bit more complicated, as there are many different objects of interest and limited options for moving around the scene. Ideally, each important object in the scene, such as cars and traffic lights, is covered by orbiting around them at different heights. Wide context shots can be covered by walking around the sidewalk and street. Furthermore, it is important also to capture detail shots facing towards the ground, as street surfaces often have a low textural variety, more data is needed for a proper reconstruction. A more detailed explanation of capturing images for photogrammetry in an urban setting can be seen in the sections 5.1.1 and 5.1.3. GPS Locations embedded as EXIF metadata can be used for georeferencing. While some cameras have a GPS sensor built-in, this is usually not the case. While capturing the images, a mobile phone can track the GPS positions and store them in a gpx file. Software like GeoSetter2 can then be used to set the GPS location of each image by comparing the timestamp with the entries in the gpx file. The camera’s internal clock must be calibrated precisely. Format The images should be captured in a RAW format, ensuring that the sensors capture the images with the highest possible quality. To avoid artifacts due to compression, the images should be converted to either TIFF or PNG. 2https://geosetter.de/ 17 3. Pipeline 3.3.2 Point Clouds When using a point cloud as the input to the pipeline, its quality directly influences the reconstruction quality. A single point cloud in the form of an E57 file [33] is the basis for further processing. While an organized point cloud, such as one generated by a single sweep of a LiDAR sensor, may be used, it is insufficient for comprehensive scene coverage due to its limited perspective. Therefore, an unorganized point cloud that encompasses the entire scene is required. The mesh reconstruction requires that per-point normals are available. Although normals can be calculated post-capture, in our tests this failed to produce correct normals for thin objects like traffic signs. Thus, it is advantageous if normals are generated during the capture or initial processing of the point cloud. The point cloud must also include RGB color information for each point, as the texturing step relies on this data. The maximum possible texture resolution depends on the point cloud density. Thus, the point cloud should be dense enough to preserve enough texture information. Artifacts such as moving vehicles or pedestrians should be removed from the point cloud to avoid appearing in the reconstructed mesh. Alternatively, these artifacts can be removed from the reconstructed mesh by modifying the pipeline’s intermediate files. Furthermore, the point cloud should be georeferenced. This can be done by providing the pipeline with a global base point, which aligns the local origin of the point cloud with its global geographical position. 3.4 Mesh Reconstruction 3.4.1 From Images Performing mesh reconstruction with photogrammetry can be done with many commercial and non-commercial tools. In this thesis, RealityCapture3 was used and evaluated for this task. RealityCapture can handle lots of input data. Thousands of images can be processed in a reasonable time of a few hours. Since RealityCapture is closed-source, we cannot go into detail about the exact algorithms used in their implementation. The whole reconstruction process can run completely automatically. RealityCapture is executed over the command line, and multiple commands are chained together. The images are loaded into the program and aligned. At this point, a point cloud of the scene is can be seen. RealityCapture can estimate the ground plane and reconstruction region, but sometimes this can fail to produce correct results. A manual step is required here to define these two properties correctly. A misaligned ground plane will hurt the performance of downstream steps. Furthermore, constraining the reconstruction region allows us to only reconstruct the area of interest. Areas outside the scene often produce more artifacts, so it is beneficial to cut them out. 3https://www.capturingreality.com/ 18 3.4. Mesh Reconstruction After this manual step, we continue by performing the remaining operations in Reality- Capture. The mesh is reconstructed using the aligned images. RealityCapture produces a mesh with millions of polygons, so a simplification step is performed to get this to a reasonable level before calculating the textures. Finally, RealityCapture textures the simplified mesh by projecting the input images into the scene. In addition to a mesh output, RealityCapture can also export the camera registration as extrinsic camera parameters. This information can be used in downstream steps for georeferencing or semantic segmentation. For a good reconstruction, it is important that RealityCapture can align as many images as possible. Experiments with reconstructions in RealityCapture from multiple datasets captured at different locations revealed what influences RealityCapture’s ability to create high-quality results. While the alignment can be improved by fine-tuning different parameters or manually aligning detected components, this is not very economical. RealityCapture scales incredibly well. Often, the solution was to provide more pictures. 3.4.2 From Point Clouds Performing mesh reconstruction from point clouds consists of 3 steps: Firstly, the point cloud is pre-processed. This is an optional step to ensure that normals are calculated, should they not already be present. Then, a mesh is reconstructed directly from the point cloud using screened Poisson surface reconstruction [38]. Lastly, the mesh is further post-processed to reduce artifacts that may emerge from the reconstruction process. Such artifacts are often visible as a "bubbling" effect. This happens in regions where the point cloud coverage is low, there are occlusions, or the normals are noisy, such as the top of trees or traffic signs. It also may happen at the border of the scene. Examples of these artifacts can be seen in Fig. 3.2. To fix the artifacts, the reconstructed mesh is compared to the input point cloud. For each vertex, the distance to the closest point of the point cloud is calculated. Vertices further away than a certain threshold are removed from the mesh. The spatial resolution of the reconstructed mesh is controlled by the depth of the Poisson surface reconstruction, which controls the level of detail by specifying the depth of the octree used by the algorithm. This depth defines how many voxels are used. The size of a voxel is given by s 2d where d is the octree depth, and s is the size of the scene. This fraction gives the distance between two sampled points. Table 3.1 shows the spatial resolution for different scene sizes and octree depths. To accurately depict small objects such as poles and traffic signs, a reconstruction depth of 12-13 is required, depending on the scene size. A more detailed comparison of different reconstruction depths is done in Section 5.2.2. The reconstructed mesh has a polycount in the millions. This is way higher than necessary and would significantly slow down downstream tasks. Depending on the scene size and complexity, the mesh is simplified to a more efficient polycount of 100, 000 to 1, 000, 000 faces. 19 3. Pipeline Figure 3.2: A mesh reconstructed with Poisson surface reconstruction. Showing artifacts occurring in occluded areas. Yellow areas are more than a meter away from the closest point in the point cloud. Octree depth Spatial resolution Voxel size (mm) for a 50m scene Voxel size (mm) for a 100m scene 8 2563 192 384 9 5123 96 192 10 10243 48 96 11 20483 24 48 12 40963 12 24 13 81923 6 12 Table 3.1: Reconstruction resolution at different reconstruction depths for Poisson surface reconstruction. Color information from the point cloud is only preserved as vertex color attributes. The simplified mesh does not have a vertex density that is high enough to provide the required radiometric resolution. Texture baking is performed in Blender to get the texture information from the high-resolution mesh to the simplified mesh. UV unwrapping of the mesh is performed, and then the colors of the high-resolution mesh are baked to a texture on the low-resolution mesh. Since the high-resolution mesh is the source of the color information, the reconstruction depth does not only specify the spatial resolution but also the maximum texture resolution of the final mesh. 20 3.5. Semantic Segmentation 3.5 Semantic Segmentation 3.5.1 3D Semantic Segmentation There are different approaches for performing 3D semantic segmentation. In order to choose an appropriate technique, the first question we have to answer is whether to segment the mesh or the point cloud. The segmentation of point clouds is a problem that is more mature in the literature than the segmentation of meshes. However, there are two arguments for doing the segmentation on the mesh. First, we desire a mesh as the output of the pipeline. Creating a segmentation of the point cloud would require the transfer of point cloud labels to the mesh, which in itself could be a source of errors. Second, meshes have many advantages over point clouds in this context [2, 17, 58, 29]. Meshes are better than point clouds for representing the geometry of scenes [2, 17]. They can represent flat surfaces and sharp edges more easily with fewer data points [58]. They can describe intricate details while disambiguating from other close surfaces [29]. Furthermore, meshes consume less memory [17, 58] since they are more efficient in storing large, flat surfaces. Another advantage of meshes is that they can contain high-frequency color information via textures. This is crucial for identifying objects with a non-distinctive shape (like traffic signs on a wall or areas separated only by road markings). Many solutions for both domains are based on DNNs. Therefore, another important aspect to consider is the availability of datasets in the domain of urban scenes for each modality. Unfortunately, datasets in this domain are scarce, especially when considering the granularity of classes we require for the task at hand. There are two point cloud datasets available [16, 5] with semantic labels from the CityScapes dataset [12]. However, they do not quite match the previously defined requirements for point clouds. They do not have a high enough density, and RGB color information is not directly available. To the best of our knowledge, there is no mesh-based dataset that fits the requirements. The datasets are either limited in scale, don’t cover many different types of scenes, or have a low semantic richness [17]. This could be because mesh-based datasets are often created for large-scale urban applications, such as smart cities or urban planning [17], which has interests of a larger scale and does not require that high of a granularity. 3.5.2 View Based Semantic Segmentation This leaves us with another approach: View-based 3D semantic segmentation. This technique generates 2D views of the scene, which are subsequently semantically seg- mented using a 2D model. The resulting 2D labels are then projected back into the 3D scene. Doing so for many different views allows for covering each part of the scene and accumulating predictions on the vertices of the mesh. This approach has been performed on point clouds [35, 41, 21, 7] and meshes [53, 40, 1]. Meshes are a better fit than point clouds for creating 2D views. Firstly, because they 21 3. Pipeline describe surfaces, there are no holes in the views and occlusion works properly. Secondly, they can provide a much higher radiometric frequency. Coincidently, the dataset from which we derive our class labels, CitiyScapes [12], is a fitting domain for this task. It is a large and diverse urban 2D dataset with pixel-level semantic labels. It covers 50 cities containing 5, 000 fully annotated and 20, 000 partially annotated images. The field of 2D semantic segmentation is rich with great-performing state-of-the-art models [10, 64, 57, 59]. We can use a model pre-trained on CityScapes for the 2D segmentation task. For a comparison of different 2D models used with this 3D segmentation technique, see Section 5.3.3. 3.5.3 Virtual View Rendering The next question is, how to obtain 2D views of the 3D mesh. We can use the original images with their estimated registrations to perform the segmentation task when using the photogrammetry strategy. This has the advantage of feeding real images to the 2D semantic segmentation model and ensures that each part of the scene is covered. However, since the registrations are only estimates, any error will result in the bleeding of semantic labels to more distant surfaces when performing the back projection (see Fig. 3.3). Furthermore, this approach would not work for the point cloud strategy. Kundu et al. [40] show that using virtual views (renderings) of the scene not only produces good results in a view-based segmentation but outperforms the usage of real images with estimated registrations. This is because they can have a smarter view selection, a higher field of view, look from behind walls, and have an exact back projection since the camera parameters are known precisely. Furthermore, they show that similar performance can be obtained with as little as 12 virtual views versus 1700 original views for an indoor scene. These arguments lead us to the usage of a virtual-view-based 3D semantic segmentation approach using 2D semantic segmentation models pre-trained on CityScapes. 3.5.4 Sampling Virtual Views The selection of virtual views is critical for the segmentation performance [40]. Each part of the mesh needs to be covered by at least one view. However, just naively sampling the views does not produce optimal results. Since a pre-trained model is used, we need to consider the domain it was trained on. The CityScapes dataset is captured using a camera mounted on top of a vehicle driving on streets. Consequently, models trained on this dataset perform best with images taken from a similar perspective. Virtual views taken from the sidewalk often misclassify the ground below as road instead of sidewalk. We show this with the experiments done in Section 5.3.2. The best-performing sampling method uses only samples from a street perspective. Since we cannot automatically sample the road points before the semantic segmentation, we require the user to manually select points on the street. The direction is sampled by positioning the camera horizontally and rotating it uniformly around the up-axis in 8 22 3.5. Semantic Segmentation Figure 3.3: Projection bleeding. The label "traffic sign" (yellow) is labeled on the wall due to inaccuracies in projecting it into the scene. steps. These manually selected points are used to generate samples that are similar to that of the 2D domain. 3.5.5 Over-Segmentation The above-described segmentation process assigns a semantic label to each vertex. This only considers a projected 2D RGB view of the 3D scene. The topological properties of the mesh are lost in this projection. However, the topology of the mesh contains a lot of valuable information. Object boundaries can be deducted from the geometric properties of a mesh. They often align with the boundaries of planar regions [19]. Furthermore, we can look at the local curvature. Concave edges often form boundaries between objects, while convex areas usually belong to the same object [36]. To leverage these properties, the Felzenszwalb algorithm [15] can be adapted to segment 3D meshes [13, 36]. Other approaches train a random forest to estimate the planarity based on a multitude of features [19] or use region-growing approaches [50]. 23 3. Pipeline The Felzenszwalb algorithm is used in this thesis due to its simplicity and efficiency. It segments based on the difference of adjacent vertex normals. The smaller the difference between two normals the more likely they belong to the same segment. Fig. 3.4a shows how such an over-segmentation looks. (a) Over-segmentation (b) Under-segmentation issue. Components span across object boundaries. Figure 3.4: Examples of over-segmentation and under-segmentation One challenge when doing this over-segmentation is that we need to be careful not to under-segment the mesh, meaning that we get too few segments, with segments spanning multiple semantic regions (see Fig. 3.4b). This issue arises when semantic boundaries are not obvious from just the topology (i.e., when a sidewalk is separated by a different texture but has the same elevation as the street). To alleviate this issue, the color difference of the vertices is added to the weight to help form boundaries in these regions. However, since we only look at vertex colors, this becomes a sampling problem, and texture changes that do not fall exactly on the vertices might still be missed. Overall, this over-segmentation helps reduce noise and bleeding while forming smoother boundaries. However, the maximum segment size needs to be limited to avoid the issues mentioned above, reducing the effectiveness of this step. See Section 5.3.1 for a more detailed breakdown of this step and its parameters. 3.6 Georeferencing Georeferencing of the mesh is required for extracting and aligning with road network metadata. The way this data is obtained depends on the input modality. Images When using images as the input, the GPS information may be obtained by the GPS location embedded in the EXIF metadata. GPS tracking provided by consumer-grade devices such as cameras or smartphones is not accurate enough for precise georeferencing. They have an accuracy of only 7-13m in urban environments [43]. However, this data may still be used to obtain the scene’s rough geographical bounding box. 24 3.7. Roads The estimated camera registrations in the local space of the mesh, in conjunction with the corresponding global GPS positions, can be used to find a transformation from local space to global space. This matrix should only contain a translation, rotation, and scaling, as we do not want to deform the mesh. The translation part of the matrix is the base point of the mesh in relation to the global coordinate system. The rotation part can be applied to the mesh to align the rotation around the up axis. Point Cloud The point cloud strategy requires the base point to be known beforehand. The user must provide it when running the pipeline. Manually Should the images not contain GPS metadata, the georeferencing can be overridden manually by specifying the base point when running the pipeline. In this case, there is no way of knowing the correct heading or alignment of the scene. This only gives a very rough alignment. 3.7 Roads CARLA needs information about the road network in the scene as an xodr file. Creating such a file from just the 3D mesh reconstruction is a challenging task. Therefore, we utilized the publicly available source of road networks OpenStreetMap. The API can be queried with a geographical bounding box to obtain information about a region as an osm file. This file can be converted to xodr using CARLA. The data from OpenStreetMap contains information about where the roads and lanes run and how they are connected. This data is good enough but not perfect. Fig. 3.5a shows a road network aligned with the mesh. It can be observed that the lanes follow the actual lanes roughly but an exact alignment is not possible without having to deform the graph. (a) A good alignment after manually mov- ing the mesh. Note how lanes still do not perfectly align with the geometry. (b) The initial alignment before manual cor- rection. Figure 3.5: A reconstructed mesh with the road graph overlayed in CARLA 25 3. Pipeline 3.8 Manual Actions Some of the steps described in the previous sections require or can be improved by performing manual actions. Manual actions allow the user to correct the errors before the pipeline continues processing the data. These manual actions aim to be as simple and fast as possible. The goal is still to be efficient and produce results with minimal manual work. The manual actions only take a fraction of the work that it would take to create the whole digital twin manually. It is also important to consider when these actions need to be performed. Preferably, manual steps can be performed back to back or even at the start or end of the pipeline. We want to minimize the time a user has to wait until the next manual step needs to be performed. Furthermore, all intermediate files are exposed and can be modified should unexpected issues arise. The following sections describe common manual actions that are expected to be performed. 3.8.1 Selecting Street Points As described in Section 3.5.4, the 3D semantic segmentation performs best when knowledge about the road positions is available. To provide this information, the user must select a few points on the street that will be used as the sampling positions. Ideally, the points are placed spaced evenly and on all lanes. The process is very straightforward. The user opens the file in CloudCompare4. Then, using the point-picking tool, the user clicks on the scene to place the sample positions as they see fit. When done, the point list must be exported as a txt file containing a list of x, y, and z positions, which are used as the input for the pipeline. The whole process only takes a few minutes to perform. In Fig. 3.6, a scene opened in CloudCompare with a few street points selected can be seen. This action can be performed before running the pipeline when using a point cloud as the input. If using the photogrammetry strategy, the user must wait until the 3D reconstruction is complete, as this uses a new local coordinate system. 3.8.2 Setting the Reconstruction Region and Ground Plane This action is only needed for the photogrammetry strategy. At the edge of the scene, there might be less data available, resulting in more artifacts in the reconstruction. Manually setting a reconstruction region also ensures that the boundary of the mesh is a straight line. Furthermore, the user might not want to reconstruct the whole scene or want to be in control of exactly what region is reconstructed. While RealityCapture can automatically choose the reconstruction region, in our tests performed, this region was often larger than required and included some artifacts. Fig. 3.7 shows a reconstruction 4https://www.danielgm.net/cc/ 26 3.8. Manual Actions Figure 3.6: Points picked on the street in CloudCompare region configured inside RealityCapture. RealityCapture might also fail to level the ground plane correctly. This step provides an opportunity for the user to fix this issue. Figure 3.7: The reconstruction region shown in RealityCapture When the pipeline reaches the step where this action must be taken, RealityCapture is opened automatically. The user must set the reconstruction region and ground plane and close RealityCapture. The pipeline will continue running afterward automatically. 27 3. Pipeline This is also only a matter of minutes. However, the user has to wait until the pipeline reaches this step, which might take longer than an hour. 3.8.3 Adding Connecting Roads The pipeline is designed to reconstruct individual street segments and intersections. Some downstream tasks that require driving simulations benefit from a larger drivable area. This allows for acceleration into the intersection and more freedom for creating different scenarios. There are different ways to extend the drivable surface. This depends on the requirements for the scene and the user’s preference. In this thesis, we use Blender for the task. First, the scene is embedded in a large flat plane, and the elevation is roughly matched with the outside of the scene. Then, the road mesh is connected to this plane to ensure cars can move smoothly between the two surfaces (see Fig. 3.8). This step is completely optional. It might be performed after the semantic segmentation or even after the full pipeline is complete and the scene is imported into CARLA. Figure 3.8: A reconstructed mesh with an extended drivable area created in Blender 3.8.4 Correcting Semantic Segmentation The semantic segmentation will have some miss-labeled regions. When semantic segmen- tation is required for downstream tasks, it is important to be able to correct such errors. This correction can be performed in Blender with the aid of a custom addon. 28 3.8. Manual Actions Figure 3.9: A semantic segmentation of a reconstruction. Showing the manual correction tool in Blender. Vertex labels can adjusted using the vertex painting tool. The automatic semantic segmentation is displayed on the mesh as per-vertex color information. It uses the common color scheme of the CityScapes dataset. The semantic segmentation can then be changed with the vertex painting tool. The addon helps the user by providing a color palette to easily change the brush’s color to the desired class. After correcting the segmentation, the RGB colors are converted back to class labels and exported for further processing by the pipeline. In Fig. 3.9 a screenshot of this step can be seen. 3.8.5 Road Alignment The automatic alignment of the road network data with the mesh is not perfect. The reasons for this are that the raw geographic input data is not accurate enough, and that the data from OpenStreetMap is just an approximation. This alignment can be fine-tuned in Unreal Engine after the scene has been imported into CARLA. The meshes can simply be moved to better align with the road network. Individual nodes of the road graph can be moved to improve the alignment further. Fig. 3.5 shows how the scene looks before and after manually fixing the alignment. 29 CHAPTER 4 Implementation 4.1 Dependencies The pipeline consists of various different tasks in various fields. It relies on numerous Python packages and external tools. The following section summarizes the most essential tools and libraries the pipeline utilizes. Python The pipeline itself is written in Python1. The Python version has been chosen to be compatible with all dependencies. The biggest limiting factor is CARLA, which is only compatible with Python up to version 3.8. All other Python dependencies are compatible with this version, so 3.8 is the version used. Besides a set of Python packages that can be installed with pip, the installation of two packages requires special attention. PyTorch and mmSegmentation2 require matching versions and a compatible CUDA installation. CARLA The largest dependency is CARLA3. To create custom CARLA maps from fbx and xodr files, CARLA and Unreal Engine 4 need to be compiled from source. This compilation process is resource-intensive, it requires over 165 GB of disk space and takes several hours to run. Blender Blender 4.14 is used for mesh processing, visualization, and executing manual actions via a custom addon. Its scripting capabilities allow for the automation of different tasks in the pipeline, such as UV unwrapping, texture baking, mesh splitting, and file conversion. 1https://www.python.org/ 2https://github.com/open-mmlab/mmsegmentation 3https://carla.org/ 4https://www.blender.org/download/releases/4-1/ 31 4. Implementation CloudCompare CloudCompare5 is used for the manual step of picking street points. Additionally, CloudCompare is a valuable tool for visualizing and debugging the input point cloud before processing. MeshLab MeshLab6 is another tool utilized for mesh processing. It implements many algorithms and helper tools for working with meshes and point clouds. We use it to run Poisson surface reconstruction [38], clean up, and simplify the meshes. RealityCapture RealityCapture7 is a commercial tool used for the photogrammetry step. It generates high-quality 3D models from photographic data and scales well with many images. Despite being a commercial tool, its integration into the pipeline is justified by its superior performance and output quality compared to open-source alternatives. 4.2 Pipeline The pipeline is designed to handle complex workflows by coordinating various tasks and tools efficiently. The overall implementation can be divided into two main components: the individual steps of the pipeline and the orchestration mechanism that ties these steps together. 4.2.1 Steps Each step of the pipeline is implemented as a distinct Python function. These functions are the building blocks of the pipeline, processing inputs and generating outputs based on the given requirements. The individual steps receive inputs given as file paths or configuration objects. File paths are used to specify the location of data files that need to be processed or the destination where output files should be saved. Configuration objects provide additional parameters required for the execution of each step. These objects are instantiated with the parameters provided by the configuration file. This provides a way to adjust the steps’ behavior without having to alter the codebase directly. These steps are implemented either directly in Python or by utilizing external tools like RealityCapture, Blender, or CARLA, which are launched as subprocesses. For Blender, a Python file is passed and executed in the Blender environment. RealityCapture and CARLA’s behavior is controlled via CLI arguments. 4.2.2 Orchestration The requirements for the pipeline established in Section 3.1.4 are automation, flexibility, traceability, and change detection. To achieve traceability and change detection, we 5https://www.danielgm.net/cc/ 6https://www.meshlab.net/ 7https://www.capturingreality.com/ 32 4.2. Pipeline require the data passed between each step to be stored on the filesystem. Specifying the locations of these files and connecting the outputs to the inputs between steps is the responsibility of the pipeline. For flexibility, the definition of the pipeline must be simple and easily modifiable. It should be clear what steps run in what order. A general-purpose pipeline system has been implemented to address these requirements. The entry point for this system is the Pipeline class. This class orchestrates the exe- cution of various steps, ensuring that the outputs of one step can be utilized as inputs for subsequent steps. The core components of the pipeline include the Artifact, PipelineData, PipelineStep, and StepGroup classes. The Artifact class represents a piece of data produced or consumed by a pipeline step. Each Artifact is associated with a file path. This class exposes the filename of the corresponding file, forming the basis for output-to-input connection. PipelineData manages the collection of Artifacts and the initial arguments passed to the pipeline. It provides mechanisms for registering new artifacts and finding the latest artifact based on its file name. This ensures that data from the latest possible step is used. The PipelineStep class encapsulates a single step within the pipeline, including its identifier, name, a callable function, input artifacts, and output artifacts. It also receives the arguments for running the callback as a dictionary, which contains the paths to input and output artifacts and further arguments mapped to their corresponding function parameters. It includes methods for running the step, checking if the inputs are newer than the outputs, and verifying the existence of inputs and outputs. This allows steps to only be executed when necessary. Steps are grouped into StepGroup instances, which help organize related steps logically and hierarchically. Each group has a unique identifier, name, and a directory for storing its sub-steps. Steps can be added to a group using the add_step method, which automatically parses the callable function’s parameters to determine the necessary inputs and outputs. This parsing ensures that the correct artifacts can be mapped to this step. Artifacts are linked to their parameters by comparing the artifact name to the parameter name. The main Pipeline class orchestrates the entire process. It initializes with a root directory and a set of arguments. This class is responsible for adding and finding input files, adding steps, and executing the pipeline. The run method determines which steps need to be executed based on the existence and age of their inputs and outputs. Furthermore, it provides a mechanism to forcefully run specific steps if required. This method also ensures that each step is executed in the correct order, respecting dependencies and the defined sequence. The mapping of input and output artifacts to function arguments happens by a naming convention. For each step, the callable function is analyzed. Parameters that start with the out_ prefix are considered outputs. For each such argument, an artifact is created 33 4. Implementation with the file path in the step’s directory. The file name is derived from the argument name. Expecting an argument name of the form out_some_file_name_suffix will set the file path to ///some_file_name.suffix. Similarly, files with the in_ prefix are considered inputs. The filename is derived in the same way. The pipeline looks for an artifact produced in a previous step with the same name and links it to this argument. Any remaining arguments are filled from the pipeline arguments. These are used to provide configuration parameters for the individual steps. They are mapped by their type, so it is recommended to use classes holding configuration data. Multiple steps can produce artifacts with the same name. In this case, the pipeline ensures that the artifact from the latest step possible is used. The pipeline takes care of the directory structure. Inputs for the pipeline are expected to be in a _in directory inside the project’s root directory. Alternatively, an alternate path for this input directory can be specified in the configuration. For each StepGroup, the pipeline creates a directory, in which each PipelineStep gets its directory. These directories and subdirectories are numbered in their execution order. The directory structures created by the pipeline can be seen in Fig. 4.1. Point Cloud Project _in 1 reconstruction 1.1 reconstruct 1.2 cleanup 1.3 simplify 1.4 texturize 2 roads 2.1 extract geo region 2.2 fetch osm 2.3 convert to xodr 3 MANUAL select samples 4 segmentation 4.1 prior 4.2 semantic segmentation 4.3 MANUAL adjust segmentation 4.4 update segmentation 4.5 apply_segmentation 5 finalize 5.1 align 5.2 convert to fbx 5.3 create carla package (a) Point Cloud Photogrammetry Project _in 1 reconstruction 1.1 rc align 1.2 MANUAL set reconstruction region 1.3 rc reconstruct 1.4 rc export 2 roads 2.1 extract geo region 2.2 fetch osm 2.3 convert to xodr 3 MANUAL select samples 4 segmentation 4.1 prior 4.2 semantic segmentation 4.3 MANUAL adjust segmentation 4.4 update segmentation 4.5 apply_segmentation 5 finalize 5.1 align 5.2 convert to fbx 5.3 create carla package (b) Photogrammetry Figure 4.1: The folder structures created after running the pipeline The Python code in Fig. 4.2 shows how the pipeline can be used. A new pipeline with a single group "reconstruction" is created. This group contains three steps: pointcloud_reconstruct, simplify_mesh, and texturize_mesh that will be 34 4.2. Pipeline cfg_pointcloud = PointCloudConfig(...) cfg_reconstruction = ReconstructionConfig(...) args = [cfg_pointcloud, cfg_reconstruction] pipeline = Pipeline(Path("path/to/project/dir"), arguments=args) pipeline.add_input("pointcloud.e57") reconstruction = pipeline.add_group("reconstruction") reconstruction.add_step("reconstruct", pointcloud_reconstruct) reconstruction.add_step("simplify", simplify_mesh) reconstruction.add_step("texturize", texturize_mesh) Figure 4.2: Example pipeline usage. Demonstrating the creation of a new pipeline with a point cloud input, two config arguments, and a single group with 3 steps. def pointcloud_reconstruct( in_pointcloud_e57: Path, out_scene_high_res_ply: Path, cfg: PointCloudConfig): ... def simplify_mesh( in_scene_high_res_ply: Path, out_scene_ply: Path, cfg: ReconstructionConfig): ... def texturize_mesh( in_scene_ply: Path, in_scene_high_res_ply: Path, out_scene_obj: Path, cfg: ReconstructionConfig): ... Figure 4.3: Example step definitions. When executing the pipeline, the function arguments are filled in automatically. Input and output paths are derived from the parameter names. executed in order. An input to the pipeline is added, and the pipeline expects the file /_in/pointcloud.e57 to exist. An artifact is created for this file, which is used as the input for the first step. The function definitions for the three steps are shown in Fig. 4.3. The first function has one input argument, which is mapped to the existing pipeline input pointcloud.e57. It produces a scene_high_res.ply file as an output artifact. The next step takes 35 4. Implementation this artifact as input, the file path of the artifact created by the previous step will be passed as an argument by the pipeline. This step, in turn, produces a scene.ply file. The last step takes both artifacts and produces a scene.obj file. Note that each of the steps takes another argument. This argument provides configuration parameters used by the step. These arguments are passed when constructing the pipeline. 4.3 Reconstruction This section covers the implementation of the mesh reconstruction from both input modalities: point clouds and images. 4.3.1 From Point Clouds The reconstruction process of point clouds is done using Screened Poisson surface recon- struction [38]. The input for this stage is an e57 file, which is processed using MeshLab with the Python library. Per-vertex normals are required. If they are not present in the input data, they are optionally computed at this stage. This reconstruction might introduce errors in the form of additional geometry away from the input points. To clean this up, each vertex’s distance to the closest point in the input point cloud is calculated, and vertices that exceed a certain distance threshold are removed. Non-manifold edges, which can cause issues in subsequent processing steps, are repaired using MeshLab by removing faces until all edges are manifold. Additionally, MeshLab is used to close small holes in the mesh. After generating the high-resolution mesh, the next step is simplification, again using MeshLab with the quadratic edge collapse function [20]. This simplification process involves using a very low planar weight to favor the reduction of polygons in flat regions. This approach helps maintain important geometric details while reducing the overall complexity of the mesh. The simplified mesh is then saved for further processing. Optionally, for meshes in the size of millions of vertices, cluster decimation 8 can be used. While potentially introducing some artifacts, this algorithm runs significantly faster. In our tests on a 60 million vertex mesh, this algorithm took only 20 minutes, while the quadratic edge collapse algorithm had to be stopped after 12 hours. The next step is texturizing the mesh, which uses Blender as a subprocess. Blender receives a Python script that implements the texturizing process, with file paths passed via the command line. The goal is to transfer the vertex color information of the denser high-resolution mesh to a texture on the low-resolution mesh. This is achieved using Render Baking. Render baking is a process where the lighting information is pre-computed and stored in texture maps. We can tell Blender to perform this baking from one object to another. 8https://pymeshlab.readthedocs.io/en/latest/filter_list.html#meshing_ decimation_clustering 36 4.3. Reconstruction Rays are cast inwards from the low-resolution object onto the high-resolution object. A cage is used so that the rays originate outside the object to ensure they do not miss the target object. In this process, the Cycles renderer is set to bake only the diffuse lighting pass, which captures only the color information without additional effects. To create the texture, the high-resolution and low-resolution meshes are loaded into Blender. A material and shader are created to render the high-resolution mesh from vertex colors. Then the low-resolution mesh is UV unwrapped and a new material and texture are created. Render baking is then performed from the high-resolution to the low-resolution mesh. Finally, the texturized mesh is exported as an obj file. Additionally, a blend file is saved, which retains all the Blender-specific settings and materials, making it easier to debug this process. 4.3.2 From Images The reconstruction from images is done with photogrammetry in RealityCapture. This process involves multiple steps, some requiring custom parameters to be set. RealityCap- ture provides a command-line interface that allows multiple commands and arguments to be chained together. The arguments for these commands are stored in xml files, which are maintained and distributed with the Python project. To allow dynamic arguments, these xml files are copied to a temporary location, and relevant parameters such as texture resolution and face count are changed. The photogrammetry process begins by aligning the images, a step where the software matches corresponding points in different photos to determine the positions and orienta- tions of the cameras. During this phase, RealityCapture creates components, which are groups of images that have been successfully aligned. These components are then tried to be merged, which can potentially form a larger component containing more aligned cameras. GPS information is not used during alignment due to the insufficient accuracy of consumer-grade devices. Once the initial alignment is complete, manual intervention is required. The user must select the reconstruction region, which ensures that the reconstruction efforts focus on the scene’s most relevant parts. Additionally, the user may need to adjust the ground plane to ensure the model’s accurate orientation. Following the manual adjustments, RealityCapture computes the 3D model using the default configuration. This model is then simplified to reduce its complexity while preserving details, making it more manageable for further processing and visualization. Texturizing the model is the next step, which involves UV unwrapping the mesh and then mapping high-resolution image data onto its surface. Finally, the model and the camera registrations are exported. The model export includes all the geometric and texture information, while the camera registrations provide the positions and orientations of the cameras used in the photogrammetry process. 37 4. Implementation 4.4 Semantic Segmentation The 3D semantic segmentation step uses a virtual view approach as described in section 3.5. In this approach, 2D views of the scene are rendered, semantically segmented, and the resulting per-pixel labels are projected back onto the mesh. To leverage the topology of the mesh, a prior over-segmentation is performed. 4.4.1 Over-Segmentation The first step in the segmentation process is performing an over-segmentation using the Felzenszwalb algorithm adapted for mesh segmentation. The Felzenszwalb segmentation algorithm is a graph-based method. It is designed to segment images by creating a graph where each node is a pixel and edges are obtained from pixel adjacency. The edges are weighted based on the color difference of the pixels. The Felzenszwalb algorithm creates a partition of the graph such that nodes within the same segment are more similar to each other than to nodes in other segments. The algorithm achieves this by iteratively merging nodes according to edge weights. The Felzenszwalb algorithm can naturally be used with the graph-like structure of meshes. The mesh with it’s vertices and edges is directly used as the input graph. Edge weights are based on local properties of the adjoining vertices. This implementation of the Felzenszwalb segmentation algorithm is adapted from the C++ implementation used in the ScanNet [13] project. It has been rewritten and optimized using Python and NumPy to vectorize operations where possible. The algorithm produces a mapping of each vertex in the mesh to a corresponding component ID. The Universe class is a data structure used to manage the segmentation process. It initializes each vertex as its own segment and provides methods to find and merge segments efficiently. This helps track which vertices belong to which segments as they are merged. Initially, a weight is calculated for each edge in the input graph. A larger weight means a larger dissimilarity between the connecting vertices. The edge weights are influenced by two properties: 1. The first one considers the difference in vertex normals. The calculation is based on their dot product. Let d = n1 ·n2 be the dot product between the two normals. The dot product ranges from −1 for parallel vectors facing away to 0 for perpendicular vectors to 1 for parallel vectors facing the same direction. The edge weight is set to wgeo = 1 − d, resulting in normals facing the same direction having the lowest weight. We want concave regions to have a higher weight than convex regions. We square the weight if the normals face away from each other, which decreases the weight for regions that are only slightly convex. 38 4.4. Semantic Segmentation 2. The next factor is calculated from the two vertices’ color difference. This is simply the Euclidean distance of the RGB color values. The two weights are then added together. The influence of each factor can be controlled in the config file. Then, the edges are sorted by weight and processed in order. The algorithm looks at the two segments connected by each edge, merging segments if the edge weight is below a certain threshold. This threshold is dynamically adjusted to ensure meaningful segments. When merging the components, a maximum size can be configured to reduce potential undersegmentation. After the initial segmentation, it merges small segments to meet size requirements and renumbers segments to ensure they are consecutive. 4.4.2 View Sampling A good selection of virtual views is critical for a good segmentation performance. A virtual view of the scene is given by the virtual camera’s position, orientation, aspect ratio, and field of view. These parameters are stored together in a 4x4 transformation matrix. Each part of the scene must be covered. Furthermore, having perspectives similar to those used in training the 2D semantic segmentation model further improves the performance. The following approaches for sampling views have been implemented: Uniform Uniformly sampling the scene by selecting points of the scene on a grid with even spacing. Per location, multiple orientations are obtained by rotating the camera at each point. The problem with this sampling technique is that these grid points often fall within geometry or are too close to objects. This makes many of the sampled views unusable. Random Randomly sampling the scene is done by randomly selecting a vertex and then moving a certain distance away from it along its normal. This decreases the chance of the camera being too close to the mesh. However, in certain conditions, this problem can still occur. Manual (from the street) For the street sampling approach, the user has to specify a set of points on the road manually. Then, the camera is positioned at different heights above these points. The camera is rotated by 45° steps for each point to ensure each view is covered. 4.4.3 Rendering The next step is rendering the scene from each view. This is done using OpenGL, specifically the Python wrapper ModernGL. The scene mesh Wavefront obj is imported from the previous stages. It may contain multiple textures stored in the same directory. To render the scene, a shader receives the geometry, view matrix, projection matrix, and 39 4. Implementation texture. The rendering is very simple, without any lighting effects. The value of the sampled texture is directly written as the fragment color output. Before rendering, the color buffer is cleared with a light blue to simulate a simple sky background. This proved to produce better segmentation performance than using a transparent background (see Section 5.3.2). After rendering, the color and depth buffer are read back into Numpy arrays. The depth buffer is then analyzed. Pixels where the depth buffer equals 1 are regions where no geometry was rendered. The ratio between the number of these pixels and the total number of pixels in the image is calculated. Renderings where more than a certain percentage of pixels are empty will be thrown away, as they do not provide sufficient information to perform meaningful 2D semantic segmentation. 4.4.4 2D Segmentation Performing 2D semantic segmentation is done using models pre-trained on CityScapes. mmSegmentation is a segmentation toolbox written in Python. It provides access to many state-of-the-art models pre-trained on various different datasets. The model used can be changed in the configuration file. By default, mask2former [10] is used. The renderings obtained from OpenGL are fed into the model, which returns a 2D array containing the predicted class for each pixel. 4.4.5 Back Projection The final step involves projecting the per-pixel class labels back onto the 3D mesh, respecting the prior over-segmentation. We accumulate predictions on the segments by looking up the closest vertex for each predicted pixel and fetching its segment. Then, after all predictions are accumulated, we choose the best class for each segment and assign the label to all its vertices. A naive approach using raycasting for each pixel is very inefficient. Since we know the exact camera parameters, we can leverage OpenGL to create a mapping from pixels to vertex IDs. For this, we need to render the scene with each fragment being set to the vertex ID of the closest vertex. Since the vertex IDs are not available in the fragment shader, we need to use a geometry shader to pass them through. This geometry shader re-emits each triangle. For each emitted triangle, the shader assigns vertex IDs and barycentric coordinates to each triangle vertex. This approach ensures that each triangle has its own vertices, allowing us to store the three vertex IDs and barycentric coordinates for each vertex. This setup is necessary because we need access to all three vertices within the fragment shader to find the id of the closest vertex. In the fragment shader, barycentric coordinates are automatically interpolated. By examining these coordinates, we can determine the closest vertex to each fragment. The shader reads the corresponding vertex ID and writes it to the shader output. This method efficiently determines the nearest vertex for each fragment. 40 4.5. Road Networks The buffer output from the fragment shader is read as a Numpy array, mapping each pixel to the closest vertex of the mesh. We index the over-segmentation array using this mapping to create a pixel-to-segment mapping. The predicted 2D classes can be projected onto the mesh using this mapping. An array of size (segments, classes) stores the segmentation results. For each pixel in each 2D semantic segmentation, the corresponding segment’s class counter is incremented by one. To improve efficiency, instead of iterating over all pixels, the arrays are converted to tensors. Let’s call the flattened pixel-to-segment mapping seg and the flattened predicted segmentation pred. This gives us two tensors of equal length, where the i-th entry corresponds to a prediction of the class pred[i] for the segment seg[i]. The index_put_ function from PyTorch allows us to index the accumulation tensor by (seg, pred) and add one to each entry of accumulation[seg[i], pred[i]]. The method allows for duplicates and runs in parallel, which significantly speeds up the process. Once all views are processed, we are left with the accumulation array containing all sampled classifications for each component. To get per vertex class labels the accumulation array is indexed by the component mapping. Resulting in an array of size (vertices, classes), where each entry represents the count of samples recorded for each class- vertex combination. Weights can be applied to the class labels as an optional improvement. For example, if a particular class is frequently misclassified, applying a larger weight to said class can make positive identifications more impactful. To determine the final class label for each vertex, we select the class with the highest (weighted) sample count. This is accomplished by taking the array’s argmax, which identifies the class with the maximum value. Since this approach increments the sample count for each pixel, larger triangles on the image contribute more samples to each of their vertices than smaller ones. This in turn makes regions that are closer to the camera more impactful. 4.5 Road Networks The source for road network information is the OpenStreetMap9 (OSM) API. This API can be queried with a geographical bounding box to return map data as an osm file. As described in Section 3.6, georeferencing data comes either from the data source or user input. An important aspect to consider when working with geographical positions is the usage of different spatial reference systems. These systems define how locations on the Earth’s surface are measured. These coordinate systems are often given as an EPSG code, which refers to an entry in the Geodetic Parameter Dataset created by the European Petroleum 9https://www.openstreetmap.org 41 4. Implementation Survey Group10. OpenStreetMap uses the WGS-84 coordinate system (EPSG:4326), which is typically used for GPS applications and commonly found in EXIF metadata. For example, the Cumberlandstraße dataset located in Vienna (see Section 5.1.3) has its origin at 48.19212, 16.2955 in EPSG:4326. Conversion is necessary to support datasets recorded in different coordinate systems. This conversion can be performed using PyProj11, which creates a transformer between two coordinate systems. It ensures the georeferencing is given in the correct system required for the OpenStreetMap API. Pyproj is also used to align the mesh and the OpenDrive file, which have different local coordinate systems. For each, a global offset that places them within a specific region of a larger map is given. This offset allows the local coordinates (x, y) to be translated into global coordinates (latitude, longitude). The alignment is performed by translating the mesh. This is done by first transforming the mesh coordinate system into EPSG:4326. Then, the EPSG:4326 transformation is applied to convert this offset from lat/long to x/y in a local coordinate system. This is exactly the system in which the offsets in the OpenDrive file are given. Adding these offsets gives a 2D translation between the two local coordinate systems. Applying this offset to the mesh completes the alignment. 4.6 Mesh Splitting After obtaining the semantic segmentation for the mesh, the mesh needs to be split into individual sub-meshes, one for each class. CARLA requires each mesh to be separated to use the semantic information. The pipeline uses Blender to perform this splitting. The semantic segmentation is provided as a per-vertex class label in a NumPy array stored as a npy file. A Blender script receives the path to the mesh and the segmentation. The script then stores the class labels as vertex attributes and assigns vertex colors based on the standard Cityscapes color palette. The class labels must be stored as vertex attributes to preserve them when modifying the mesh, as the vertex ids change after each splitting operation. Additionally, another layer of vertex colors is added to store debugging metrics. The script iterates over each class label, selecting all vertices belonging to that class. It then uses the Blender "separate" operation to split the selection into a new mesh. The resulting meshes are named according to their class. CARLA can automatically detect some classes based on their mesh names, so they are assigned names that CARLA recognizes. The names that overlap with our defined classes are Road_Road for road and Road_Sidewalk for sidewalk. The remaining classes must be assigned manually inside CARLA if required. This can be done by moving the meshes into the corresponding sub-directory for each class. 10https://epsg.io/ 11https://pyproj4.github.io/pyproj/stable/ 42 CHAPTER 5 Evaluation This thesis aims to build a pipeline that can create a semantically labeled digital twin of streets and intersections. This digital twin should be photorealistic, have an accurate mesh, and have high-resolution textures. The semantic segmentation should accurately separate the mesh into relevant regions, such as roads, sidewalks, and buildings. The pipeline should automate as many steps as possible, requiring no or minimal user intervention. The pipeline’s running time is not a critical factor. If the tool has to run for many hours, it is acceptable if this can be done offline without relying on user input too often. To evaluate how well the created pipeline satisfies these requirements, various experiments on the individual steps and the pipeline itself were performed. These experiments were done through comparative analyses and controlled tests with quantitative metrics. The experiments were designed to answer the following research questions: • Does the pipeline produce high-quality meshes and textures from image inputs? • Does the pipeline produce high-quality meshes and textures from point cloud inputs? • How does the reconstruction quality differ between both input modalities? • What are good parameters for the mesh reconstruction from a point cloud? • What is an appropriate mesh resolution for the usage in ADSs? • What is an appropriate texture resolution for the usage in ADSs? • What influence do different image-capturing strategies have on the reconstructed mesh? • Can virtual view semantic segmentation produce an adequate semantic segmentation of the reconstructed mesh? 43 5. Evaluation • How do the parameters of the segmentation process (over-segmentation, sampling strategy, 2D segmentation model) influence the segmentation quality? • How long does the pipeline and each step run? • To what degree is the pipeline automated? The following steps were performed to answer these research questions: Point cloud and image datasets were captured using different hardware and strategies. The pipeline was used to create reconstructions of these datasets. Some of the reconstructions were further cleaned up and annotated manually to provide a ground truth dataset for benchmarking. Then, we demonstrate the visual quality of the reconstructions by analyzing the meshes and textures. By having point cloud and image data from the same scenes captured simultaneously, a direct comparison between reconstructions of both modalities could be performed. The effect of different parameters for the mesh reconstruction was observed by highlighting differences in the reconstructed scenes. The semantic segmentation performance was tested with the ground truth datasets, and the impact of different parameters and approaches was analyzed. Finally, we measured the running time of each step and analyzed the degree of automation of the pipeline. 5.1 Data This section discusses the data used to build and evaluate the pipeline. The quality of the input data is crucial for the quality of the reconstruction of the digital twin. The pipeline needs either a high-density and precise point cloud with RGB and normal information or a set of high-quality images covering the whole scene from multiple perspectives. To the best of our knowledge, no image-based datasets within our desired domain and quality exist. There are publicly available point cloud datasets, however, they either lack the required point density [5, 16] or normal information [52]. While normals can be calculated, it does not produce accurate results, especially for thin geoemtry like traffic signs. This thesis covers the entire process of creating a digital twin, starting from the data acquisition. We capture our own data, allowing us to investigate the feasibility of the different techniques and create guidelines for doing so with good results. 5.1.1 Handheld Camera The first datasets created for the pipeline were image-based datasets of different locations in Vienna. Using a handheld camera allowed us to be very flexible in terms of what locations we could capture. It gave us direct control over how the images were captured and what perspectives were covered. We show, that when strategically capturing images with just a handheld camera, good reconstructions can be achieved. 44 5.1. Data Capturing The images were captured with a DSLR, a Canon EOS 100D, which has an 18 MP crop sensor. An 18mm lens provided a broad field of view, ensuring an appropriate overlap between images. Capturing the first datasets quickly revealed problem areas that needed special attention. The photogrammetry pipeline had difficulty reconstructing low texture and reflective regions. In particular, roads and cars were not reconstructed well. Other areas were reconstructed much better, showing promise for the technique and leading us to pursue it further. Creating usable image data for the pipeline was an iterative process. The quality of reconstructions was continuously improved by capturing new datasets with more images, better coverage, and more thought put into capturing the images. This revealed that many issues can be improved with more images covering problem areas from different viewpoints. The process of capturing was the following: First, wide shots were taken to capture the context of the whole scene. This was done by walking a route around the scene multiple times. We walked on the sidewalk and the road, taking images roughly every meter. Four passes were performed, with the direction in which the images were taken being rotated by 90 degrees each time. This approach does not produce a full 360° coverage with our camera but is sufficient for aligning these images with more close-up shots and covering most parts of the scenes from multiple viewpoints. Next, more close-up shots were taken, especially focusing on problem areas. This included shots facing the ground, with the street covering more than half of the image. helping the photogrammetry process find more features in these low-textured regions which results in a better reconstruction of the floor. Finally, detail shots were taken of smaller objects with more complex geometry. This mainly included traffic signs, poles, and cars. To capture them in full detail, images were taken by orbiting around each of them multiple times at different heights and distances. This ensures that there is plenty of data available for a detailed reconstruction. Example images and the distribution of views can be seen in Fig. 5.1. These datasets consist of 500 to 1200 images per scene covering 1000 to 2000 m2. Images were captured in a raw format and processed to even out the lighting. This was done by lifting the shadows and reducing the highlights, which reduces the hardness of shadows and creates more uniform lighting. Fig. 5.2 shows a comparison between a raw and edited image. Capturing these datasets took about 30 minutes each. Most of the time spent was due to having to walk the loop four times while trying to avoid capturing pedestrians and cars, as moving objects can negatively impact the alignment process. 45 5. Evaluation (a) Input images. These shots are specifically done to capture more features of the road surface (b) Camera registrations in RealityCapture Figure 5.1: Example input images and their registraions in the scene. The strategies used to capture the images can be seen here: walking a path around the scene on the sidewalk, walking on the road, and orbiting around traffic signs. 46 5.1. Data (a) Raw (b) Edited Figure 5.2: A comparison of a captured image before and after processing. 5.1.2 NavVis VLX The next device used for data acquisition was the NavVis VLX (as seen in Fig. 5.3). This mobile mapping device features two LiDAR scanners and four 18MP cameras. The device uses simultaneous localization and mapping (SLAM) and loop-closing algorithms to create a complete point cloud of the scene. The registration of the scans is done only with the IMU and LiDAR data. The images are stitched together to form a 360◦ panorama, which is then used to color the points. Offline processing further improves the results, filtering out moving objects such as cars and pedestrians, uniformly sampling the point cloud, and calculating normals. This process produces a very dense and high-precision point cloud. The usage of the VLX is simple. The device is worn on the shoulders and can be controlled with a small display in the front. While recording, a live preview of the captured point cloud centered around the device is displayed. LiDAR data is captured automatically while the user walks around the scene. Images need to be captured manually. This is done by pressing a trigger button on the device. This device allowed us to capture LiDAR and image data of the same scene at the same time, allowing for a direct comparison of the two strategies of the pipeline. Capturing images with a 360° view makes capturing image-based datasets much easier and faster, eliminating the need to walk over each location multiple times. The images captured 47 5. Evaluation with the device replace the wide, establishing shots described in the previous section. This data was augmented by images captured with a handheld DSLR, focusing on traffic signs, the road surface, and cars. Figure 5.3: The NavVis VLX mobile mapping device Point Cloud The point cloud obtained after processing by NavVis can be seen in Fig. 5.4. The close-up render shows the density of the point cloud and how well it captures textural information. The geometry is very precise, with noise-free flat surfaces and detailed small objects such as poles. The captured datasets contain roughly 50 million to 150 million points for scenes of the size of 800 m2 to 2000 m2. Images The images captured by this device are embedded in the e57 point cloud as panorama images. They have a resolution of 8192 by 4096 pixels. To use them in the photogrammetry pipeline, they have been projected to a cube map, with each side stored as an individual file, with a resolution of 3072 by 3072 pixels. Images are captured manually by pressing a button on the device. This gives control of the exact position and frequency at which images are captured. We captured images roughly every two steps, giving us about 300 48 5.1. Data (a) (b) Figure 5.4: A point cloud captured with the NavVis VLX panorama images per scene, which results in 6 times as many unprojected images usable for the photogrammetry strategy. 5.1.3 Datasets The methods presented in this thesis are evaluated on three different custom datasets. All three have point cloud data captured from the VLX, and two of the datasets contain images captured by the VLX and additionally a handheld DSLR. Table 5.1 gives an overview of the datasets, their area, and input data count. Dataset Ground Area Points Images from VLX Images from DSLR Cumberland 2000 m2 133.6 million 1949 887 Jenullgasse 740 m2 78.3 million 1307 1396 Mex 810 m2 69.8 million - - Table 5.1: Statistics of our captured datasets Cumberlandstraße The first dataset, Cumberlandstraße, spans an area of 2000 m2 of an intersection in Vienna. It features a complex intersection with many traffic signs, road markings, lanes, and vegetation. The scene was captured with the laser scanner and the handheld camera. The path taken along the scene can be seen in Fig. 5.5. Images and laser scans were captured by walking on the sidewalk and road. Further images were captured by walking around complex objects such as traffic signs. The raw data contains 133.6 million points 49 5. Evaluation (a) (b) Figure 5.5: Cumberlandstraße: camera registrations (a) and mapping path (b) for the point cloud and 2836 images, 1949 of which come from the scanner, and the remaining 887 from the DSLR. Jenullgasse The next scene, Jenullgasse, was also captured in Vienna. Since reflective and transparent surfaces can be challenging for LiDAR scanners and the photogrammetry process, this location was specifically chosen, as it contains many parked cars. Again, the data was captured by walking on the sidewalk or road and by orbiting around objects of interest. The path can be seen in Fig. 5.6. We captured each car from many different perspectives, providing the photogrammetry pipeline with ample data for reconstructing these challenging objects. In total, 2703 images were captured, 1307 of which stem from the VLX. The point cloud has 78.3 million points with a scene size of 740 m2. (a) (b) Figure 5.6: Jenullgasse: camera registrations (a) and mapping path (b) Mex The last scene is from an intersection in a town in Switzerland called Mex. This one was only captured by the VLX scanner. The point cloud has 69.8 million points, and it 50 5.1. Data spans an area of 810 m2. This dataset was an example provided by NavVis, capturing this dataset, they did not focus on capturing many panorama images of the scene. Thus, the number of images is insufficient for a good reconstruction using photogrammetry. 5.1.4 Ground Truth Semantic Labels Reconstructed meshes were manually annotated with semantic labels for all three datasets to provide a benchmark for the segmentation part of the pipeline. These meshes are reconstructed from the point clouds, as they have a smoother and more precise topology. To create the annotation, first, a semantic segmentation was performed with the pipeline to get most of the vertices annotated correctly. Then, the manual step of the pipeline, which allows correcting the segmentation, was used (see Section 3.8.4). In Blender, each vertex was inspected and set to the correct label, adhering to the class definitions from CityScapes [12]. Fig. 5.8 shows a top-down view of each scene with the corresponding annotations. ro ad si dew alk build in g w all fe nce pole tr aff ic lig ht tr aff ic si gn vegeta tio n te rr ain sk y pers on ri der ca r tr uck bus tr ain m oto rc ycl e bic ycl e Class 0 100,000 200,000 300,000 400,000 500,000 V e rt ic e s Figure 5.7: Class distribution of the ground truth annotations for all three datasets combined In Fig. 5.7, a distribution of the labels can be seen. Due to the nature of the domain, the datasets do not have an even class distribution, and some of the classes are not present in our dataset. Predictions from these classes are filtered out by default and are not considered when calculating the evaluation metrics. Our annotated ground truth datasets use the following subset of the CityScapes classes: road, sidewalk, building, wall, fence, pole, traffic sign, vegetation, terrain, car, motorcycle, and bicycle. Creating the ground truth annotations took a total of 10 hours for all three datasets. 51 5. Evaluation ■ Road ■ Sidewalk ■ Building ■ Wall ■ Fence ■ Pole ■ Traffic Light ■ Traffic Sign ■ Vegetation ■ Terrain ■ Sky ■ Person ■ Rider ■ Car ■ Truck ■ Bus ■ Train ■ Motorcycle ■ Bicycle (a) (b) (c) (d) (e) (f) Figure 5.8: Ground truth annotations on Mex (a, b), Cumberlandstraße (c, d), and Jenullgasse (e, f) 52 5.2. Reconstruction 5.2 Reconstruction This section examines the reconstructions generated by the pipeline. It begins with a visual demonstration and description of reconstructed meshes from both input modalities. Next, four experiments were performed to investigate the influence of differences in input data and configurations. This was done using quantitative and descriptive analysis. In doing so, good baseline parameters for the pipeline configuration were defined. 5.2.1 Results In this section, we showcase reconstructions from images and point clouds and then highlight their differences and challenges. From Images The image-based datasets described in Sections 5.1.1 and 5.1.2 produced usable recon- struction results, albeit not artifact-free. While most areas were reconstructed accurately, some scenes contain obvious issues. These issues are mainly visible on cars and the road. Fig. 5.9 highlights renderings of reconstructions from this data. In the reconstruction process, not all of the input images could be aligned by reality capture. The reconstructions of Cumberlandstraße and Jenullgasse used 68.8% and 51.8% of the images respectively. In Sub-figure 5.9a a wall with a lot of textural and geometric variation can be seen. The reconstruction here is very accurate, and the resulting texture is sharp and has a high resolution. In Sub-figure 5.9b, a big hole in the street can be observed. This is a big issue, as this interferes with the physics simulation of vehicles. In some cases, it might be impossible to drive on these areas without manually fixing the mesh. This is because these regions have less textural variation, which results in fewer features usable by the photogrammetry software. This issue did not occur with our datasets, which contain more images focusing on capturing details on the ground. Furthermore, cars were most often reconstructed with many artifacts, especially where the windows are. In Sub-figure 5.9c, two traffic signs can be seen. These signs were specifically targeted with orbiting detail shots. However, there are still artifacts in the form of holes, cut-off portions, and blurry textures. This shows how challenging the flat, uniform, and reflective surfaces of signs themselves are. Another common issue with image-based reconstructions is the appearance of additional, often floating geometry. This happens especially at the outside and higher part of the mesh. The white/blue texture of these regions hints that they might be created due to RealityCapture wrongly estimating the distance of the sky and clouds. Fig. 5.10 shows these artifacts in a reconstructed mesh. 53 5. Evaluation (a) Good reconstruction in areas with a high texture variation. (b) Deformations on cars and the road. (c) Traffic signs are sometimes only par- tially reconstructed. Figure 5.9: Results of a reconstruction using an image based dataset. Figure 5.10: A scene reconstructed from images, showcasing floating artifacts. 54 5.2. Reconstruction From Point Clouds Reconstructions from point clouds produced very smooth and precise meshes. Fig. 5.11 highlights such a reconstructed mesh. The ground was fully reconstructed without any holes or major defects. Cars and bikes appear with detailed geometry, and there are only minor artifacts, such as their windows not being reconstructed. However, the texture of the reflective surfaces of cars does not accurately represent the car’s true color, showing visible reflections of the surrounding area. This is an artifact that is propagated from the input data. Thin objects like traffic signs and poles are sometimes only partially reconstructed. The generated textures are of a high quality, and road markings are clearly visible and sharp. This is made possible by the high-density input point cloud used for the reconstructions. Figure 5.11: Jenullgasse reconstructed from a point cloud 55 5. Evaluation Comparison Having two datasets of the same scenes with both modalities allows for the direct comparison of the strategies. As seen in Fig. 5.12 and 5.13, both modalities produce accurate meshes with few striking visible artifacts when seen from far away. (a) From a point cloud (b) From images Figure 5.12: Cumberlandstraße reconstructed With both strategies, the meshes contain no holes in the road, which is crucial for the driving simulation. The reconstruction from the point cloud produces a very uniform mesh, while the photogrammetry mesh is more noisy and jagged. However, the road surface is smooth enough and does not behave differently in manual simulated driving tests. Both meshes have high-resolution textures, allowing the identification of road markings and traffic signs. However, the reconstructions are not perfect. Issues arise, especially with regard to traffic signs and cars. In Fig. 5.14, renderings of reconstructions of both data sources from the same perspective are shown. Using these images, we highlight the key differences between both results. In both reconstructions, traffic signs do often have holes or are missing entirely. Overall, the signs from the photogrammetry strategy show such artifacts less frequently. However, since the raw point cloud contains these missing sections of the signs, the artifacts are possibly due to the geometry being too thin and sometimes being dismissed in the Poisson surface reconstruction [38]. The text on the traffic signs is sharper and more readable in 56 5.2. Reconstruction (a) From point cloud (b) From images Figure 5.13: Jenullgasse reconstructed the image-based reconstruction. The cars reconstructed from the LiDAR data have a mostly accurate geometry. Geometric artifacts are mainly visible in the transparent windows. The texturing is not accurate, showing the reflections of the surrounding area. These reflections are visible in the raw input data of the point cloud and arise due to the way color information is mapped from RGB images to the points. The cars reconstructed from images show significant defects in any case. Due to the reflection and transparency, the photogrammetry process fails to find enough matching features on these surfaces. Overall, the reconstructions from both modalities produce very similar high-quality results. Both strategies proved to be viable for the creation of digital twins of streets and intersections. 57 5. Evaluation (a) From images (b) From point cloud Figure 5.14: Direct comparison of mesh and texture reconstructions from both modalities 58 5.2. Reconstruction 5.2.2 Poisson Reconstruction Depth With this experiment, we explore the influence of the Poisson surface reconstruction [38] depth on the resulting reconstruction. The algorithm is only used with the point cloud strategy. The choice of the Poisson surface reconstruction depth controls the depth of the underlying octree used in the Poisson surface reconstruction algorithm [38]. Each increase in depth doubles the spatial resolution in each axis. It significantly impacts processing time and the mesh resolution. This experiment aims to evaluate the differences in the resulting reconstruction at varying reconstruction depths. Different features become visible at different reconstruction depths. In particular, we are interested in traffic signs and road markings. Methodology For this experiment, the three point-cloud-based datasets obtained from the NavVis VLX were used (see Section 5.1.2). By running the first steps of the pipeline, the point cloud was first reconstructed to a mesh, then artifacts were cleaned up, and finally, the mesh was textured. This allowed us to measure processing and post-processing times. The scene was reconstructed at depths 10 to 13. The upper limit was set to 13 because reconstructions beyond that resulted in meshes of unmanageable size with hundreds of millions of faces. Critical areas of interest were highlighted and compared between each reconstruction depth. The reconstruction and further processing durations were recorded, and renderings of the scene were created for comparison. To measure the difference between the point cloud and the reconstructed meshes, the Hausdorff distance [4] was used. The Hausdorf distance measures how far away two meshes are. Given two Meshes, A and B, the Hausdorff distance for a point on Mesh A is given by the distance to the closest point on Mesh B. We measured the distance by sampling points on the point cloud and calculating their Hausdorff distance to the reconstructed mesh. This direction was chosen to minimize errors due to the "bubbling" artifacts in the reconstruction. Results Distance Sub-figure 5.15d shows the measured Hausdorf distances from the point cloud to each reconstructed mesh. As the depth increases, the mean distances decrease from 6 mm to 14 mm at a depth of 10 to only 1 mm to 2 mm at a depth of 13. Vertex Count Increasing the reconstruction depth greatly increases the vertex count of the resulting mesh (see Sub-figure 5.15b). For each increase in depth, the count increases by a factor of 3-4. At a depth of 13, this reaches 60 to 80 million vertices. Meshes of this size create new challenges for downstream processing. 59 5. Evaluation QZ^STPNQM^N ^qZ`TMR PJURqJYf NZhNMQJdZ oJRZqJTZ m̀ PrP j gj cj aj Oj ijj igj icj W M Q` NJ S T ] U JT M NZ P [ nZ^STPNQM^NJST WZRNK ij ii ig ie (a) Task durations ^MU_ZQq`T\ IZTMqqL`PPZ UZh W`N`PZN j gj cj aj Oj l Z QN Z h X S M T N ]J T p Jqq JS T P [ nZ^STPNQM^NJST WZRNK ij ii ig ie (b) Vertex Counts ij ii ig ie WZRNK j bj ijj ibj gjj gbj ejj ebj cjj m S N` q W M Q` NJ S T ] U JT M NZ P [ W`N`PZN ^MU_ZQq`T\ IZTMqqL`PPZ UZh (c) Total reconstruction duration ij ii ig ie WZRNK j g c a O ij ig ic p Z ` T k ` M P \ S QY Y W JP N` T ^ Z ] U U [ W`N`PZN ^MU_ZQq`T\ IZTMqqL`PPZ UZh (d) Mean Hausdorff distance from the input point cloud Figure 5.15: Quantitative results of the experiments with varying Poisson surface recon- struction depths Processing Time As seen in Sub-figures 5.15a and 5.15c, the reconstruction time, as well as the duration of the successive tasks increase with a higher depth. At a depth of 13, reconstruction and cleanup took over two hours, which is acceptable for our purposes. However, the mesh size became unmanageable with our initial simplification implementation. The simplification ran for over 12 hours before we had to cancel it. This increase in computing time could be attributed to Meshlab running out of RAM, which is limited to 32GB on our testing hardware. We used a different simplification method in this case to still be able to create reconstructions with this depth and evaluate them. 60 5.2. Reconstruction Instead of just using quadratic edge collapse [20], we first performed Clustering Decima- tion1, which runs significantly faster, even on meshes with millions of vertices. However, this technique introduces unwanted artifacts, so it is only enabled when necessary. This approach was only used for reconstructions created with a reconstruction depth of 13. While the measured Hausdorf distances, processing times, and vertex counts provide valuable quantitative data, they do not paint a full picture. Small errors in the mesh may only slightly influence the Hausdorff distance while potentially being detrimental to the use in downstream tasks. We evaluate areas of interest in the reconstructed meshes to provide a more comprehensive assessment. For this, various renderings of the meshes (as seen in Fig. 5.16) are used. (a) Depth 10 (b) Depth 11 (c) Depth 12 (d) Depth 13 Figure 5.16: Reconstructions from a point cloud at different Poisson depths Traffic Signs The first row of images shows a stop sign at all reconstruction depths. At depths of 10 to 11 the sign is not, or only partially visible. At depth 12, the sign becomes visible and identifiable as such. However, there are minor artifacts in the form 1https://pymeshlab.readthedocs.io/en/0.1.9/filter_list.html# simplification_clustering_decimation 61 5. Evaluation of holes. Interestingly, these artifacts did not occur for axis-aligned signs, as seen in the second row. At a reconstruction depth of 13, the sign is visible in high quality and free of holes or major deformations. Poles Poles are visible at all reconstruction depths. However, at lower depths, some deformations are visible. With higher depths, they become smoother and more detailed. Color Resolution The color information is available only as a per-vertex color attribute at this pipeline stage. This color is later used to bake a texture on a lower-resolution mesh. Thus, the density of the vertices limits the final texture resolution. Features such as road markings are visible at every reconstruction depth. However, they are blurry at lower depths of 10 and 11. Increasing the depth to 12 or 13 results in noticeably sharper textures. Discussion From the above-presented results, we can conclude that the choice of reconstruction depth is a tradeoff between quality and performance. A value of 12 provides a good middle ground with small artifacts. Increasing the depth to 13 produces noticeable improvements in the quality of textures and the mesh at the cost of significantly larger files, longer processing times, higher system requirements, and an additional source of artifacts being introduced. At both resolutions, traffic signs and road markings are clearly visible. Depths beyond 13 are not worth considering, as the input point clouds for this experiment have roughly the same amount of points as the meshes reconstructed with a depth of 13 have vertices. 5.2.3 Mesh Resolution The mesh reconstruction step produces meshes with millions of vertices and faces. Meshes of this size are inefficient and significantly increase the processing time and RAM usage of downstream tasks. With this experiment, we investigate how simplifying the reconstructed mesh to different face counts influences the reconstruction quality. This is relevant for both strategies, meshes from point clouds and photogrammetry. Methodology In this experiment, our three point-cloud-based datasets were used again. First, a mesh reconstruction was performed using a Poisson Depth of 12, ensuring high-quality meshes were created. The meshes were simplified to face counts of 100, 000, 500, 000, and 1, 000, 000. The simplification was done using Quadric Edge Collapse Decimation [20], which allows a decimation to a specific vertex count. The Hausdorff distance from the input point cloud to the simplified mesh was calculated. We are interested in the mean of the Hausdorff distances of all points, giving us the average distance from the point cloud. 62 5.2. Reconstruction Furthermore, simplified reconstructions at the same face counts were created for the image-based reconstructions. This allowed us to evaluate the differences caused by varying face counts for both modalities. Areas of interest were compared for all different face counts. These areas include the quality of thin objects such as signs and poles and how efficiently flat areas are stored. Results Fig. 5.17 shows the Hausdorff distances measured from the point cloud to the recon- structed meshes for each point-cloud-based dataset. For comparison, we also calculated the Hausdorff distance to the non-simplified meshes, which have 16 to 52 million faces. The mean Hausdorff distance significantly improves with an increasing face count. At a face count of 1 000 000 million, the Hausdorff distance approaches the distance the full-resolution mesh achieves. When comparing the point counts of the datasets (see Table 5.1) with this figure, we observe that the Hausdorff distance correlates with the dataset size. This shows that larger scenes require a higher mesh resolution for a similar reconstruction quality. ^MU_ZQq`T\ IZTMqqL`PPZ UZh W`N`PZN j g c a O ij ig k ` M P \ S QY Y W JP N` T ^ Z ] U U [ V`^Z XSMTN ijj jjj bjj jjj i jjj jjj YMqq Figure 5.17: Hausdorff Distnace for different face counts Fig. 5.18 shows a pole with four traffic signs for each reconstructed mesh. The reconstruc- tions from images and point clouds behave similar for this object. A resolution of 100, 000 faces is insufficient, as some signs are completely absent in the reconstruction. Increasing the resolution to 500, 000 faces, the signs get reconstructed with a few polygons. This is 63 5. Evaluation enough to identify them, but increasing the face count to 1, 000, 000 further increases the quality of the signs. Fig. 5.19 shows a larger segment of the scene. Here we can observe the difference between the two input modalities. The meshes reconstructed from the point cloud have a more uniform face distribution. F ro m im ag es F ro m p oi nt cl ou d (a) 100 000 faces (b) 500 000 faces (c) 1 000 000 faces Figure 5.18: A pole with traffic signs simplified to different face counts for both modalities Discussion The reconstructions from both modalities produce similar results and artifacts at similar face counts. Higher resolutions increase the mesh quality while increasing file size and reducing performance. Another factor to be considered is the topology of the mesh. In some cases, object boundaries might not align with the edges of the mesh. This is especially true for low sidewalks, which become flat with low mesh resolutions. When that happens, the segmentation step cannot produce accurate and clear boundaries. In some cases, the mesh resolution constrains the maximum possible segmentation quality. Depending on the scene size and complexity, a face count of 500, 000 to 1, 000, 000 is recommended. If the segmentation produces edges that are not aligned with object boundaries, increasing the face count might help. 64 5.2. Reconstruction F ro m im ag es F ro m p oi nt cl ou d (a) 100 000 faces (b) 500 000 faces (c) 1 000 000 faces Figure 5.19: A reconstruction simplified to different face counts for both modalities 5.2.4 Texture Resolutions This experiment aims to compare the quality of different texture resolutions for both input modalities. Methodology Image-based and point-cloud-based reconstructions of Cumberlandstraße were used for this experiment. After performing a mesh reconstruction, clean-up, and a simplification to 500, 000 faces, the texturing step was performed at resolutions 4096 (4k), 8192 (8k), and 16384 (16k). The texturing from point clouds only supports a single texture, so only one texture was used with both modalities. The overall differences were highlighted, and the legibility and sharpness of road markings and text were compared between the different resolutions and modalities. Results Overall Fig. 5.20 shows a wall with a detailed texture and a foreground object. For the image-based approach, a resolution of 4k shows an erroneous-looking pattern on the bricks. At higher resolutions, this issue gets resolved. However, other artifacts are 65 5. Evaluation F ro m im ag es F ro m p oi nt cl ou d (a) 4K resolution (b) 8K resolution (c) 16K resolution Figure 5.20: Comparison of different texture resolutions for both modalities introduced on the sidewalk as darker-tinted blobs. The point-cloud-based approach shows no visible errors and gets noticeable sharper with increasing resolution. Road Markings Fig. 5.21 highlights road markings with all created textures. For the image-based results with the lowest resolution of 4k, road markings are visibly blurry. At the next higher resolution of 8k, this is already greatly improved. The jump in quality from 4k to 8k is the most noticeable. Further increasing the resolution to 16k further improves the edge sharpness. With the point-cloud-based reconstruction, an increase in quality can also be observed. However, this improvement is not as significant as with the image-based approach. The image-based reconstruction benefits of a 16k texture in this case. The line boundary is clearer, and more details on the road surface can be observed. The point-cloud-based reconstructions lack detail on the road surface. Text Fig. 5.22 shows a closeup of a sign containing text. For the image-based approach, this text becomes barely readable at a resolution of 8k, with a significant improvement at a resolution of 16k. In the case of the point-cloud-based approach, the text is not readable, even at the highest resolution. 66 5.2. Reconstruction F ro m im ag es F ro m p oi nt cl ou d (a) 4K resolution (b) 8K resolution (c) 16K resolution Figure 5.21: Close-up of a road marking, comparing different texture resolutions for both modalities Discussion Depending on the total surface area of the scene, a texture resolution of 8k or 16k is sufficient. If performance is not a constraint, a resolution of 16k is best. The main difference in the textures of both modalities is their sharpness and consistency. Textures obtained from photogrammetry can have a potentially much higher resolution. However, due to slight errors in the camera alignment, the texture projection may not align perfectly, which can result in some artifacts. The textures obtained from the point-cloud-bassed strategy are overall more consistent and artifact-free at the cost of a lower maximum resolution and sharpness. In the photogrammetry process, multiple 16k textures could be used for an even higher fidelity. 67 5. Evaluation F ro m im ag es F ro m p oi nt cl ou d (a) 4K resolution (b) 8K resolution (c) 16K resolution Figure 5.22: Close-up of text, comparing different texture resolutions for both modalities 5.2.5 Image Subsets The last experiment we performed investigated whether images captured from the NavVis VLX are enough for a good reconstruction and if complementing them with further images can improve the reconstruction quality. Methodology We created two reconstructions with different images from the image-based dataset of Cumberlandstraße. This dataset contains 2836 images. 1949 of these images were captured using the VLX, providing a broad coverage of the scene. The remaining 887 images were captured using a handheld camera and focused on capturing details on the ground and around traffic signs and poles. One of the reconstructions used all images, and the other used just the images captured with the VLX. The resulting meshes were compared with a descriptive analysis. The key differences are highlighted in this section. 68 5.2. Reconstruction (a) Subset, only the 1949 images captured with the VLX (b) All 2836 images Figure 5.23: Comparison of an image-based reconstruction using only a subset of all images (a) to a reconstruction using all images (b) 69 5. Evaluation Results Fig. 5.23 shows renderings of the reconstructions described above. The pole of the traffic sign failed to be reconstructed and the texture on the sign is very blurry when using only the images from the VLX. Adding the detail shots, the traffic sign gets reconstructed a lot better. The pole is fully visible and the sign is mostly reconstructed. However, it still contains a missing segment in the middle. The texture of the sign is also a lot more accurate. The pole in the second row shows more artifacts that occur when not using all the images. The signs are only partially visible, and the pole contains visible deformations. When using all images, the signs are nearly fully reconstructed, the textures legible, and the pole is significantly smoother. The last row shows the road without textures, which illustrates the roughness of the surface. When not using all images, the road is rougher. Using all images makes this surface smoother and reduces the amount of bumps. Discussion While just using images captured by the VLX produces usable results, they contain noticeable artifacts. Using close-up shots focusing on capturing more details of roads and objects of interest significantly improves the reconstruction results. Providing a smoother road surface, better-reconstructed meshes for detailed objects, and a better texture quality in these areas. 5.3 Segmentation In this section, the results of the segmentation evaluation are presented. First, we show what the resulting segmented meshes look like. Then, we present the evaluation of the influence of different parameters on the segmentation performance. This evaluation was done by using the manually labeled ground truth datasets to measure different metrics. The overall performance was measured in mean per-class intersection over union (mIoU) and mean F1 score (mF1). Furthermore, for each class, the intersection over union (IoU), precision, recall, and F1 score were calculated. IoU measures the similarity between the set of predicted vertices and ground truth labeled vertices for each class. It is calculated by dividing the intersection of the sets with their union. mIoU refers to the mean IoU over all classes. The parameters in question are parameters related to the over-segmentation, sampling techniques, and 2D segmentation model selection. Fig. 5.24 shows reconstructions of all three datasets segmented by the pipeline. Overall, the segmentations are mostly accurate. Roads and sidewalks are usually separated, traffic signs are identified as such, vegetation and terrain are recognized, and cars are identified. However, three types of issues can be observed: 70 5.3. Segmentation 1. Fuzzy boundaries: The border between classes sometimes does not follow the actual object boundaries. This often occurs with sidewalks, as there is little variation in the mesh and texture. 2. Misslabeled regions: Larger regions might be misslabeled. This happens espe- cially in regions that are occluded from the virtual views used in the segmentation. This can also happen in regions that are misclassified in multiple views. 3. Noise: Smaller patches of objects may be fragmented into multiple classes. This can happen for classes that are very similar, such as terrain and vegetation, road and sidewalk, or wall and building. (a) Cumberlandstraße Signs and poles are detectet. Fuzzy boundary between the sidewalk and road. (b) Jenullgasse Cars are detected well. The occluded sidewalk is only partially labeled correctly. Incorrectly labeled part of the building. (c) Mex Misslabeled patches of sidewalk on the road. Figure 5.24: Segmentations produced by the pipeline 5.3.1 Over-Segmentation Methodology First, to find suitable parameters for the over-segmentation, segmentations with different parameters influencing the over-segmentation were performed and checked against the ground truth. The parameters in question are: k threshold The k threshold is a constant used in the Felzenzwalb algorithm [15]. It influences the difference between two components that is required for them to be considered a boundary. Higher values for k tend to produce larger-sized components. 71 5. Evaluation min vertices Our algorithm implementation tries to make each segment at least this size if possible. This is done by naively merging small components without looking at the quality of the edge connecting them. Since this does not look at the local properties of the mesh, it might merge components that do not belong together, thus resulting in under-segmentation. max vertices The maximum component size can be limited. This limits the possibility of under-segmentation, where a component spans a large region containing multiple classes. Results Fig. 5.25a shows the influence of the k threshold on mIoU across different minimum vertex counts. Observing the progression of the curve without a minimum segment size constraint (blue line), it is clear that performance improves as the k threshold increases. This is likely because a higher k threshold leads to larger segment sizes, which helps the algorithm reduce noise, form smoother boundaries, and label occluded areas. The experiments with higher minimum vertex counts show unpredictable results for low k thresholds. This can be explained by the fact that with lower values, the segment sizes produced by the Felzenzwalb algorithm tend to be lower (see Fig. 5.25b). Since the minimum vertex size constraint is then applied through a naive approach without considering edge weights, errors may be introduced. The more vertices are merged, the higher the potential error. Performance stabilizes across different minimum segment sizes at higher levels of the k threshold. With the approach that does not enforce a minimum vertex count producing more consistent results. In Fig. 5.25b it can be seen that the mean segment size is smaller than the minimum vertex size. This can be explained by the way the merging is implemented. The minimum vertex size is only a soft constraint, two adjacent components are only merged if both are below this threshold. This also explains why the segment size decreases with higher k thresholds when using a minimum segment size of 200. As the segments produced in the initial segmentation become larger, fewer segments get merged in the second step because these segments would become too large. 5.3.2 Virtual View Sampling This experiment analyzes the influence of different sampling techniques for selecting virtual views. The virtual views are used to perform a 2D semantic segmentation on renderings of the reconstructed mesh, and the predicted labels are then projected back onto the mesh. The methods in question are random sampling, uniform sampling in a grid, and sampling from manually specified points on the street. The goal was to check the hypothesis that using perspectives similar to the ones the 2D segmentation model was trained on improves the segmentation performance. Furthermore, the influence of the chosen number of samples and the combinations of the techniques were investigated. 72 5.3. Segmentation 0 1 2 3 4 5 k threshold 0.44 0.46 0.48 0.50 0.52 0.54 m Io U Minimum segment vertices 1 20 50 200 (a) Influence of k threshold on mIoU 0 1 2 3 4 5 k threshold 0 10 20 30 40 50 60 m e a n s e g m e n t s iz e Minimum segment vertices 1 20 50 200 (b) Influence of k threshold on the segment size Figure 5.25: The effect of different k thresholds and minimum vertex counts Methodology For all labeled datasets, points on the street were selected to provide street perspective sampling positions. Each technique was analyzed separately with varying sample counts to investigate how more sampled views influence the segmentation performance. Street perspective sampling used subsets of the selected points and duplicates at different heights. For each point, eight views were generated by rotating the camera around the up-axis. The uniform sampling count was adjusted by varying the grid size, which defines the distance between sampled points. The number of samples is inversely proportional to the grid size and depends on the dimensions of the scene. The random count was controlled directly. The resulting segmentations were then checked against the ground truth. For the combined experiments, 40 different street sample points with vertical offsets of 1.5m, 3m, and 4.5m were used, resulting in 960 views. Additionally, uniform sampling was configured with a grid size of 4m, and 1000 random samples were used. Then, variations of this configuration were performed by disabling subsets of the sampling techniques. Results Fig. 5.26 shows the mIoU and mF1 scores against the chosen sampling counts. The experiments were performed with and without the prior over-segmentation. In all cases, more samples produced better results, with diminishing returns when approaching the respectively chosen maximum sample sizes. Random Sampling Randomly sampling the scene produced the worst results, ap- proaching an mIoU of 44.9 and an mF1 score of 55.9. After filtering views that show 73 5. Evaluation 0 500 1000 1500 2000 Views Used 0.0 0.2 0.4 0.6 0.8 1.0 S c o re mIoU F1 (a) Random Sampling 0 500 1000 1500 2000 2500 Views Used 0.0 0.2 0.4 0.6 0.8 1.0 S c o re mIoU F1 (b) Manual Sampling 0 1000 2000 3000 4000 5000 6000 Views Used 0.0 0.2 0.4 0.6 0.8 1.0 S c o re mIoU F1 (c) Uniform Sampling Figure 5.26: Comparison of different sampling methods. Showing metrics with over- segmentation (full line) and without (dotted line). The x-axes show the sum of views used (after filtering bad views) for benchmarking all three datasets together. more than 60% of the image does not show any geometry, only 75% of the views were used in the segmentation. Manual Sampling As expected, manual sampling from the street performs better, approaching an mIoU of 55.2 and an mF1 score of 67.1. 40 street points at three heights resulted in 960 views, of which 95% were usable. Uniform Sampling Uniformly sampling the scene produces results comparable to those of street sampling. This is likely because the uniform sampling covers the views selected on the street. The number of generated samples depends on the scene size. For our datasets, a uniform grid size of 4 resulted in 3, 000 to 8, 000 views being considered. Only 40% of the views were used, which is not very resource-efficient. Combined Table 5.2 shows the segmentation performance with different combinations of sampling techniques. The combined approach uses all views from the sampling strategies used. Using all sampling methods achieves a baseline mIoU of 53.3 and an mF1 score of 63.7. The best-performing configuration uses just street sampling, increasing the mIoU by 1.9 and the mF1 score by 3.5. A detailed evaluation per class is shown in table 5.3. The IoU upper bound measures the maximal achievable IoU given the produced over- segmentation. This is calculated by finding an optimal assignment of components to classes and calculating the IoU of the resulting segmentation. The mean IoU upper bound is 89.3 74 5.3. Segmentation Configuration mIoU ∆ mIoU F1 ∆ F1 With all (baseline) 53.3 00.0 63.7 0.0 Without random sampling 53.3 -00.0 64.0 0.3 Without street sampling 52.0 -01.3 62.4 -1.3 Without uniform sampling 52.1 -01.2 63.3 -0.4 Without oversegmentation sampling 52.4 -00.9 62.8 -0.9 Only random sampling 44.9 -08.4 55.9 -7.7 Only street sampling 55.2 +01.9 67.1 +03.5 Only uniform sampling 51.0 -02.3 61.9 -1.8 Table 5.2: Benchmarks of different segmentation configurations. Mean intersection over union (mIoU, %) and F1 score (%) and their change against the baseline configuration are given. Class IoU F1 precision recall mIoU Upper Bound road 41.8 59.0 45.1 85.2 92.0 sidewalk 60.9 75.7 73.6 77.8 84.6 building 87.5 93.3 92.9 93.8 98.7 wall 20.9 34.5 54.4 25.3 70.8 fence 14.5 25.4 42.5 18.1 87.5 pole 15.4 26.7 79.0 16.0 94.9 traffic sign 62.3 76.8 74.6 79.2 72.9 vegetation 90.6 95.0 97.8 92.4 98.4 terrain 68.0 81.0 80.1 81.8 93.2 car 73.8 84.9 92.2 78.7 96.7 motorcycle 50.3 66.9 51.1 96.9 91.5 bicycle 76.3 86.6 85.8 87.3 90.3 Table 5.3: Per-class metrics of a benchmark with the best performing model. The metrics are intersection over union (IoU, %), F1 (%), precision (%), and recall (%). Also indicates the maximum possible mIoU given the calculated over-segmentation. Discussion Uniform sampling and street sampling showed a comparable performance. Since street sampling requires manually selecting the sample points, uniform sampling can be used instead for further pipeline automation. In all cases, using a prior over-segmentation increased the performance considerably. Removing random and uniform views and using just manually selected street perspective views performs best. This indicates that it is more important to have good viewpoints than more viewpoints. Randomly and uniformly sampled viewpoints can be in unfavorable positions because they are either from within geometry, contain many occlusions, or have 75 5. Evaluation perspectives not seen in the 2D domain the models were trained on. This results in an inaccurate 2D segmentation propagating to the 3D segmentation. The difference between good and bad sampling positions can be seen in Fig. 5.27. (a) Street sampling with accurate 2D segmentations. (b) Random sampling. Contains many misclassified areas. Especially sidewalk (pink) is often misclassified as road (purple). Figure 5.27: Examples of 2D semantic segmentations. Rendered input images are overlayed with a semantic segmentation map generated with mask2former [10]. 5.3.3 2D Segmentation Models There are many 2D semantic segmentation models trained on the CityScapes [12] dataset available [10, 59, 57, 64, 56, 54, 32]. The 3D semantic segmentation performance was evaluated with different state-of-the-art 2D semantic segmentation models to find the best model to be used in the pipeline. The best-performing configuration, as established in the previous sections, was used for this experiment. All of the models are trained on the CityScapes dataset. They are provided by mmSegmentation2. The largest possible models that fit the available 8GB VRAM were chosen. Table 5.4 shows the results of this experiment. The choice of 2D segmentation model is crucial for the 3D segmentation performance. Out of the tested 2D segmentation models, mask2former is the model that performs best on the CityScapes [12] dataset, achieving a mIoU of 81.71. Likewise, mask2former shows the best performance in our 3D segmentation benchmark. 2https://github.com/open-mmlab/mmsegmentation 76 5.4. Automation Model mIoU F1 CityScapes mIoU mask2former [10] 55.2 67.1 81.7 segformer [56] 51.6 63.1 78.6 hrnet [54] 50.6 62.7 78.5 ddrnet [59] 45.7 57.8 80.0 pidnet [57] 41.0 52.2 80.9 pspnet [19] 38.9 49.1 79.5 isanet [32] 34.4 45.1 79.3 Table 5.4: Benchmark of the 3D segmentation performance using different 2D segmenta- tion models. Also shows the respective mIoU on the CityScapes dataset. 5.4 Automation 5.4.1 Runnig Time In this section, we investigate the degree of automation of the pipeline, including how long it and its steps run, when manual steps are necessary or beneficial, and how long they take. (a) Image-based (b) Point-cloud-based Figure 5.28: Timeline of the pipeline, showing key steps being performed for both strategies. Manual actions are highlighted in orange. The longest-running steps are highlighted in blue. Fig. 5.28 shows a timeline of the steps performed by the pipeline for both branches, image- 77 5. Evaluation based and point-cloud-based. The time measurements were taken from the reconstruction and segmentation of the three datasets described in section 5.1.3. The benchmark was performed on a Windows 11 PC with an AMD Ryzen 7 3700X 8-Core Processor, an NVIDIA RTX 2080, and 32 GB of RAM. The indicated times are only rough indicators. The actual running time can vary depending on a number of factors, such as the number of input images, point cloud size, samples chosen while performing the 3D segmentation, and higher-resolution texture and mesh reconstructions. Still, this evaluation gives a good overview of the level of automation and running time of the whole pipeline and each step. Most of the steps are fully automated, manual steps are highlighted in orange (Fig. 5.28). In both cases, the longest part of the pipeline is performing the mesh reconstruction, which can take several hours. The remaining parts run relatively quickly, completing in under 15 minutes. The point-cloud-based strategy has the advantage that the bulk of the processing time can be done without interruption. This is because the selection of the street points can be performed on the input point cloud itself, as it has the same coordinate system as the reconstructed mesh. In the image-based strategy, the user has to wait until the mesh is reconstructed to do so. Another drawback of the image-based strategy is that a manual action must be performed in the middle of the mesh reconstruction. The user has to define the reconstruction region. For this strategy, the user has to wait twice, causing the pipeline to stall. 5.4.2 Time Savings Some of the pipeline steps could be performed manually. Photogrammetry can be performed in the GUI of RealityCapture. Point clouds can be reconstructed and cleaned within MeshLab, and textures can then be baked in Blender. However, this involves many tedious and repetitive operations of preparing and importing input data, exporting to different formats, configuring the parameters and operations in all these applications, and waiting for long-running tasks to complete, only to start the next one. This manual overhead would make the manual creation of a digital twin take even longer. The semantic segmentation could also be performed manually. However, this would be a time-consuming task. Our ground truth annotation process took several hours per scene, even though this only involved fixing an existing annotation done by the pipeline. The pipeline takes care of all these steps. Only where strictly necessary the user has to perform a manual action. These actions are clearly defined, and the pipeline gives instructions on what to do and automatically watches the filesystem to detect when the action was performed so that it can continue immediately. This shows how much manual work the pipeline automates and allows for a much more efficient creation of a digital twin. 78 CHAPTER 6 Conclusion This thesis has demonstrated the feasibility of creating high-quality digital twins from both images and point clouds in a semi-automated manner. Photogrammetry, using RealityCapture, proved to be an effective method for reconstructing 3D meshes from image data. By capturing between 500 and 2,000 strategically taken images, a high-quality 3D model can be generated within a few hours. We also provided detailed guidelines on camera hardware, configuration, and proper capturing techniques to ensure the input data is suitable for reconstruction. Point clouds were also utilized as input data. For this modality, Poisson surface re- construction is used to convert them into 3D meshes. When using high-quality point clouds, this process is both efficient and accurate. The input data requires RGB colors and normal information for each point. We captured this data using an existing mobile mapping unit that produces dense and precise point clouds. A comparison between image-based and point cloud-based reconstructions revealed that both methods yield high-quality meshes and textures. Image-based inputs allow for higher-resolution textures, while the texture resolution from point clouds is limited by the density of the point cloud data. Reconstructions from point clouds showed a higher mesh accuracy with fewer geometric artifacts. Despite these differences, both methods produce reliable results with few artifacts. The reconstructions were free of significant issues that could impede physics simulations. Roads were smooth and without holes, allowing simulated vehicles to navigate them without issue. However, some artifacts were observed, particularly in the reconstruction of cars and traffic signs, which occasionally exhibited holes or missing sections. This is caused by the reflective and low-texture surfaces, which are challenging for both the photogrammetry process and the capturing and coloring of point clouds. The developed pipeline is mostly automated, running in a matter of hours with only a few well-defined manual interventions required, which take less than 20 minutes. This makes 79 6. Conclusion it a highly cost-effective and efficient method for creating digital twins of intersections. In addition to its efficiency, the pipeline is highly flexible. Intermediate files can be reviewed and corrected as needed, ensuring that any unexpected errors can be investigated and manually resolved. The pipeline’s modularity also allows for easy modification and extension, opening up the project for future research and improvements. This thesis also demonstrated that 3D semantic segmentation can be performed without the need for extensive ground truth data by leveraging pre-trained 2D segmentation models. This is done by segmenting 2D views of the 3D scene and back-projecting the predicted class labels onto the vertices of the mesh. Additionally, we created a ground truth semantically annotated dataset. It contains reconstructed meshes of 3 intersections with per-vertex semantic labels. This dataset is valuable for evaluating semantic segmentation algorithms for 3D meshes in urban settings and was used to evaluate the performance of the pipeline’s semantic segmentation algorithm. While our approach does not match the performance of state-of-the-art 3D semantic segmentation models trained on comparable domains, it still delivers highly usable results. This achievement is particularly significant given that we lacked sufficient ground truth data to train a model from scratch and instead had to rely on pre-trained 2D models for this task. Moreover, although the segmentation results are not perfect, the approach offers significant time savings for semantically segmenting a mesh for use in autonomous driving simulations compared to manually doing so — a key objective of this thesis. 6.1 Limitations While the pipeline developed in this thesis is promising, there are several limitations that need to be addressed in future work: Scene Size The current approach struggles with large scenes, which require longer processing times and have higher hardware requirements. This is especially true when point clouds are used as the input. A possible solution is to split the scene into smaller sections for processing and merge them afterward. Such an approach would allow the creation of digital twins covering way larger scenes, spanning multiple intersections and roads, or even large parts of cities. Road Accuracy The alignment of road data with the 3D mesh is currently only an approximation. To improve this, a smarter alignment algorithm could be developed that more accurately aligns the road network with the road mesh by looking at the geometry. Alternatively, a method for creating the road network directly from the road mesh could be explored. Fixing Artifacts Certain scene elements, such as signs and cars, often contain signifi- cant artifacts. These issues could be mitigated by detecting these elements and replacing 80 6.2. Future Work them with pre-made, high-quality meshes. This is especially important for traffic signs, as they play a crucial role in autonomous driving simulations. 6.2 Future Work Building on the foundation laid in this thesis, several avenues for future research and development are apparent. Enhancing the pipeline’s ability to handle larger and more complex scenes, improving the accuracy of road network creation, improving the 3D semantic segmentation performance, and refining artifact handling are key areas for further exploration. Another possible direction for future work involves optimizing the data capture process. Developing dedicated capturing devices that streamline the acquisition of images, LiDAR data, or both could significantly enhance the input data quality and capturing efficiency. Currently, the pipeline handles these modalities as separate, independent branches, but fusing both data types presents an intriguing research opportunity. Such fusion could leverage the strengths and mitigate the weaknesses of each modality. The generation of a broader variety of scenes can be beneficial for downstream tasks. This could involve developing tools to modify existing reconstructions or even create automatic variations easily. Additionally, exploring methods for procedurally generating scenes that adhere to specific constraints could allow for testing specific scenarios with many variations. In conclusion, this thesis has made significant strides toward automating the creation of digital twins for autonomous driving simulations. While challenges remain, the work presented here lays a solid foundation for future advancements in this field. 81 Overview of Generative AI Tools Used ChatGPT GPT-4o mini and GPT-4o were used only to aid the writing process. The tool was used to help me improve the writing of a handful of paragraphs. The output of the tool was not used verbatim. Instead, I used it as a feedback and recommendation system. I looked at the altered output and hand-picked changes to apply to my text. The tool was not used to generate text from nothing. I always included text I had written in my own words in the prompt. Usually accompanied by prompts similar to: "Help me improve my writing, keep it similar to my own style." 83 List of Figures 3.1 Overview of the pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Artifacts on a reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Projection bleeding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Examples of over-segmentation and under-segmentation . . . . . . . . . . 24 3.5 A reconstructed mesh with the road graph overlayed in CARLA . . . . . 25 3.6 Points picked on the street in CloudCompare . . . . . . . . . . . . . . . . 27 3.7 The reconstruction region shown in RealityCapture . . . . . . . . . . . . . 27 3.8 A reconstructed mesh with an extended drivable area created in Blender . 28 3.9 Manual correction addon in Blender . . . . . . . . . . . . . . . . . . . . . 29 4.1 The folder structures created after running the pipeline . . . . . . . . . . 34 4.2 Example pipeline usage in Python . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Example step definitions in Python . . . . . . . . . . . . . . . . . . . . . . 35 5.1 Example input images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2 A comparison of a captured image before and after processing. . . . . . . 47 5.3 The NavVis VLX mobile mapping device . . . . . . . . . . . . . . . . . . 48 5.4 A point cloud captured with the NavVis VLX . . . . . . . . . . . . . . . 49 5.5 Cumberlandstraße: registrations and path . . . . . . . . . . . . . . . . . . 50 5.6 Jenullgasse: registrations and path . . . . . . . . . . . . . . . . . . . . . . 50 5.7 Class distribution of the ground truth annotations for all three datasets combined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.8 Ground truth annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.9 Results of a reconstruction using an image based dataset. . . . . . . . . . 54 5.10 A scene reconstructed from images, showcasing floating artifacts. . . . . . 54 5.11 Jenullgasse reconstructed from a point cloud . . . . . . . . . . . . . . . . 55 5.12 Cumberlandstraße reconstructed . . . . . . . . . . . . . . . . . . . . . . . 56 5.13 Jenullgasse reconstructed . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.14 Direct comparison of mesh and texture reconstructions from both modalities 58 5.15 Quantitative results of the experiments with varying Poisson surface recon- struction depths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.16 Reconstructions from a point cloud at different Poisson depths . . . . . . 61 5.17 Hausdorff Distnace for different face counts . . . . . . . . . . . . . . . . . 63 5.18 A pole with traffic signs simplified to different face counts for both modalities 64 85 5.19 A reconstruction simplified to different face counts for both modalities . . 65 5.20 Comparison of different texture resolutions for both modalities . . . . . . 66 5.21 Close-up of a road marking, comparing different texture resolutions for both modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.22 Close-up of text, comparing different texture resolutions for both modalities 68 5.23 Comparison of an image-based reconstruction using different images . . . 69 5.24 Segmentations produced by the pipeline . . . . . . . . . . . . . . . . . . . 71 5.25 The effect of different k thresholds and minimum vertex counts . . . . . . 73 5.26 Comparison of different sampling methods . . . . . . . . . . . . . . . . . . 74 5.27 Examples of 2D semantic segmentations . . . . . . . . . . . . . . . . . . . 76 5.28 Timeline of the pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 86 List of Tables 3.1 Reconstruction resolution at different reconstruction depths for Poisson surface reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.1 Statistics of our captured datasets . . . . . . . . . . . . . . . . . . . . . . 49 5.2 Benchmarks of different segmentation configurations . . . . . . . . . . . . 75 5.3 Per-class metrics of a benchmark with the best performing model . . . . . 75 5.4 Benchmark of the 3D segmentation performance using different 2D segmenta- tion models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 87 Bibliography [1] Ahmed Abdelreheem, Ivan Skorokhodov, Maks Ovsjanikov, and Peter Wonka. SATR: Zero-Shot Semantic Segmentation of 3D Shapes. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 15120–15133, Paris, France, October 2023. IEEE. [2] Jibril Muhammad Adam, Weiquan Liu, Yu Zang, Muhammad Kamran Afzal, Sai- fullahi Aminu Bello, Abdullahi Uwaisu Muhammad, Cheng Wang, and Jonathan Li. Deep learning-based semantic segmentation of urban-scale 3D meshes in re- mote sensing: A survey. International Journal of Applied Earth Observation and Geoinformation, 121:103365, July 2023. [3] ASAM e.V. Opendrive format specification. https://www.asam.net/ standards/detail/opendrive/, January 2020. [Online; Accessed: 2024-07-14]. [4] N. Aspert, D. Santa-Cruz, and T. Ebrahimi. Mesh: measuring errors between surfaces using the hausdorff distance. In Proceedings. IEEE International Conference on Multimedia and Expo, volume 1, pages 705–708 vol.1, 2002. [5] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proc. of the IEEE/CVF International Conf. on Computer Vision (ICCV), pages 9297–9307, 2019. [6] F. Bernardini, J. Mittleman, H. Rushmeier, C. Silva, and G. Taubin. The ball- pivoting algorithm for surface reconstruction. IEEE Transactions on Visualization and Computer Graphics, 5(4):349–359, October 1999. [7] Alexandre Boulch, Joris Guerry, Bertrand Le Saux, and Nicolas Audebert. SnapNet: 3D point cloud semantic labeling with 2D deep segmentation networks. Computers & Graphics, 71:189–198, April 2018. [8] M. Brown and D.G. Lowe. Unsupervised 3d object recognition and reconstruction in unordered datasets. In Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05), pages 56–63, 2005. 89 [9] R. Qi Charles, Hao Su, Mo Kaichun, and Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77–85, Honolulu, HI, July 2017. IEEE. [10] Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention Mask Transformer for Universal Image Segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1280–1289, New Orleans, LA, USA, June 2022. IEEE. [11] Pranav Singh Chib and Pravendra Singh. Recent advancements in end-to-end autonomous driving using deep learning: A survey. IEEE Transactions on Intelligent Vehicles, 9(1):103–118, 2024. [12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus En- zweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding. In 2016 IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, Las Vegas, NV, USA, June 2016. IEEE. [13] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. [14] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In Sergey Levine, Vincent Vanhoucke, and Ken Goldberg, editors, Proceedings of the 1st Annual Conference on Robot Learning, volume 78 of Proceedings of Machine Learning Research, pages 1–16. PMLR, 13–15 Nov 2017. [15] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient Graph-Based Image Segmentation. International Journal of Computer Vision, 59(2):167–181, September 2004. [16] Whye Kit Fong, Rohit Mohan, Juana Valeria Hurtado, Lubing Zhou, Holger Caesar, Oscar Beijbom, and Abhinav Valada. Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. IEEE Robotics and Automation Letters, 7(2):3795–3802, 2022. [17] Lin Gao, Yu Liu, Xi Chen, Yuxiang Liu, Shen Yan, and Maojun Zhang. Cus3d: A new comprehensive urban-scale semantic-segmentation 3d benchmark dataset. Remote Sensing, 16(6):1079, 2024. [18] Weixiao Gao, Liangliang Nan, Bas Boom, and Hugo Ledoux. SUM: A benchmark dataset of Semantic Urban Meshes. ISPRS Journal of Photogrammetry and Remote Sensing, 179:108–120, September 2021. 90 [19] Weixiao Gao, Liangliang Nan, Bas Boom, and Hugo Ledoux. PSSNet: Planarity- sensible Semantic Segmentation of large-scale urban meshes. ISPRS Journal of Photogrammetry and Remote Sensing, 196:32–44, February 2023. [20] Michael Garland and Paul S Heckbert. Surface simplification using quadric error metrics. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 209–216, 1997. [21] Kyle Genova, Xiaoqi Yin, Abhijit Kundu, Caroline Pantofaru, Forrester Cole, Avneesh Sud, Brian Brewington, Brian Shucker, and Thomas Funkhouser. Learning 3d semantic segmentation with only 2d image supervision. In 2021 International Conference on 3D Vision (3DV), pages 361–372, 2021. [22] Carlos Gómez-Huélamo, Javier Del Egido, Luis M. Bergasa, Rafael Barea, Elena López-Guillén, Felipe Arango, Javier Araluce, and Joaquín López. Train here, drive there: Simulating real-world use cases with fully-autonomous driving architecture in carla simulator. In Luis M. Bergasa, Manuel Ocaña, Rafael Barea, Elena López- Guillén, and Pedro Revenga, editors, Advances in Physical Agents II, pages 44–59, Cham, 2021. Springer International Publishing. [23] Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. A Survey of Deep Learning Techniques for Autonomous Driving. Journal of Field Robotics, 37(3):362–386, April 2020. [24] Carsten Griwodz, Simone Gasparini, Lilian Calvet, Pierre Gurdjos, Fabien Castan, Benoit Maujean, Gregoire De Lillo, and Yann Lanthony. AliceVision Meshroom: An open-source 3D reconstruction pipeline. In Proceedings of the 12th ACM Multimedia Systems Conference, pages 241–247, Istanbul Turkey, June 2021. ACM. [25] Carsten Griwodz, Simone Gasparini, Lilian Calvet, Pierre Gurdjos, Fabien Castan, Benoit Maujean, Gregoire De Lillo, and Yann Lanthony. Alicevision Meshroom: An open-source 3D reconstruction pipeline. In Proc. 12th ACM Multimed. Syst. Conf. - MMSys ’21. ACM Press, 2021. [26] Grégoire Grzeczkowicz and Bruno Vallet. Semantic Segmentation of Urban Textured Meshes Through Point Sampling. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, V-2-2022:177–184, May 2022. [27] Abhishek Gupta, Alagan Anpalagan, Ling Guan, and Ahmed Shaharyar Khwaja. Deep learning for object detection and scene perception in self-driving cars: Survey, challenges, and open issues. Array, 10:100057, 2021. [28] Rodrigo Gutiérrez-Moreno, Rafael Barea, Elena López-Guillén, Javier Araluce, and Luis M. Bergasa. Reinforcement learning-based autonomous driving at intersections in carla simulator. Sensors, 22(21), 2022. 91 [29] Rana Hanocka, Amir Hertz, Noa Fish, Raja Giryes, Shachar Fleishman, and Daniel Cohen-Or. MeshCNN: A Network with an Edge. ACM Transactions on Graphics, 38(4):1–12, August 2019. [30] Qingyong Hu, Bo Yang, Sheikh Khalid, Wen Xiao, Niki Trigoni, and Andrew Markham. Towards semantic segmentation of urban-scale 3d point clouds: A dataset, benchmarks and challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4977–4987, June 2021. [31] Xuemin Hu, Shen Li, Tingyu Huang, Bo Tang, Rouxing Huai, and Long Chen. How simulation helps autonomous driving: A survey of sim2real, digital twins, and parallel intelligence. IEEE Transactions on Intelligent Vehicles, 9(1):593–612, 2024. [32] L Huang, Y Yuan, J Guo, C Zhang, X Chen, and J Wang. Interlaced sparse self- attention for semantic segmentation. arxiv 2019. arXiv preprint arXiv:1907.12273, 2019. [33] Daniel Huber. The astm e57 file format for 3d imaging data exchange. In Proceedings of SPIE Electronics Imaging Science and Technology Conference (IS&T), 3D Imaging Metrology, volume 7864, January 2011. [34] Michael R James and Stuart Robson. Straightforward reconstruction of 3d surfaces and topography with a camera: Accuracy and geoscience application. Journal of Geophysical Research: Earth Surface, 117(F3), 2012. [35] Maximilian Jaritz, Jiayuan Gu, and Hao Su. Multi-View PointNet for 3D Scene Understanding. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3995–4003, Seoul, Korea (South), October 2019. IEEE. [36] Andrej Karpathy, Stephen Miller, and Li Fei-Fei. Object discovery in 3D scenes via shape analysis. In 2013 IEEE International Conference on Robotics and Automation, pages 2088–2095, Karlsruhe, Germany, May 2013. IEEE. [37] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruc- tion. In Proceedings of the fourth Eurographics symposium on Geometry processing, volume 7, 2006. [38] Michael Kazhdan and Hugues Hoppe. Screened poisson surface reconstruction. ACM Transactions on Graphics, 32(3):1–13, June 2013. [39] Michael Kölle, Dominik Laupheimer, Stefan Schmohl, Norbert Haala, Franz Rotten- steiner, Jan Dirk Wegner, and Hugo Ledoux. The Hessigheim 3D (H3D) Benchmark on Semantic Segmentation of High-Resolution 3D Point Clouds and Textured Meshes from UAV LiDAR and Multi-View-Stereo. ISPRS Open Journal of Photogrammetry and Remote Sensing, 1:100001, October 2021. 92 [40] Abhijit Kundu, Xiaoqi Yin, Alireza Fathi, David Ross, Brian Brewington, Thomas Funkhouser, and Caroline Pantofaru. Virtual multi-view fusion for 3d semantic segmentation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 518–535, Cham, 2020. Springer International Publishing. [41] Felix Järemo Lawin, Martin Danelljan, Patrik Tosteberg, Goutam Bhat, Fahad Shah- baz Khan, and Michael Felsberg. Deep projective 3d semantic segmentation. In Michael Felsberg, Anders Heyden, and Norbert Krüger, editors, Computer Anal- ysis of Images and Patterns, pages 95–107, Cham, 2017. Springer International Publishing. [42] Liqiang Lin, Yilin Liu, Yue Hu, Xingguang Yan, Ke Xie, and Hui Huang. Capturing, Reconstructing, and Simulating: The UrbanScene3D Dataset. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, volume 13668, pages 93–109. Springer Nature Switzerland, Cham, 2022. [43] Krista Merry and Pete Bettinger. Smartphone gps accuracy study in an urban environment. PLOS ONE, 14(7):1–19, 07 2019. [44] Adam R Mosbrucker, Jon J Major, Kurt R Spicer, and John Pitlick. Camera system considerations for geomorphic applications of sfm photogrammetry. Earth Surface Processes and Landforms, 42(6):969–986, 2017. [45] D.R. Niranjan, B C VinayKarthik, and Mohana. Deep learning based object detection model for autonomous driving research using carla simulator. In 2021 2nd International Conference on Smart Electronics and Communication (ICOSEC), pages 1251–1258, 2021. [46] Blazej Osinski, Piotr Milos, Adam Jakubowski, Pawel Ziecina, Michal Martyniak, Christopher Galias, Antonia Breuer, Silviu Homoceanu, and Henryk Michalewski. CARLA real traffic scenarios - novel training ground and benchmark for autonomous driving. CoRR, abs/2012.11329, 2020. [47] Błażej Osiński, Adam Jakubowski, Paweł Zięcina, Piotr Miłoś, Christopher Galias, Silviu Homoceanu, and Henryk Michalewski. Simulation-based reinforcement learning for real-world autonomous driving. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 6411–6418, 2020. [48] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. 93 [49] Capturing Reality. How to take photographs. https://rchelp. capturingreality.com/en-US/tutorials/takingpictures.htm. [Ac- cessed: 2024-10-03]. [50] Mohammad Rouhani, Florent Lafarge, and Pierre Alliez. Semantic segmentation of 3D textured meshes for urban scene analysis. ISPRS Journal of Photogrammetry and Remote Sensing, 123:124–139, January 2017. [51] Rajvi Shah, Aditya Deshpande, and P.J. Narayanan. Multistage sfm: Revisiting incremental structure from motion. In 2014 2nd International Conference on 3D Vision, volume 1, pages 417–424, 2014. [52] Magistrat Wien Magistratsabteilung 41 Stadtvermessung. Kappazunder dataset, 2020. data retrieved from Geodatenviewer der Stadtvermessung Wien, https: //www.wien.gv.at/geodatenviewer/portal/wien/. [53] Abubakar Sulaiman Gezawa, Qicong Wang, Haruna Chiroma, and Yunqi Lei. A Deep Learning Approach to Mesh Segmentation. Computer Modeling in Engineering & Sciences, 135(2):1745–1763, 2023. [54] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution repre- sentation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. [55] Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Francois Goulette, and Leonidas J. Guibas. Kpconv: Flexible and deformable convo- lution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. [56] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with trans- formers. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 12077–12090. Curran Associates, Inc., 2021. [57] Jiacong Xu, Zixiang Xiong, and Shankar P. Bhattacharyya. PIDNet: A Real-time Semantic Segmentation Network Inspired by PID Controllers. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19529– 19539, Vancouver, BC, Canada, June 2023. IEEE. [58] Yetao Yang, Rongkui Tang, Mengjiao Xia, and Chen Zhang. A surface graph based deep learning framework for large-scale urban mesh semantic segmentation. International Journal of Applied Earth Observation and Geoinformation, 119:103322, May 2023. [59] Puyuan Yi, Shengkun Tang, and Jian Yao. Ddr-net: Learning multi-stage multi-view stereo with dynamic depth range. arXiv preprint arXiv:2103.14275, 2021. 94 [60] Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and Kazuya Takeda. A Survey of Autonomous Driving: Common Practices and Emerging Technologies. IEEE Access, 8:58443–58469, 2020. [61] Guangyun Zhang and Rongting Zhang. MeshNet-SP: A Semantic Urban 3D Mesh Segmentation Network with Sparse Prior. Remote Sensing, 15(22):5324, November 2023. [62] Rongting Zhang, Guangyun Zhang, Jihao Yin, Xiuping Jia, and Ajmal Mian. Mesh- Based DGCNN: Semantic Segmentation of Textured 3-D Urban Scenes. IEEE Transactions on Geoscience and Remote Sensing, 61:1–12, 2023. [63] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip H.S. Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16259–16268, October 2021. [64] Jingchun Zhou, Mingliang Hao, Dehuan Zhang, Peiyu Zou, and Weishi Zhang. Fusion PSPnet Image Segmentation Based Method for Multi-Focus Image Fusion. IEEE Photonics Journal, 11(6):1–12, December 2019. 95