Park, K. (2020). 6D pose estimation of objects using limited training data [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2020.85042
Pose estimation of objects is an important task to understand the surrounding environment for interacting with the objects in robot manipulation and augmented reality applications. Major computer vision tasks, such as object detection and classification, have significantly improved using Convolutions Neural Networks (CNN). Likewise, recent pose estimation methods using CNN have achieved high performance using a large amount of training data, which is, however, difficult to obtain from real environments.This thesis presents multiple methods that overcome the limited source of training in practical scenarios while solving common challenges in object pose estimation. Symmetry and occlusion of objects are the most common challenges that make estimations inaccurate. This thesis introduces a method that regresses pixel-wise coordinates of an object while resolving ambiguous views from symmetric poses with a novel loss function in the training process. Coordinates of occluded regions are also predicted regardless of visibility, which makes the method robust to occlusion. The method shows state-of-the-art performance in the evaluations using only a limited number of real images. Nevertheless, annotating object poses in images is a difficult and time-consuming task, which prevents pose estimation methods from learning a new object from real scenes that are clutter. This thesis introduces an approach that leverages a few cluttered images of an object to learn its appearances in arbitrary poses. The novel refinement step updates pose annotations of input images to reduce pose errors that are common if poses are self-annotated by camera tracking or manually annotated by humans. Evaluations present the generated images from the method lead to state-of-the-art performance compared to methods using 13 times the number of real training images. Domains such as retail shops face new objects very often. Thus, it is inefficient to train pose estimators for new objects every time. Furthermore, it is difficult to build precise 3D models of all instances in real-world environments. A template-based method in this thesis tackles these practical challenges by estimating poses of a new object using previous observations of the same or similar objects. The nearest observations are used to determine the object’s locations, segmentation masks, and poses. The method is further extended to predict dense correspondences between the nearest observation and a target object for transferring grasp poses from similar experiences. Evaluations using public datasets show the template-based method performs better than baseline methods for segmentation and pose estimation tasks. Grasp experiments using a robot show the benefit of leveraging successful grasp experiences that significantly improve the grasp performance for familiar objects.