E376 - Institut für Automatisierungs- und Regelungstechnik
Number of Pages:
3D Vision; Pose estimation; Object recognition; robotics
Maschinelles Lernen; 3D Sehen; Objekterkennung; Posebestimmung; Roboter
Autonomous robots are expected to reliably interact with their environment, following user commands and manipulating objects. This requires a robot to understand its environment, to determine the objects of which it is composed and how they relate to each other. Using object pose estimation, the robot may determine the 3D translation and 3D rotation of known object models with respect to its observation of the environment. Given the pose of all observed objects, the robot may create a 3D representation of the scene, consisting of the objects’models and the spatial relations between them. Such an understanding allows the robot to, for example, reason about interactions with individual objects, synthesize novel views of the scene or interpret users’ commands. However, the alignment of object models to the robot’s visual observation may suffer from sensor noise, partial observability and object symmetry that lead to ambiguous situations and inaccurate poses. The resulting representation of the scene may thus contain implausibilities such as intersecting, floating or statically unstableobjects. Resorting to physical relations alone also suffers from ambiguity as there are, for example, numerous possibilities for two objects to plausibly interact. Accounting for such scene-level consistency is further complicated by multiple, potentially inaccurate hypotheses per object that create a complex search space for resolving conflicting pose hypotheses.To overcome these ambiguities and to resolve scene-level inconsistencies, we hypothesize that visual and physical plausibility complement each other and allow for more accurate and robust object pose estimation. We conjecture that the complexity of dealing with scenes of multiple objects with multiple hypotheses each may be tamed by considering the plausibility of the resulting configurations. While we argue that such reasoning may be generally beneficial in robot vision, we focus on the task of object pose estimation and its sub-steps of refinement and verification. In this thesis, we provide definitions for visual and physical plausibility of object poses in static scenes. Visual plausibility is considered as rendering- or pointcloud-based alignment. Physical plausibility is determined by simulation or evaluation of static equilibrium. We propose analytical and learning-based approaches to the object pose estimation task that leverage these definitions. We explore concepts from reinforcement learning to incorporate plausibility at different stages of the pose estimation pipeline and to efficiently consider vast numbers scene-level combinations. Moreover, based on the plausibility information gathered by our proposed methods, we derive explanation strategies for human-robot interaction in case of robotic failure. By evaluation on common datasets and by applying our methods to robotic grasping, we highlight the accuracy, robustness and efficiency of our proposed object pose estimation approaches and demonstrate the benefit of considering visual and physical plausibility for this task.
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers