Self-supervised Vision Transformers for 3D pose estimation of novel objects

Thalhammer, Stefan; Weibel, Jean-Baptiste; Vincze, Markus; Rodriguez-Garcia, Jose

doi:10.1016/j.imavis.2023.104816

Record link:

http://hdl.handle.net/20.500.12708/189457

Title:

Self-supervised Vision Transformers for 3D pose estimation of novel objects

Citation:

Thalhammer, S., Weibel, J.-B., Vincze, M., & Rodriguez-Garcia, J. (2023). Self-supervised Vision Transformers for 3D pose estimation of novel objects. Image and Vision Computing, 139, Article 104816. https://doi.org/10.1016/j.imavis.2023.104816

Publisher DOI:

10.1016/j.imavis.2023.104816

CatalogPlus:

AC17202794

Publication Type:

Article - Original Research Article

Language:

English

Authors:

Thalhammer, Stefan
Weibel, Jean-Baptiste
Vincze, Markus
Rodriguez-Garcia, Jose

Organisational Unit:

E376-02 - Forschungsbereich Komplexe Dynamische Systeme

Journal:

Image and Vision Computing

ISSN:

0262-8856

Date (published):

Nov-2023

Number of Pages:

Publisher:

Elsevier

Peer reviewed:

Yes

Keywords:

Object pose estimation; Self-supervised learning; Template matching; Vision transformer

Abstract:

Object pose estimation is important for object manipulation and scene understanding. In order to improve the general applicability of pose estimators, recent research focuses on providing estimates for novel objects, that is, objects unseen during training. Such works use deep template matching strategies to retrieve the closest template connected to a query image, which implicitly provides object class and pose. Despite the recent success and improvements of Vision Transformers over CNNs for many vision tasks, the state of the art uses CNN-based approaches for novel object pose estimation. This work evaluates and demonstrates the differences between self-supervised CNNs and Vision Transformers for deep template matching. In detail, both types of approaches are trained using contrastive learning to match training images against rendered templates of isolated objects. At test time such templates are matched against query images of known and novel objects under challenging settings, such as clutter, occlusion and object symmetries, using masked cosine similarity. The presented results not only demonstrate that Vision Transformers improve matching accuracy over CNNs but also that for some cases pre-trained Vision Transformers do not need fine-tuning to achieve the improvement. Furthermore, we highlight the differences in optimization and network architecture when comparing these two types of networks for deep template matching.

Project title:

Verfolgbare Roboter Handhabung von sterilen medizinischen Produkten: 101017089 (European Commission)

Research Areas:

Automation and Robotics: 100%

Science Branch:

2020 - Elektrotechnik, Elektronik, Informationstechnik: 100%

License:

CC BY 4.0