<div class="csl-bib-body">
<div class="csl-entry">Hermosilla, P., Stippel, C., & Sick, L. (2025). Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding. In <i>Proceedings of the IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)</i> (pp. 14835–14844). https://doi.org/10.1109/CVPR52734.2025.01382</div>
</div>
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/222797
-
dc.description.abstract
Self-Supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository.
en
dc.language.iso
en
-
dc.subject
3d scene understanding
en
dc.subject
point cloud processing
en
dc.subject
representation learning
en
dc.subject
self-supervised learning
en
dc.title
Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding
en
dc.type
Inproceedings
en
dc.type
Konferenzbeitrag
de
dc.contributor.affiliation
Universität Ulm, Germany
-
dc.relation.isbn
979-8-3315-4365-5
-
dc.relation.issn
1063-6919
-
dc.description.startpage
14835
-
dc.description.endpage
14844
-
dc.type.category
Full-Paper Contribution
-
dc.relation.eissn
2575-7075
-
tuw.booktitle
Proceedings of the IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)
-
tuw.peerreviewed
true
-
tuw.researchTopic.id
I5
-
tuw.researchTopic.name
Visual Computing and Human-Centered Technology
-
tuw.researchTopic.value
100
-
tuw.publication.orgunit
E193-01 - Forschungsbereich Computer Vision
-
tuw.publisher.doi
10.1109/CVPR52734.2025.01382
-
dc.description.numberOfPages
10
-
tuw.author.orcid
0009-0004-6524-0715
-
tuw.event.name
The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)
en
tuw.event.startdate
11-06-2025
-
tuw.event.enddate
15-06-2025
-
tuw.event.online
On Site
-
tuw.event.type
Event for scientific audience
-
tuw.event.country
US
-
tuw.event.presenter
Hermosilla, Pedro
-
wb.sciencebranch
Informatik
-
wb.sciencebranch
Mathematik
-
wb.sciencebranch.oefos
1020
-
wb.sciencebranch.oefos
1010
-
wb.sciencebranch.value
90
-
wb.sciencebranch.value
10
-
item.openairetype
conference paper
-
item.openairecristype
http://purl.org/coar/resource_type/c_5794
-
item.cerifentitytype
Publications
-
item.languageiso639-1
en
-
item.grantfulltext
none
-
item.fulltext
no Fulltext
-
crisitem.author.dept
E193-01 - Forschungsbereich Computer Vision
-
crisitem.author.dept
E193-01 - Forschungsbereich Computer Vision
-
crisitem.author.dept
Universität Ulm, Germany
-
crisitem.author.orcid
0009-0004-6524-0715
-
crisitem.author.parentorg
E193 - Institut für Visual Computing and Human-Centered Technology
-
crisitem.author.parentorg
E193 - Institut für Visual Computing and Human-Centered Technology