<div class="csl-bib-body">
<div class="csl-entry">Koch, S., Hermosilla, P., Vaskevicius, N., Colosi, M., & Ropinski, T. (2024). Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph prediction. In <i>2024 International Conference on 3D Vision (3DV)</i> (pp. 1037–1047). https://doi.org/10.1109/3DV62453.2024.00076</div>
</div>
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/203928
-
dc.description.abstract
3D scene graphs are an emerging 3D scene representation, that models both the objects present in the scene as well as their relationships. However, learning 3D scene graphs is a challenging task because it requires not only object labels but also relationship annotations, which are very scarce in datasets. While it is widely accepted that pre-training is an effective approach to improve model performance in low data regimes, in this paper, we find that existing pre-training methods are ill-suited for 3D scene graphs. To solve this issue, we present the first language-based pre-training approach for 3D scene graphs, whereby we exploit the strong relationship between scene graphs and language. To this end, we leverage the language encoder of CLIP, a popular vision-language model, to distill its knowledge into our graph-based network. We formulate a contrastive pre-training, which aligns text embeddings of relationships (subject-predicate-object triplets) and predicted 3D graph features. Our method achieves state-of-the-art results on the main semantic 3D scene graph benchmark by showing improved effectiveness over pre-training baselines and outperforming all the existing fully supervised scene graph prediction methods by a significant margin. Furthermore, since our scene graph features are language-aligned, it allows us to query the language space of the features in a zero-shot manner. In this paper, we show an example of utilizing this property of the features to predict the room type of a scene without further training.
en
dc.language.iso
en
-
dc.subject
3D representation learning
en
dc.subject
3D Scene Graph
en
dc.subject
CLIP
en
dc.subject
GCN
en
dc.subject
language + 3D vision
en
dc.subject
pre-training
en
dc.title
Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph prediction
en
dc.type
Inproceedings
en
dc.type
Konferenzbeitrag
de
dc.contributor.affiliation
Universität Ulm, Germany
-
dc.contributor.affiliation
Robert Bosch (Germany), Germany
-
dc.contributor.affiliation
Robert Bosch (Germany), Germany
-
dc.contributor.affiliation
Universität Ulm, Germany
-
dc.relation.isbn
[9798350362459]
-
dc.description.startpage
1037
-
dc.description.endpage
1047
-
dc.type.category
Full-Paper Contribution
-
tuw.booktitle
2024 International Conference on 3D Vision (3DV)
-
tuw.peerreviewed
true
-
tuw.researchTopic.id
I5
-
tuw.researchTopic.name
Visual Computing and Human-Centered Technology
-
tuw.researchTopic.value
100
-
tuw.publication.orgunit
E193-01 - Forschungsbereich Computer Vision
-
tuw.publisher.doi
10.1109/3DV62453.2024.00076
-
dc.description.numberOfPages
11
-
tuw.author.orcid
0000-0002-1409-5114
-
tuw.author.orcid
0000-0001-8141-2725
-
tuw.author.orcid
0000-0002-7857-5512
-
tuw.event.name
International Conference on 3D Vision (3DV 2024)
en
tuw.event.startdate
18-03-2024
-
tuw.event.enddate
21-03-2024
-
tuw.event.online
On Site
-
tuw.event.type
Event for scientific audience
-
tuw.event.country
CH
-
tuw.event.presenter
Koch, Sebastian
-
wb.sciencebranch
Informatik
-
wb.sciencebranch
Mathematik
-
wb.sciencebranch.oefos
1020
-
wb.sciencebranch.oefos
1010
-
wb.sciencebranch.value
90
-
wb.sciencebranch.value
10
-
item.grantfulltext
restricted
-
item.openairetype
conference paper
-
item.openairecristype
http://purl.org/coar/resource_type/c_5794
-
item.languageiso639-1
en
-
item.fulltext
no Fulltext
-
item.cerifentitytype
Publications
-
crisitem.author.dept
Universität Ulm, Germany
-
crisitem.author.dept
E193-01 - Forschungsbereich Computer Vision
-
crisitem.author.dept
Robert Bosch (Germany), Germany
-
crisitem.author.dept
Robert Bosch (Germany), Germany
-
crisitem.author.dept
Universität Ulm, Germany
-
crisitem.author.orcid
0000-0002-1409-5114
-
crisitem.author.orcid
0000-0001-8141-2725
-
crisitem.author.orcid
0000-0002-7857-5512
-
crisitem.author.parentorg
E193 - Institut für Visual Computing and Human-Centered Technology