<div class="csl-bib-body">
<div class="csl-entry">Tekaya, N., Waldner, M., & Zeppelzauer, M. (2025). A Matter of Time: Revealing the Structure of Time in Vision-Language Models. In <i>MM ’25: Proceedings of the 33rd ACM International Conference on Multimedia</i> (pp. 12371–12380). https://doi.org/10.1145/3746027.3758163</div>
</div>
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/221607
-
dc.description.abstract
Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ''timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.
en
dc.description.sponsorship
FWF - Österr. Wissenschaftsfonds
-
dc.language.iso
en
-
dc.subject
Multimodal representations
en
dc.subject
Vision-language models
-
dc.subject
Time modeling
-
dc.subject
Time reasoning
-
dc.subject
Time estimation
-
dc.subject
Benchmark dataset
-
dc.title
A Matter of Time: Revealing the Structure of Time in Vision-Language Models
en
dc.type
Inproceedings
en
dc.type
Konferenzbeitrag
de
dc.contributor.affiliation
St. Pölten University of Applied Sciences, Austria
-
dc.relation.isbn
979-8-4007-2035-2
-
dc.description.startpage
12371
-
dc.description.endpage
12380
-
dc.relation.grantno
DFH 37-N
-
dc.type.category
Full-Paper Contribution
-
tuw.booktitle
MM '25: Proceedings of the 33rd ACM International Conference on Multimedia
-
tuw.peerreviewed
true
-
tuw.project.title
Visuelle Analytik und Computer Vision treffen auf kulturelles Erbe
-
tuw.researchinfrastructure
Vienna Scientific Cluster
-
tuw.researchTopic.id
I5
-
tuw.researchTopic.id
C5
-
tuw.researchTopic.name
Visual Computing and Human-Centered Technology
-
tuw.researchTopic.name
Computer Science Foundations
-
tuw.researchTopic.value
80
-
tuw.researchTopic.value
20
-
tuw.linking
https://tekayanidham.github.io/timeline-page/
-
tuw.publication.orgunit
E193-02 - Forschungsbereich Computer Graphics
-
tuw.publication.orgunit
E056-18 - Fachbereich Visual Analytics and Computer Vision Meet Cultural Heritage
-
tuw.publisher.doi
10.1145/3746027.3758163
-
dc.description.numberOfPages
10
-
tuw.author.orcid
0009-0003-8679-4082
-
tuw.author.orcid
0000-0003-1387-5132
-
tuw.author.orcid
0000-0003-0413-4746
-
tuw.event.name
ACM International Conference on Multimedia 2025
en
tuw.event.startdate
27-10-2025
-
tuw.event.enddate
31-10-2025
-
tuw.event.online
On Site
-
tuw.event.type
Event for scientific audience
-
tuw.event.place
Dublin
-
tuw.event.country
IE
-
tuw.event.presenter
Tekaya, Nidham
-
tuw.event.track
Multi Track
-
wb.sciencebranch
Informatik
-
wb.sciencebranch
Mathematik
-
wb.sciencebranch.oefos
1020
-
wb.sciencebranch.oefos
1010
-
wb.sciencebranch.value
90
-
wb.sciencebranch.value
10
-
item.openairetype
conference paper
-
item.openairecristype
http://purl.org/coar/resource_type/c_5794
-
item.cerifentitytype
Publications
-
item.languageiso639-1
en
-
item.grantfulltext
restricted
-
item.fulltext
no Fulltext
-
crisitem.author.dept
St. Pölten University of Applied Sciences, Austria
-
crisitem.author.dept
E193-02 - Forschungsbereich Computer Graphics
-
crisitem.author.dept
E193-07 - Forschungsbereich Visual Analytics
-
crisitem.author.orcid
0009-0003-8679-4082
-
crisitem.author.orcid
0000-0003-1387-5132
-
crisitem.author.orcid
0000-0003-0413-4746
-
crisitem.author.parentorg
E193 - Institut für Visual Computing and Human-Centered Technology
-
crisitem.author.parentorg
E193 - Institut für Visual Computing and Human-Centered Technology