A Matter of Time: Revealing the Structure of Time in Vision-Language Models

Tekaya, Nidham; Waldner, Manuela; Zeppelzauer, Matthias

doi:10.1145/3746027.3758163

Record link:

http://hdl.handle.net/20.500.12708/221607

Title:

A Matter of Time: Revealing the Structure of Time in Vision-Language Models

Citation:

Tekaya, N., Waldner, M., & Zeppelzauer, M. (2025). A Matter of Time: Revealing the Structure of Time in Vision-Language Models. In MM ’25: Proceedings of the 33rd ACM International Conference on Multimedia (pp. 12371–12380). https://doi.org/10.1145/3746027.3758163

Publisher DOI:

10.1145/3746027.3758163

Publication Type:

Inproceedings - Full-Paper Contribution

Language:

English

Authors:

Tekaya, Nidham
Waldner, Manuela
Zeppelzauer, Matthias

Organisational Unit:

E193-02 - Forschungsbereich Computer Graphics
E056-18 - Fachbereich Visual Analytics and Computer Vision Meet Cultural Heritage

Published in:

MM '25: Proceedings of the 33rd ACM International Conference on Multimedia

ISBN:

979-8-4007-2035-2

Date (published):

2025

Event name:

ACM International Conference on Multimedia 2025

Event date:

27-Oct-2025 - 31-Oct-2025

Event place:

Dublin, Ireland

Number of Pages:

Peer reviewed:

Yes

Keywords:

Multimodal representations

Vision-language models; Time modeling; Time reasoning; Time estimation; Benchmark dataset

Abstract:

Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ''timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.

Research facilities:

Vienna Scientific Cluster

Project title:

Visuelle Analytik und Computer Vision treffen auf kulturelles Erbe: DFH 37-N (FWF - Österr. Wissenschaftsfonds)

Link (external):

https://tekayanidham.github.io/timeline-page/

Research Areas:

Visual Computing and Human-Centered Technology: 80%
Computer Science Foundations: 20%

Science Branch:

1020 - Informatik: 90%
1010 - Mathematik: 10%

Appears in Collections:

Conference Paper

Show full item record

Page view(s)

143

checked on Nov 25, 2025

Download(s)

checked on Nov 25, 2025

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM