Tekaya, N., Waldner, M., & Zeppelzauer, M. (2025). A Matter of Time: Revealing the Structure of Time in Vision-Language Models. In MM ’25: Proceedings of the 33rd ACM International Conference on Multimedia (pp. 12371–12380). https://doi.org/10.1145/3746027.3758163
E193-02 - Forschungsbereich Computer Graphics E056-18 - Fachbereich Visual Analytics and Computer Vision Meet Cultural Heritage
-
Published in:
MM '25: Proceedings of the 33rd ACM International Conference on Multimedia
-
ISBN:
979-8-4007-2035-2
-
Date (published):
2025
-
Event name:
ACM International Conference on Multimedia 2025
en
Event date:
27-Oct-2025 - 31-Oct-2025
-
Event place:
Dublin, Ireland
-
Number of Pages:
10
-
Peer reviewed:
Yes
-
Keywords:
Multimodal representations
en
Vision-language models; Time modeling; Time reasoning; Time estimation; Benchmark dataset
-
Abstract:
Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ''timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.
en
Research facilities:
Vienna Scientific Cluster
-
Project title:
Visuelle Analytik und Computer Vision treffen auf kulturelles Erbe: DFH 37-N (FWF - Österr. Wissenschaftsfonds)