A Matter of Time: Revealing the Structure of Time in Vision-Language Models

Tekaya, Nidham; Waldner, Manuela; Zeppelzauer, Matthias

doi:10.1145/3746027.3758163

DC Field

Value

Language

dc.contributor.author

Tekaya, Nidham

dc.contributor.author

Waldner, Manuela

dc.contributor.author

Zeppelzauer, Matthias

dc.date.accessioned

2025-11-25T10:42:14Z

dc.date.available

2025-11-25T10:42:14Z

dc.date.issued

2025

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Tekaya, N., Waldner, M., & Zeppelzauer, M. (2025). A Matter of Time: Revealing the Structure of Time in Vision-Language Models. In <i>MM ’25: Proceedings of the 33rd ACM International Conference on Multimedia</i> (pp. 12371–12380). https://doi.org/10.1145/3746027.3758163</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/221607

dc.description.abstract

Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ''timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.

dc.description.sponsorship

FWF - Österr. Wissenschaftsfonds

dc.language.iso

dc.subject

Multimodal representations

dc.subject

Vision-language models

dc.subject

Time modeling

dc.subject

Time reasoning

dc.subject

Time estimation

dc.subject

Benchmark dataset

dc.title

A Matter of Time: Revealing the Structure of Time in Vision-Language Models

dc.type

Inproceedings

dc.type

Konferenzbeitrag

dc.contributor.affiliation

University of Applied Sciences St Pölten, Austria

dc.relation.isbn

979-8-4007-2035-2

dc.description.startpage

12371

dc.description.endpage

12380

dc.relation.grantno

DFH 37-N

dc.type.category

Full-Paper Contribution

tuw.booktitle

MM '25: Proceedings of the 33rd ACM International Conference on Multimedia

tuw.peerreviewed

true

tuw.project.title

Visuelle Analytik und Computer Vision treffen auf kulturelles Erbe

tuw.researchinfrastructure

Vienna Scientific Cluster

tuw.researchTopic.id

tuw.researchTopic.name

Visual Computing and Human-Centered Technology

tuw.researchTopic.name

Computer Science Foundations

tuw.researchTopic.value

tuw.linking

https://tekayanidham.github.io/timeline-page/

tuw.publication.orgunit

E193-02 - Forschungsbereich Computer Graphics

tuw.publication.orgunit

E056-18 - Fachbereich Visual Analytics and Computer Vision Meet Cultural Heritage

tuw.publisher.doi

10.1145/3746027.3758163

dc.description.numberOfPages

tuw.author.orcid

0009-0003-8679-4082

tuw.author.orcid

0000-0003-1387-5132

tuw.author.orcid

0000-0003-0413-4746

tuw.event.name

ACM International Conference on Multimedia 2025

tuw.event.startdate

27-10-2025

tuw.event.enddate

31-10-2025

tuw.event.online

On Site

tuw.event.type

Event for scientific audience

tuw.event.place

Dublin

tuw.event.country

tuw.event.presenter

Tekaya, Nidham

tuw.event.track

Multi Track

wb.sciencebranch

Informatik

wb.sciencebranch

Mathematik

wb.sciencebranch.oefos

1020

wb.sciencebranch.oefos

1010

wb.sciencebranch.value

item.grantfulltext

restricted

item.fulltext

no Fulltext

item.cerifentitytype

Publications

item.languageiso639-1

item.openairetype

conference paper

item.openairecristype

http://purl.org/coar/resource_type/c_5794

crisitem.author.dept

University of Applied Sciences St Pölten, Austria

crisitem.author.dept

E193-02 - Forschungsbereich Computer Graphics

crisitem.author.dept

E193-07 - Forschungsbereich Visual Analytics

crisitem.author.orcid

0009-0003-8679-4082

crisitem.author.orcid

0000-0003-1387-5132

crisitem.author.orcid

0000-0003-0413-4746

crisitem.author.parentorg

E193 - Institut für Visual Computing and Human-Centered Technology

crisitem.author.parentorg

E193 - Institut für Visual Computing and Human-Centered Technology

crisitem.project.funder

FWF - Österr. Wissenschaftsfonds

crisitem.project.grantno

DFH 37-N

Appears in Collections:

Conference Paper

Show simple item record

Page view(s)

143

checked on Nov 25, 2025

Download(s)

checked on Nov 25, 2025

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM