Robotic-CLIP: Fine-Tuning CLIP on Action Data for Robotic Applications

Nguyen, Nghia; Vu, Minh Nhat; Tung, D. Ta; Baoru, Huang; Vo, Thieu; Ngan, Le; Nguyen, Anh

doi:10.1109/ICRA55743.2025.11127829

DC Field

Value

Language

dc.contributor.author

Nguyen, Nghia

dc.contributor.author

Vu, Minh Nhat

dc.contributor.author

Tung, D. Ta

dc.contributor.author

Baoru, Huang

dc.contributor.author

Vo, Thieu

dc.contributor.author

Ngan, Le

dc.contributor.author

Nguyen, Anh

dc.date.accessioned

2026-01-07T09:32:36Z

dc.date.available

2026-01-07T09:32:36Z

dc.date.issued

2025-09-02

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Nguyen, N., Vu, M. N., Tung, D. T., Baoru, H., Vo, T., Ngan, L., & Nguyen, A. (2025). Robotic-CLIP: Fine-Tuning CLIP on Action Data for Robotic Applications. In <i>2025 IEEE International Conference on Robotics and Automation (ICRA)</i> (pp. 5930–5936). IEEE. https://doi.org/10.1109/ICRA55743.2025.11127829</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/223620

dc.description.abstract

Vision language models have played a key role in extracting meaningful features for various robotic applications. Among these, Contrastive Language-Image Pretraining (CLIP) is widely used in robotic tasks that require both vision and natural language understanding. However, CLIP was trained solely on static images paired with text prompts and has not yet been fully adapted for robotic tasks involving dynamic actions. In this paper, we introduce Robotic-CLIP to enhance robotic perception capabilities. We first gather and label large-scale action data, and then build our Robotic-CLIP by fine-tuning CLIP on 309,433 videos (≈ 7.4 million frames) of action data using contrastive learning. By leveraging action data, Robotic-CLIP inherits CLIP's strong image performance while gaining the ability to understand actions in robotic contexts. Intensive experiments show that our Robotic-CLIP outperforms other CLIP-based models across various language-driven robotic tasks. Additionally, we demonstrate the practical effectiveness of Robotic-CLIP in real-world grasping applications.

dc.language.iso

dc.subject

vision language models

dc.subject

Contrastive Language-Image Pretraining

dc.subject

grasping

dc.title

Robotic-CLIP: Fine-Tuning CLIP on Action Data for Robotic Applications

dc.type

Inproceedings

dc.type

Konferenzbeitrag

dc.contributor.affiliation

The University of Tokyo, Japan

dc.contributor.affiliation

University of Liverpool, United Kingdom of Great Britain and Northern Ireland (the)

dc.contributor.affiliation

University of Arkansas System, United States of America (the)

dc.contributor.affiliation

University of Liverpool, United Kingdom of Great Britain and Northern Ireland (the)

dc.relation.isbn

979-8-3315-4139-2

dc.relation.doi

10.1109/ICRA55743.2025

dc.description.startpage

5930

dc.description.endpage

5936

dc.type.category

Full-Paper Contribution

tuw.booktitle

2025 IEEE International Conference on Robotics and Automation (ICRA)

tuw.peerreviewed

true

tuw.relation.publisher

IEEE

tuw.researchTopic.id

tuw.researchTopic.name

Mathematical and Algorithmic Foundations

tuw.researchTopic.name

Computer Science Foundations

tuw.researchTopic.name

Modeling and Simulation

tuw.researchTopic.value

tuw.publication.orgunit

E376-02 - Forschungsbereich Komplexe Dynamische Systeme

tuw.publisher.doi

10.1109/ICRA55743.2025.11127829

dc.description.numberOfPages

tuw.author.orcid

0000-0002-2382-3339

tuw.author.orcid

0000-0002-2342-1364

tuw.author.orcid

0000-0002-1896-3038

tuw.event.name

IEEE International Conference on Robotics and Automation (ICRA 2025)

tuw.event.startdate

19-05-2025

tuw.event.enddate

23-05-2025

tuw.event.online

On Site

tuw.event.type

Event for scientific audience

tuw.event.place

Atlanta, GA

tuw.event.country

tuw.event.presenter

Nguyen, Nghia

wb.sciencebranch

Elektrotechnik, Elektronik, Informationstechnik

wb.sciencebranch.oefos

2020

wb.sciencebranch.value

100

item.languageiso639-1

item.openairecristype

http://purl.org/coar/resource_type/c_5794

item.fulltext

no Fulltext

item.cerifentitytype

Publications

item.grantfulltext

none

item.openairetype

conference paper

crisitem.author.dept

E376-02 - Forschungsbereich Komplexe Dynamische Systeme

crisitem.author.dept

The University of Tokyo, Japan

crisitem.author.dept

University of Liverpool, United Kingdom of Great Britain and Northern Ireland (the)

crisitem.author.dept

University of Arkansas System, United States of America (the)

crisitem.author.dept

University of Liverpool, United Kingdom of Great Britain and Northern Ireland (the)

crisitem.author.orcid

0000-0002-2382-3339

crisitem.author.parentorg

E376 - Institut für Automatisierungs- und Regelungstechnik

Appears in Collections:

Conference Paper

Show simple item record

Page view(s)

checked on Jan 8, 2026

Google Scholar^TM

Check

Page view(s)

Google ScholarTM

Google Scholar^TM