Multimodal Transformer Models for Human Action Classification

Varga, Zoltan; Mascaro, Esteve Valls; Sliwowski, Daniel Jan; Lee, Dongheui

doi:10.1007/978-3-031-92011-0_5

DC Field

Value

Language

dc.contributor.author

Varga, Zoltan

dc.contributor.author

Mascaro, Esteve Valls

dc.contributor.author

Sliwowski, Daniel Jan

dc.contributor.author

Lee, Dongheui

dc.contributor.editor

Park, Deahyung

dc.contributor.editor

Liu, Cunjia

dc.contributor.editor

Lee, Dae-Young

dc.contributor.editor

Kim, Min Jun

dc.date.accessioned

2026-01-08T00:01:40Z

dc.date.available

2026-01-08T00:01:40Z

dc.date.issued

2025-07-16

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Varga, Z., Mascaro, E. V., Sliwowski, D. J., & Lee, D. (2025). Multimodal Transformer Models for Human Action Classification. In D. Park, C. Liu, D.-Y. Lee, & M. J. Kim (Eds.), <i>Robot Intelligence Technology and Applications 9 : Results from the 12th International Conference on Robot Intelligence Technology and Applications</i> (pp. 52–63). Springer. https://doi.org/10.1007/978-3-031-92011-0_5</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/223723

dc.description.abstract

Most research in deep learning focuses on a single modality, such as image, text, or proprioception data. However, humans beneﬁt from leveraging information from diverse senses on a daily basis for richer information acquisition. Inspired by this, we design a transformerbased multimodal model for human action recognition and thoroughly evaluate its performance and robustness. Furthermore, we explore fusion methods to assess how modalities are best combined. Lastly, a model is trained to infer (generate) a missing modality. Our study shows that multimodal transformers perform better than their modality-speciﬁc equivalents. We achieve an improvement of 10.1% when using multiple data modalities over our vision-only baseline and outperform current state-of-the-art approaches by 32.8%. Furthermore, we evaluate a mean square error of 9.6% in the tactile force reconstruction task. The implemented model can be applied in scenarios where robotic assistance depends on recognising human actions for decision-making, tackling situations where vision is limited or audio and other modalities are required for deeper understanding.

dc.description.sponsorship

European Commission

dc.language.iso

dc.subject

Artiﬁcial intelligence

dc.subject

Perception

dc.subject

Human action recognition

dc.subject

Machine Learning

dc.title

Multimodal Transformer Models for Human Action Classification

dc.type

Inproceedings

dc.type

Konferenzbeitrag

dc.contributor.editoraffiliation

Korea Advanced Institute of Science and Technology, Korea (the Republic of)

dc.contributor.editoraffiliation

Loughborough University, United Kingdom of Great Britain and Northern Ireland (the)

dc.contributor.editoraffiliation

Aerospace Engineering - Korea Advanced Institute of Science and Technology (Daejeon, KR)

dc.contributor.editoraffiliation

Korea Advanced Institute of Science and Technology

dc.relation.isbn

978-3-031-92011-0

dc.relation.doi

10.1007/978-3-031-92011-0

dc.relation.issn

2367-3370

dc.description.startpage

dc.description.endpage

dc.relation.grantno

GAP-101136067

dc.type.category

Full-Paper Contribution

dc.relation.eissn

2367-3389

tuw.booktitle

Robot Intelligence Technology and Applications 9 : Results from the 12th International Conference on Robot Intelligence Technology and Applications

tuw.container.volume

1419

tuw.peerreviewed

true

tuw.book.ispartofseries

Lecture Notes in Networks and Systems

tuw.relation.publisher

Springer

tuw.relation.publisherplace

Cham

tuw.project.title

INteractive robots that intuitiVely lEarn to inVErt tasks ReaSoning about their Execution

tuw.researchTopic.id

tuw.researchTopic.name

Automation and Robotics

tuw.researchTopic.value

100

tuw.publication.orgunit

E384-03 - Forschungsbereich Autonomous Systems

tuw.publisher.doi

10.1007/978-3-031-92011-0_5

dc.description.numberOfPages

tuw.author.orcid

0009-0004-9121-1694

tuw.author.orcid

0000-0003-1897-7664

tuw.editor.orcid

0000-0002-1287-9433

tuw.editor.orcid

0000-0003-2829-9369

tuw.editor.orcid

0000-0003-3839-2700

tuw.event.name

International Conference on Robot Intelligence Technology and Applications

tuw.event.startdate

04-12-2024

tuw.event.enddate

07-12-2024

tuw.event.online

On Site

tuw.event.type

Event for scientific audience

tuw.event.place

Ulsan

tuw.event.country

tuw.event.institution

Ulsan National Institute of Science & Technology

tuw.event.presenter

Sliwowski, Daniel Jan

tuw.event.track

Single Track

wb.sciencebranch

Elektrotechnik, Elektronik, Informationstechnik

wb.sciencebranch.oefos

2020

wb.sciencebranch.value

100

item.openairetype

conference paper

item.openairecristype

http://purl.org/coar/resource_type/c_5794

item.cerifentitytype

Publications

item.languageiso639-1

item.grantfulltext

none

item.fulltext

no Fulltext

crisitem.author.dept

E384-03 - Forschungsbereich Autonomous Systems

crisitem.author.dept

E384-03 - Forschungsbereich Autonomous Systems

crisitem.author.dept

E384-03 - Forschungsbereich Autonomous Systems

crisitem.author.dept

E384-03 - Forschungsbereich Autonomous Systems

crisitem.author.orcid

0009-0004-9121-1694

crisitem.author.orcid

0000-0003-1897-7664

crisitem.author.parentorg

E384 - Institut für Computertechnik

crisitem.author.parentorg

E384 - Institut für Computertechnik

crisitem.author.parentorg

E384 - Institut für Computertechnik

crisitem.author.parentorg

E384 - Institut für Computertechnik

crisitem.project.funder

European Commission

crisitem.project.grantno

GAP-101136067

Appears in Collections:

Conference Paper

Show simple item record

Google Scholar^TM

Check

Google ScholarTM

Google Scholar^TM