<div class="csl-bib-body">
<div class="csl-entry">Varga, Z., Mascaro, E. V., Sliwowski, D. J., & Lee, D. (2025). Multimodal Transformer Models for Human Action Classification. In D. Park, C. Liu, D.-Y. Lee, & M. J. Kim (Eds.), <i>Robot Intelligence Technology and Applications 9 : Results from the 12th International Conference on Robot Intelligence Technology and Applications</i> (pp. 52–63). Springer. https://doi.org/10.1007/978-3-031-92011-0_5</div>
</div>
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/223723
-
dc.description.abstract
Most research in deep learning focuses on a single modality, such as image, text, or proprioception data. However, humans benefit from leveraging information from diverse senses on a daily basis for richer information acquisition. Inspired by this, we design a transformerbased multimodal model for human action recognition and thoroughly evaluate its performance and robustness. Furthermore, we explore fusion methods to assess how modalities are best combined. Lastly, a model is trained to infer (generate) a missing modality. Our study shows that multimodal transformers perform better than their modality-specific equivalents. We achieve an improvement of 10.1% when using multiple data modalities over our vision-only baseline and outperform current state-of-the-art approaches by 32.8%. Furthermore, we evaluate a mean square error of 9.6% in the tactile force reconstruction task. The implemented model can be applied in scenarios where robotic assistance depends on recognising human actions for decision-making, tackling situations where vision is limited or audio and other modalities are required for deeper understanding.
en
dc.description.sponsorship
European Commission
-
dc.language.iso
en
-
dc.subject
Artificial intelligence
en
dc.subject
Perception
en
dc.subject
Human action recognition
en
dc.subject
Machine Learning
en
dc.title
Multimodal Transformer Models for Human Action Classification
en
dc.type
Inproceedings
en
dc.type
Konferenzbeitrag
de
dc.contributor.editoraffiliation
Korea Advanced Institute of Science and Technology, Korea (the Republic of)
-
dc.contributor.editoraffiliation
Loughborough University, United Kingdom of Great Britain and Northern Ireland (the)
-
dc.contributor.editoraffiliation
Aerospace Engineering - Korea Advanced Institute of Science and Technology (Daejeon, KR)
-
dc.contributor.editoraffiliation
Korea Advanced Institute of Science and Technology
-
dc.relation.isbn
978-3-031-92011-0
-
dc.relation.doi
10.1007/978-3-031-92011-0
-
dc.relation.issn
2367-3370
-
dc.description.startpage
52
-
dc.description.endpage
63
-
dc.relation.grantno
GAP-101136067
-
dc.type.category
Full-Paper Contribution
-
dc.relation.eissn
2367-3389
-
tuw.booktitle
Robot Intelligence Technology and Applications 9 : Results from the 12th International Conference on Robot Intelligence Technology and Applications
-
tuw.container.volume
1419
-
tuw.peerreviewed
true
-
tuw.book.ispartofseries
Lecture Notes in Networks and Systems
-
tuw.relation.publisher
Springer
-
tuw.relation.publisherplace
Cham
-
tuw.project.title
INteractive robots that intuitiVely lEarn to inVErt tasks ReaSoning about their Execution
-
tuw.researchTopic.id
I3
-
tuw.researchTopic.name
Automation and Robotics
-
tuw.researchTopic.value
100
-
tuw.publication.orgunit
E384-03 - Forschungsbereich Autonomous Systems
-
tuw.publisher.doi
10.1007/978-3-031-92011-0_5
-
dc.description.numberOfPages
12
-
tuw.author.orcid
0009-0004-9121-1694
-
tuw.author.orcid
0000-0003-1897-7664
-
tuw.editor.orcid
0000-0002-1287-9433
-
tuw.editor.orcid
0000-0003-2829-9369
-
tuw.editor.orcid
0000-0003-3839-2700
-
tuw.event.name
International Conference on Robot Intelligence Technology and Applications