Real-world human-robot interaction behavior generation using latent diffusion models

Stanovcic, Sergej

doi:10.34726/hss.2026.123206

Record link:

https://doi.org/10.34726/hss.2026.123206
http://hdl.handle.net/20.500.12708/228390

Title:

Real-world human-robot interaction behavior generation using latent diffusion models

Citation:

Stanovcic, S. (2026). Real-world human-robot interaction behavior generation using latent diffusion models [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.123206

reposiTUm DOI:

10.34726/hss.2026.123206

CatalogPlus:

AC17883648

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Stanovcic, Sergej

Advisor:

Lee, Dongheui

Co-advisor:

Valls Mascaro, Esteve

Organisational Unit:

E384 - Institut für Computertechnik

Date (published):

2026

Number of Pages:

Keywords:

Human Robot Integration; Robot Behavior Generation; Deep Learning; Diffusion Models

Abstract:

Natural social interactions involve two agents exhibiting smooth and diverse behaviors that align with each other's intent in real time. Creating this level of expressiveness in human–robot interaction (HRI) requires a robot to go beyond simple reactive behaviors and instead anticipate the rich distribution of possible human actions, enabling responses that are diverse, human-like, and socially aligned. This thesis bridges the gap between complex generative modeling and actual robotic deployment by integrating visual perception, context-aware motion generation, and physical-hardware execution into a single coherent system. At the core of the system lies a latent diffusion framework designed for the joint generation of two-person social interactions. Given past context and a high-level interaction description, our model generates potential future motions for both agents in an interdependent manner. By operating within a temporally coherent latent space, the framework ensures smooth, aligned motion segments while significantly reducing computational overhead to support live interaction. To achieve real-time generation, the model is integrated into a continuous streaming pipeline that combines chunked diffusion inference with real-time SMPL-X pose estimation from a single RGBD camera, eliminating the need for restrictive motion capture systems and enabling continuous prediction from live human input. The framework is demonstrated both in simulation and through real-world experiments with Tiago++ and Unitree G1 robots, with generated reactor motion retargeted online to each platform's embodiment. Ultimately, this thesis provides a robust solution for diverse and responsive motion generation, advancing the development of socially aware robots capable of engaging with humans naturally and adaptively under realistic conditions.

License:

In Copyright

Appears in Collections:

Thesis