DoublyAware : Dual Planning and Policy Awareness for Temporal Difference Learning in Humanoid Locomotion

Nguyen, Khang; Le, An T.; Peters, Jan; Vu, Minh Nhat

doi:10.1109/LRA.2025.3648611

DC Field

Value

Language

dc.contributor.author

Nguyen, Khang

dc.contributor.author

Le, An T.

dc.contributor.author

Peters, Jan

dc.contributor.author

Vu, Minh Nhat

dc.date.accessioned

2026-02-09T12:16:35Z

dc.date.available

2026-02-09T12:16:35Z

dc.date.issued

2025-01-01

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Nguyen, K., Le, A. T., Peters, J., & Vu, M. N. (2025). DoublyAware : Dual Planning and Policy Awareness for Temporal Difference Learning in Humanoid Locomotion. <i>IEEE Robotics and Automation Letters</i>, <i>11</i>(2), 2162–2169. https://doi.org/10.1109/LRA.2025.3648611</div> </div>

dc.identifier.issn

2377-3766

dc.identifier.uri

http://hdl.handle.net/20.500.12708/226129

dc.description.abstract

Achieving robust robot learning for humanoid locomotion is a fundamental challenge in model-based reinforcement learning (MBRL), where environmental stochasticity and randomness can hinder efficient exploration and learning stability. The environmental, so-called aleatoric, uncertainty can be amplified in high-dimensional action spaces with complex contact dynamics and entangled with epistemic uncertainty in the models during learning phases. In this work, we propose DoublyAware, an uncertainty-aware extension of Temporal Difference Model Predictive Control (TD-MPC) that explicitly decomposes uncertainty into two disjoint, interpretable components, i.e., planning and policy uncertainties. To handle the planning uncertainty, DoublyAware employs conformal prediction to filter candidate trajectories using quantile-calibrated risk bounds, ensuring statistical consistency and robustness against stochastic dynamics. Meanwhile, policy rollouts are leveraged as structured informative priors to support the learning phase with Group-Relative Policy Constraint (GRPC) optimizers, which impose a group-based adaptive trust region in the latent action space. This combination enables the robot agent to prioritize high-confidence, high-reward behavior while maintaining effective, targeted exploration under uncertainty. Evaluated on the HumanoidBench locomotion suite with the Unitree 26-DoF H1-2 humanoid, DoublyAware demonstrates improved sample efficiency, accelerated convergence, and enhanced motion feasibility compared to RL baselines. Our results emphasize the significance of structured uncertainty modeling for data-efficient and reliable decision-making in TD-MPC-based humanoid locomotion learning.

dc.language.iso

dc.publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

dc.relation.ispartof

IEEE Robotics and Automation Letters

dc.subject

humanoid and bipedal locomotion

dc.subject

reinforcement learning

dc.title

DoublyAware : Dual Planning and Policy Awareness for Temporal Difference Learning in Humanoid Locomotion

dc.type

Article

dc.type

Artikel

dc.identifier.scopus

2-s2.0-105025945917

dc.identifier.url

https://api.elsevier.com/content/abstract/scopus_id/105025945917

dc.contributor.affiliation

Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates (the)

dc.contributor.affiliation

VinUniversity, Viet Nam

dc.contributor.affiliation

Technical University of Darmstadt, Germany

dc.description.startpage

2162

dc.description.endpage

2169

dc.type.category

Original Research Article

tuw.container.volume

tuw.container.issue

tuw.journal.peerreviewed

true

tuw.peerreviewed

true

wb.publication.intCoWork

International Co-publication

tuw.researchTopic.id

tuw.researchTopic.name

Modeling and Simulation

tuw.researchTopic.name

Automation and Robotics

tuw.researchTopic.value

dcterms.isPartOf.title

IEEE Robotics and Automation Letters

tuw.publication.orgunit

E376-02 - Forschungsbereich Komplexe Dynamische Systeme

tuw.publisher.doi

10.1109/LRA.2025.3648611

dc.identifier.eissn

2377-3766

dc.description.numberOfPages

tuw.author.orcid

0000-0003-3471-5533

tuw.author.orcid

0000-0003-0929-3316

tuw.author.orcid

0000-0002-5266-8091

wb.sci

true

wb.sciencebranch

Elektrotechnik, Elektronik, Informationstechnik

wb.sciencebranch.oefos

2020

wb.sciencebranch.value

100

item.openairecristype

http://purl.org/coar/resource_type/c_2df8fbb1

item.fulltext

no Fulltext

item.languageiso639-1

item.grantfulltext

none

item.openairetype

research article

item.cerifentitytype

Publications

crisitem.author.dept

Technical University of Darmstadt, Germany

crisitem.author.dept

E376-02 - Forschungsbereich Komplexe Dynamische Systeme

crisitem.author.orcid

0000-0003-0929-3316

crisitem.author.orcid

0000-0002-5266-8091

crisitem.author.parentorg

E376 - Institut für Automatisierungs- und Regelungstechnik

Appears in Collections:

Article

Show simple item record

Google Scholar^TM

Check

Google ScholarTM

Google Scholar^TM