Built to Behave: Constraining Reinforcement Learning Agents with Restraining Bolts

Tulcan, Radu Florin

doi:10.34726/hss.2026.130021

Record link:

https://doi.org/10.34726/hss.2026.130021
http://hdl.handle.net/20.500.12708/227997

Title:

Built to Behave: Constraining Reinforcement Learning Agents with Restraining Bolts

Citation:

Tulcan, R. F. (2026). Built to Behave: Constraining Reinforcement Learning Agents with Restraining Bolts [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.130021

reposiTUm DOI:

10.34726/hss.2026.130021

CatalogPlus:

AC17856894

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Tulcan, Radu Florin

Advisor:

Ciabattoni, Agata

Organisational Unit:

E192 - Institut für Logic and Computation

Date (published):

2026

Number of Pages:

121

Keywords:

Reinforcement Learning | Normative Reasoning | Restraining Bolt | Temporal Logic

Abstract:

Over the past decade we have discovered the true potential and versatility of autonomous Artificial Intelligence (AI) agents which led to various applications, such as clinical robots, autonomous driving systems and smart factory robots. As these agents are increasingly deployed in environments where they operate alongside and interact with humans, their behavior must align with our regulatory framework, as well as social and ethical norms.In this thesis, we focus on Reinforcement Learning (RL) as a framework for designing autonomous agents and investigate methods for ensuring norm-compliant behavior. To this aim, we employ and refine the Restraining Bolt (RB) technique, originally developed for safe RL. The RB allows a RL agent to learn behavior that adheres to a set of constraint specifications, expressed as temporal logic formulas, by assigning a positive reward whenever the agent satisfies the given specifications. To learn norm-compliant behavior, we introduce the Normative Restraining Bolt (NRB) which reinterprets the RB by considering violation specifications of norms rather than just constraint specifications. By assigning a negative reward whenever a violation specification is satisfied, the NRB signals and punishes norm violations, enabling an agent to learn norm-compliant behavior in complex normative scenarios, including conditional obligations, Contrary to Duty Obligations (CTD), conflicting norms, and permissions-as-exceptions.Since the NRB still relies on trial and error to determine the appropriate punishments for the violation specifications, we introduce an extension of the NRB, the Ordered Normative Restraining Bolt (ONRB). In addition to accommodating complex normative scenarios, the ONRB enables the automatic computation of these punishments, as well as the integration of norm priorities and norm updates, thereby establishing formal guarantees for maximal norm adherence. Although effective in ensuring norm adherence, the NRB and ONRB permit norm violations only insofar as they are conditioned by the normative system. However, normative systems may not account for all possible circumstances under which norm violations may arise. To handle norm violations more flexibly, we introduce the Value-Based Ordered Normative Restraining Bolt (VBONRB), which views norms as emerging from human values and permits norm violations whenever justified by the values underlying those norms.We showcase the capabilities of the ONRB and VBONRB through a series of case studies involving CTDs, conflicting norms, and conditional obligations, and highlight the advantages and limitations of all four techniques through a comparative analysis.

Additional information:

Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft

License:

In Copyright

Appears in Collections:

Thesis