Safe Policy Improvement in Constrained Markov Decision Processes

Berducci, Luigi; Grosu, Radu

doi:10.1007/978-3-031-19849-6_21

DC Field

Value

Language

dc.contributor.author

Berducci, Luigi

dc.contributor.author

Grosu, Radu

dc.contributor.editor

Margaria, Tiziana

dc.contributor.editor

Steffen, Bernhard

dc.date.accessioned

2022-11-21T15:25:31Z

dc.date.available

2022-11-21T15:25:31Z

dc.date.issued

2022-10-17

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Berducci, L., & Grosu, R. (2022). Safe Policy Improvement in Constrained Markov Decision Processes. In T. Margaria & B. Steffen (Eds.), <i>Leveraging Applications of Formal Methods, Verification and Validation. Verification Principles (ISoLA 2022), Proceedings, Part I</i> (pp. 360–381). Springer. https://doi.org/10.1007/978-3-031-19849-6_21</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/135876

dc.description.abstract

The automatic synthesis of a policy through reinforcement learning (RL) from a given set of formal requirements depends on the construction of a reward signal and consists of the iterative application of many policy-improvement steps. The synthesis algorithm has to balance target, safety, and comfort requirements in a single objective and to guarantee that the policy improvement does not increase the number of safety-requirements violations, especially for safety-critical applications. In this work, we present a solution to the synthesis problem by solving its two main challenges: reward-shaping from a set of formal requirements and safe policy update. For the first, we propose an automatic reward-shaping procedure, defining a scalar reward signal compliant with the task specification. For the second, we introduce an algorithm ensuring that the policy is improved in a safe fashion, with high-confidence guarantees. We also discuss the adoption of a model-based RL algorithm to efficiently use the collected data and train a model-free agent on the predicted trajectories, where the safety violation does not have the same impact as in the real world. Finally, we demonstrate in standard control benchmarks that the resulting learning procedure is effective and robust even under heavy perturbations of the hyperparameters.

dc.description.sponsorship

FFG - Österr. Forschungsförderungs- gesellschaft mbH

dc.language.iso

dc.relation.ispartofseries

Lecture Notes in Computer Science

dc.subject

reinforcement learning

dc.subject

safe policy improvement

dc.subject

formal specification

dc.title

Safe Policy Improvement in Constrained Markov Decision Processes

dc.type

Inproceedings

dc.type

Konferenzbeitrag

dc.contributor.editoraffiliation

University of Potsdam, Germany

dc.contributor.editoraffiliation

TU Dortmund University, Germany

dc.relation.isbn

978-3-031-19849-6

dc.relation.doi

10.1007/978-3-031-19849-6

dc.relation.issn

0302-9743

dc.description.startpage

360

dc.description.endpage

381

dc.relation.grantno

FFG Projektnummer: 880811

dc.type.category

Full-Paper Contribution

dc.relation.eissn

1611-3349

tuw.booktitle

Leveraging Applications of Formal Methods, Verification and Validation. Verification Principles (ISoLA 2022), Proceedings, Part I

tuw.container.volume

13701

tuw.peerreviewed

true

tuw.book.ispartofseries

Lecture Notes in Computer Science

tuw.relation.publisher

Springer

tuw.relation.publisherplace

Cham

tuw.book.chapter

tuw.publication.invited

invited

tuw.project.title

Autonomous-Driving Examiner

tuw.researchTopic.id

tuw.researchTopic.name

Computer Engineering and Software-Intensive Systems

tuw.researchTopic.value

100

tuw.linking

https://arxiv.org/abs/2210.11259

tuw.publication.orgunit

E191-01 - Forschungsbereich Cyber-Physical Systems

tuw.publisher.doi

10.1007/978-3-031-19849-6_21

dc.description.numberOfPages

tuw.author.orcid

0000-0002-3497-6007

tuw.editor.orcid

0000-0002-5547-9739

tuw.editor.orcid

0000-0001-9619-1558

tuw.event.name

11th International Symposium on Leveraging Applications of Formal Methods (ISoLA 2022)

tuw.event.startdate

22-10-2022

tuw.event.enddate

30-10-2022

tuw.event.online

On Site

tuw.event.type

Event for scientific audience

tuw.event.place

Rhodes

tuw.event.country

tuw.event.presenter

Berducci, Luigi

tuw.event.track

Multi Track

wb.sciencebranch

Informatik

wb.sciencebranch.oefos

1020

wb.sciencebranch.value

100

item.grantfulltext

none

item.openairecristype

http://purl.org/coar/resource_type/c_5794

item.openairetype

conference paper

item.languageiso639-1

item.cerifentitytype

Publications

item.fulltext

no Fulltext

crisitem.author.dept

E191-01 - Forschungsbereich Cyber-Physical Systems

crisitem.author.dept

E191-01 - Forschungsbereich Cyber-Physical Systems

crisitem.author.orcid

0000-0002-3497-6007

crisitem.author.orcid

0000-0001-5715-2142

crisitem.author.parentorg

E191 - Institut für Computer Engineering

crisitem.author.parentorg

E191 - Institut für Computer Engineering

crisitem.project.funder

FFG - Österr. Forschungsförderungs- gesellschaft mbH

crisitem.project.grantno

FFG Projektnummer: 880811

Appears in Collections:

Conference Paper

Show simple item record

Page view(s)

303

checked on Nov 23, 2023

Google Scholar^TM

Check

Page view(s)

Google ScholarTM

Google Scholar^TM