BECEL: Benchmark for Consistency Evaluation of Language Models

Jang, Myeongjun; Kwon, Deuk Sin; Lukasiewicz, Thomas

DC Field

Value

Language

dc.contributor.author

Jang, Myeongjun

dc.contributor.author

Kwon, Deuk Sin

dc.contributor.author

Lukasiewicz, Thomas

dc.contributor.editor

Calzolari, Nicoletta

dc.contributor.editor

Huang, Chu-Ren

dc.contributor.editor

Kim, Hansaem

dc.date.accessioned

2024-01-25T09:13:11Z

dc.date.available

2024-01-25T09:13:11Z

dc.date.issued

2022-10

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Jang, M., Kwon, D. S., & Lukasiewicz, T. (2022). BECEL: Benchmark for Consistency Evaluation of Language Models. In N. Calzolari, C.-R. Huang, & H. Kim (Eds.), <i>Proceedings of the 29th International Conference on Computational Linguistics</i> (pp. 3680–3696). International Committee on Computational Linguistics. http://hdl.handle.net/20.500.12708/192675</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/192675

dc.description.abstract

Behavioural consistency is a critical condition for a language model (LM) to become trustworthy like humans. Despite its importance, however, there is little consensus on the definition of LM consistency, resulting in different definitions across many studies. In this paper, we first propose the idea of LM consistency based on behavioural consistency and establish a taxonomy that classifies previously studied consistencies into several sub-categories. Next, we create a new benchmark that allows us to evaluate a model on 19 test cases, distinguished by multiple types of consistency and diverse downstream tasks. Through extensive experiments on the new benchmark, we ascertain that none of the modern pre-trained language models (PLMs) performs well in every test case, while exhibiting high inconsistency in many cases. Our experimental results suggest that a unified benchmark that covers broad aspects (i.e., multiple consistency types and tasks) is essential for a more precise evaluation.

dc.language.iso

dc.subject

behavioural consistency

dc.subject

language models

dc.title

BECEL: Benchmark for Consistency Evaluation of Language Models

dc.type

Inproceedings

dc.type

Konferenzbeitrag

dc.contributor.affiliation

University of Oxford, United Kingdom of Great Britain and Northern Ireland (the)

dc.contributor.affiliation

Korea Telecom (South Korea), Korea (the Republic of)

dc.description.startpage

3680

dc.description.endpage

3696

dc.type.category

Full-Paper Contribution

tuw.booktitle

Proceedings of the 29th International Conference on Computational Linguistics

tuw.relation.publisher

International Committee on Computational Linguistics

tuw.researchTopic.id

tuw.researchTopic.name

Information Systems Engineering

tuw.researchTopic.value

100

tuw.publication.orgunit

E192-07 - Forschungsbereich Artificial Intelligence Techniques

tuw.publication.orgunit

E192-03 - Forschungsbereich Knowledge Based Systems

dc.description.numberOfPages

tuw.editor.orcid

0000-0002-8526-5520

tuw.editor.orcid

0000-0001-8536-5276

tuw.event.name

29th International Conference on Computational Linguistics

tuw.event.startdate

12-10-2022

tuw.event.enddate

17-10-2022

tuw.event.online

On Site

tuw.event.type

Event for scientific audience

tuw.event.place

Gyeongju

tuw.event.country

tuw.event.presenter

Jang, Myeongjun

wb.sciencebranch

Informatik

wb.sciencebranch

Mathematik

wb.sciencebranch.oefos

1020

wb.sciencebranch.oefos

1010

wb.sciencebranch.value

item.openairetype

conference paper

item.cerifentitytype

Publications

item.grantfulltext

none

item.languageiso639-1

item.openairecristype

http://purl.org/coar/resource_type/c_5794

item.fulltext

no Fulltext

crisitem.author.dept

University of Oxford

crisitem.author.dept

Korea Telecom (South Korea)

crisitem.author.dept

E192-07 - Forschungsbereich Artificial Intelligence Techniques

crisitem.author.parentorg

E192 - Institut für Logic and Computation

Appears in Collections:

Conference Paper

Show simple item record

Page view(s)

217

checked on Jan 25, 2024

Google Scholar^TM

Check

Page view(s)

Google ScholarTM

Google Scholar^TM