Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards

Arzt, Varvara; Hanbury, Allan

doi:10.18653/v1/2024.genbench-1.8

DC Element

Wert

Sprache

dc.contributor.author

Arzt, Varvara

dc.contributor.author

Hanbury, Allan

dc.contributor.editor

Hupkes, Dieuwke

dc.contributor.editor

Dankers, Verna

dc.contributor.editor

Batsuren, Khuyagbaatar

dc.contributor.editor

Kazemnejad, Amirhossein

dc.contributor.editor

Christodoulopoulos, Christos

dc.contributor.editor

Giulianelli, Mario

dc.contributor.editor

Cotterel, Ryan

dc.date.accessioned

2025-01-28T16:14:50Z

dc.date.available

2025-01-28T16:14:50Z

dc.date.issued

2024

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Arzt, V., & Hanbury, A. (2024). Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards. In D. Hupkes, V. Dankers, K. Batsuren, A. Kazemnejad, C. Christodoulopoulos, M. Giulianelli, & R. Cotterel (Eds.), <i>Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP</i> (pp. 120–130). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.genbench-1.8</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/209895

dc.description.abstract

This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP, with a focus on the relation extraction (RE) task. Existing RE benchmarks often suffer from insufficient documentation, lacking crucial details such as data sources, inter-annotator agreement, the algorithms used for the selection of instances for datasets, and information on potential biases like dataset imbalance. Progress in RE is frequently measured by leaderboards that rank systems based on evaluation methods, typically limited to aggregate metrics like F1-score. However, the absence of detailed performance analysis beyond these metrics can obscure the true generalisation capabilities of models. Our analysis reveals that widely used RE benchmarks, such as TACRED and NYT, tend to be highly imbalanced and contain noisy labels. Moreover, the lack of class-based performance metrics fails to accurately reflect model performance across datasets with a large number of relation types. These limitations should be carefully considered when reporting progress in RE. While our discussion centers on the transparency of RE benchmarks and leaderboards, the observations we discuss are broadly applicable to other NLP tasks as well. Rather than undermining the significance and value of existing RE benchmarks and the development of new models, this paper advocates for improved documentation and more rigorous evaluation to advance the field.

dc.language.iso

dc.subject

Relation Extraction

dc.subject

Benchmarks

dc.subject

Leaderboards

dc.subject

Transparency

dc.title

Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards

dc.type

Inproceedings

dc.type

Konferenzbeitrag

dc.relation.publication

Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP

dc.contributor.editoraffiliation

University of Edinburgh, United Kingdom of Great Britain and Northern Ireland (the)

dc.contributor.editoraffiliation

ETH Zurich, Switzerland

dc.contributor.editoraffiliation

ETH Zurich, Switzerland

dc.relation.isbn

979-8-89176-182-7

dc.relation.doi

10.18653/v1/2024.genbench-1

dc.description.startpage

120

dc.description.endpage

130

dc.type.category

Full-Paper Contribution

tuw.booktitle

Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP

tuw.peerreviewed

true

tuw.relation.publisher

Association for Computational Linguistics

tuw.researchTopic.id

tuw.researchTopic.name

Information Systems Engineering

tuw.researchTopic.value

100

tuw.publication.orgunit

E194-04 - Forschungsbereich Data Science

tuw.publisher.doi

10.18653/v1/2024.genbench-1.8

dc.description.numberOfPages

tuw.author.orcid

0000-0002-7149-5843

tuw.editor.orcid

0000-0002-9883-3304

tuw.editor.orcid

0000-0001-5955-6896

tuw.editor.orcid

0000-0001-7708-0051

tuw.event.name

GenBench Workshop 2024

tuw.event.startdate

16-11-2024

tuw.event.enddate

16-11-2024

tuw.event.online

On Site

tuw.event.type

Event for scientific audience

tuw.event.place

Miami

tuw.event.country

tuw.event.presenter

Arzt, Varvara

wb.sciencebranch

Sprach- und Literaturwissenschaften

wb.sciencebranch

Informatik

wb.sciencebranch.oefos

6020

wb.sciencebranch.oefos

1020

wb.sciencebranch.value

item.openairetype

conference paper

item.cerifentitytype

Publications

item.grantfulltext

none

item.languageiso639-1

item.openairecristype

http://purl.org/coar/resource_type/c_5794

item.fulltext

no Fulltext

crisitem.author.dept

E194-04 - Forschungsbereich Data Science

crisitem.author.dept

E194-04 - Forschungsbereich Data Science

crisitem.author.orcid

0000-0002-7149-5843

crisitem.author.parentorg

E194 - Institut für Information Systems Engineering

crisitem.author.parentorg

E194 - Institut für Information Systems Engineering

Enthalten in den Sammlungen:

Conference Paper

Zur Kurzanzeige

Seiten Aufrufe

118

aufgerufen am 28.01.2025

Google Scholar^TM

Check

Seiten Aufrufe

Google ScholarTM

Google Scholar^TM