<div class="csl-bib-body">
<div class="csl-entry">Arzt, V., & Hanbury, A. (2024). Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards. In D. Hupkes, V. Dankers, K. Batsuren, A. Kazemnejad, C. Christodoulopoulos, M. Giulianelli, & R. Cotterel (Eds.), <i>Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP</i> (pp. 120–130). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.genbench-1.8</div>
</div>
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/209895
-
dc.description.abstract
This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP, with a focus on the relation extraction (RE) task. Existing RE benchmarks often suffer from insufficient documentation, lacking crucial details such as data sources, inter-annotator agreement, the algorithms used for the selection of instances for datasets, and information on potential biases like dataset imbalance. Progress in RE is frequently measured by leaderboards that rank systems based on evaluation methods, typically limited to aggregate metrics like F1-score. However, the absence of detailed performance analysis beyond these metrics can obscure the true generalisation capabilities of models. Our analysis reveals that widely used RE benchmarks, such as TACRED and NYT, tend to be highly imbalanced and contain noisy labels. Moreover, the lack of class-based performance metrics fails to accurately reflect model performance across datasets with a large number of relation types. These limitations should be carefully considered when reporting progress in RE. While our discussion centers on the transparency of RE benchmarks and leaderboards, the observations we discuss are broadly applicable to other NLP tasks as well. Rather than undermining the significance and value of existing RE benchmarks and the development of new models, this paper advocates for improved documentation and more rigorous evaluation to advance the field.
en
dc.language.iso
en
-
dc.subject
Relation Extraction
en
dc.subject
Benchmarks
en
dc.subject
Leaderboards
en
dc.subject
Transparency
en
dc.title
Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards
en
dc.type
Inproceedings
en
dc.type
Konferenzbeitrag
de
dc.relation.publication
Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP
-
dc.contributor.editoraffiliation
University of Edinburgh, United Kingdom of Great Britain and Northern Ireland (the)
-
dc.contributor.editoraffiliation
ETH Zurich, Switzerland
-
dc.contributor.editoraffiliation
ETH Zurich, Switzerland
-
dc.relation.isbn
979-8-89176-182-7
-
dc.relation.doi
10.18653/v1/2024.genbench-1
-
dc.description.startpage
120
-
dc.description.endpage
130
-
dc.type.category
Full-Paper Contribution
-
tuw.booktitle
Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP
-
tuw.peerreviewed
true
-
tuw.relation.publisher
Association for Computational Linguistics
-
tuw.researchTopic.id
I4
-
tuw.researchTopic.name
Information Systems Engineering
-
tuw.researchTopic.value
100
-
tuw.publication.orgunit
E194-04 - Forschungsbereich Data Science
-
tuw.publisher.doi
10.18653/v1/2024.genbench-1.8
-
dc.description.numberOfPages
11
-
tuw.author.orcid
0000-0002-7149-5843
-
tuw.editor.orcid
0000-0002-9883-3304
-
tuw.editor.orcid
0000-0001-5955-6896
-
tuw.editor.orcid
0000-0001-7708-0051
-
tuw.event.name
GenBench Workshop 2024
en
tuw.event.startdate
16-11-2024
-
tuw.event.enddate
16-11-2024
-
tuw.event.online
On Site
-
tuw.event.type
Event for scientific audience
-
tuw.event.place
Miami
-
tuw.event.country
US
-
tuw.event.presenter
Arzt, Varvara
-
wb.sciencebranch
Sprach- und Literaturwissenschaften
-
wb.sciencebranch
Informatik
-
wb.sciencebranch.oefos
6020
-
wb.sciencebranch.oefos
1020
-
wb.sciencebranch.value
10
-
wb.sciencebranch.value
90
-
item.openairetype
conference paper
-
item.fulltext
no Fulltext
-
item.cerifentitytype
Publications
-
item.languageiso639-1
en
-
item.openairecristype
http://purl.org/coar/resource_type/c_5794
-
item.grantfulltext
none
-
crisitem.author.dept
E194-04 - Forschungsbereich Data Science
-
crisitem.author.dept
E194-04 - Forschungsbereich Data Science
-
crisitem.author.orcid
0000-0002-7149-5843
-
crisitem.author.parentorg
E194 - Institut für Information Systems Engineering
-
crisitem.author.parentorg
E194 - Institut für Information Systems Engineering