Efficient and effective manual corpus annotation to create resources for evaluation and machine learning

Zlabinger, Markus

doi:10.34726/hss.2022.100624

DC Field

Value

Language

dc.contributor.advisor

Hanbury, Allan

dc.contributor.author

Zlabinger, Markus

dc.date.accessioned

2022-03-11T11:02:33Z

dc.date.issued

2021

dc.date.submitted

2022-03

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Zlabinger, M. (2021). <i>Efficient and effective manual corpus annotation to create resources for evaluation and machine learning</i> [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.100624</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2022.100624

dc.identifier.uri

http://hdl.handle.net/20.500.12708/19730

dc.description.abstract

Manually annotated text corpora are an indispensable part of Information Retrieval (IR) and Natural Language Processing (NLP). We depend on annotated corpora for evaluation and supervised machine learning. While we depend on annotated corpora in many situations, creating a new one is a time-consuming procedure. It involves annotators going through the texts of a corpus to assign labels. To make things worse, annotators might be inaccurate in assigning labels, especially for domain-specific annotation tasks, as these usually require expert annotators to be conducted accurately. This thesis introduces novel methodologies to support annotators in assigning labels more quickly and accurately. The methodologies are applied and evaluated for text annotation tasks related to the IR and NLP research area, such as question-answering and named-entity recognition. In addition, we focus on annotation tasks from the biomedical domain, as these are difficult to annotate accurately due to the complex jargon of this domain. The first methodology introduced in this thesis aims to support annotators in assigning labels quickly. We propose to pre-group texts based on their semantic similarity before being annotated. Annotating a group of semantically similar texts, such as questions, sentences, or phrases, can be time-saving, especially when each text requires similar labeling. The second proposed methodology is about supporting non-expert annotators in conducting domain-specific annotation tasks. Annotators are commonly prepared for such tasks with instructions and examples. Providing examples is essential; however, they are usually defined globally over an entire task and might not be useful in labeling individual texts. Instead of providing examples globally, we propose showing examples that are similar to the currently labeled text to support annotators dynamically. We systematically evaluate the proposed methodologies and measure substantial improvements in efficiency and effectiveness when using our novel methodologies for acquiring annotations.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Machine Learning

dc.subject

Corpus Annotation

dc.subject

Information Retrieval

dc.title

Efficient and effective manual corpus annotation to create resources for evaluation and machine learning

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2022.100624

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Markus Zlabinger

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

tuw.publication.orgunit

E194 - Institut für Information Systems Engineering

dc.type.qualificationlevel

Doctoral

dc.identifier.libraryid

AC16465261

dc.description.numberOfPages

139

dc.thesistype

Dissertation

dc.thesistype

Dissertation

tuw.author.orcid

0000-0003-2733-3043

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.advisor.orcid

0000-0002-7149-5843

item.languageiso639-1

item.openairetype

doctoral thesis

item.grantfulltext

open

item.fulltext

with Fulltext

item.cerifentitytype

Publications

item.mimetype

application/pdf

item.openairecristype

http://purl.org/coar/resource_type/c_db06

item.openaccessfulltext

Open Access

crisitem.author.dept

E194-04 - Forschungsbereich Data Science

crisitem.author.parentorg

E194 - Institut für Information Systems Engineering

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(3.43 MB)

In Copyright

Show simple item record

Page view(s)

476

checked on Nov 20, 2023

Download(s)

318

checked on Nov 20, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM