<div class="csl-bib-body">
<div class="csl-entry">Zlabinger, M. (2021). <i>Efficient and effective manual corpus annotation to create resources for evaluation and machine learning</i> [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.100624</div>
</div>
-
dc.identifier.uri
https://doi.org/10.34726/hss.2022.100624
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/19730
-
dc.description.abstract
Manually annotated text corpora are an indispensable part of Information Retrieval (IR) and Natural Language Processing (NLP). We depend on annotated corpora for evaluation and supervised machine learning. While we depend on annotated corpora in many situations, creating a new one is a time-consuming procedure. It involves annotators going through the texts of a corpus to assign labels. To make things worse, annotators might be inaccurate in assigning labels, especially for domain-specific annotation tasks, as these usually require expert annotators to be conducted accurately. This thesis introduces novel methodologies to support annotators in assigning labels more quickly and accurately. The methodologies are applied and evaluated for text annotation tasks related to the IR and NLP research area, such as question-answering and named-entity recognition. In addition, we focus on annotation tasks from the biomedical domain, as these are difficult to annotate accurately due to the complex jargon of this domain. The first methodology introduced in this thesis aims to support annotators in assigning labels quickly. We propose to pre-group texts based on their semantic similarity before being annotated. Annotating a group of semantically similar texts, such as questions, sentences, or phrases, can be time-saving, especially when each text requires similar labeling. The second proposed methodology is about supporting non-expert annotators in conducting domain-specific annotation tasks. Annotators are commonly prepared for such tasks with instructions and examples. Providing examples is essential; however, they are usually defined globally over an entire task and might not be useful in labeling individual texts. Instead of providing examples globally, we propose showing examples that are similar to the currently labeled text to support annotators dynamically. We systematically evaluate the proposed methodologies and measure substantial improvements in efficiency and effectiveness when using our novel methodologies for acquiring annotations.
en
dc.language
English
-
dc.language.iso
en
-
dc.rights.uri
http://rightsstatements.org/vocab/InC/1.0/
-
dc.subject
Machine Learning
en
dc.subject
Corpus Annotation
en
dc.subject
Information Retrieval
en
dc.title
Efficient and effective manual corpus annotation to create resources for evaluation and machine learning
en
dc.type
Thesis
en
dc.type
Hochschulschrift
de
dc.rights.license
In Copyright
en
dc.rights.license
Urheberrechtsschutz
de
dc.identifier.doi
10.34726/hss.2022.100624
-
dc.contributor.affiliation
TU Wien, Österreich
-
dc.rights.holder
Markus Zlabinger
-
dc.publisher.place
Wien
-
tuw.version
vor
-
tuw.thesisinformation
Technische Universität Wien
-
tuw.publication.orgunit
E194 - Institut für Information Systems Engineering
-
dc.type.qualificationlevel
Doctoral
-
dc.identifier.libraryid
AC16465261
-
dc.description.numberOfPages
139
-
dc.thesistype
Dissertation
de
dc.thesistype
Dissertation
en
tuw.author.orcid
0000-0003-2733-3043
-
dc.rights.identifier
In Copyright
en
dc.rights.identifier
Urheberrechtsschutz
de
tuw.advisor.staffStatus
staff
-
tuw.advisor.orcid
0000-0002-7149-5843
-
item.languageiso639-1
en
-
item.openairetype
doctoral thesis
-
item.grantfulltext
open
-
item.fulltext
with Fulltext
-
item.cerifentitytype
Publications
-
item.mimetype
application/pdf
-
item.openairecristype
http://purl.org/coar/resource_type/c_db06
-
item.openaccessfulltext
Open Access
-
crisitem.author.dept
E194-04 - Forschungsbereich Data Science
-
crisitem.author.parentorg
E194 - Institut für Information Systems Engineering