<div class="csl-bib-body">
<div class="csl-entry">Schiegl, A. (2021). <i>Disease-Symptom relation extraction from medical text corpora with BERT</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2021.77705</div>
</div>
-
dc.identifier.uri
https://doi.org/10.34726/hss.2021.77705
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/17874
-
dc.description.abstract
To this day vast amounts of medical knowledge is still published in unstructured form e.g., case reports, clinical notes etc. The automated extraction of relations from unstructured sources between symptoms, diseases and other patient related information plays an important role in areas such as Evidence Based Medicine. For example, effective disease- symptom relation extraction accelerates tasks such as reviewing large amounts of medical literature to learn new disease characteristics.In this work we present a relation extraction model based on BERT and MetaMap that extracts disease-symptom relations from over 20,000 BMJ Case Reports. Case reports are medical publications that contain clinically important information about the course of patients with specific medical conditions. Our model exploits the fact that a case report focuses on a single disease which is mentioned in the case report title. By doing so we represent the problem of relation extraction as a named entity recognition problem, which simplifies the model and the annotation of the training dataset.We evaluate our model using the Disease Symptom Relation Collection (DSR). DSR is a set of graded disease-symptom relations from 20 diseases which was curated by medical doctors. We evaluate our model by measuring the relevance of the disease-symptom relations it extracted from BMJ Case Reports. We measure relevance by calculating the agreement with the ground truth provided by the medical doctors with the metrics nDCG@k, precision@k and recall@k. Furthermore, we compare the relevance our model achieved with the relevance of two baseline models: a word2vec model and a co-occurrence model trained on 1.5 million PubMed Central articles.Our results show that our approach outperforms baselines by up to 25% nDCG, 27% precision and 10% recall. The agreement between our model and the ground truth is up to 64% nDCG@5 and 66% precision@5. Furthermore, our results also show that case reports are a high quality source of disease-symptom relations. Despite that, we find that they are of limited use due to the small number of openly accessible case reports.
en
dc.language
English
-
dc.language.iso
en
-
dc.rights.uri
http://rightsstatements.org/vocab/InC/1.0/
-
dc.subject
NLP
en
dc.subject
Biomedical Relation Extraction
en
dc.subject
Information Retrieval
en
dc.subject
Text Mining
en
dc.title
Disease-Symptom relation extraction from medical text corpora with BERT
en
dc.title.alternative
Krankheit-Symptom Relationsextraktion aus medizinischen Texten mit BERT
de
dc.type
Thesis
en
dc.type
Hochschulschrift
de
dc.rights.license
In Copyright
en
dc.rights.license
Urheberrechtsschutz
de
dc.identifier.doi
10.34726/hss.2021.77705
-
dc.contributor.affiliation
TU Wien, Österreich
-
dc.rights.holder
Adrian Schiegl
-
dc.publisher.place
Wien
-
tuw.version
vor
-
tuw.thesisinformation
Technische Universität Wien
-
dc.contributor.assistant
Zlabinger, Markus
-
tuw.publication.orgunit
E194 - Institut für Information Systems Engineering
-
dc.type.qualificationlevel
Diploma
-
dc.identifier.libraryid
AC16235613
-
dc.description.numberOfPages
68
-
dc.thesistype
Diplomarbeit
de
dc.thesistype
Diploma Thesis
en
dc.rights.identifier
In Copyright
en
dc.rights.identifier
Urheberrechtsschutz
de
tuw.advisor.staffStatus
staff
-
tuw.assistant.staffStatus
staff
-
tuw.advisor.orcid
0000-0002-7149-5843
-
item.languageiso639-1
en
-
item.openairetype
master thesis
-
item.grantfulltext
open
-
item.fulltext
with Fulltext
-
item.cerifentitytype
Publications
-
item.mimetype
application/pdf
-
item.openairecristype
http://purl.org/coar/resource_type/c_bdcc
-
item.openaccessfulltext
Open Access
-
crisitem.author.dept
E194-04 - Forschungsbereich Data Science
-
crisitem.author.parentorg
E194 - Institut für Information Systems Engineering