Disease-Symptom relation extraction from medical text corpora with BERT

Schiegl, Adrian

doi:10.34726/hss.2021.77705

DC Field

Value

Language

dc.contributor.advisor

Hanbury, Allan

dc.contributor.author

Schiegl, Adrian

dc.date.accessioned

2021-06-24T05:44:26Z

dc.date.issued

2021

dc.date.submitted

2021-06

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Schiegl, A. (2021). <i>Disease-Symptom relation extraction from medical text corpora with BERT</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2021.77705</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2021.77705

dc.identifier.uri

http://hdl.handle.net/20.500.12708/17874

dc.description.abstract

To this day vast amounts of medical knowledge is still published in unstructured form e.g., case reports, clinical notes etc. The automated extraction of relations from unstructured sources between symptoms, diseases and other patient related information plays an important role in areas such as Evidence Based Medicine. For example, effective disease- symptom relation extraction accelerates tasks such as reviewing large amounts of medical literature to learn new disease characteristics.In this work we present a relation extraction model based on BERT and MetaMap that extracts disease-symptom relations from over 20,000 BMJ Case Reports. Case reports are medical publications that contain clinically important information about the course of patients with specific medical conditions. Our model exploits the fact that a case report focuses on a single disease which is mentioned in the case report title. By doing so we represent the problem of relation extraction as a named entity recognition problem, which simplifies the model and the annotation of the training dataset.We evaluate our model using the Disease Symptom Relation Collection (DSR). DSR is a set of graded disease-symptom relations from 20 diseases which was curated by medical doctors. We evaluate our model by measuring the relevance of the disease-symptom relations it extracted from BMJ Case Reports. We measure relevance by calculating the agreement with the ground truth provided by the medical doctors with the metrics nDCG@k, precision@k and recall@k. Furthermore, we compare the relevance our model achieved with the relevance of two baseline models: a word2vec model and a co-occurrence model trained on 1.5 million PubMed Central articles.Our results show that our approach outperforms baselines by up to 25% nDCG, 27% precision and 10% recall. The agreement between our model and the ground truth is up to 64% nDCG@5 and 66% precision@5. Furthermore, our results also show that case reports are a high quality source of disease-symptom relations. Despite that, we find that they are of limited use due to the small number of openly accessible case reports.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

NLP

dc.subject

Biomedical Relation Extraction

dc.subject

Information Retrieval

dc.subject

Text Mining

dc.title

Disease-Symptom relation extraction from medical text corpora with BERT

dc.title.alternative

Krankheit-Symptom Relationsextraktion aus medizinischen Texten mit BERT

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2021.77705

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Adrian Schiegl

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

dc.contributor.assistant

Zlabinger, Markus

tuw.publication.orgunit

E194 - Institut für Information Systems Engineering

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC16235613

dc.description.numberOfPages

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.assistant.staffStatus

staff

tuw.advisor.orcid

0000-0002-7149-5843

item.languageiso639-1

item.openairetype

master thesis

item.grantfulltext

open

item.fulltext

with Fulltext

item.cerifentitytype

Publications

item.mimetype

application/pdf

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.openaccessfulltext

Open Access

crisitem.author.dept

E194-04 - Forschungsbereich Data Science

crisitem.author.parentorg

E194 - Institut für Information Systems Engineering

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(1.04 MB)

In Copyright

Show simple item record

Page view(s)

530

checked on Nov 21, 2023

Download(s)

411

checked on Nov 21, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM