Title: Disease-Symptom relation extraction from medical text corpora with BERT
Other Titles: Krankheit-Symptom Relationsextraktion aus medizinischen Texten mit BERT
Language: English
Authors: Schiegl, Adrian 
Qualification level: Diploma
Advisor: Hanbury, Allan  
Assisting Advisor: Zlabinger, Markus 
Issue Date: 2021
Schiegl, A. (2021). Disease-Symptom relation extraction from medical text corpora with BERT [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2021.77705
Number of Pages: 68
Qualification level: Diploma
To this day vast amounts of medical knowledge is still published in unstructured form e.g., case reports, clinical notes etc. The automated extraction of relations from unstructured sources between symptoms, diseases and other patient related information plays an important role in areas such as Evidence Based Medicine. For example, effective disease- symptom relation extraction accelerates tasks such as reviewing large amounts of medical literature to learn new disease characteristics.In this work we present a relation extraction model based on BERT and MetaMap that extracts disease-symptom relations from over 20,000 BMJ Case Reports. Case reports are medical publications that contain clinically important information about the course of patients with specific medical conditions. Our model exploits the fact that a case report focuses on a single disease which is mentioned in the case report title. By doing so we represent the problem of relation extraction as a named entity recognition problem, which simplifies the model and the annotation of the training dataset.We evaluate our model using the Disease Symptom Relation Collection (DSR). DSR is a set of graded disease-symptom relations from 20 diseases which was curated by medical doctors. We evaluate our model by measuring the relevance of the disease-symptom relations it extracted from BMJ Case Reports. We measure relevance by calculating the agreement with the ground truth provided by the medical doctors with the metrics nDCG@k, precision@k and recall@k. Furthermore, we compare the relevance our model achieved with the relevance of two baseline models: a word2vec model and a co-occurrence model trained on 1.5 million PubMed Central articles.Our results show that our approach outperforms baselines by up to 25% nDCG, 27% precision and 10% recall. The agreement between our model and the ground truth is up to 64% nDCG@5 and 66% precision@5. Furthermore, our results also show that case reports are a high quality source of disease-symptom relations. Despite that, we find that they are of limited use due to the small number of openly accessible case reports.
Keywords: NLP; Biomedical Relation Extraction; Information Retrieval; Text Mining
URI: https://doi.org/10.34726/hss.2021.77705
DOI: 10.34726/hss.2021.77705
Library ID: AC16235613
Organisation: E194 - Institut für Information Systems Engineering 
Publication Type: Thesis
Appears in Collections:Thesis

Files in this item:

Page view(s)

checked on Aug 4, 2021


checked on Aug 4, 2021

Google ScholarTM


Items in reposiTUm are protected by copyright, with all rights reserved, unless otherwise indicated.