Schiegl, A. (2021). Disease-Symptom relation extraction from medical text corpora with BERT [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2021.77705
E194 - Institut für Information Systems Engineering
-
Date (published):
2021
-
Number of Pages:
68
-
Keywords:
NLP; Biomedical Relation Extraction; Information Retrieval; Text Mining
en
Abstract:
To this day vast amounts of medical knowledge is still published in unstructured form e.g., case reports, clinical notes etc. The automated extraction of relations from unstructured sources between symptoms, diseases and other patient related information plays an important role in areas such as Evidence Based Medicine. For example, effective disease- symptom relation extraction accelerates tasks such as reviewing large amounts of medical literature to learn new disease characteristics.In this work we present a relation extraction model based on BERT and MetaMap that extracts disease-symptom relations from over 20,000 BMJ Case Reports. Case reports are medical publications that contain clinically important information about the course of patients with specific medical conditions. Our model exploits the fact that a case report focuses on a single disease which is mentioned in the case report title. By doing so we represent the problem of relation extraction as a named entity recognition problem, which simplifies the model and the annotation of the training dataset.We evaluate our model using the Disease Symptom Relation Collection (DSR). DSR is a set of graded disease-symptom relations from 20 diseases which was curated by medical doctors. We evaluate our model by measuring the relevance of the disease-symptom relations it extracted from BMJ Case Reports. We measure relevance by calculating the agreement with the ground truth provided by the medical doctors with the metrics nDCG@k, precision@k and recall@k. Furthermore, we compare the relevance our model achieved with the relevance of two baseline models: a word2vec model and a co-occurrence model trained on 1.5 million PubMed Central articles.Our results show that our approach outperforms baselines by up to 25% nDCG, 27% precision and 10% recall. The agreement between our model and the ground truth is up to 64% nDCG@5 and 66% precision@5. Furthermore, our results also show that case reports are a high quality source of disease-symptom relations. Despite that, we find that they are of limited use due to the small number of openly accessible case reports.