Cross-dataset medical entity recognition

Kopali, Nils

doi:10.34726/hss.2025.121281

Record link:

https://doi.org/10.34726/hss.2025.121281
http://hdl.handle.net/20.500.12708/209809

Title:

Cross-dataset medical entity recognition

Citation:

Kopali, N. (2024). Cross-dataset medical entity recognition [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.121281

reposiTUm DOI:

10.34726/hss.2025.121281

CatalogPlus:

AC17418714

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Kopali, Nils

Advisor:

Recski, Gábor

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2024

Number of Pages:

Keywords:

Named Entity Recognition (NER); Machine Learning; Cross-dataset analysis; Generalizability

Abstract:

Diese Arbeit untersucht die Robustheit und Generalisierbarkeit der Named Entity Recognition (NER)-Modelle BioBERT und KeBioLM in verschiedenen biomedizinischen Datensätzen. Angesichts der zunehmenden Komplexität biomedizinischer Texte ist die Anpassungsfähigkeit dieser Modelle an verschiedene Datensätze von entscheidender Bedeutung. Durch systematische Experimente mit zwei biomedizinischen NER-Benchmark-Datensätzen, BC5CDR und NCBI, untersucht diese Studie die Auswirkung von Datensatzunterschieden auf die Präzision, den Recall und die F1-Werte von BioBERT und KeBioLM. Beide Modelle zeigen eine starke datenbankinterne Leistung; ihre Effektivität nimmt jedoch in datenbankübergreifenden Szenarien deutlich ab, was die Herausforderungen bei der Generalisierung auf neue, ungesehene Daten verdeutlicht. Die Ergebnisse unterstreichen die Notwendigkeit, anpassungsfähigere NER-Systeme zu entwickeln, die in der Lage sind, mit der dynamischen und vielfältigen Natur von biomedizinischen Texten umzugehen. Die für die Analyse in dieser Arbeit verwendeten Skripte sind im GitHub-Repository verfügbar.

This thesis investigates the robustness and generalizability of Named Entity Recognition (NER) models BioBERT and KeBioLM across diverse biomedical datasets. With the expanding complexity of biomedical texts, these models' adaptability to different datasets remains crucial. The study aims to dissect these variations by analyzing model performance on specific datasets, focusing on how unseen entities and annotation inconsistencies affect their accuracy and generalization capabilities.Through systematic experiments using two benchmark biomedical NER datasets, BC5CDR and NCBI, this research explores the impact of dataset differences on the precision, recall, and F1 scores of BioBERT and KeBioLM. Both models exhibit strong in-dataset performance; however, their effectiveness significantly declines in cross-dataset scenarios, highlighting the challenges in generalizing to new, unseen data. This analysis extends to examining the influence of unseen entities and dataset-specific biases, revealing critical insights into the models' limitations and areas for improvement.The findings emphasize the necessity for developing more adaptable NER systems capable of handling the dynamic and diverse nature of biomedical texts. The scripts used for analysis in this thesis are available in the GitHub repository.

License:

In Copyright

Appears in Collections:

Thesis