Csakvari, T. R. (2025). Large Language Model-based framework for Open Information Extraction [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.131626
E194 - Institut für Information Systems Engineering
-
Date (published):
2025
-
Number of Pages:
85
-
Keywords:
Open Information Extraction; Semantic Triplet Matching; Large Language Models; Knowledge Extraction; Legal Texts; Automated Assessment; Text Comparison; Information Retrieval; Natural Language Processing; Evaluation Framework
en
Abstract:
The growth of unstructured digital text demands effective knowledge extraction methods. While traditional Information Extraction is limited by rigid schemas, Open Information Extraction (OIE) provides needed flexibility. Large Language Models (LLMs) show promise for OIE but their application to both OIE and semantic triplet matching remains underexplored.This thesis introduces and evaluates a novel, modular LLM-based framework designed for OIE, subsequent semantic triplet matching, and text comparison, with validation performed on a German legal education dataset of student responses. The framework employs LLMs to first extract (subject, relation, object) triplets from the German legal texts. These extracted candidate triplets are then semantically compared against predefined target triplets (representing key legal contents) using an LLM-based triplet matching process. The system's performance was quantitatively and qualitatively evaluated on the dataset of student answers to a specific legal case, comparing LLM-based triplet matching outputs against human-annotated ground truth. Several state-of-the-art LLMs (including GPT-4 series, Llama, DeepSeek) were benchmarked, alongside alternative methods such as end-to-end LLM evaluation, rule-based OIE, and string-based triplet matching for comparison.Results demonstrate the framework's considerable proficiency, with the top-performing configuration (GPT-4.1-mini for both OIE and triplet matching) achieving 80.0\% accuracy and a Matthews Correlation Coefficient (MCC) of 0.589. This modular LLM-OIE plus LLM-matching approach generally outperformed holistic end-to-end LLM methods and simpler rule-based or string-matching techniques, highlighting the value of structured intermediate representations.This research validates the utility of LLMs for OIE and semantic comparison in a specialized, non-English domain. The developed open-source, modular framework serves as a practical tool and contributes to understanding LLM capabilities and limitations in structured knowledge extraction, offering a foundation for advanced automated assessment and information retrieval systems.
en
Additional information:
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers