Extracting tabular data from utility value appraisals

Rirsch, Klaus

doi:10.34726/hss.2023.77704

Record link:

https://doi.org/10.34726/hss.2023.77704
http://hdl.handle.net/20.500.12708/141972

Title:

Extracting tabular data from utility value appraisals

Citation:

Rirsch, K. (2021). Extracting tabular data from utility value appraisals [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2023.77704

reposiTUm DOI:

10.34726/hss.2023.77704

CatalogPlus:

AC16743501

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Rirsch, Klaus

Advisor:

Hanbury, Allan

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2021

Number of Pages:

Keywords:

Nutzwertgutachten; table extraction; scanned images; extraction system; utility value appraisal; rule-based; heuristics

table extraction; scanned images; extraction system; utility value appraisal; rule-based; heuristics

Abstract:

Nutzwertgutachten beinhalten Informationen mit vielen Anwendungen im Marketing und der Analyse von Immobilien in Österreich. Relevante Daten sind vorranging in tabellarischer Form und Dokumente sind als PDF’s verfügbar, die aus gescannten Bildern bestehen. In der Arbeit werden regelbasierte Methoden zur Extraktion von Zieldatenvorgestellt, und deren Ausgaben werden mit der eines kommerziellen Produkts verglichen. Für den Vergleich wird eine Probe von Nutzwertgutachten verwendet, die auch dazu dient eine Ontologie für Zieldaten zu erstellen. Das Ziel der Arbeit war herauszufinden, ob regelbasierte Systeme, die ohne vorklassifizierte Datenbestände auskommen, bessere Resultate als eine moderne Deep-Learning Anwendung liefern können. Precision und Recall wurden als Maßstäbe in den Bereichen der Erkennung von Tabellen, ihrer Struktur, und ihres Inhalts für drei Extraktionssysteme gemessen und verglichen. Der Entwicklungs- und Verarbeitungsprozess der regelbasierten Systeme, sowie Bereiche mit Verbesserungspotential werden anhand von Beispielen veranschaulicht. Der Einfluss von bestimmten Tabellenattributen auf die Ergebnisse wird anhand eines Modells, das verschiedene Arten von Tabellen repräsentiert, untersucht. Die regelbasierten Prototypen konnten nur in Einzelfällen bessere Ergebnisse als das kommerzielle Produkt liefern. Im Zuge der Auswertung hat sich herausgestellt, dass Eigenschaften von Tabellen und die Komplexität ihrer Strukturen Einfluss auf die Ergebnisse von Extraktionssystemen haben können, aber auch, dass andere Faktoren, wie das Umfeld der Tabelle, Textformatierung und die Qualität der Scans Herausforderungen für alle untersuchten Software-Lösungen darstellen.

Utility value appraisals contain data that have many applications in marketing and analyzing real-estate in Austria. Relevant information is predominantly represented in tabular format and individual documents are available as PDF’s containing scanned images. Rule-based methods for extracting certain target data are proposed and their output is compared to results from a commercial product. A sample of utility value appraisals is used for ground-truthing and to derive an ontology for relevant data. The aim was to find out whether heuristics that do not rely on the availability of labelled data-sets can outperform a modern Deep-Learning approach. Precision and Recall were used as measurements in the areas of Table-Recognition, Table-Structure-Recognition and Character-Recognition for the performance of three extraction systems to determine the answer. Examples are used to describe development and processing steps as well as to highlight areas for improvement based on the output of the different approaches. The impact of different table attributes on extraction results is examined using a model forrepresenting different types of tables and a sample of utility value appraisals. Even though the prototypes did manage to outperform the commercial product in some cases, it achieved better results overall. We found that the format of a table and its complexity can impact extraction results, but that other factors like scan quality, the environment of a table and text formatting also have significant impact on all software artefacts that were examined.

License:

In Copyright

Appears in Collections:

Thesis