Automated semantic annotation of historical catalogues

Körner, David

doi:10.34726/hss.2020.47722

DC Field

Value

Language

dc.contributor.advisor

Sablatnig, Robert

dc.contributor.author

Körner, David

dc.date.accessioned

2020-07-23T16:47:49Z

dc.date.issued

2020

dc.date.submitted

2020-07

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Körner, D. (2020). <i>Automated semantic annotation of historical catalogues</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2020.47722</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2020.47722

dc.identifier.uri

http://hdl.handle.net/20.500.12708/15172

dc.description.abstract

Historical documents comprise all kind of information that can be used to gain knowledge about certain periods in time. Art exhibition catalogues represent a special type of historical documents that contains information valuable for the research on the history of art. The research project "Exhibitions of Modern European Painting 1905-1915" at the University of Vienna thrives to gather and digitize art exhibition catalogues in order to perform research on the history of modern painting. The project deals with a collection of more than 1300 catalogues. The manual digitization of this collection is a cumbersome process even when utilizing additional state-of-the-art software like Tesseract. In thisthesis an automated system for the extraction of specific information is proposed. The system is limited to the collection of exhibition catalogues and provides means to improve the digitization process in combination with Tesseract. The first step of the system is a page segmentation. For this purpose, an approach based on Maximally Stable Extremal Regions and a subsequent text region grouping is used. The resulting text regions are then further refined by applying a word level font style classification. This classification is done using a texture analysis of the word regions based on Gabor filtering. The computed font style information of text regions is then utilized in order to identify specific categories of information that are formatted using a unique font style. Finally, by combining these steps with the optical character recognition methodology of Tesseract it is possible to automatically extract different categories of information from the catalogues. The proposed page segmentation methodology is evaluated on the data set of the ICDAR2013 Competition on Historical Book Recognition and is able to outperform the segmentation results of Tesseract. In addition, the proposed Gabor filtering approach used for font style classification is evaluated using varying exhibition catalogues and achieves recognition rates above 90% for cropped word images. By using the proposed stages in combination with the optical character recognition of Tesseract it is possible to ease the recognition of the exhibition catalogues and reduce the need for manual effort in the digitization process.

dc.description.abstract

Historische Dokumente enthalten verschiedenste Arten von Informationen, die zum besseren Verständnis bestimmter Zeitabschnitte der Geschichte genutzt werden können. Kunstausstellungskataloge stellen eine spezielle Art von historischen Dokumenten dar, die wertvolle Informationen über die Kunstgeschichte enthalten. Das Forschungsprojekt "Ausstellungen moderner europäischer Malerei 1905-1915" der Universität Wien bemüht sich um die Sammlung und Digitalisierung von Kunstausstellungskatalogen, um die Geschichte der modernen Malerei zu erforschen. Das Projekt befasst sich mit einer Sammlung von mehr als 1300 Katalogen. Die manuelle Digitalisierung dieser Sammlung ist ein aufwendiger Prozess, selbst wenn zusätzliche Software wie Tesseract verwendet wird. In dieser Arbeit wird ein automatisiertes System für die Extraktion von spezifischen Informationen vorgestellt. Das System beschränkt sich auf die Sammlung von Ausstellungskatalogen und vereinfacht den Digitalisierungsprozess in Kombination mit Tesseract. Der erste Schritt des Systems ist eine Seitensegmentierung. Zu diesem Zweck wird ein Ansatz basierend auf "Maximally Stable Extremal Regions" und eine anschließende Gruppierung der Textregionen verwendet. Die dabei entstandenen Textbereiche werden durch Anwendung einer Fontklassifikation auf Wortebene weiter verfeinert. Diese Klassifizierung erfolgt mittels einer Texturanalyse der Wortregionen basierend auf Gabor-Filterung. Die dadurch erlangten Fontinformationen werden dann verwendet, um bestimmte Kategorien von Informationen zu identifizieren, die sich durch eindeutige Fontstile unterscheiden. Schließlich ist es durch die Kombination dieser Schritte mit der optischen Texterkennung von Tesseract möglich, automatisiert verschiedene Kategorien von Informationen aus den Katalogen zu extrahieren. Die vorgeschlagene Methode zur Seitensegmentierung wird anhand des Datensatzes der ICDAR2013 Competition on Historical Book Recognition evaluiert und ist in der Lage, die Segmentierungsergebnisse von Tesseract zu übertreffen. Darüber hinaus wird der Gabor-Filteransatz, der für die Klassifizierung von Fonts verwendet wird, anhand unterschiedlicher Ausstellungskataloge evaluiert und erreicht eine Erkennungsrate von über 90% für zugeschnittene Wortbilder. Durch die Verwendung der vorgeschlagenen Schritte in Kombination mit der Texterkennung von Tesseract ist es möglich, die digitale Erfassung der Ausstellungskataloge zu erleichtern und den manuellen Aufwand im Digitalisierungsprozess zu reduzieren.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Layout Analyse

dc.subject

Seitensegmentierung

dc.subject

Fontstilklassifikation

dc.subject

semantische Annotation

dc.subject

historische Dokumente

dc.subject

layout analysis

dc.subject

page segmentation

dc.subject

font style classification

dc.subject

semantic annotation

dc.subject

historical documents

dc.title

Automated semantic annotation of historical catalogues

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2020.47722

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

David Körner

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

dc.contributor.assistant

Diem, Markus

tuw.publication.orgunit

E193 - Institut für Visual Computing and Human-Centered Technology

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC15693176

dc.description.numberOfPages

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.assistant.staffStatus

staff

tuw.advisor.orcid

0000-0003-4195-1593

tuw.assistant.orcid

0000-0002-5048-5128

item.languageiso639-1

item.openairetype

master thesis

item.grantfulltext

open

item.fulltext

with Fulltext

item.cerifentitytype

Publications

item.mimetype

application/pdf

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.openaccessfulltext

Open Access

crisitem.author.dept

TU Wien

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(13.03 MB)

In Copyright

Show simple item record

Page view(s)

410

checked on Nov 23, 2023

Download(s)

176

checked on Nov 23, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM