Text classification and layout analysis for document reassembling

Diem, Markus

doi:10.34726/hss.2014.23117

DC Field

Value

Language

dc.contributor.advisor

Sablatnig, Robert

dc.contributor.author

Diem, Markus

dc.date.accessioned

2020-06-29T16:41:57Z

dc.date.issued

2014

dc.date.submitted

2014-05

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Diem, M. (2014). <i>Text classification and layout analysis for document reassembling</i> [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2014.23117</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2014.23117

dc.identifier.uri

http://hdl.handle.net/20.500.12708/7464

dc.description

Zsfassung in dt. Sprache

dc.description.abstract

Eine automatische Dokumentrekonstruktion von handzerrissenen Akten ermöglicht die Wiederherstellung von verlorengeglaubtem Inhalt. Das zugrundeliegende Datenmaterial beinhaltet 600 Millionen Stasi Schnipsel die beim Fall der Berliner Mauer vernichtet wurden. Konturbasierte Ansätze können Schnipsel nicht richtig zusammensetzten wenn mehrere Seiten gleichzeitig zerrissen wurden. Zusätzlich ist die Komplexität der automatischen Rekonstruktion zu hoch wenn jedes Schnipsel mit jedem verglichen werden muss. Deshalb wird vor der Rekonstruktion der visuelle Inhalt von Schnipseln analysiert. Dadurch kann einerseits eine Vorsortierung vorgenommen werden, andererseits ermöglicht es Schnipsel mit gleichen Risskanten voneinander zu unterscheiden. Algorithmen zur automatischen Textlokalisierung und Papieranalyse werden in dieser Doktorarbeit vorgestellt, die bei der Rekonstruktion mit konturbasierten Ansätzen kombiniert werden. Die Papieranalyse klassifiziert Papier in liniiert, kariert und leeres Papier. Liegt ein liniiertes oder kariertes Schnipsel vor, so werden die Linien genau lokalisiert um benachbarte Schnipsel basierend auf deren Liniierung auszurichten. Des Weiteren wurde eine neue Textlokalisierung entwickelt, die Wörter kompakt repräsentiert und deren lokale Ausrichtung genau wiedergibt. Alle Elemente eines Dokuments werden in Maschinenschrift, Handschrift oder kein Text mit Hilfe von sogenannten Gradient Shape Features (GSF) klassifiziert. Die Erkennung von kein Text ermöglicht es falsch binarisierte Elemente zu verwerfen um auf diese Weise gute Ergebnisse in verrauschten Dokumenten zu erzielen. Nach der Klassifikation und Lokalisierung von Text werden diese Elemente zu hierarchisch höheren Strukturen zusammengefasst. Dabei wird ein bottom-up Verfahren verwendet welches auch auf zerrissenen Dokumenten angewendet werden kann. Die Methodik wurde empirisch auf öffentlich verfügbaren Datensätzen evaluiert und mit bestehenden Dokumentanalysesystemen verglichen. Die Textklassifikation und Lokalisierung konnte dabei bisherige Ergebnisse auf einem Datensatz mit modernen gedruckten Layouts und einem Handschriftendatensatz verbessern. Die Layout Analyse wurde auf den letzten drei Page Segmentation Contest Datensätzen evaluiert. Dort wurden im Vergleich zu State-of-the-Art Methoden ähnliche Ergebnisse erzielt, wobei sich herausstellte, dass besonders Bangla eine Schwierigkeit für die entwickelte Methode darstellt.

dc.description.abstract

In the context of automated reassembling of manually torn document snippets contour based approaches are insufficient because snippets have the same rupture edges if more than one page is torn at the same time. Moreover jigsaw puzzling is np hard which requests for a grouping of document snippets beforehand such that the complexity and computational speed of reassembling is improved. Analyzing the visual content of document snippets renders the distinction of snippets with the same contours possible. In addition, a visual content extraction enables fine alignment of snippets with the same content and for grouping snippets. The document analysis approaches presented in this thesis are part of a combined reassembling which utilizes content and contour for the reconstruction of about 600 Million Stasi snippets. The ruling analysis classifies the supporting material into void, lined, and checked paper. If a ruling is detected, the lines are localized accurately which allows for snippet alignments. Snippets might have sparse visual content depending on the conscientiousness when tearing. Therefore a new word localization (the so-called Profile Box) is introduced which keeps a compact word representation while accounting for anticipated deformations such as a word's local skew. These word boxes are further classified into printed, manuscript, and non-text elements by means of Gradient Shape Features (GSF) which are designed newly for this task. The latter class allows for rejecting falsely binarized elements which improves the robustness in the presence of degraded or noisy documents. Finally, a layout analysis is performed that is based on a bottom-up approach to keep the element clustering flexible even if a global text structure is not present. Results on various publicly available databases show that the methodology is capable of being adopted to different document analysis scenarios. A synthetic database for ruling line removal is created and made publicly available which allows comparisons between the approach proposed and other state-of-the-art methodologies. The text classification is compared to other approaches by means of the PRImA benchmarking database and the Iam database, which is a handwriting database written by multiple authors. The methodology presented achieves the best results in both empirical evaluations. On real world Stasi snippets, the recognition rate is lower because of the heterogeneity and sparseness of content in the data. The layout analysis is additionally evaluated on the most recent Handwriting Segmentation Contests where it competes state-of-the-art methods and on a medieval database.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Document Analysis

dc.subject

Layout Analysis

dc.subject

Text Classification

dc.subject

Document Reassembling

dc.title

Text classification and layout analysis for document reassembling

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2014.23117

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Markus Diem

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

tuw.publication.orgunit

E183 - Institut für Rechnergestützte Automation

dc.type.qualificationlevel

Doctoral

dc.identifier.libraryid

AC11682908

dc.description.numberOfPages

159

dc.identifier.urn

urn:nbn:at:at-ubtuw:1-68094

dc.thesistype

Dissertation

dc.thesistype

Dissertation

tuw.author.orcid

0000-0002-5048-5128

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.advisor.orcid

0000-0003-4195-1593

item.languageiso639-1

item.openairetype

doctoral thesis

item.grantfulltext

open

item.fulltext

with Fulltext

item.cerifentitytype

Publications

item.mimetype

application/pdf

item.openairecristype

http://purl.org/coar/resource_type/c_db06

item.openaccessfulltext

Open Access

crisitem.author.dept

E193-01 - Forschungsbereich Computer Vision

crisitem.author.orcid

0000-0002-5048-5128

crisitem.author.parentorg

E193 - Institut für Visual Computing and Human-Centered Technology

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(17.11 MB)

In Copyright

Show simple item record

Page view(s)

351

checked on Nov 21, 2023

Download(s)

176

checked on Nov 21, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM