Text classification and layout analysis for document reassembling

Diem, Markus

doi:10.34726/hss.2014.23117

Record link:

https://doi.org/10.34726/hss.2014.23117
http://hdl.handle.net/20.500.12708/7464

Title:

Text classification and layout analysis for document reassembling

Citation:

Diem, M. (2014). Text classification and layout analysis for document reassembling [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2014.23117

reposiTUm DOI:

10.34726/hss.2014.23117

CatalogPlus:

AC11682908

Publication Type:

Thesis - Dissertation

Language:

English

Authors:

Diem, Markus

Advisor:

Sablatnig, Robert

Organisational Unit:

E183 - Institut für Rechnergestützte Automation

Date (published):

2014

Number of Pages:

159

Keywords:

Document Analysis; Layout Analysis; Text Classification; Document Reassembling

Abstract:

Eine automatische Dokumentrekonstruktion von handzerrissenen Akten ermöglicht die Wiederherstellung von verlorengeglaubtem Inhalt. Das zugrundeliegende Datenmaterial beinhaltet 600 Millionen Stasi Schnipsel die beim Fall der Berliner Mauer vernichtet wurden. Konturbasierte Ansätze können Schnipsel nicht richtig zusammensetzten wenn mehrere Seiten gleichzeitig zerrissen wurden. Zusätzlich ist die Komplexität der automatischen Rekonstruktion zu hoch wenn jedes Schnipsel mit jedem verglichen werden muss. Deshalb wird vor der Rekonstruktion der visuelle Inhalt von Schnipseln analysiert. Dadurch kann einerseits eine Vorsortierung vorgenommen werden, andererseits ermöglicht es Schnipsel mit gleichen Risskanten voneinander zu unterscheiden. Algorithmen zur automatischen Textlokalisierung und Papieranalyse werden in dieser Doktorarbeit vorgestellt, die bei der Rekonstruktion mit konturbasierten Ansätzen kombiniert werden. Die Papieranalyse klassifiziert Papier in liniiert, kariert und leeres Papier. Liegt ein liniiertes oder kariertes Schnipsel vor, so werden die Linien genau lokalisiert um benachbarte Schnipsel basierend auf deren Liniierung auszurichten. Des Weiteren wurde eine neue Textlokalisierung entwickelt, die Wörter kompakt repräsentiert und deren lokale Ausrichtung genau wiedergibt. Alle Elemente eines Dokuments werden in Maschinenschrift, Handschrift oder kein Text mit Hilfe von sogenannten Gradient Shape Features (GSF) klassifiziert. Die Erkennung von kein Text ermöglicht es falsch binarisierte Elemente zu verwerfen um auf diese Weise gute Ergebnisse in verrauschten Dokumenten zu erzielen. Nach der Klassifikation und Lokalisierung von Text werden diese Elemente zu hierarchisch höheren Strukturen zusammengefasst. Dabei wird ein bottom-up Verfahren verwendet welches auch auf zerrissenen Dokumenten angewendet werden kann. Die Methodik wurde empirisch auf öffentlich verfügbaren Datensätzen evaluiert und mit bestehenden Dokumentanalysesystemen verglichen. Die Textklassifikation und Lokalisierung konnte dabei bisherige Ergebnisse auf einem Datensatz mit modernen gedruckten Layouts und einem Handschriftendatensatz verbessern. Die Layout Analyse wurde auf den letzten drei Page Segmentation Contest Datensätzen evaluiert. Dort wurden im Vergleich zu State-of-the-Art Methoden ähnliche Ergebnisse erzielt, wobei sich herausstellte, dass besonders Bangla eine Schwierigkeit für die entwickelte Methode darstellt.

In the context of automated reassembling of manually torn document snippets contour based approaches are insufficient because snippets have the same rupture edges if more than one page is torn at the same time. Moreover jigsaw puzzling is np hard which requests for a grouping of document snippets beforehand such that the complexity and computational speed of reassembling is improved. Analyzing the visual content of document snippets renders the distinction of snippets with the same contours possible. In addition, a visual content extraction enables fine alignment of snippets with the same content and for grouping snippets. The document analysis approaches presented in this thesis are part of a combined reassembling which utilizes content and contour for the reconstruction of about 600 Million Stasi snippets. The ruling analysis classifies the supporting material into void, lined, and checked paper. If a ruling is detected, the lines are localized accurately which allows for snippet alignments. Snippets might have sparse visual content depending on the conscientiousness when tearing. Therefore a new word localization (the so-called Profile Box) is introduced which keeps a compact word representation while accounting for anticipated deformations such as a word's local skew. These word boxes are further classified into printed, manuscript, and non-text elements by means of Gradient Shape Features (GSF) which are designed newly for this task. The latter class allows for rejecting falsely binarized elements which improves the robustness in the presence of degraded or noisy documents. Finally, a layout analysis is performed that is based on a bottom-up approach to keep the element clustering flexible even if a global text structure is not present. Results on various publicly available databases show that the methodology is capable of being adopted to different document analysis scenarios. A synthetic database for ruling line removal is created and made publicly available which allows comparisons between the approach proposed and other state-of-the-art methodologies. The text classification is compared to other approaches by means of the PRImA benchmarking database and the Iam database, which is a handwriting database written by multiple authors. The methodology presented achieves the best results in both empirical evaluations. On real world Stasi snippets, the recognition rate is lower because of the heterogeneity and sparseness of content in the data. The layout analysis is additionally evaluated on the most recent Handwriting Segmentation Contests where it competes state-of-the-art methods and on a medieval database.

Additional information:

Zsfassung in dt. Sprache

License:

In Copyright

Appears in Collections:

Thesis