<div class="csl-bib-body">
<div class="csl-entry">Duhan, A. (2022). <i>A modular model combining visual and textual features for document image classification</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.88103</div>
</div>
-
dc.identifier.uri
https://doi.org/10.34726/hss.2022.88103
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/20197
-
dc.description.abstract
Document image classification is the classification of digitized documents. Typically, these documents are either scanned or photographed. One page of such a document is referred to as a document image. Classifying document images is a crucial task since it is an initial step in downstream applications. This step is done manually by companies as of now, which takes considerable time and financial resources. The main reason is the need of a very high accuracy (well beyond 90%) for document image classification systems, specifically because it is potentially the first step in a series of downstream applications. However, achieving a high accuracy without having a dataset with millions of annotated documents is not trivial. The current state-of-the-art document image classification model is based on a transformer network, which is pretrained on more than 11 million scanned document images and thus requires a huge amount of resources to train. Additionally, this and other state-of-the-art document image classification models have well beyond 100 million parameters. In this work, we address both challenges. First, we create a model which is capable to compete with the current state-of-the-art model without pretraining on millions of scanned document images. Second, we create a model which is smaller than current state-of-the-art models, in terms of parameters. To achieve this, current state-of-the-art models are used as base models, which are relatively small in size. Their optimal setting is first found in a hyperparameter tuning on a subset of the data. The input to these models is based on image and text features. Their output is then combined, and a final meta-classifier is trained on these outputs to generate the final result. The results show, that the developed approach can compete with current state-of-the-art models in terms of accuracy. Additionally, it requires fewer parameters to train, and it is easily parallelizable, due to its modular nature. Moreover, the results show, which specific parts of document images are important for classification. Depending on how efficient the system should be, fewer modules from the system can be chosen.
en
dc.language
English
-
dc.language.iso
en
-
dc.rights.uri
http://rightsstatements.org/vocab/InC/1.0/
-
dc.subject
Document Image Classification
en
dc.subject
Modular
en
dc.subject
Multimodal
en
dc.subject
Convolutional Neural Networks
en
dc.subject
Transformer
en
dc.subject
RVL-CDIP
en
dc.subject
Tobacco3482
en
dc.title
A modular model combining visual and textual features for document image classification
en
dc.title.alternative
Ein modulares Modell, das visuelle und textuelle Merkmale zur Klassifizierung von Dokumentenbildern kombiniert
de
dc.type
Thesis
en
dc.type
Hochschulschrift
de
dc.rights.license
In Copyright
en
dc.rights.license
Urheberrechtsschutz
de
dc.identifier.doi
10.34726/hss.2022.88103
-
dc.contributor.affiliation
TU Wien, Österreich
-
dc.rights.holder
Amer Duhan
-
dc.publisher.place
Wien
-
tuw.version
vor
-
tuw.thesisinformation
Technische Universität Wien
-
dc.contributor.assistant
Hanbury, Allan
-
tuw.publication.orgunit
E193 - Institut für Visual Computing and Human-Centered Technology