Duhan, A. (2022). A modular model combining visual and textual features for document image classification [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.88103
Document image classification is the classification of digitized documents. Typically, these documents are either scanned or photographed. One page of such a document is referred to as a document image. Classifying document images is a crucial task since it is an initial step in downstream applications. This step is done manually by companies as of now, which takes considerable time and financial resources. The main reason is the need of a very high accuracy (well beyond 90%) for document image classification systems, specifically because it is potentially the first step in a series of downstream applications. However, achieving a high accuracy without having a dataset with millions of annotated documents is not trivial. The current state-of-the-art document image classification model is based on a transformer network, which is pretrained on more than 11 million scanned document images and thus requires a huge amount of resources to train. Additionally, this and other state-of-the-art document image classification models have well beyond 100 million parameters. In this work, we address both challenges. First, we create a model which is capable to compete with the current state-of-the-art model without pretraining on millions of scanned document images. Second, we create a model which is smaller than current state-of-the-art models, in terms of parameters. To achieve this, current state-of-the-art models are used as base models, which are relatively small in size. Their optimal setting is first found in a hyperparameter tuning on a subset of the data. The input to these models is based on image and text features. Their output is then combined, and a final meta-classifier is trained on these outputs to generate the final result. The results show, that the developed approach can compete with current state-of-the-art models in terms of accuracy. Additionally, it requires fewer parameters to train, and it is easily parallelizable, due to its modular nature. Moreover, the results show, which specific parts of document images are important for classification. Depending on how efficient the system should be, fewer modules from the system can be chosen.