A modular model combining visual and textual features for document image classification

Duhan, Amer

doi:10.34726/hss.2022.88103

DC Field

Value

Language

dc.contributor.advisor

Sablatnig, Robert

dc.contributor.author

Duhan, Amer

dc.date.accessioned

2022-05-19T07:32:56Z

dc.date.issued

2022

dc.date.submitted

2022-05

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Duhan, A. (2022). <i>A modular model combining visual and textual features for document image classification</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.88103</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2022.88103

dc.identifier.uri

http://hdl.handle.net/20.500.12708/20197

dc.description.abstract

Document image classification is the classification of digitized documents. Typically, these documents are either scanned or photographed. One page of such a document is referred to as a document image. Classifying document images is a crucial task since it is an initial step in downstream applications. This step is done manually by companies as of now, which takes considerable time and financial resources. The main reason is the need of a very high accuracy (well beyond 90%) for document image classification systems, specifically because it is potentially the first step in a series of downstream applications. However, achieving a high accuracy without having a dataset with millions of annotated documents is not trivial. The current state-of-the-art document image classification model is based on a transformer network, which is pretrained on more than 11 million scanned document images and thus requires a huge amount of resources to train. Additionally, this and other state-of-the-art document image classification models have well beyond 100 million parameters. In this work, we address both challenges. First, we create a model which is capable to compete with the current state-of-the-art model without pretraining on millions of scanned document images. Second, we create a model which is smaller than current state-of-the-art models, in terms of parameters. To achieve this, current state-of-the-art models are used as base models, which are relatively small in size. Their optimal setting is first found in a hyperparameter tuning on a subset of the data. The input to these models is based on image and text features. Their output is then combined, and a final meta-classifier is trained on these outputs to generate the final result. The results show, that the developed approach can compete with current state-of-the-art models in terms of accuracy. Additionally, it requires fewer parameters to train, and it is easily parallelizable, due to its modular nature. Moreover, the results show, which specific parts of document images are important for classification. Depending on how efficient the system should be, fewer modules from the system can be chosen.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Document Image Classification

dc.subject

Modular

dc.subject

Multimodal

dc.subject

Convolutional Neural Networks

dc.subject

Transformer

dc.subject

RVL-CDIP

dc.subject

Tobacco3482

dc.title

A modular model combining visual and textual features for document image classification

dc.title.alternative

Ein modulares Modell, das visuelle und textuelle Merkmale zur Klassifizierung von Dokumentenbildern kombiniert

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2022.88103

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Amer Duhan

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

dc.contributor.assistant

Hanbury, Allan

tuw.publication.orgunit

E193 - Institut für Visual Computing and Human-Centered Technology

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC16528748

dc.description.numberOfPages

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.assistant.staffStatus

staff

tuw.advisor.orcid

0000-0003-4195-1593

tuw.assistant.orcid

0000-0002-7149-5843

item.languageiso639-1

item.openairetype

master thesis

item.grantfulltext

open

item.fulltext

with Fulltext

item.cerifentitytype

Publications

item.mimetype

application/pdf

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.openaccessfulltext

Open Access

crisitem.author.dept

E120-01-1 - Forschungsgruppe Mikrowellenfernerkundung

crisitem.author.parentorg

E120-01 - Forschungsbereich Fernerkundung

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(2.74 MB)

In Copyright

Show simple item record

Page view(s)

965

checked on Nov 23, 2023

Download(s)

201

checked on Nov 23, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM