Synthetic data for applications in document analysis

Muth, Markus Michael

doi:10.34726/hss.2023.103190

DC Field

Value

Language

dc.contributor.advisor

Sablatnig, Robert

dc.contributor.author

Muth, Markus Michael

dc.date.accessioned

2023-10-04T11:15:32Z

dc.date.issued

2023

dc.date.submitted

2023-09

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Muth, M. M. (2023). <i>Synthetic data for applications in document analysis</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2023.103190</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2023.103190

dc.identifier.uri

http://hdl.handle.net/20.500.12708/188733

dc.description.abstract

Die Verwendung von synthetischer Handschrift zur Verbesserung von Machine Learning Methoden wird für zwei Anwendungsbereiche analysiert: das Finden von Handschrift in Bildern (Handwritten Text Detection, HTD), und die Zeichenerkennung für Handschrift (Handwritten Text Recognition, HTR). Für HTD wird synthetische Handschrift mit- hilfe eines bereits vorhandenen Machine Learning Modells [DMP+20] generiert und zu gescannten Dokumenten hinzugefügt, um handschriftliche Notizen zu imitieren. Modelle zur Objekterkennung (YOLOv5 [JAS+ ] und YOLOv8 [JCQ]) werden trainiert, um die Handschrift vom restlichen Inhalt der Dokumente zu unterscheiden. Anschließend werden diese Modelle mit echten Daten evaluaiert: Für den CVL Datensatz [KFDS13] wird eine mAP@50 von 0.88 und ein F1@50 auf Pixelebene von 0.96 erreicht; für echte Notizen auf einer wissenschaftlichen Publikation wird eine mAP@50 von 0.72 und ein F1@50 von 0.89 auf Pixelebene erlangt. Die synthetisch generierte Handschrift wird weiters verwendet, um ein bereits vorhandenes Modell zur Zeichenerkennung [CCP21] zu trainie- ren. Anschließend wird dieses Modell zur Erkennung des Inhalts von Bildern mit echter Handschrift angewendet. Dies führt zu einer Zeichenfehlerhäufigkeit (Character Error Rate, CER) von 28.3% und einer Wortfehlerhäufigkeit (Word Error Rate, WER) von 65.5% für Bilder aus dem IAM-Datensatz [MB02], was mehr als dreimal so hoch ist wie die Fehlerraten ohne synthetischen Daten. Die Verwendung von synthetischen Bildern in Kombination mit echten Daten ermöglicht jedoch eine Reduzierung der Fehlerraten im Vergleich zur Verwendung von echten Daten allein, insbesondere für kleine Datensätze. Die Verwendung von nur 10% der Trainingsdaten (113 Bilder) aus dem CVL-Datensatz [KFDS13] führt zu einer CER von 54.5% und einer WER von 88.8%. Wenn das Modell jedoch mit synthetischen Daten vortrainiert wird, ergibt sich eine CER von 14.6% und eine WER von 43.4%.

dc.description.abstract

The usability of synthetic HandWritten Text (HWT) to improve machine learning models is assessed for two domains: Handwritten Text Detection (HTD) and Handwritten Text Recognition (HTR). Synthetic HWT is generated using an existing model [DMP+20], and added to scanned documents to mimic handwritten annotations. Object detection models (YOLOv5 [JAS+ ] and YOLOv8 [JCQ]) are trained to distinguish HWT from remaining content. Applying those models to real data results in a mAP@50 of 0.88 and a pixel-level F1@50 of 0.96 for the CVL data set [KFDS13], and a mAP@50 of 0.72 and F1@50 of 0.89 for a scientific paper with real handwritten annotations. The synthetic HWT from [DMP+ 20] is further used to train the HTR model described in [CCP21], which is then applied to recognize the content of real HWT data sets. This results in a Character Error Rate (CER) of 28.3% and a Word Error Rate (WER) of 65.5% for line images of the IAM data set [MB02], which is more than three times higher than the state-of-the-art results. However, mixing synthetic with real data allows to reduce the CER and WER compared to using real data only, especially for small data sets. Using only 10% of the training data (113 images) from the CVL data set [KFDS13] results in a CER of 54.5% and a WER of 88.8%, pre-training the model with synthetic data results in a CER of 14.6% and a WER of 43.4%.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

HandWritten Text

dc.subject

IAM dataset

dc.subject

Synthesized HandWritten Text

dc.subject

GAN

dc.title

Synthetic data for applications in document analysis

dc.title.alternative

Verwendung von Synthetischen Datensätzen für Anwendungen im Bereich der Dokumentenanalyse

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2023.103190

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Markus Michael Muth

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

dc.contributor.assistant

Kleber, Florian

dc.contributor.assistant

Peer, Marco

tuw.publication.orgunit

E193 - Institut für Visual Computing and Human-Centered Technology

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC16959647

dc.description.numberOfPages

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.assistant.staffStatus

staff

tuw.assistant.staffStatus

staff

tuw.advisor.orcid

0000-0003-4195-1593

tuw.assistant.orcid

0000-0001-8351-5066

tuw.assistant.orcid

0000-0001-6843-0830

item.mimetype

application/pdf

item.cerifentitytype

Publications

item.fulltext

with Fulltext

item.openaccessfulltext

Open Access

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.grantfulltext

open

item.openairetype

master thesis

item.languageiso639-1

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(2.61 MB)

In Copyright

Show simple item record

Page view(s)

508

checked on Nov 23, 2023

Download(s)

566

checked on Nov 23, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM