Novel methods for writer identification and retrieval

Fiel, Stefan

doi:10.34726/hss.2015.25468

Record link:

https://doi.org/10.34726/hss.2015.25468
http://hdl.handle.net/20.500.12708/3591

Title:

Novel methods for writer identification and retrieval

Citation:

Fiel, S. (2015). Novel methods for writer identification and retrieval [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2015.25468

reposiTUm DOI:

10.34726/hss.2015.25468

CatalogPlus:

AC13006439

Publication Type:

Thesis - Dissertation

Language:

English

Authors:

Fiel, Stefan

Advisor:

Sablatnig, Robert

Organisational Unit:

E183 - Institut für Rechnergestützte Automation

Date (published):

2015

Number of Pages:

118

Keywords:

Writer Identification; Writer Retrieval; Fisher Vector; Deep Learning; Document Analysis

Abstract:

Writer identification is the task of identifying the writer of a handwritten document, based on a set of documents where the authors are known. It can be used e.g. for tasks in forensics and for historical document analysis. In contrast to this, writer retrieval is to receive a ranking of the pages in the set of documents sorted according to the similarity of handwriting and can be used for clustering a not indexed set of documents according to the individual handwriting. State-of-the-art methods calculate features on the contours of the characters, so pre-processing steps are needed to extract this contour. In contrast to this in this thesis, three novel approaches for writer identification and writer retrieval are presented. The first is based on the bag of words approach, which is well known for object recognition. SIFT features are calculated on the handwriting and then an occurrence histogram is generated which is then used for the identification of the writer. The second method is based on the Fisher vector. Again, SIFT features are generated on the handwriting, but this time the gradient vectors of a Gaussian Mixture Model (GMM) are used to generate the feature vector for writer identification. The last method is based on Convolutional Neural Network (CNN). A CNN is trained on image patches and the classification layer is cut off and the second last layer is used as feature vector for this patch. The mean vector of all patches on one page is the feature vector for the handwriting and is used for identification and retrieval. The methods presented are evaluated and compared to the state of the art on different scientific databases and additionally on a historic dataset using common evaluation metrics for writer identification. The evaluations show that the three methods proposed outperform the state of the art on many of the different tasks on these datasets. Advantages and possible weaknesses are discussed. The methods proposed achieve good results (>90%) on every dataset used for evaluation.

Additional information:

Zusammenfassung in deutscher Sprache

License:

In Copyright

Appears in Collections:

Thesis