Open Set Klassifizierung von Nummerntafeltypen in Bildern mittels Deep Learning DIPLOMARBEIT zur Erlangung des akademischen Grades Diplom-Ingenieur im Rahmen des Studiums Visual Computing eingereicht von Sebastian Haushofer, BSc Matrikelnummer 01525970 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Walter Kropatsch, O.Univ.Prof. Dipl.-Ing. Dr.techn. Mitwirkung: Jiří Hladůvka, Univ.Ass. Dr.techn. Wien, 31. Jänner 2021 Sebastian Haushofer Walter Kropatsch Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Open Set Classification in the Domain of License Plate Type Images using Deep Learning DIPLOMA THESIS submitted in partial fulfillment of the requirements for the degree of Diplom-Ingenieur in Visual Computing by Sebastian Haushofer, BSc Registration Number 01525970 to the Faculty of Informatics at the TU Wien Advisor: Walter Kropatsch, O.Univ.Prof. Dipl.-Ing. Dr.techn. Assistance: Jiří Hladůvka, Univ.Ass. Dr.techn. Vienna, 31st January, 2021 Sebastian Haushofer Walter Kropatsch Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Erklärung zur Verfassung der Arbeit Sebastian Haushofer, BSc Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. Wien, 31. Jänner 2021 Sebastian Haushofer v Acknowledgements Above all, I want to thank my family and friends who always support me and give me a lot of energy and motivation. Besides, I want to express my gratitude to my supervisors Jiří Hladůvka and Walter Kropatsch for answering my technical questions and providing valuable feedback on my work. Last but not least, I am grateful to everybody around the world who inspires me by coming up with new and original ideas. This includes artists and scientists of all kind. vii Kurzfassung Bei der Erkennung des Nummerntafeltyps handelt es sich um ein Klassifizierungsproblem auf einer offenen Menge von Klassen. Das Ziel ist zwischen verschiedenen Hintergründen von Nummerntafeln, welche an Fahrzeugen angebracht sind, zu unterscheiden. Wenn der Nummerntafeltyp bekannt ist, lässt sich damit auf den Herkunftsstaat, in dem das Kennzeichen registriert ist, rückschließen. Moderne Techniken, die Klassifikation auf einer offenen Menge lösen, nutzen Deep Lear- ning, um einen eingebetteten Raum als Zwischenprodukt zu trainieren. In diesem wird anschließend die Klassifikation durchgeführt. Der eingebettete Raum ist im Wesentlichen das Ergebnis einer nichtlinearen Merkmalsreduktion. In ihm ist jede Klasse durch einen kompakten Cluster repräsentiert, der von den anderen Klassen trennbar ist. Der eingebet- tete Raum wird nur mit einer Teilmenge aller bekannten Klassen trainiert. Die übrigen bekannten Klassen werden integriert, ohne das Neuronale Netzwerk neu zu trainieren. Instanzen von Klassen, die für das gesamte System unbekannt sind, werden mittels Thresholding identifiziert. Ein wesentlicher Beitrag dieser Arbeit ist eine Analyse der Klassenverteilungen im einge- betteten Raum, welche von bisherigen Arbeiten weitgehend außer Acht gelassen wurde. Für diesen Zweck werden einige neue Bewertungskriterien vorgestellt, welche es ermögli- chen die Qualität verschiedener gelernter Räume zu vergleichen. Die identifizierenden Zeichen einer Nummerntafel sind eine Herausforderung für die maschinelle Erkennung des Nummerntafeltyps, da Neuronale Netzwerke dazu neigen sie als Merkmale zu lernen. Um dieses Problem zu lösen, werden die identifizierenden Zeichen während des Trainings per Zufall ausmaskiert. Außerdem wird mit Hilfe sogenannter Saliency Maps untersucht, auf welche Merkmale das trainierte Neuronale Netzwerk besonders sensibel reagiert. Ein bestimmter Klassifikator, welcher häufig in eingebetteten Räumen verwendet wird, nimmt Verteilungen mit homogener Varianz als Vorbedingung an. Die Analyse der Verteilungen im eingebetteten Raum zeigt, dass diese Annahme für den verwendeten Nummerntafeltyp Datensatz nicht erfüllt ist. Die Experimente zeigen, dass die Klassen- verteilungen auch einen signifikanten Unterschied zu multivariaten Normalverteilungen, welche die Form von Hyperellipsen besitzen, aufweisen. Die Saliency Maps visualisieren, dass das Neuronale Netzwerk für den Großteil der Nummerntafeltypen vernünftige Merk- male lernt. Allerdings werden für wenige Klassen primitive Merkmale ausgewählt, welche von einem Menschen nicht in Betracht gezogen werden würden. ix Abstract License plate type recognition is a classification problem on an open set of classes. Its purpose is to distinguish between different backgrounds of license plates mounted on vehicles. Knowing the license plate type allows to identify the country in which the respective license plate is registered. Modern approaches which solve open set classification problems make use of deep learning to train an intermediate embedding space in which a subsequent classification is performed. This embedding space essentially represents the result of a non-linear feature reduction. Within it, each class is represented by a compact cluster which is separable from the other classes. The embedding space is only trained with a subset of the known classes. The remaining known classes are incorporated without retraining the network. Instances of classes which are unknown to the system are identified using thresholding. The main contribution of this work is the analysis of the class distributions within the embedding space, which has been largely neglected by previous work. For this purpose, new benchmarks which allow a qualitative comparison between different learned embedding spaces are introduced. The identifying characters of a license plate are challenging for license plate type recognition because the model tends towards learning specific characters as features. To overcome this problem, randomized masking of the characters during training is proposed. Furthermore, investigations about which features the trained network is sensitive to are carried out using gradient-based saliency map techniques. A commonly used embedding space classifier assumes class distributions of homogeneous variance as a prior under the open set restriction. The analysis of the class distributions in the embedding space shows that this prior is not met for the used license plate type dataset. The experiments reveal that the class distributions even show a significant difference from multivariate Gaussians, which are capable of modeling more complex distributions in the shape of hyper-ellipsoids. The computed saliency maps visualize that the model learns reasonable features for most of the license plate types. However, for a few of them, primitive features which a human would not consider for classification are exploited. xi Contents Kurzfassung ix Abstract xi Contents xiii 1 Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Who this Thesis is for . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Methodological Approach . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Aim of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 State of the Art 15 2.1 License Plate Recognition (LPR) . . . . . . . . . . . . . . . . . . . . . 15 2.2 Embedding Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Saliency Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3 Methodology 33 3.1 Training the Embedding Space . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Classification in the Embedding Space . . . . . . . . . . . . . . . . . . 38 3.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Saliency Map Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4 Results 55 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2 Data Augmentation and Character Masking . . . . . . . . . . . . . . . 56 4.3 Performance Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4 Embedding Space Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 Interpreting the Saliency Maps . . . . . . . . . . . . . . . . . . . . . . 75 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5 Conclusion and Future Work 81 xiii 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 List of Figures 85 List of Tables 89 Acronyms 91 Bibliography 93 CHAPTER 1 Introduction 1.1 Problem Statement License Plate Type Recognition (LPTR) is a subtask for solving the broader problem of License Plate Recognition (LPR). The goal of LPR is to read the characters written on a License Plate (LP) of a vehicle, as well as the country or state of origin for which the LP is registered. These two properties are a unique identifier of a vehicle in most settings. In many countries such as the USA, an LP does not only contain characters but also a background image which is shared amongst many different LPs of a certain state. This background image is also known as the License Plate Type (LPT) and may contain additional fixed characters, coats of arms, symbols, or textures. In general, the LPT is defined as the parts of the LP which remain once the characters identifying the LP are masked out. Some example LPs and their corresponding LPTs from the US state of Virginia are shown in Figure 1.1. The aim of LPTR is to identify the LPT based on a given image of an LP, which fundamentally makes it a classification problem. This task is challenging for two reasons. Firstly, there are hundreds of different LPTs and any system solving the LPTR task should be able to classify new LPTs, which have not been considered when initially developing the system. This should be achieved with as little adaptation of the system, cost and time effort as possible. Another goal of an LPTR system is to recognize if a certain LPT is unknown to itself, as opposed to assigning an incorrect class label to it. Because of these two properties, LPTR is a classification problem on an open set of classes. The second aspect which makes LPTR challenging is that captured images of LPTs have very high variance along many different dimensions in real world applications. These include illumination, contrast, reflections, occlusions, motion blur, dirt, image resolution and other noise in general. The images shown in Figure 1.1 are idealized. Real world data typically looks like the images shown in Figure 1.2. In fact, these images are part of the dataset which has been used for this work. 1 1. Introduction (a) Original LPs containing the identifying characters. (b) LPTs of the LPs shown in (a), obtained by masking out the relevant characters. Figure 1.1: Some example LPs and their corresponding LPTs from the US state of Virginia. Images by Kustermann [Kus20]. 1.2 Motivation LPR systems are heavily used for tolling and speed control stations. To reduce costs of human evaluation, people have put a lot of effort and money into developing Optical Character Recognition (OCR) systems to solve the LPR task automatically over the past decades [AAP+08]. In many cases LPTR is very promising for further improving the reliability and performance of such a system. First of all, the LPT implies the state of origin of an LP. This is due to the fact that there is a many-to-one correspondence between LPTs and states. Using a database which stores the LPTs for each state of interest, it is possible to identify the state of origin, given the LPT. As the LPT depends on the number of identifying characters written on it and their relative positions, knowing the LPT makes it possible to perform sanity checks on the 2 1.3. Who this Thesis is for (a) Neutral conditions (b) Overlit (c) Poor contrast (d) Blurred (e) Dirty (f) Occluded Figure 1.2: Different images of the same LPT from the real world dataset used in this work. The LPT shown in these images is the same as the one at the bottom left in Figure 1.1. The images show some of the variances which occur in the dataset. read LP characters and vice versa. Notice that this property is only useful if the LPTR system is capable of identifying the LPT without requiring any special handling (e.g. masking) of the characters in advance. Otherwise, this leads to a chicken-egg situation, as character validation depends on the LPT and classifying the LPT depends on the characters. Lastly, LPTs may contain additional information which would not be accessible by traditional OCR means otherwise. For example, in the USA there are certain groups of people who receive special rates for tolling stations. These groups include veterans or persons with disabilities. A group is indicated by special symbols or signs which are part of the LPT. Just as with the states, there is a many-to-one correspondence between LPTs and a certain group. 1.3 Who this Thesis is for This thesis is written for anyone who is interested in computer vision, pattern recognition, deep learning and/or machine learning. It might be of particular interest to people who work on classification problems on open sets. The thesis is written in such a way that a graduate student of computer science, who has taken an introduction course to machine learning, should be able to understand it. Thus, there is quite a stack of prior knowledge in different fields the reader should be familiar with. However, in all the different domains in which prior knowledge is assumed, the fundamental principles should be enough to follow along. More advanced topics are explained within the thesis. Regarding mathematics, knowledge about basic concepts of linear algebra, multivariable calculus and statistics is assumed. Particularly important for this thesis are trans- formations between different coordinate systems, eigenvalues and eigenvectors, partial 3 1. Introduction derivatives, Probability Density Function (PDF)s and Cumulative Distribution Function (CDF)s, and common probability distributions. Furthermore, as the title of the thesis suggests, deep learning and machine learning play a central role in general. Basic knowl- edge about artificial neural networks, especially the Convolutional Neural Network (CNN) architecture, is assumed. This includes the common layer types (convolutional, pool- ing, fully connected, dropout, batch normalization), activation functions, loss functions, regularization, optimizers, gradient descent and back-propagation. Regarding pattern recognition and classical machine learning, Principal Component Analysis (PCA), multi- variate Gaussians and various common metrics for evaluating classifiers are especially relevant. All evaluation metrics which have been used in this work are briefly explained in Section 3.3. For all of these fields, any introductory textbook about the respective topic that is com- monly used for teaching at universities should cover the necessary aspects. However, we offer some recommendations for books which we consider particularly well fit. For a great modern introduction to deep learning, the reader may have a look at the "Deep learning" book by Goodfellow et al. [GBC16]. As a reference book for pattern recognition and machine learning in general, we recommend Bishop’s "Pattern recognition and machine learning" [Bis06]. Both books also cover a brief refresher of most of the mathematical foundations mentioned above. Two more great books worth mentioning are "Introduction to linear algebra" by Strang [Str16] and "Vector calculus, linear algebra, and differential forms: a unified approach" by Hubbard and Hubbard [HH15]. 1.4 Methodological Approach First of all, it is justified why deep learning is well suited for solving LPTR and what benefits it has over traditional approaches. Then, the basic principle behind using deep learning for solving the problem is explained. At the end of this section, an illustration of how such a deep learning based system can be used in practice is given. 1.4.1 The Difficulties with Traditional Approaches Although all images of the same LPT should look the same once the respective LP’s identifying characters are masked out, this rather theoretical idealization does not hold for real world data. This is due to different kinds of variances of the captured images, as mentioned in Section 1.1. In an idealized world, all images of LPTs are captured directly from the front with the same camera, without any perspective distortion and at neutral illumination conditions without any specular highlights. Under these circumstances, a pixel-wise comparison between two different images of the same LPT would show no difference at all. A common metric for computing pixel-wise similarity between any pair of two images is the Normalized Cross Correlation (NCC). Due to the given variances illustrated in Figure 1.2, this naive approach alone is not capable of solving the LPTR task reliably, given more than just a handful of different LPTs. For example, tiny alignment differences obtained 4 1.4. Methodological Approach by shifting one of the images slightly along the x or y-axis would already cause faulty results. A problem like this can be overcome with traditional computer vision approaches for aligning the images. Many other problems caused by variances such as noise may also be solved by hard-coded computer vision solutions designed by a human expert. However, there will most likely always be edge cases which have not been considered during the initial development phase of the system, which are potential error sources. A particularly hard problem in general are occlusions of any kind. All methods which are based on simple pixel-wise comparisons are not capable of overcoming a problem like this. This is because such algorithms have no understanding of what characterizes the LPT shown in a given image. For a human being, an LPT remains uniquely identifiable as long as enough of its characteristics distinguishing it from other LPTs are not occluded. Being able to solve occlusions therefore requires deeper understanding of the underlying data than just raw pixels. Researchers have developed many algorithms to compute handcrafted features from images. The main idea is that certain combinations of computed features uniquely identify a specific class of interest. A major disadvantage of handcrafted features is that different features are useful for different datasets. A human expert must find the right features which are capable of solving the underlying problem. This is both time consuming and expensive. 1.4.2 Why Deep Learning? Deep learning is very promising for a problem like LPTR because it extracts relevant features identifying a certain object by itself during training. It does not require any sort of expert human knowledge to define meaningful features for the problem at stake. Deep learning only needs a precise definition of the task to be solved in form of a differentiable optimization problem and a lot of training data as input. However, it must be noted that collecting and selecting the right training data is often very challenging, as unwanted biases can easily occur within the data. Notice that the previously mentioned problem of edge cases which have not been considered during development does not disappear. Deep learning will only learn to solve a task for the data it is trained with. Although it can learn to generalize to similar data, its generalization capabilities for edge cases are quite limited. Thus, the network will most likely have troubles interpreting edge cases correctly if they are not part of the training set. However, once the developer becomes aware of these cases, retraining the network with the according data can often solve the problem. If it does, the relevant features will then be discovered automatically during retraining. This gives another advantage over traditional approaches, where the chosen handcrafted features might not be expressive enough to describe edge cases correctly. In this scenario, selecting new or additional handcrafted features is necessary to overcome the problem. For example, if occlusions are not considered at first, simple features such as contours might be enough for correctly classifying different objects. But once occlusions are introduced, characterizing parts of the contours may not be visible anymore, which could lead to errors. 5 1. Introduction 1.4.3 Deep Learning for Open Set License Plate Type Recognition To summarize the previous section, deep learning is capable of overcoming or at least mitigating the problems with all sorts of variances if enough training data is available. For an LPTR classification system, there are two important requirements, as described in Section 1.1. New Classes The LPTR system must be able to incorporate new classes which have not been considered during the initial development of the system. Furthermore, adding these new classes should be quick and inexpensive. Therefore, retraining the network after a limited number of new classes has been added should be avoided whenever possible. Not using new classes for retraining the network is regarded as a given constraint for the LPTR system in this work. Every class which is initially available and which is used for training the network is defined as Train Class (TRC) in this work. Contrarily, each class which is added to the system afterwards is defined as Test Class (TEC). Together, the TRCs and TECs make up the known classes. Overcoming the constraint explained above can be achieved by using techniques from Few-Shot Learning (FSL). Wang et al. [WYKN20] give a great survey on the topic. The goal of FSL is to correctly classify instances of classes for which only few training samples are available. This is achieved by using an additional dataset for training. The main idea is to learn features from the additional dataset which can also be used for recognizing the classes which only consist of a few samples. In FSL, this additional dataset often is only a means for training. In the LPTR context, the additional dataset consists of the TRCs. The classes with few samples correspond to the TECs. Various techniques for FSL, which also use the classes with few samples for training the network in addition to the extra dataset, exist. However, this violates the LPTR constraint. Approaches which comply to the constraint are based on embedding spaces. This work focuses on such methods. The basic idea behind using embedding spaces for image classification with deep learning is the following [WYKN20]: A CNN is trained to learn a non-linear function mapping Rm → Rn where n m from the input pixels of an image to a lower dimensional embedding space, which represents the characteristics of different classes. For the actual classification, a simple classifier is applied in the embedding space instead of using it on the raw pixels. The embedding space and its classifier are used due to the open set restriction. For each new TEC, it is only necessary to transform its representative images into the embedding space and to compute a classifier with the obtained embeddings. With this setup, the CNN calculating the embeddings does not require retraining. Notice that in this context, TEC refers to classes which are unknown during embedding space training. It does not refer to classes which are unknown to the entire system. TECs still require training samples in order to compute the parameters of the embedding space classifier. The exact type of classifier which is used in the embedding space is interchangeable, but it is usually a statistical model of simple nature. Different options are described in Section 3.2. These classifiers do not require iterative training. Their 6 1.4. Methodological Approach parameters can be estimated directly. Therefore, updating an embedding space classifier for recognizing additional TECs is simple and quick, and does not violate the LPTR constraint. It must be noted that the TECs are not used in the traditional FSL context in this work. This is because there is no restriction to the maximal number of TEC samples which can be used for training the embedding space classifier. In contrast, FSL can be defined more rigorously by limiting the exact number of samples which are available per class. In this work, more samples of a TEC are always desirable. This closely mimics how a LPTR system would be used in the real world, as data for new TECs would be collected over time. Not using all of this data if it was available would give a clear disadvantage. One benefit of having more data for each TEC is that it allows to use more complex embedding space classifiers, which is further discussed in Section 1.5.2 and Section 3.2. Unknown Classes The LPTR system should also recognize instances of classes which are unknown to itself. In this context, unknown class means that it is neither a TRC nor a TEC, i.e. an LPT which the system has not seen for training the embedding space classifier. Geng et al. [GHC20] give a comprehensive survey on the underlying problem, which is known as Open Set Recognition (OSR). The aim of OSR is to distinguish between known and unknown classes, which fundamentally makes it a binary classification task. Performing OSR can be achieved by setting a threshold for the required confidence in the classifier’s prediction for a known class. If the highest computed class confidence is below this threshold, the input is identified as belonging to an unknown class. The threshold is computed empirically using the training datasets of the known and unknown classes. Notice that all samples of the unknown classes can have the same label, as long as it is distinguishable from the labels of the known classes. Combing OSR with embedding spaces is straight forward, as the embedding space classifiers directly output class confidences. In fact, it is also common to use embedding spaces solely for OSR [GHC20]. Embedding spaces allow to perform thresholding on other parameters than confidence, such as distance, which is further discussed in Section 3.3.1. Learning an embedding space which is capable of distinguishing TECs from each other as well as from TRCs, while still recognizing unknown classes, is challenging. There are various algorithms to train an embedding space which use different loss functions. Recent state-of-the-art approaches are elaborated in Section 2.2. However, the underlying idea remains the same for all of them. 1.4.4 Desired Properties of Embedding Spaces for Classification The natural question to ask at this point is how a qualitative embedding space must be structured after training. The classification performed within this space should be as simple as possible and TECs must also be recognizable. A desired representation for every class is a compact and homogeneous cluster which has large empty gaps to the clusters 7 1. Introduction of other classes in the embedding space. Intuitively, large gaps between clusters of TRCs allow the cluster of a TEC to be located somewhere in between existing clusters. This enables distinguishing it from all other classes. In other words, it is desirable to only fill a fraction of the available high dimensional embedding space with the TRCs. Moreover, each TRC should have its own dedicated subspace, preferably of uniform variance, to enable the usage of simpler classifiers. The majority of the embedding space should be left empty after training to leave space for TECs. For these reasons, it is desirable that the classification problem is linearly separable within the embedding space. Two different three dimensional embedding spaces are shown in Figure 1.3. The embedding space in Figure 1.3a is qualitatively better than the one in Figure 1.3b for open set problems, as it is more compact and has larger gaps between different TRCs. (a) This embedding space is obtained by using the Arcface softmax exten- sion [DGXZ19], which is explained in Section 2.2.1. The resulting cluster of each class is more compact and the gaps between different classes are larger than in (b). (b) The cluster of each class is spread out more widely than in (a), leav- ing smaller gaps between distinct classes. This embedding space has been trained with a traditional soft- max classifier enforcing no explicit constraints on the embeddings. Figure 1.3: Points within two 3D embedding spaces trained with the same CNN and dataset, using a different loss function. Each colored point represents the embedding of one image and each color one class. The used dataset is a subset of the license plate type dataset described in Section 4.1 and consists of 9 classes. 8 1.5. Aim of the Work 1.4.5 License Plate Type Recognition using Embedding Spaces in Practice Using an embedding space based classification algorithm in a real world LPTR system can be achieved as follows, once the embedding space has been trained. First, all classes of interest for the specific application are identified. These classes can contain both TRCs and TECs. Then, for every class, a certain amount of embeddings are computed, given a set of representative images for the respective class. The embeddings of each class are then used to compute the relevant parameters of a simple classifier, which can be as trivial as the mean vector of all embeddings. Notice once again that TRC refers to the training of the embedding space. Both TRCs and TECs consist of a train and test set for the embedding space classifiers. The calculated parameters of each classifier are stored in a database and associated with the according class label. All steps mentioned so far only have to be done once in advance, during the system setup. During operation, the system is given an image for which the according class should be computed. To do so, the embedding of the input image is calculated and then given to all precomputed classifiers stored in the database. Assuming that the system is only confronted with images of classes known to its database, the class with the highest score always wins and the according label is assigned to the input image. In scenarios where images of classes unknown to the database can also be an input, i.e. an open set, the system must be able to identify that these images belong to an unknown class. As mentioned before, this is particularly relevant for LPTR in practice, as many rare LPTs exist for which little to no data could be collected in advance. Identifying unknown classes can be achieved by thresholding [GHC20]. This is further discussed in Section 3.3.1. 1.5 Aim of the Work One obvious goal is to develop a system which is capable of solving the LPTR task as described in Section 1.1 as well as possible. Apart from that, many interesting scientific questions arise when thinking about this problem. We distinguish between those regarding LPTR specifically and those concerning embedding spaces. To the best of our knowledge, no scientific work which is specifically targeted at solving LPTR has been done yet. A lot of recent research on embedding spaces using deep learning has been conducted on face recognition problems [SKP15], [DGXZ19], [GZJ+15]. These algorithms are further discussed in Section 2.2. Many concepts which are introduced in the face recognition domain can be directly applied to LPTR. Vice versa, the analysis methods used in this work, which do not directly concern LPTR, can be applied to embedding space based methods in general. Face recognition is heavily used as point of reference in this work, as it operates under similar conditions as LPTR. Methodologically speaking, recognizing unknown LPTs is similar to verifying whether or not two faces are identical. Furthermore, classifying new faces without requiring expensive retraining is a standard use case in face recognition. 9 1. Introduction 1.5.1 Research Questions in the License Plate Type Recognition Domain The biggest question about LPTR using deep learning is how to deal with the identifying characters of an LP, as they vary between different LPs. The easiest idea which comes to mind is to simply ignore their presence. However, this could lead to severe problems if a certain character was placed at the same position for most images of a particular LPT in the training dataset. This is because it would introduce an unwanted bias. Consider the following scenario of a country or LPT implementing a policy which systematically iterates through characters over time. For example, the LP syntax could have the form annn, where a stands for a letter and n for a number. According to this system, the first registered LP would be A000. Then, each time a new LP was registered, the number would increase until it eventually reached A999. The next LP in order would be B000. A system assigning characters in a way like this can be problematic for developing a learning based LPTR algorithm if the training data has been collected over a short period of time. In our example, the CNN could learn the letter A as reliable indicator for identifying a certain LPT. This is because deep learning algorithms do not allow to formally implement hard, abstract constraints such as "ignore all characters when making a decision". In our example, using the presence of A as first character for making a decision is an unwanted bias. What makes this problem worse is its time dependent nature, as it may stay unrecognized for a long period of time. A good indicator for a system suffering from this problem could be a slowly decreasing performance over time. In our example, more and more vehicles which have a syntax that starts with a different letter than A would be registered. The main research question associated with this problem is how to prevent the CNN from learning the presence of particular characters as reliable decision rules. Now, the next question is how to proof that the CNN does not learn such faulty behavior. Many people criticize neural networks for being black box systems, as the exact reasons why they make certain decisions are unknown. Proofing that a neural network has a property like "ignores all characters" formally is impossible with current mathematical methods. However, many techniques have been developed in recent years which try to explain based on which features a neural network makes a certain decision [SVZ13], [SCD+17], [AGM+18]. Those which are relevant to this work are discussed in Section 2.3. Thus, a crucial part for verifying that the CNN behaves as desired to some degree is to make use of these techniques. Conveniently, the same algorithms can also be used to partially answer another relevant question: Which features does the CNN learn and extract from different LPTs? It is not clear in advance which of the LPT characteristics (symbols, textures, additional texts, positions of the identifying characters) the network will identify as discriminative to what degree. 10 1.5. Aim of the Work 1.5.2 Research Questions about Embedding Spaces Most deep learning research on classification focuses on reaching better performance than previous methods regarding metrics such as Mean Average Precision (mAP) and accuracy. As time has shown, this approach is great for closing the performance gap between humans and machines in many areas such as image classification. However, this strategy does not bring science closer to a real understanding what exactly artificial neural networks learn in order to solve a certain task. Regarding embedding spaces specifically, to the best of our knowledge, there is little to no research which analyzes what a learned embedding space actually looks like. The research about training an embedding space mentioned in Section 2.2 goes into detail about the constraints which are enforced on the embedding space, i.e. under which conditions it is obtained. However, it does not analyze how the data is actually distributed in the embedding space once training has finished. State-of-the-art systems such as triplet loss [SKP15] and Arcface [DGXZ19] neglect the shapes of the true distributions. Instead, they simply compute the mean embedding as center for each class, given a certain number of samples. For classification, they choose the nearest center to the input of interest and assign the respective class label to it. This method only guarantees to select the correct class if each sample’s intra-class distance to the center is always smaller than its inter-class distances to the other centers. As long as this condition is fulfilled, the distributions can have arbitrary shapes. In order to fulfill this criterion for stretched or differently sized distributions, the gaps between distinct classes need to be sufficiently large. While this is fine for closed sets, it is problematic for open sets because TECs will lie within empty gaps between TRCs. Consequently, this condition is likely to be broken for stretched TRCs once TECs are added. Thus, it is desirable that all classes can be recognized independently of the size of the empty gaps between their distributions. This can be achieved with a more restrictive condition, where the classes’ respective distributions must share the same variance in all directions. In this case, the class distributions are hyper-spherical and have the same reach. To summarize, under the open set restriction, a nearest center based classifier must assume the restrictive condition as a prior to guarantee flawless results. An example which violates both the more restrictive and the general condition is illustrated in Figure 1.4. One of the main goals of this thesis is to analyze the class distributions in the embedding space. The results of this investigation depend on the used training algorithm, dataset and parameters. It is unclear in advance how similar a class’s distribution is to the desired hyper-spherical shape. However, the assumption that a TRC has the shape of a hyper-ellipsoid seems plausible, just by looking at the shapes of the class distributions in Figure 1.3. Besides, the central idea of algorithms learning an embedding space is to minimize intra-class distances and maximize inter-class distances. Thus, each class must be represented by a single cluster in the embedding space, assuming the training has been successful and has converged. A cluster in the shape of a hyper-ellipsoid is a natural way for achieving this. Consider the following scenario as an analogy for the emergence of a particular cluster during training. Imagine individual atoms or molecules in outer space flying towards 11 1. Introduction Figure 1.4: The image shows two distinct classes, where each class for itself has the same variance in all directions. However, the variance of the left class is larger. The yellow dots represent the samples based on which the parameters (mean and variance) of each class have been calculated. The blue circle symbolizes a new sample which should be classified as either belonging to the left or to the right class. The point’s Euclidean distance to the right class Y is slightly smaller than its distance X to the left class. However, because the left class has higher variance, it is more likely that the blue point belongs to this class. If only the class means are considered without the variances, the blue circle will be incorrectly assigned to the right class. each other due to the force of gravity. Each atom is equivalent to one training sample of the same class in this scenario. Gravity pulls each atom towards every other atom, which is equivalent to minimizing intra-class distances. The point of equilibrium for this situation is one joint spherical object. This object could be a planet or a sun, depending on the kind and amount of atoms. This analogy is incomplete though because additional forces pushing the atoms of one planet away from the atoms of other planets are missing. These forces would represent maximizing inter-class distances and could potentially cause an elongation along some axes of a planet. In our analogy, such an additional force could be a centrifugal force, caused by the rotation of the planet. Earth is slightly elliptical due to its rotation, for example. The arguments about a cluster’s shape presented above only apply to TRCs but not to TECs. The shape of a TEC has far greater uncertainty in the embedding space, as the network has not directly been trained on it. The more similar instances of a TEC are to those of the TRCs, the higher the probability that the TEC’s shapes will be similar those of the TRCs. This is because the network knows how to interpret most underlying 12 1.5. Aim of the Work patterns of a TEC if it is similar enough to the TRCs. It cannot be expected that a TEC which shares zero commonalities with the TRCs will form one compact cluster. In this case, the neural network will most certainly not be able to identify the class’s patterns correctly, as it has not encountered them during training. Therefore, analyzing the shapes of TEC distributions seems promising for evaluating whether or not the network has learned to generalize well enough to make sense of the TECs. If it has, the shape of a TEC cluster should look similar to the one of a TRC. To summarize, another goal of this work is to see how strongly the shape of TEC distributions vary from those of TRC distributions. Note that for the reasons mentioned above, this property depends on the used data and does not generalize to arbitrary datasets. 13 CHAPTER 2 State of the Art This chapter is split into three parts. First, the general setup for solving LPR and recent research, which uses deep learning for this purpose, are briefly discussed in Section 2.1. Then, various algorithms for training embedding spaces are elaborated in Section 2.2. The final part of this chapter (Section 2.3) summarizes different techniques for creating saliency maps, which visualize which parts of input images are most relevant to a trained network. The focus lies on gradient-based saliency techniques. 2.1 License Plate Recognition (LPR) As LPTR is only a subtask of LPR, it is worth discussing how such a system is typically setup. Broadly speaking, LPR consists of two main subtasks. First, the LP has to be located in the image, which is known as License Plate Detection (LPD). This is necessary because it is not possible to only capture the LP without any additional background and perspective distortion in practice. In real world applications, images of LPs are taken with stationary cameras while the respective cars on which the LPs are mounted are driving. This leads to a lot of uncertainty about a car’s precise position regarding image coordinates. As a result, the LP only fills a fraction of the captured image. This ensures that the car will be contained within the camera’s view frustum in almost all situations. A captured image typically contains large parts of the car, the street and in many cases even other cars. LPTR operates on crops of LPs and therefore causally follows LPD. The second stage of LPR is responsible for segmenting and classifying the characters on the extracted LP. Precise details about how these two steps are solved are not directly relevant for this work. However, this paragraph points out references for the interested reader. Until the recent advances in deep learning, LPR has been dominated by traditional computer vision approaches. Such systems made heavy use of thresholding, edge detection, morphological operations, color, texture and more. A great summary about how these techniques are 15 2. State of the Art used for LPR is given by Anagnostopoulos et al. [AAP+08] and by Du et al. [DISB12]. In more recent years, most research regarding LPR has been carried out using deep learning. Laroca et al. [LZG+19], [LSZ+18] use a pipeline consisting of three different neural network models for solving the entire LPR task. They use the You Only Look Once (YOLO) network architecture, introduced by Redmon et al. [RDGF16], for each stage. In a nutshell, YOLO is capable of classifying several objects in an image. The underlying network also outputs the respective axes aligned bounding box in image coordinates for each classified object. In the approach by Laroca et al., the first network is in charge of detecting all vehicles in the scene to reduce the search space for plates. The second network then searches for license plates on each detected vehicle in the scene, individually. Interestingly, they apply what they call layout classification within this step. This can be viewed as a simple closed set variation of LPTR which coarsely distinguishes between different LP layouts. However, their classification system is only used as means to identify the approximate shape and overall layout of a LP for further processing. Its goal is not to distinguish between different LPTs. The final stage of their system processes each detected plate one-by-one and detects and classifies all characters on the plate. Knowing the plate layout from the previous step, it is possible to modify the read characters to fit the constraints implied by the layout. Note that this method performs all stages of the LPR task regarding computer vision using neural networks and does not require any traditional computer vision approaches. Jorgensen [Jør17] and Silva et al. [SJ17] develop very similar systems which are based on a YOLO network for LPD as well. 2.2 Embedding Spaces Using intermediate embedding spaces is state of the art in both OSR [GHC20] and FSL [WYKN20]. This section discusses various methods for training embedding spaces. The goal of each technique is to create empty gaps between distinct classes by maximizing inter-class distances and minimizing intra-class distances. 2.2.1 Softmax Extensions First, traditional classification using the softmax activation function is discussed from a geometric view point. This new perspective allows to modify the underlying concept, such that marginal constraints can be added between distinct classes, leading to more compact embedding spaces. Three state-of-the-art approaches on which the derivations and explanations in this section are based on and inspired by are introduced and compared [LWY+17], [WWZ+18] [DGXZ19]. 16 2.2. Embedding Spaces 2.2.1.1 Softmax from a Geometric Perspective The standard way of training a classifier with a neural network is to use a separate output for each possible class j, i.e. one-hot-encoded labels. The raw output values computed by the network for a given input a are also known as logits. To obtain a confidence score pj for each class based on the logits, the softmax activation function is used as defined in Equation 2.1. Intuitively, it can be understood as interpreting the output logits of the network as unnormalized log likelihoods. This observation follows directly from the equation, which first transforms the logits from log space to linear space and then normalizes over all classes N . pj = elogitj N k=1 elogitk (2.1) The loss function L used for training, which is based on these confidences, is given by Equation 2.2. There are two different interpretations of this function, which are maximizing the log likelihood of the target class y and minimizing the cross entropy H between the target outputs zj and the actual outputs pj . The former is easy to see, as maximizing the log likelihood is equal to minimizing the negative log likelihood. The latter can be derived from the general equation for cross entropy, by setting zj = 0|∀j = y and zy = 1, as shown in Equation 2.3. L = − log (py) (2.2) L = H(z, p) = − N j=1 zj log (pj) = − log (py) (2.3) Independently of the interpretation, the loss will decrease if py increases and will converge to 0 as py approaches 1. py in turn increases if either logity increases or the logits of all other classes decrease, which is exactly what happens during training. The logits themselves are the final outputs computed over many neural network layers with non- linearities in between. Let x be the activation feature vector of the final layer before the individual class outputs. To keep the equations clearer, indices are not considered for different inputs and x can represent the feature vector of any single input a. In this case, the individual output logits are computed by Equation 2.4. Wj represents the weights of the final fully connected layer which are used for computing the respective class’s output logitj . More specifically, Wj is the jth row vector of the last fully connected layer’s weight matrix W . b represents the additive scalar bias. Note that Equation 2.4 is also written in terms of the dot product, which incorporates the angle ϕj ∈ [0, π] between the vectors Wj and x. logitj = Wj · x + b = Wj · x · cos ϕj + b (2.4) The computation of the logits by these means without the bias is visualized in Figure 2.1. Equation 2.1 implies that the highest confidence is always assigned to the logit with 17 2. State of the Art Figure 2.1: A schematic drawing of a typical architecture for a neural network classifier. The input a is transformed to a feature vector x over many layers. This vector is then multiplied by the vector Wj to obtain the respective output logit for class j. the highest value. This is because the exponential function is strictly monotonically increasing over R. Thus, an input a is assigned to the target class y if Inequality 2.5 holds. Wy · x · cos ϕy + b > Wj · x · cos ϕj + b |∀j = y (2.5) Notice that when both Wj and x are set to 1, b is set to 0 and the logits are multiplied by a constant scalar s instead, the only remaining variable in Equation 2.4 is the angle ϕj . Therefore, it simplifies to Equation 2.6. The effects of s are discussed in Section 2.2.1.3. logitj = s · cos ϕj (2.6) By substituting Equation 2.6 into Inequality 2.5, it is easy to see that the classification decision only depends on the angles ϕj . To classify an input correctly, Inequality 2.7 must be fulfilled. cos ϕy > cos ϕj |∀j = y (2.7) This can be further simplified to Inequality 2.8, by taking the inverse cosine function, which flips the < operator, as it is strictly monotonically decreasing in the range [−1, 1]. ϕy < ϕj |∀j = y (2.8) Given this inequality, one can observe that in order to fulfill the conditions, the angle ϕy of the target class must decrease, while the angles ϕj of all other classes should increase. In fact, this is exactly what the neural network optimizes for during training, as decreasing ϕy increases logity and increasing ϕj decreases logitj . Notice that the training 18 2.2. Embedding Spaces operates in angular space on the manifold of a hyper-sphere and not in Euclidean space because both Wj and x are normalized to 1. This is why the network must learn to distinguish between classes solely based on the angles ϕj . Intuitively, the training process can be imagined as pulling all x of the same class closer together while pushing them further away from the x of all other classes. More specifically, this process happens on the manifold of a hyper-sphere. At this point, the reader might notice the similarity between the desired properties for the space in which all x lie and those of an embedding space. In fact, the space of all x corresponds to an embedding space. One possible outcome of a converged training for various normalized x in an embedding space of dimensionality 3 is visualized in Figure 2.2. This figure shows the normalized versions of the same x shown in Figure 1.3a. Another way to think about the difference is that Figure 1.3a shows the values of x as they truly are before normalization, as opposed to Figure 2.2, which shows how they are interpreted during training. Figure 2.2: A 3D embedding space obtained by training a classifier using Arcface [DGXZ19]. Each point represents the normalized feature vector x of an input a and each class is illustrated by one color. Due to the normalization, all points lie on the surface of a sphere. During training, the angular distance of each x to its respective class center Wj is minimized. 19 2. State of the Art The angle ϕj is given by Equation 2.9 and simplifies to Equation 2.10 if Wj = x = 1 holds. ϕj = arccos Wj · x Wj · x (2.9) ϕj = arccos (Wj · x) (2.10) As Wj is a weight vector of the network, it is constant for all inputs x at any fixed point during training. The network tries to minimize the angles ϕj between all x which belong to class j and Wj . This allows to interpret Wj as the center of class j. 2.2.1.2 Incorporating the Margin Leaving the computation of the logits as in Equation 2.6 will result in a separation between distinct classes without enforcing empty gaps in between them, as Inequality 2.8 illustrates. Thus, we want to add a penalty in form of a margin for the target class y, which incorporates the desired gaps. The underlying idea is to pretend that a feature vector x is further away from its target center Wy than it truly is. This forces the training to decrease the angle ϕy even further to classify the input correctly, leading to more intra-class compactness. As each class adds the margin to its respective samples, this leads to more compactness for all classes. As a consequence, empty gaps between distinct classes will emerge. When making use of the margin all equations mentioned in Section 2.2.1.1 remain the same, except for the computation of the logits. Regarding the logits, the computation for the target class y changes, all other logitj |∀j = y remain unchanged and are computed as in Equation 2.6. Sphereface by Liu et al. [LWY+17] uses a multiplicative margin while both Cosface by Wang et al. [WWZ+18] and Arcface by Deng et al. [DGXZ19] apply additive margins in different ways. The respective formulas are given by Inequalities 2.11, 2.12 and 2.13. The reason why all three have the term face in their name is because they have originally been developed in the domain of face recognition. logity = x · cos (ms · ϕy) (2.11) logity = s · (cos (ϕy) − mc) (2.12) logity = s · cos (ϕy + ma) (2.13) Notice how Arcface given by Inequality 2.11 does not multiply by a constant scalar s but by x instead. This method was the first of the three approaches and the others later discovered that fixing the value for x by using the constant s instead gives better results. This is explained with more detail in Section 2.2.1.3. Now the question is, which of the three margins works best empirically. To understand their difference intuitively, a geometric interpretation is helpful. Let us consider a binary classification problem with classes A and B for simplicity and visualization purposes. It is easy to generalize the following observations to multiple classes. For binary classification, the decision region for class A is given by Inequalities 2.14, 2.15 and 2.16 for Sphereface, Cosface and Arcface, 20 2.2. Embedding Spaces respectively. All three can be derived in the same way as Inequality 2.8. To obtain the decision region for B instead of A, ϕB and ϕA simply have to be exchanged with one another. The resulting decision regions are visualized in Figure 2.3. ϕA · ms < ϕB (2.14) arccos (cos (ϕA) − mc) < ϕB (2.15) ϕA + ma < ϕB (2.16) The enforced gap between any pair of two classes in a multi-class scenario behaves equivalently to the binary classification setup. However, this is difficult to visualize, as each added class introduces another dimension. Furthermore, it seems intuitive that the enforced gap should be independent of the exact value of the target class’s angle ϕy. For example, why should it be desirable that a small angle requires less margin than a larger angle as enforced by Sphereface, shown in Figure 2.3b. Consider that if ϕy is small, all other angles ϕj should still be significantly larger. Empirical experiments by Deng et al. [DGXZ19] confirm the intuition that the constant gap enforced by Arcface indeed gives better results than the gaps which emerge from Sphereface or Cosface. In their study, the quality of the learned embedding space is evaluated in the open set face recognition context for various datasets. The authors also experiment with applying all three different types of margins at the same time, as they are not mutually exclusive. Their results indicate that the Arcface margin alone works best. 2.2.1.3 Impact of the Scale Parameter Finally, let us consider the effect of the scale parameter s. Wang et al. [WWZ+18] formally derive a lower bound of its value which depends on the number of classes. We try to offer an intuitive, alternative explanation for why this value is necessary and how it can be chosen. Without applying any scaling to the logits, their value will always be within the range [−1, 1] if both Wj and x have unit length. If we assume that the target class y scores the maximal value 1 while all other N − 1 classes take the minimal value −1, the resulting confidence py can be computed by Equation 2.17. It can be derived by plugging the respective values into Equation 2.1. Notice that Equation 2.17 assumes the best case scenario, yielding the highest possible value for py. In practice, the unscaled logity will be smaller than 1 and the other unscaled logits will be larger than −1. py = es es + (N − 1) · e−s (2.17) Figure 2.4 shows various results for py when varying N and s. For example, if s is fixed at 1, the value for py quickly converges towards 0 as N increases. This is clearly undesirable because py should approach 1 in this scenario. The figure also demonstrates that as s increases the value for py rises if N is fixed. This behavior can be explained by the nature 21 2. State of the Art (a) If there is no margin, both classes share the same boundary and no gap is formed between the classes. (b) The enforced gap between the re- gion boundaries increases linearly as ϕ increases. (c) A non-linear decision boundary which corresponds to wider gaps for par- ticularly small and large ϕ. (d) The gap between both classes has the same constant value which is inde- pendent of ϕ. Figure 2.3: Decision regions for a binary classification problem for the traditional softmax classifier and its extensions which make use of a margin parameter. The blue regions represent the combinations of angle values which classify the respective input as A and the red regions illustrate the equivalent for B. The algorithm and margin value used are written above each respective graph. 22 2.2. Embedding Spaces of the exponential function. Linearly increasing inputs lead to exponentially increasing outputs, hence the name. Consequently, if the range of input values is increased by applying the scale factor, the outputs grow further apart. This allows the target logit to outweigh the sum of the other logits when the normalization is applied. Figure 2.4: The maximal achievable target confidence score py for a classifier, assuming the scale s and the number of classes N are parameters. This graph represents the best case scenario, yielding the theoretical maximal values for py, as defined in Equation 2.17. Notice that the N axis shows values on a log scale. To calculate the required scale factor s for obtaining a certain minimum target confidence py, given N classes, Equation 2.17 has to be solved for s. Let us consider a more realistic scenario for computing this value, by assuming a logit value of 0 instead of −1 for all non-target classes. A logit of 0 is equivalent to an angle ϕ of π/2. Replacing e−s in Equation 2.17 by 1 and solving for s yields Equation 2.18. Solving it for py = 0.99 and N = 100 yields a minimal scale of 9.2, for example. This heuristic equation can be used to calculate an approximately fitting value for s. s = ln py · (1 − N) py − 1 (2.18) 2.2.2 Triplet Loss Another popular method for learning an embedding space is to optimize directly on the embeddings without using an additional bottleneck classification layer as in Section 2.2.1. In order to do so, constraints for intra-class compactness and inter-class gaps are incorporated into the loss function, which operates on concrete vectors in the embedding space. Broadly speaking, there are two different types of loss functions of this kind, which 23 2. State of the Art are contrastive loss and triplet loss. Examples of the former are given by [TYRW14] and by Sun et al. [SCWT14], using Siamese networks. However, here we focus on the latter, which has been introduced by Schroff et al. [SKP15]. This is because its results outperform contrastive loss in most domains, such as face recognition [MD18]. It is also the most widely used method which operates directly on the embedding space. The main idea behind triplet loss is to make use of three different inputs at once for a single loss value computation by putting them in relation to one another. The inputs are the anchor aa, positive ap and negative an. Their respective representations in the embedding space are xa, xp and xn. The anchor can be viewed as the equivalent of the target class in classification. The positive represents an input of the same class as the anchor and the negative an input of a different class. The aim is to pull the positive closer to the anchor and to push the negative further away from the anchor. The concept of triplet loss is illustrated in Figure 2.5. Figure 2.5: A schematic drawing of the principle behind triplet loss. The inputs, anchor aa, positive ap and negative an are transformed to their respective embedding representations xa, xp and xn, using the same neural network. The green arrow indicates that the network aims to pull xp closer to xa, whereas the red arrow symbolizes that xn is being pushed away from xa, during training. The desired outcome is that xp is a certain amount m closer to xa than xn, for all possible configurations of the triplet T (xa, xp, xn). The hyper-parameter m represents the margin or desired gap between distinct classes. Mathematically, this is expressed by Equation 2.19. dist(xa, xp) + m < dist(xa, xn) |∀(xa, xp, xn) ∈ T (2.19) Originally, the distance between different embeddings has been measured in Euclidean space, but Deng et al. [DGXZ19] later pointed out that better results are obtained in angular space. This is because all x are normalized for computing the triplet loss, which constraints them to lie on the manifold of a hyper-sphere, just like all methods described 24 2.2. Embedding Spaces in Section 2.2.1. Due to the constant curvature of the hyper-sphere’s manifold, it is more natural to measure distances with angles. To account for different metrics and to generalize, dist(a, b) is used as a placeholder for an appropriate distance metric in this section. As mentioned, two examples are the Euclidean distance dist(a, b) = a − b and the angular distance dist(a, b) = arccos(a · b), a = b = 1. The loss function which achieves the desired constraints is given by Equation 2.20. The max function is used for clamping, in order to avoid negative loss in cases where the constraint is already fulfilled. L = max(0, dist(xa, xp) + m − dist(xa, xn)) (2.20) The challenging aspect about triplet loss is to select triplets from which the optimization benefits most. For example, if a specific triplet already fulfills the constraint of Equation 2.19, there is no loss. This results in no improvement and change of the network’s param- eters. The training benefits most from hard positives and hard negatives. In this context, the hardest positive is the positive furthest away from the anchor argmaxxpdist(xa, xp) and the hardest negative is the negative closest to the anchor argminxndist(xa, xn). Schroff et al. [SKP15] note that consistently choosing a negative which is closer to the anchor than the positive dist(xa, xn) < (xa, xp) can result in convergence to a poor local minima. Finding the optimal triplets for each parameter update during training scales poorly, as it requires searching through the entire dataset each time. What makes it worse is that the complexity of finding the best triplet for n samples is not O(n) but O(n3). This is because all permutations need to be considered for the triplet. Therefore, online search strategies are often used for triplet mining in practice. One simple method is to increase the batch size and to only pick a subset of the samples within the batch which contains quality triplets. Some more advanced strategies for online triplet mining are given by Hermans et al. [HBL17]. 2.2.3 Auto-encoders One of the most popular and researched unsupervised deep learning methods is the auto-encoder. This section is about the supervised auto-encoder by Gao et al. [GZJ+15], which adds supervision in the form of class annotations to a traditional auto-encoder. The goal of the supervision is to add a constraint for intra-class compactness in addition to the reconstruction target, in order to learn a qualitative embedding space. The aim of a traditional auto-encoder is to learn an encoding function f(a) = x and a decoding function g(x) = o such that the loss L = a−o 2 = a−g(f(a)) 2 is minimized. This means that the reconstructed output should be as similar to the input as possible. An auto-encoder only makes sense if the dimensionality of the encoding x is significantly smaller than the dimensionality of the input a, i.e. the information is compressed. The architecture of an auto-encoder is schematically illustrated in Figure 2.6. Typically, auto-encoders are used for compression, feature reduction or clustering, which means that only the encoder is of interest after training. The decoder is only a remedy 25 2. State of the Art Figure 2.6: A schematic drawing of an auto-encoder. The input a is compressed into x, which in turn is used for reconstructing o. The aim of training is to make o as similar to a as possible. for learning. The elegance of the auto-encoder originates from its unsupervised nature, as its only target is to reconstruct the input as closely as possible. Furthermore, it can be trained in a layer-by-layer fashion. First, an encoding f1(a) = x1 and decoding g1(x1) = o are learned. Afterwards, another encoding f2(x1) = x2 and decoding g2(x2) = x1 are trained. This process can be repeated iteratively until a desired compression factor is reached. Notice that for open set problems, the encoded space represents the embedding space. However, no constraints on the desired shape of the embedding space are enforced by this traditional target function formulation. The only goal is to reconstruct inputs, but there are many different ways of achieving this. For example, it is plausible that inputs which show the same type of variance, such as Gaussian noise, are summarized by similar embeddings. However, for classification it is desired that inputs of the same class are described by similar embeddings as opposed to inputs exposed to the same variance. The main idea behind the supervised auto-encoder by Gao et al. [GZJ+15] is to add a class similarity constraint in addition to the reconstruction target. To do so, a reference input ar j which represents the perfect instance of a particular class j, with as little variance as possible, is used. The goal of training is slightly modified, such that the reconstructed output oj of any input aj should match the reference input ar j as closely as possible, instead of the actual input aj . In addition, the encoding xj should be as similar as possible to the reference encoding xr j . The complete resulting loss function is summarized in Equation 2.21, where λ is a hyper-parameter controlling the weighting between the two loss components. The authors also add another loss component which aims for sparsity in the network’s activations. However, as we do not use this term for any of the other 26 2.3. Saliency Maps losses in Section 2.2, we also omit it here. Notice that a key difference compared to the other losses in the section is that this approach only enforces intra-class compactness. It does not explicitly aim for increased inter-class distances. L = λ ar j − g(f(aj)) 2 + (1 − λ) f(ar j) − f(aj) 2 (2.21) The supervised auto-encoder shares many similarities with the denoising auto-encoder [VLL+10]. The latter is an unsupervised method which adds artificial noise to the inputs. Similar to the former, it tries to match the reconstructed output of the noisy input to the unnoisy (reference) input. The main difference is that the supervised auto-encoder does not use noisy inputs, but any input of the same class as the reference input. This also explains the reason why it requires supervision while the denoising auto-encoder does not. 2.3 Saliency Maps The biggest critic neural networks receive amongst experts and the public in general, is that they are black box or at least dark-gray box systems. Their decisions and inner workings are not fully understandable for a human because there are simply too many parameters to be considered. Trying to mitigate this issue, an entirely new branch studying the interpretability and comprehensibility of neural networks has emerged. One key question which this research is trying to answer is "Which parts of an image are actually important when the network makes a decision?". For example, in the context of LPTR, one could expect that special symbols and textures of an LP are good indicators for classification. This question can be answered by saliency map techniques, which this section and work mainly focuses on. A different subbranch of this research called feature visualization views the problem from a different perspective and asks "Which features maximize the activation of a single neuron or a group of neurons in general?". By "in general", features which do not necessarily have to be part of an actual image of the dataset are meant. This question is briefly discussed in Section 2.3.5, but left for more rigorous exploration for future work. 2.3.1 Why Saliency Maps? Because neural networks learn patterns from data, any bias in the training data can trick the network into believing that the wrong patterns are relevant for making a certain decision. Ribeiro et al. [RSG16] show an example, where a classifier is trained to distinguish between huskies and wolfs. But the training data is biased, such that all images of wolfs show them in snow scenes whereas the huskies are never shown in such a setting. Testing the trained model with data showing huskies in snow scenes and wolfs in different scenes shows that the network has learned to classify between huskies and wolfs depending on whether or not snow is present in the background. In this context, this means that huskies in snow scenes are classified as wolfs and the other way around. This 27 2. State of the Art is clearly undesirable and unacceptable. The problem emerges because the network is only given the task which needs to be solved during training. There are no constraints on how to solve the problem. The snow bias in the example above is quite obvious and arguably not too difficult to identify. However, not all biases are this easy to spot and may hide deep within the data. The usage of saliency map techniques often allows to find such learned biases via careful human inspection. 2.3.2 Vanilla Saliency Maps Saliency maps as introduced by Simonyan et al. [SVZ13] are a popular and simple method for investigating to which parts of the input image the network reacts strongly after training. Consequently, they can be used to spot learned biases by careful human inspection to some degree. The idea behind vanilla saliency maps for a classification network is to first compute which class j has the highest score logitmax = max(logitj) after a regular forward pass. Then, the gradients which emerge solely from back-propagating the signal of the neuron with the highest class score are computed. The result of this step contains a partial derivative Ga xy for every pixel axy of the input image a, given that it is a grayscale image. Mathematically, the partial derivative for a pixel axy is given by Equation 2.22. Ga xy = ∂logitmax ∂axy (2.22) The absolute value of this partial derivative |Ga xy| is a good approximation for how much the respective pixel influences the largest logit. Together, all |Ga xy| make up the saliency map. An intuitive interpretation for why these values correspond to saliency attributions is the following: Changing a pixel’s value whose corresponding partial derivative is large will affect the largest logit more significantly than adapting a pixel with a lower partial derivative by the same amount. Therefore, it is more important for the classification decision. To summarize, a saliency map consists of the absolute values of the partial derivatives of the respective pixels of the input image. In practice, these values are rescaled to a different range, such that they can be interpreted as a heatmap by a human. When computing the gradients based on the highest class score, it is important to use the raw class logit before the subsequent softmax layer. This is because increasing the softmax score of a certain class can be achieved by reducing the logits of other classes. This is undesirable in this context because we are interested in which parts of the image cause a large logit for one specific class, as opposed to which parts decrease the logits of other classes. For our biased wolf husky example, a typical saliency map for a wolf classification would highlight the pixels showing the snow and disregard the regions of the image showing the actual wolf. 28 2.3. Saliency Maps 2.3.3 Gradient-weighted Class Activation Mapping (Grad-CAM) While a vanilla saliency map is a great first step towards understanding fractions about which features in a given image a neural network reacts to, it still does not reveal much about what is going on within the hidden layers of the network. Selvara et al. [SCD+17] have introduced an approach called Grad-CAM, which tackles this problem. Grad-CAM can be viewed as an extension or generalization of vanilla saliency maps. Instead of only looking at the partial derivatives Ga xy at the input image a, the gradients Gl uv of a hidden layer Al are considered. In this context, l refers to the lth layer of the network and u and v specify the spatial position within the layer. Notice that any Gk uv represents a gradient as opposed to a partial derivative because there are several feature maps per layer. This is why all activations of a convolutional layer together are also known as feature volume, whose width Ul and height Vl decrease for deeper layers of the network. Grad-CAM applies smoothing by averaging the partial derivatives of each feature map to retrieve a weighting vector wl for each layer l, as illustrated in Equation 2.23. wl = 1 Ul · Vl Ul u Vl v ∂logitmax ∂Al uv (2.23) Consequently, each feature map can be interpreted as either having a positive or a negative influence on the largest logit. This smoothing is particularly reasonable for deeper layers of the network, where features become increasingly complex and are likely to be detected by entire feature maps as opposed to individual feature wise convolutions. Finally, to retrieve a single 2D saliency map M l for each layer l, the dot product between the respective activations and weight vector is computed for every spatial position (u, v), as in Equation 2.24. M l uv = ReLU(Al uv · wl) (2.24) Looking at this equation, one notices two things. Firstly, Grad-CAM does not only construct a heatmap solely based on the gradients, but additionally also makes use of the activations Al. Secondly, negative pixel values of the resulting heatmap are truncated by the Rectified Linear Unit (ReLU) operation. For a given image, the activations within a feature volume represent how important certain image regions are for making a classification decision and not what the image looks like at these regions. Notice that for vanilla saliency maps the input pixels have the same role as the activations for Grad-CAM. Using these activations for constructing a Grad-CAM saliency map is the biggest difference compared to computing the respective vanilla saliency map. For the latter, the values of the input pixels are disregarded because they could potentially be misleading. Consider a scenario where the partial derivative for a very bright input pixel is small. If this partial derivative was multiplied by the pixel’s intensity, the resulting saliency attribution would falsely become quite large. The opposite argument can also be made for a dark pixel with a large partial derivative. In a nutshell, the problem with using pixel intensities directly for vanilla saliency maps is that dark pixels can be part of important features whereas bright pixels might not be. 29 2. State of the Art For hidden layers, especially those at the end of the network, the situation is different, as large activations correspond to detected features. The larger an activation and its partial derivative, the more important the respective feature is for a classification decision. If an activation is positive but its partial derivative is negative, this indicates that the detected feature is important for a different class than the one with the largest logit. As all feature volumes Al are assumed to be the result of a preceding ReLU activation, the other two possible cases where an activation is negative do not need to be considered. In practice, the Grad-CAM saliency map is only computed for the last convolutional layer because it represents the most high level features which have been detected. As a consequence, the resolution of the heat map is significantly smaller than for a vanilla saliency map, which can make it appear very pixelated. 2.3.4 Other Saliency Map Techniques It must be noted that there are many other algorithms for computing saliency maps based on gradients, such as Integrated Gradients by Sundararajan et al. [STY17]. However, recent research by Adebayo et al. [AGM+18] has shown that many of these methods are actually not very reliable. In their experiments, they have performed sanity checks by computing saliency maps for both completely untrained classifiers and partially corrupted classifiers, for which a subset of the trained weights have been exchanged by random variables. Furthermore, they have investigated saliency maps which have been created using a network trained on randomly labeled data. Their results demonstrate that many saliency techniques such as Integrated Gradients show little change under such corruptions compared to the reference saliency maps created from a properly trained network. They conclude that many of the algorithms which fail their sanity checks share similarities with regular edge detectors. Only Grad-CAM and vanilla saliency maps pass their sanity checks. This is also one of the reasons why these techniques have been chosen for this work. For the sake of completion, it must be said that saliency techniques which are not based on gradients exist too. Examples for these are Layer-wise Relevance Propagation by Bach et al. [BBM+15] and Deep Learning Important Features by Shrikumar et al. [SGK17]. The major disadvantage of such methods is that they require additional algorithms for back-propagating their relevance signals to the input. In contrast, the gradient back-propagation required for gradient-based saliency techniques is always available, as it is needed for training anyway. The biggest benefit of such methods is that they allow to back-propagate relevance even if the respective gradients are zero. However, these techniques are not used within this work and therefore not further discussed. The interested reader is encouraged to read the papers mentioned within this paragraph for more information. 2.3.5 Feature Visualization While saliency techniques reveal which image regions are considered for a classification decision, they do not show which images maximally activate a specific neuron or com- 30 2.3. Saliency Maps binations of neurons. An entire subbranch called feature visualization tries to answer this question. A good introduction is given by Olah et al. [OMS17], for example. The basic idea of feature visualization is to use gradient ascent on the input image instead of gradient descent on the network weights, to maximize a specific neuron activation of the network. After convergence, the resulting input image shows which features the neuron of interest reacts to most strongly. More recent work on this topic by Carter et al. [CAS+19] focuses on creating 2D embed- dings for frequently occurring activations and maps objects of each class to their own region within this embedding space. For the mapped features, the input images which create the according activations are created and shown in the 2D embedding space in a so called activation atlas. This technique allows to identify how similar objects are positioned relative to each other in the 2D embedding space. It must be noted that the embeddings do not directly correspond to an activation within the network. Instead, they are computed using t-SNE [MH08], a non-linear dimensionality reduction technique, on the last convolutional layer. 31 CHAPTER 3 Methodology While Chapter 2 summarizes recent state-of-the-art methods related to this work, this chapter discusses other techniques which have been used. The majority of this chapter deals with classification on an open set using embedding spaces in general and is not specific to LPTR. First, various training aspects such as architectural choices for the neural network and hyper-parameter tuning are discussed in Section 3.1. Afterwards, different options for performing classification within a learned embedding space are elaborated in Section 3.2. Then, the used evaluation metrics are described in Section 3.3. These include both performance benchmarks on the test dataset and metrics for evaluating the embedding space. Finally, an extension for gradient-based saliency maps, which allows to also compute them for TECs, is introduced in Section 3.4. 3.1 Training the Embedding Space In Section 2.2 many ways for training an embedding space effectively have been introduced. However, nothing has been said about the underlying network which transforms the inputs to feature vectors of the embedding space. The first part of this section discusses different architectural options for this purpose. Then, the selection of the training algorithm for this work is justified. Finally, the heuristic which has been used for choosing good hyper-parameters for network training is discussed. 3.1.1 Neural Network Architecture Over the past ten years a lot of innovations have happened in the design of CNN archi- tectures. A great overview of many recent developments is given by Gu et al. [GWK+18]. Two popular and commonly used types of model architectures are the Visual Geometry Group Network (VGG) by Simonyan et al. [SZ14] and the ResNet by He et al. [HZRS16a]. For this work, both types of architectures have been evaluated and compared. The main 33 3. Methodology reason for testing two different CNN types is to test if the obtained results regarding the properties of a learned embedding space strongly depend on the choice of architecture type. 3.1.1.1 VGG Type The VGG’s main building block consists of a few convolutional layers followed by a max pooling layer. Many of such blocks are concatenated, where each one reduces the feature map size to a quarter, due to the pooling. Typically, the number of feature maps per convolutional layer is doubled each time the feature map size is decreased. This is done to counterbalance the decrease of the feature vector’s size after the pooling. At the end of the network, some fully connected layers are used, which finally lead to the individual class outputs. Between the convolutional layers, ReLUs are used as non-linearities. Additionally, we use Batch Normalization (BN) [IS15] layers after the ReLUs for normalization. The final VGG architecture used in this thesis is summarized in Table 3.1. Many different options with a varying number of convolutional layers and therefore learnable parameters have been evaluated. Empirical experiments have shown that this architecture is a good compromise between accuracy and complexity for the used LPTR dataset. Each network input is an image crop of an LP with a height of 64 pixels and a width of 128 pixels. The output is an embedding of dimensionality 32. The tuning of this value is further discussed in Section 3.1.2.2. 3.1.1.2 ResNet Type The ResNet has been designed for building very deep networks, with up to several hundred convolutional layers. The main idea behind ResNet is to use a building block which adds an identity (shortcut) connection in parallel to the regular convolutional layers, as a second path. The output of a block is the sum of both paths. This way, gradients can be back-propagated to the early blocks of the network effectively and do not vanish. This is because of the multivariable chain rule, which states that gradients reaching the same neuron from different paths are summed up. As a gradient remains unchanged along the identity function towards any preceding neuron of an earlier block, it can hardly vanish completely. Let xl be the activation after the lth block. The activation after the (l + 1)th block is then given by xl+1 = xl + f(xl), where f(xl) represents the function learned by the convolutional layers. Reordering the equation yields f(xl) = xl+1 −xl, which is equivalent to the difference (the residual) between the output and the input of the respective block. The shortcut connection makes the learning task for the network easier, as it can learn f(x) = 0, which effectively ignores the entire block. ResNet uses a ReLU activation function and BN layer after every convolutional layer. In the original paper, the authors applied the ReLU and BN at the end of a block, after the addition of the identity. This breaks the unaltered gradient flow between two consecutive shortcut connections mentioned above because the ReLU and BN are applied in between. In a later paper [HZRS16b], the same authors investigated an alternative 34 3.1. Training the Embedding Space layer name output shape layer shape parameters conv0x 64 × 128 × 32 [3 × 3, 32] × 2 9568 maxpool0 32 × 64 × 32 2 × 2 0 conv1x 32 × 64 × 32 [3 × 3, 32] × 2 18496 maxpool1 16 × 32 × 32 2 × 2 0 conv2x 16 × 32 × 32 [3 × 3, 32] × 2 18496 maxpool2 8 × 16 × 32 2 × 2 0 conv3x 8 × 16 × 64 [3 × 3, 64] × 2 55424 maxpool3 4 × 8 × 64 2 × 2 0 conv4x 4 × 8 × 64 [3 × 3, 64] × 2 73856 maxpool4 2 × 4 × 64 2 × 2 0 fc0 128 512 × 128 65664 fc1 32 128 × 32 4128 Table 3.1: Structure of the used VGG type architecture. An entry in the layer shape column with content [a × a, b] × c can be understood as following: A convolutional block of c individual convolutional layers, where each layer has b feature maps and a kernel width and height of a. In total, the network has 245632 weight parameters. The outputs are not individual class scores, but form a feature vector in the embedding space. placement of the according ReLU and BN units. The ResNet building block resulting from the empirically established best placement is also used in this work and is visualized in Figure 3.1. Figure 3.1: A schematic drawing of a single ResNet block. The upper path applies two convolutional layers. The lower path is the identity function. As only the upper path contains BN and ReLU units, gradients are propagated over the identity unaltered. To reduce the size of a feature map within a block, it is not applicable to use the identity function anymore. To overcome this problem, a 1×1 convolutional layer with a stride of 2 is used instead, which effectively reduces the output size to a quarter. Notice that this in 35 3. Methodology fact alters the gradients over the shortcut connection, but only by a linear transformation. In the convolutional path of a block, the feature map size is reduced by applying the convolution using a stride of 2 in one of the layers. In large networks, several consecutive residual building blocks are used before the feature map size is reduced again. However, for our LPTR dataset, the architecture in Table 3.2, which reduces the output size within every residual block, has shown to be sufficient empirically. layer name output shape layer shape parameters conv0 64 × 128 × 32 3 × 3, 32 320 conv1x 64 × 128 × 32 3 × 3, 32 3 × 3, 32 × 1 18496 conv2x 32 × 64 × 32 3 × 3, 32 3 × 3, 32 × 1 19552 conv3x 16 × 32 × 32 3 × 3, 32 3 × 3, 32 × 1 19552 conv4x 8 × 16 × 64 3 × 3, 64 3 × 3, 64 × 1 57536 conv5x 4 × 8 × 64 3 × 3, 64 3 × 3, 64 × 1 78016 conv6 4 × 8 × 128 1 × 1, 128 8320 avgpool 2 × 4 × 128 2 × 2 0 fc 32 1024 × 32 32800 Table 3.2: Structure of the used ResNet [HZRS16a] architecture. The entries of shape a×a,b a×a,b × c within the layer shape column represent c consecutive residual blocks. The conceptual shape of one such block is shown in Figure 3.1. a defines the width and height of each convolutional kernel and b the number of feature maps per convolutional layer. In total, the network has 234592 weight parameters. The outputs are not individual class scores, but form a feature vector in the embedding space. 3.1.2 Training Procedure 3.1.2.1 Choice of Training Algorithm While considering all the algorithms discussed in Section 2.2, it has been decided to chose Arcface for this work because of its elegance, simplicity and effectiveness. The 36 3.1. Training the Embedding Space biggest advantage of Arcface and its other softmax based cousins over other methods is their ability to compute the loss function from a single decoupled input. In contrast, triplet loss always requires two auxiliary reference inputs to the anchor input for this purpose, i.e. a triplet. The additional computational effort which is required to select good triplets make the algorithm less attractive. The supervised auto-encoder also needs the auxiliary reference input of the target class for loss computation. The reason why Arcface works so well without requiring any additional inputs is because the centers of all classes are implicitly stored in the weights of the last fully connected layer. Because the class centers are just parameters of the network, they adapt automatically during training and are always available as reference for loss computation. As a baseline for comparisons regarding performance and the shape of the embedding space, a traditional softmax classifier has been used. Notice that the traditional softmax classifier enforces no constraints on the embedding space, except for class wise separability. Therefore, distinct classes can have arbitrarily small gaps in between them. For this reason, such an embedding space serves as a good basis for comparisons. When the shape of an embedding space is investigated, it is important to consider how large the enforced marginal gap between distinct classes was for training. This is because this parameter effectively controls the compactness of each class’s distribution and thus its shape. For Arcface, the margin m is a single hyper-parameter. To guarantee fair conditions for the embedding space analysis, it is the only hyper-parameter which has been varied for the results discussed in Chapter 4. To summarize, embedding spaces which are analyzed are obtained by using the Arcface algorithm with different margin settings. Additionally, the embedding space resulting from a traditional softmax classifier training is investigated for comparison. 3.1.2.2 Hyper-parameter Tuning There are several training hyper-parameters which need to be tuned before conducting more careful experiments. This is done to improve performance and to avoid spending too much time on the analysis of settings which are later discarded. Hyper-parameters for training an embedding space that are independent of the training algorithm are the embedding space’s dimensionality d, the network architecture and the optimizer. Specific to Arcface are the scale s and the margin m. The numerical hyper-parameters have been adjusted by independent exhaustive searches, each one with the other hyper-parameters fixed. The metric used for measuring the quality of a specific configuration is its resulting mean accuracy score on the test set, further explained in Section 3.3.1. A parameter wise exhaustive search has no guarantees for yielding the optimal configuration, but heavily reduces complexity compared to a full exhaustive search. The complexity of an independent exhaustive search which iterates over x values for each of the n hyper-parameters is O(nx). In contrast, the full exhaustive search, which enumerates all possible inter-parameter combinations, has a complexity of O(xn), due to the curse of dimensionality. This makes the full exhaustive search computationally intractable, even for small n. This is because trying just a single configuration is very expensive, as it requires training and evaluating the underlying 37 3. Methodology neural network. For hyper-parameters whose scale is bounded by a specific range, the x sampled values are picked equally spaced apart on a linear scale. An example for this is the margin m, which must lie within the range [0, π]. However, if a hyper-parameter can become arbitrarily large or if there is great uncertainty about the upper or lower bound of its range, choosing the n samples equally spaced apart on a logarithmic scale is a better option. Consider a scenario, where it is uncertain if a good candidate value lies within the range [1, 10], [1, 100] or [1, 1000]. If samples are picked from a linear scale, the search will not sample two out of the three ranges effectively, independently of the chosen distance between two consecutive samples. For example, a sampling distance of 10 would give 10 samples for the range [1, 100], but only 1 sample for the range [1, 10] and a total of 100 samples within the range [1, 1000]. Thus, a logarithmic scale with base 10 is required to sample each interval equally, in this scenario. On the logarithmic scale, the samples are picked equally spaced apart and are then converted back to the linear scale via exponentiation. Once an approximately good value has been found with this method, an additional search on the linear scale can be carried out on the surrounding sub-interval, to fine tune the hyper-parameter. Logarithmic hyper-parameter sampling has been used for tuning the scale s and embedding space dimensionality d. Many more sophisticated approaches for hyper-parameter tuning exist. One such tech- nique is Bayesian optimization, which is explored visually by Agnihotri et al. [AB20]. However, such techniques are not a key aspect for this work, which is why they are not further discussed. 3.2 Classification in the Embedding Space The purpose of training the embedding space is that the subsequent classification can also be applied on the TECs and unknown classes. The natural question which arises is which kind of classifier should be used. Independently of the choice, the goal is to build the respective model, given training samples xj for each class j. Notice that in this context training sample does not refer to the training of the embedding space. Instead, it solely refers to the computation of an input’s respective embedding, which is then used for computing the parameters of the classifier. One straight forward approach for classification in the embedding space is the k-Nearest Neighbors (kNN) classifier. However, kNN scales poorly for an increasing number of classes or samples, as it requires to store every single sample in a database for reference. For this reason, it is not a serious option for LPTR because performance and scalability are important. Another method is classification based on the nearest class center. Here, the distance to each center is measured according to a metric which does not consider variance, e.g. the Euclidean distance. This approach has been commonly used by recent work, such as Arcface. Other examples are given in Section 2.2. This type of classifier only guarantees to yield the correct result under the open set constraint if all classes share the same variance along all dimensions, as explained in Section 1.5.2. To test if this condition holds and to see if performance benefits from a more complicated classifier, 38 3.2. Classification in the Embedding Space a multivariate Gaussian classifier is evaluated in addition to the nearest center classifier in this work. The intuition behind this is that a Gaussian is more expressive and allows to model a distribution in the shape of a hyper-ellipsoid. One Gaussian is used for each class. Each class should receive a normalized confidence score pj as output, as opposed to a binary decision output. This is also necessary for defining a threshold value for unknown classes. A classifier should also be capable of measuring similarity sim and distance dist, as both can be used for thresholding. Converting from distance to similarity and vice versa is simple. It is convenient to define the final class confidences in terms of softmax, as many similarities represent unnormalized log likelihoods. However, as the range of similarity values can vary for different classifiers, it is practical to implement a scale parameter s, as discussed in Section 2.2.1.3. The final score pj for each class j out of N classes is then given by Equation 3.1. pj = es·simj N k=1 es·simk (3.1) 3.2.1 Cosine Classifier For Arcface, the angular distance is the metric of choice for the nearest center based classifier. This is because points lie on the manifold of a hyper-sphere during training. The angular distance essentially captures the same information as the cosine similarity and converting between these two metrics is straight forward. The classifier’s only parameter is the center of each class, which is calculated as the mean µj of its respective training samples xj . The cosine similarity simϕj for a test input x is then given by Equation 3.2. simϕj = µj · x µj · x (3.2) Converting this value to the according angular distance can be done with Equation 3.3. Notice that it is almost equivalent to Equation 2.9, which is used for training. The only difference is that instead of the weights Wj , which are explicitly stored for the TRCs during training, the class centers µj are used. This allows to compute class scores for the TECs as well. distϕj = arccos(simϕj ) = arccos µj · x µj · x (3.3) A good value for the scaling factor s is one which is similar to the value used during training with Arcface. This is because the underlying computation of the confidence scores using the cosine classifier is the same as during training. 3.2.2 Gaussian Classifier Geometrically, the iso-contours of a multivariate Gaussian are equivalent to hyper- ellipsoids. Essentially, a Gaussian models different amounts of variance along orthogonal axes or directions. This means that every direction is its own linearly independent 39 3. Methodology dimension. Although the axes are all orthogonal, they do not necessarily have to be aligned with a Cartesian reference coordinate system, but can be arbitrarily rotated, as in Figure 3.2c. Two special cases of Gaussians are same variance along all dimensions and arbitrary variances which are aligned with the reference coordinate system. These cases are shown in Figure 3.2a and Figure 3.2b, respectively. (a) Spherical Gaussian (b) Diagonal Gaussian (c) Full Gaussian Figure 3.2: Three different bivariate Gaussians obtained from the same samples. The orange dot shows the mean while the orange circle represents the covariance matrix of the Gaussian. The yellow points show the samples from which the Gaussian is obtained. In (a), the variance is constrained to be equal in all directions. In (b), the axes of the largest variances are aligned with the axes of the reference coordinate system. (c) shows the general case without constraints. 3.2.2.1 Geometric Interpretation of the Mahalanobis Distance The PDF of the multivariate Gaussian X ∼ N (µ, Σ) is given by Equation 3.4, where Σ represents the covariance matrix, d the dimensionality and the µ the mean vector. p(x) = 1 (2π)d|Σ| exp −1 2(x − µ)T Σ−1(x − µ) (3.4) As it is important to understand the meaning of this equation for the analysis of the embedding space, an intuitive geometric interpretation is given in this section. Let us start by considering the PDF of the univariate Gaussian X ∼ N (µ, σ2), which is given by Equation 3.5. p(x) = 1√ 2πσ2 exp −1 2 x − µ σ 2 (3.5) The part of the exponent given by Equation 3.6 corresponds to a transformation to the standard normal distribution X ∼ N (0, 1) and thus measures how many standard 40 3.2. Classification in the Embedding Space deviations away x is from µ. The value dist(x) is also known as the z-score. dist(x)2 = x − µ σ 2 (3.6) For the multivariate Gaussian, this distance computation must be generalized to multiple dimensions. To do so, the Euclidean distance metric is used. Let us consider the diagonal covariance matrix (Figure 3.2b) first and generalize to the full Gaussian afterwards. For a diagonal Gaussian, the Euclidean distance from the mean is given by Equation 3.7, where Λ represents a diagonal matrix with the variances σ2 i as its entries. dist(x)2 = d i=1 xi − µi σi 2 = (x − µ)T Λ−1(x − µ) (3.7) This equation already looks quite similar to part of the exponent in Equation 3.4. At this point, the only thing missing for a full Gaussian is an arbitrary orientation of its axes relative to a fixed reference coordinate system. This can be implemented by rotating x, using the orthonormal rotation matrix Q. More precisely, Q defines the orientation of the full Gaussian’s axes in the reference coordinate system. Therefore, y = Q−1(x − µ) effectively transforms x to the Gaussian’s local coordinate system. This transformation simplifies the full Gaussian X ∼ N (µ, Σ) to a diagonal Gaussian Y ∼ N (0, Λ). If this rotation is incorporated into Equation 3.7, it can be simplified to Equation 3.8, given that QT = Q−1 for an orthonormal matrix. dist(x)2 = (QT (x − µ))T Λ−1(QT (x − µ)) = (x − µ)T (QΛ−1QT )(x − µ) (3.8) This distance is also known as the Mahalanobis distance distM and is the multivariate equivalent to the z-score. Thus, it also answers the question how many standard deviations away a point lies from the mean. The expression QΛ−1QT can be summarized as a single matrix Σ−1. Its inverse Σ = (QΛ−1QT )−1 = QΛQT corresponds to the Eigendecomposition of a symmetric matrix. In fact, Σ is not only symmetric, but positive definite, as all variances must be positive in the non-degenerate case of a Gaussian. From the Eigendecomposition it follows that Λ represents both the variances and the eigenvalues. A formal proof that Σ is equivalent to the (sample) covariance matrix and not any other positive definite symmetric matrix is omitted here. It can be proven by a maximum likelihood estimator for Σ, as done by Magnus and Neudecker [MN19], for example. Intuitively it is not surprising, after all the covariance matrix measures all pair wise linear correlations between the variables. These can also be described by a single rotation from the reference variables’ original coordinate system. Substituting Σ−1 into Equation 3.8 yields Equation 3.9, the general formulation for the Mahalanobis distance, as it occurs in the exponent of Equation 3.4. distM (x)2 = (x − µ)T Σ−1(x − µ) (3.9) 41 3. Methodology The purpose of the normalization divisor (2π)d|Σ| in Equation 3.4 is to ensure that it integrates to 1 over Rd. This is a formal requirement for all PDFs. It can be intuitively understood by a Riemann sum approximation of the integral, which ensures that the sum of the probabilities of all infinitesimal small intervals equals 1. 3.2.2.2 Gaussians for Classification To use Gaussians for classification in the embedding space, each class must be represented by its own multivariate Gaussian distribution. Its PDF effectively measures similarity, as a larger value corresponds to being closer to the center. In practice, the natural logarithm of the PDF is nicer to work with because it simplifies the calculation. This is illustrated in Equation 3.10. ln(p(x)) = −1 2(x − µ)T Σ−1(x − µ) − 1 2 ln|Σ| − d 2 ln(2π) (3.10) This equation could be used directly for computing the similarity score of each class. However, the last term is constant for the Gaussians of all classes and therefore can be omitted, yielding Equation 3.11 for the similarity score of class j. simj(x) = −1 2(x − µj)T Σ−1 j (x − µj) − 1 2 ln|Σj | (3.11) Using the log likelihood also allows to multiply the similarity score with the scale factor s before applying the softmax. The scaling is particularly important for high dimensional Gaussians because the log PDF values approach negative infinity with increasing dimensionality. To counterbalance this, s needs to have rather small values s 1 for Gaussians. The maximum likelihood estimator for the covariance matrix Σ of a full Gaussian is given by Equation 3.12. It measures all unnormalized pair wise correlations based on M samples xi. µ is the sample mean of all xi. Σ = 1 M − 1 M i=1 (xi − µ)(xi − µ)T (3.12) This general case is the most promising choice because class variances can neither be expected to be equal in all directions nor to be aligned with the embedding space’s reference coordinate system. However, the former, i.e. a spherical Gaussian, is still useful for two reasons. Firstly, it captures the average variance of a class and thus allows to compute how strongly the average size varies between different classes. Secondly, it serves as a baseline for performance comparisons with the full Gaussian. The spherical Gaussian’s performance is bounded by the performance of the full Gaussian if the latter fits the true data distribution of a class more accurately. The maximum likelihood estimator for the scalar variance σ of the spherical Gaussian is defined by Equation 3.13. This formula effectively computes the variance for each dimension and averages over all 42 3.2. Classification in the Embedding Space dimensions d and samples M . σ = 1 (M − 1)d M i=1 (xi − µ)T (xi − µ) (3.13) To define a non-degenerate Gaussian of dimensionality d, d + 1 linearly independent samples are required. To understand this intuitively, imagine a straight line in 2D space. It can be described by any point on the line plus another point for specifying its direction. Notice that the number of points which are required to define the line does not change, independently of the dimensionality D ≥ 2 of the space it lies in. A line will still be a line in 3D or any higher dimension. The only thing that changes is the number of coordinates the two points consist of. A 2D plane in 3D can be defined by a 3rd additional point, which fixes the orientation of the plane. This observation can be generalized to hyper-planes by induction, showing that a d dimensional hyper-plane can be described by d + 1 independent points. This property is independent of the dimensionality D > d in which the hyper-plane is placed. A Gaussian behaves the same way as the hyper-plane in this scenario. For example, if only 3 independent samples are available to specify a 3D Gaussian, there will only be variances in 2 different directions instead of all 3 if the orientation of the Gaussian is chosen accordingly. In general, if M < d + 1 independent samples are used to define a d dimensional Gaussian, there will only be M − 1 distinct eigenvalues or variances. In this case, Σ does not have full rank and the PDF does not exist, as there is no Σ−1. This property makes it more challenging to use the full Gaussian for classification in high dimensional embedding spaces. As the required dimensionality of the embedding space grows with the number of distinct classes, this problem becomes more troublesome for large datasets. A way to mitigate it, is to use data augmentation to compute several embedding space samples from a single input. The spherical Gaussian does not suffer from this scaling problem, as it is well defined by two independent samples, regardless of its dimensionality. 3.2.2.3 Gaussians and Metric Spaces Notice that the embeddings are interpreted in angular space during training using Arcface, but a multivariate Gaussian operates in Euclidean space by definition. Therefore, the distance of a point x from the embedding space center matters for a Gaussian. For this reason, the unnormalized values x are used as inputs for computing its parameters instead of the normalized ones. An alternative option would be to transform the normalized embedding space coordinates of x into hyper-spherical coordinates. For a d dimensional vector, this would yield d − 1 angles plus the radius r, which is constant due to the normalization. Computing a Gaussian based on these angular values would eliminate the curvature, introduced by defining points on the hyper-sphere’s surface in Euclidean coordinates. However, the transformation would cause distortions around the poles of the hyper-sphere. To illustrate this, consider the map of our globe on a flat piece of paper, as the result of a Mercator 43 3. Methodology projection. It will make the north and south pole appear huge, while the size of Africa is heavily underrepresented. In fact, there is no projection of a sphere onto the flat plane which does not cause distortions on the global level. Therefore, interpreting the unnormalized embedding space values directly as Euclidean coordinates is an effective option. This is not a restriction, as a classification problem which is separable solely based on angular distances must also be solvable using the respective Euclidean distances. In fact, the classification can even become easier when interpreting the unnormalized embeddings in Euclidean space, as it effectively uses the radius as additional parameter. 3.3 Evaluation Metrics There are two types of evaluation metrics which need to be distinguished in this work. Performance benchmarks measure results in terms of accuracy on the dataset, whereas embedding space metrics deal with the structure and properties of the embedding space after training with the dataset. This section describes the theory and motivation behind various metrics, which can be applied independently of the dataset. Specific results for these metrics are discussed in Chapter 4 and are based on the used LPTR dataset. 3.3.1 Performance Benchmarks All performance benchmarks measure which data is classified correctly under which conditions in some way. How exactly this is done depends on the specific metric. In this work, each performance metric is not just applied once on the entire dataset, but four times on different subsets. This is done to distinguish between performance on the TRCs and TECs more carefully. The easiest category is performance solely on the closed set TRCs. The supplementary measurement is based only on the TECs. The third subset measures performance on both TRCs and TECs. The final category only takes TEC instances as inputs, but allows to classify each input as either a TRC or a TEC. This may seem bizarre at first glance. Why would one want to classify an input as an object which is known not to be part of the dataset? The purpose of this is to see how well TRCs and TECs can be distinguished in the embedding space. If their class distributions don’t share any overlaps, the performance of classifying TECs as TECs only and TECs as TRCs or TECs should be the same. Accuracy is one of the most important metrics and measures the percentage of the correctly classified data independently of the confidence in the classification. In its traditional formulation, every sample of the dataset receives equal weight. This can lead to biases, in cases where the number of samples for different classes varies strongly. For example, consider a ternary classification problem where a single class covers 90% of all samples. A classifier which always outputs this class will have 90% accuracy, although it 44 3.3. Evaluation Metrics might not compute anything. Mean accuracy overcomes this problem by computing the accuracy of each class individually and then averages over all classes afterwards. This way, each class receives the same total weight, independently of its number of samples. In the example above, the mean accuracy would equal 1 3 ∗ (90 + 0 + 0) = 30% because the classifier never predicts the other two classes. As the number of samples per class varies quite heavily for the LPTR dataset used in this work, mean accuracy is always evaluated instead of accuracy. Confusion Matrix is the natural multi-class generalization for the distinction between true positives, false positives, true negatives and false negatives. For each class, the confusion matrix captures how many instances of another class were classified as that class, hence the matrix. In the context of TRCs and TECs it is particularly interesting because it allows to inspect the proportions of mis-classifications between TRCs and TECs. For example, if the training has not generalized optimally, TRCs can be expected to be confused with TECs more often than with other TRCs. This is because the network has been explicitly trained to distinguish TRCs from TRCs but not TRCs from TECs. One major disadvantage of the confusion matrix is that it scales quite poorly because N distinct classes result in N2 entries. This makes it intractable for human inspection for thousands of classes or more. Precision Recall Curve is a plot which shows the relationship between precision and recall for different confidence thresholds. For low thresholds, the recall is highest and the precision lowest, as no predictions are discarded. The opposite is true for very high thresholds because only predictions which receive a high confidence are believed to be correct. For multi-class problems, each class has its own Precision Recall Curve (PRC). However, a single PRC can be computed by averaging over all classes. Mean Average Precision (mAP) is one of the most popular scalar performance benchmarks for object classification. It can be interpreted as an aggregation of several PRCs because it effectively computes their average area under the curve. More precisely, the Average Precision (AP) of a specific class is calculated by integrating the class’s precision over all respective recall values. mAP then computes the mean of the APs over all classes N , as given by Equation 3.14. pj and rj represent precision and recall of class j, respectively. In practice, the AP is computed numerically, e.g. by sampling the precision for a number of recall values which are equally spaced apart. mAP = 1 N N j=1 APj = 1 N N j=1 1 0 pj(rj)drj (3.14) Distinction between Known and Unknown Classes is about measuring how many percent of the instances of known classes and unknown classes can be identified as known and unknown, respectively. Essentially, this OSR problem corresponds to a 45 3. Methodology binary classification task. The accuracy depends on the confidence threshold below which inputs are considered as belonging to an unknown class. A low threshold will result in a high recall for known classes, but a small recall for unknown classes. The opposite is true for a high threshold. The optimal threshold value maximizes the mean accuracy of the underlying binary classification problem. The thresholding can also be performed based on distances [GHC20]. The advantage of a distance threshold is that it is independent of the distances to all classes but the closest. As the confidence is indirectly based on the distances to all known classes, its value can be misleading. In a nutshell, a larger minimal distance can score a higher confidence than a smaller minimal distance. This is because the confidence solely depends on the ratio between the computed similarity scores of all classes. Consider the following binary classification example which illustrates this. In scenario A, the distances have the values distA 1 = 5 and distA 2 = 10. If similarity is computed as the reciprocal of the distance sim = dist−1, the larger confidence score equals pA 1 = 5−1 5−1+10−1 = 2 3 . In scenario B, the distances equal distB 1 = 30 and distB 2 = 90, yielding pB 1 = 30−1 30−1+90−1 = 3 4 as largest confidence. This example illustrates that max pB > max pA, although min distB > min distA. This alone does not necessarily explain a distortion if distinct classes are expected to have different distance distributions. For example, the larger distance values in scenario B could just be more likely to occur than the smaller distance values in scenario A. However, if all classes share the same distance distribution, this can be problematic. Therefore, distance thresholds sometimes are preferable over confidence thresholds. This is the case for the Mahalanobis distance of the multivariate Gaussian, which is further discussed in Section 3.3.2. Note that in this work the OSR performance is only evaluated using the entire dataset. This means that instances of TRCs, TECs and unknown classes are used as inputs for evaluation. This is done because it mimics the LPTR task in the real world most closely. Besides, the mean accuracy of the OSR is evaluated independently of the other metrics. This means that it is the only benchmark which also uses instances of unknown classes for evaluation. For all other metrics, the inputs are restricted to images of TRCs and TECs only. 3.3.2 Embedding Space Metrics This second category of metrics measures properties of the learned embedding space. For most of these metrics, evaluation on two subgroups, i.e. only TRCs and only TECs is sufficient, as no performance benchmarks are measured. The purpose of this distinction is to see if and how the embedding space metrics differ for TRCs and TECs. If the network has learned to generalize features of TECs well enough, there should be no significant difference for the TRC and TEC measurements. The metrics are compared for different values of the margin hyper-parameter to get a deeper understanding of its effects. 46 3.3. Evaluation Metrics Simple Angular Distance Statistics are used to compare the TRCs’ and TECs’ approximate average sizes in the embedding space based on the margin hyper-parameter. We introduce two benchmarks, the Mean Average Angular Distance (MAAD) and the Standard Deviation of the Average Angular Distance (StdAAD). To compute them, the Average Angular Distance (AAD) to the class center is computed for each class at first. Then, the MAAD is calculated by Equation 3.15, where N equals the number of classes, Mj the number of available samples of class j and xij the ith sample of the jth class. MAAD = 1 N N j=1 AADj = 1 N N j=1 1 Mj Mj i=1 distϕj (xij) (3.15) The StdAAD measures the average deviation from the MAAD and is given by Equation 3.16. In other words, it can be used to see how strongly the sizes of different classes vary. StdAAD = 1 N − 1 N j=1 (AADj − MAAD)2 (3.16) Plotting the MAAD and StdAAD for different margins for both TRCs and TECs separately allows to inspect how the average size of a class is affected by the margin. It is expected that the MAAD decreases with an increasing margin, as it makes the classes more compact. For a particular dataset, is unclear in advance how strongly the MAAD varies for TRCs and TECs and how the StdAAD changes for an increasing margin. Notice that we have defined the MAAD and StdAAD specifically for angular distances. This is because Arcface learns based on angles. Defining both metrics based on angles therefore allows to evaluate the underlying distributions as seen by the training. Defining the MAAD and StdAAD for other metric spaces is straight forward. Mean Distance PDF extends the angular distance statistics by not only considering the mean and standard deviation but the entire distribution. This distance distribution is computed as a continuous PDF, which is obtained from a limited number of samples using a Gaussian Kernel Density Estimation (KDE), given by Equation 3.17. M represents the number of samples, xi the individual samples and h the standard deviation of the Gaussian, which controls the amount of smoothing. p(x) = 1 M M i=1 1√ 2πh2 exp −1 2 (x − xi)2 h2 (3.17) The mean distance PDF is calculated as the average of the individual distance PDFs obtained from the test set of all TRCs or TECs. It cannot only be computed for the angular distance but also for the Euclidean distance or even the Mahalanobis distance if Gaussians are used. The Mahalanobis distance PDF of a multivariate Gaussian is static and its properties are very counter-intuitive at first glance. This is because the PDF’s mode is not located at the center like in the univariate case. Instead, the mode moves further and further away 47 3. Methodology from the center with increasing dimensionality. An intuitive explanation for this is that for a point the number of possibilities to be a certain distance away from the center grows with the size of the manifold at that distance. For example, for a 2D spherical Gaussian, there is an entire circle of infinitely many points which share the same distance towards the center. For a 3D spherical Gaussian, there is a whole sphere of points for every distance. In contrast, at the center there is only one point which has zero distance to the center, which is itself. The probability to sample a point which is a certain distance away from the center is the Gaussian’s PDF value of any point at that distance multiplied by the size of the manifold at that distance. To formulate this mathematically, let us consider this from a different perspective. We will start with standard multivariate Gaussian and generalize to arbitrary Gaussians afterwards. A d-dimensional standard multivariate Gaussian X ∼ N (0, I) has no correlations and therefore its PDF can also be described as the product of the PDFs of d orthogonal standard univariate Gaussians Y ∼ N (0, 1). The distribution over the Euclidean distances from the origin of X is then given by Equation 3.18 and is equivalent to the definition of the χ distribution. χ = √ XT X = d i=1 Y 2 i (3.18) This means that the distance PDF for a standard univariate Gaussian is given by the according PDF of the χ distribution. A similar derivation which uses the χ2 distribution instead of the χ distribution is given by Wang et al. [WSM15]. The χ distribution’s PDF for k degrees of freedom and values x ≥ 0 is given by Equation 3.19, where Γ(z) represents the gamma function, given by Γ(z) = ∞ 0 xz−1e−xdx. The gamma function can be understood as the generalization of the factorial function over R. The interested reader can find more detailed information about these functions in a book by Forbes et al. [FEHP11], which is specifically about probability distributions. p(x) = xk−1e− x2 2 2 k 2 −1Γ(k 2 ) (3.19) If the χ distribution’s PDF is used as Euclidean distance PDF, x corresponds to a point’s distance from the center and k to the dimensionality d of the standard multivariate Gaussian X. To generalize this to an arbitrary multivariate Gaussian distribution, the Mahalanobis distance of a point needs to be used as input for the PDF instead of the points’ Euclidean distance. This is because the Mahalanobis distance distribution of any multivariate Gaussian is equivalent to the Euclidean distance distribution of the standard multivariate Gaussian of the same dimensionality. As explained in Section 3.2.2.1, for a Gaussian, the Mahalanobis distance measures how many standard deviations away a point is from its center. Furthermore, the Mahalanobis distance’s iso-contours have the shape of hyper-ellipsoids. Therefore, its underling computations can also be viewed as a transformation from any multivariate Gaussian distribution to X. Some examples of the Mahalanobis distance PDF for different dimensionality values are plotted in Figure 3.3. 48 3.3. Evaluation Metrics Notice that due to the transformation, this PDF looks exactly the same for all possible Gaussians which share the same dimensionality. Figure 3.3: The Mahalanobis distance PDFs for multivariate Gaussian distributions with a different dimensionality d. As the dimensionality increases, the distribution approaches a univariate Gaussian with its center further and further away from the center. PCA is closely related to a multivariate Gaussian distribution by definition. The Gaussian’s major axes are aligned with the principal components of the data from which the Gaussian has been estimated. One way to view principal components is that they capture the directions of maximal variance, which is exactly what a Gaussian does. The alternative viewpoint on PCA describes the coordinate system of a linear transformation, which is defined by the principal components, and leads to the minimal linear reconstruction error if a number of components are discarded. In this work, there is no interest in reducing dimensionality using PCA. If this was necessary or desired, it could be done directly during training of the embedding space. This would leave even greater flexibility, as the reduction would not be restricted to be linear. Instead, the subject of interest is solely the distribution of the standard deviations along the principal components. If viewed as the major axes of a multivariate Gaussian, principal components measure how curved or spherical its corresponding hyper-ellipsoid is. Calculating the principal components is equivalent to computing the major axes of a 49 3. Methodology Gaussian and their lengths. More mathematically, the variances diag(Λ) along all major axes Q need to be calculated. As explained in Section 3.2.2.1, these can be obtained from the Eigendecomposition of the covariance matrix Σ = QΛQT . The respective standard deviations are then computed as diag( √ Λ). After sorting, these values can be plotted against their rank. The resulting figure can then be interpreted in two ways. The first viewpoint simply shows the lengths of the true distribution’s principal components. Alternatively, the elongation of the hyper-ellipsoid corresponding to the Gaussian estimate of the true distribution is visualized. To retrieve a single plot which consolidates this for all TRCs or TECs, the sorted standard deviations of each distribution can be normalized and then averaged over all classes. Similarity of Estimated Class Distributions to Gaussians determines how well the test data distributions fit the respective Gaussian estimates, which are based on the training data. For this purpose, the mean Mahalanobis distance PDF estimated from a test set is particularly interesting because the PDF it is expected to match is known, as explained above. Therefore, it is possible to determine if there is a statistically significant difference between the distribution given by test data and the one to be expected. It must be noted that even if there is no statistically significant difference, this does not imply that the test data follows a Gaussian distribution. In this case, a test for multivariate normality has to be applied in addition. The first test alone is not enough for verifying that the test set data follows the estimated Gaussians because the Mahalanobis distance PDF looses representative power. This is due to its reduction from the Gaussian’s real dimensionality d to a single dimension. However, if there is a statistically significant difference, it implies that the test data does not follow a perfect multivariate Gaussian distribution. Notice that even if a statistically significant difference has been confirmed, this does not imply that this difference is important. In this context, it merely states that it is extremely unlikely that the computed Mahalanobis distances of the test set originate from the estimated distribution. Statistical significance makes no statement about how different two distributions are from each other, only that they are unlikely to be identical. For example, if there is zero uncertainty about the shape of two almost identical distributions, there will still be a statistically significant difference between them. In cases where a statistically significant difference has been confirmed, it is still interesting to see how different the according distributions are. There are many different ways to measure similarity between two distributions. Cha [Cha07] gives a great comprehensive overview of different options for measuring similarity using distances. In this work, the Jensen-Shannon Distance (JSD), which is a statistical distance metric, has been used. For a more rigorous description of the JSD than provided here, please refer to Briet et al. [BH09]. Two nice properties of the JSD are that it is always well defined and that its base 2 interpretation lies within the range [0, 1]. The latter allows to directly convert it to a similarity score by computing simJSD = 1 − JSD. The JSD is an extension to the Kullback-Leibler Divergence (KLD) and enforces symmetry to obtain a mathematically 50 3.3. Evaluation Metrics valid metric space. The KLD essentially describes the average amount of redundant information contained within the encodings that define a message, compared to the theoretical minimum of the average amount of information required for each encoding. This minimum is is also known as entropy. For this reason, the KLD is commonly referred to as relative entropy, as it describes the additionally required information due to its non-optimality. In our context, there are no discrete encodings, but continuous distributions instead. Defining the KLD in terms of distributions is a generalization, as the length of each encoding represents a probability. Typically, specific encodings are derived from the underlying probability distribution. The KLD is given by Equation 3.20, where P and Q represent the continuous random variables of the true PDF and the estimated PDF, respectively. KLD(P ||Q) = ∞ −∞ p(x)log p(x) q(x)dx (3.20) Using this, the JSD is then defined by Equation 3.21 and can be understood as a symmetric, smoothed and averaged version of the KLD. JSD(P ||Q) = KLD(P ||M) + KLD(Q||M) 2 | M = P + Q 2 (3.21) The similarity score computed based on the JSD is only an approximation of the true similarity between the test set’s distribution and the estimated Gaussians. Again, this is because the similarity is computed based on the 1D Mahalanobis distance distribution as opposed to the full d dimensional Gaussian distribution. The major advantage of this dimensionality reduction is that the resulting 1D PDF can be visually interpreted by a human. This allows a human to estimate similarity visually without computing a similarity score. Furthermore, the overall shape of the distance distribution reveals insight about the overall structure of the true distribution. For example, if several modes exist, it implies that the true distribution consists of more than one compact cluster. Separability of Distinct Classes assesses whether or not any overlaps exist between the distributions of different classes in the embedding space. Assuming that each class can be described by a Gaussian, this question is equivalent to asking if the classes’ respective hyper-ellipsoids overlap. Theoretically, all Gaussians spread infinitely far in all dimensions they cover. However, a confidence ellipsoid which contains p < 100% of a class’s data can be computed. This is also known as the Standard Deviational Hyper-Ellipsoid (SDHE) and is described with great detail by Wang et al. [WSM15]. In a nutshell, the SDHE can be derived by considering the χ distribution of the Mahalanobis distance, as explained above. More precisely, the inverse CDF of the χ distribution can be used to compute the Mahalanobis distance threshold above which p% of the Gaussian’s data is contained. Multiplying this value by the standard deviations diag( √ Λ) of the Gaussian’s major axes yields the lengths of the p% confidence SDHE axes. Calculating the smallest distance between two hyper-ellipsoids is a non-linear problem. An iterative algorithm based on local approximations via hyper-spheres is given by 51 3. Methodology Lin and Han [LH02], for example. For N classes, O(N2) pair wise shortest distances exist. This makes the problem intractable for large datasets, as the smallest distance computation needs to be done for each pair of Gaussians. One way to mitigate complexity is to approximate each hyper-ellipsoid by a hyper-sphere whose radius is equivalent to the length of the hyper-ellipsoid’s axis of median length. The median is used instead of the mean because it is robust to extreme values. Another advantage of this method is that it captures an approximation of the true distribution, which is not directly based on the assumption that the true distribution follows a Gaussian. Calculating the distance between the centers of two hyper-spheres A and B is as trivial as DAB = µA − µB , where µ presents the center of a hyper-sphere. All pair wise center-to-center distances can be summarized in a symmetric matrix D. Notice that diag(D) = 0, as all hyper-spheres do not have any distance to themselves. We propose a metric called Gap/Radius Ratio (GRR) (Equation 3.22), which measures how filled or empty the embedding space is by the distributions of a number of classes. rj refers to the radius of the hyper-sphere approximation of the jth class. Essentially, the GRR measures the average ratio between the distance between the centers of two classes and the sum of their respective radii. A GRR of 1 states that the average gap between two classes is zero. In other words, the hyper-sphere approximations of the classes just barely touch each other. GRR values below 1 indicate that the hyper-sphere approximations on average are more likely to overlap with each other than to lie apart from each other with a gap in between them. For GRR values above 1, the opposite is the case. Notice that the GRR is bounded by the range [0, ∞). GRR = 1 N 2 N i=1 N j=i+1 Dij ri + rj (3.22) It must be noted that the GRR is not a precise measure for embedding space emptiness, as it approximates the classes’ true distributions by hyper-spheres. The GRR as introduced in Equation 3.22 only yields an approximation for the true value. This is because an overlap between two distinct classes cannot be captured if it occurs outside of the bounds given by the respective hyper-sphere approximations. Similarly, an overlap between two classes which does not really exist is captured if the two classes do not stretch out as widely as their hyper-sphere approximations in the overlapping region. 3.4 Saliency Map Extensions Vanilla saliency maps and Grad-CAM have been used for investigating the classification decisions of the trained network, as explained in Section 2.3. In this section, we first introduce an extension which also allows to compute saliency maps for TECs. Then, we argue why an average saliency map per LPT is preferable over an individual saliency map per image. 52 3.4. Saliency Map Extensions 3.4.1 TEC Saliency Maps All saliency map techniques described in Section 2.3 require the ability to compute the logits of the classes of interest. Gradient-based methods additionally need to be differentiable because gradients need to be back-propagated to create a saliency map. Another advantage of training an embedding space with a softmax extension like Arcface is that saliency maps for TRCs can be computed out of the box. This is because the necessary outputs for the logits already exist, as they are required for training. Notice that this is not the case for techniques like triplet loss. We propose a simple method for computing gradient-based saliency maps for TECs. As explained in Section 2.2.1.1, each Arcface row vector Wj of the last fully connected layer can be interpreted as the class center of the jth TRC in the embedding space. One way to compute saliency maps for TECs as well is to use such a weight vector for every TEC too. The class center of every TEC can be computed empirically based on the available training data. In this context, empirically refers to computing the respective embeddings of all samples of a TEC and averaging the obtained values to retrieve the center. Notice that these calculations only require the forward pass of the network. The weights of the network are not altered in any way by this procedure. The computed centers can then directly be used as weights for the last fully connected layer in order to create TEC saliency maps. Both, extending the original fully connected layer encoding the TRC centers and replacing it by the TEC centers is possible. Notice that the proposed method is specifically designed for both Arcface and the usage of the cosine classifier. Theoretically, it is also possible to compute the logits using the Gaussian classifier and Equation 3.11, as it is also differentiable. However, in this case the logit calculations can no longer be represented by a single fully connected layer after the embedding layer. This is because each logit computation is more complex than a single dot product for the Gaussian classifier. 3.4.2 Average Saliency Maps In general, saliency maps of different images need to be considered individually and independently of one another. This is because one class can have very distinct appearances. For example, an image of a cat could show it from the front or from the side. Depending on the view, a cat looks completely different, but nevertheless both images belong to the same class. Thus, combining two such images by an overlay does not make sense in general, as pixels at corresponding locations can represent very different features. However, in the world of LPTs there is not a lot of diversity because different instances of the same LPT are almost identical by definition, except for their identifying characters and other neglectable personalizations. To determine the general behavior of the network regarding a specific LPT, it is a good approximation to overlay the saliency maps of all its available samples to compute the average for every pixel. Obviously, this requires that all images of an LPT have the same size. If the saliency map of a single image is considered individually, there is a danger of over-fitting due to occlusions or noise, for 53 3. Methodology example. The average saliency map smooths out such variances to some degree. Besides, it heavily reduces the number of saliency maps which need to be inspected by a human. Instead of having to look at one saliency map per image, there is only one per LPT. 54 CHAPTER 4 Results This chapter discusses the results which have been achieved on the used LPTR dataset. At first, the LPTR dataset on which the experiments have been conducted and the experimental setup are described in Section 4.1. Afterwards, a solution for handling the identifying characters of the LPs during training is proposed in Section 4.2. Then, the results of various performance benchmarks and the associated embedding space metrics are given in Sections 4.3 and 4.4, respectively. The theory behind all used benchmarks is described in Section 3.3. Next, observations about what the network has learned during training are described based on the saliency maps of various LPTs in Section 4.5. Finally, the results are summarized and discussed in Section 4.6 and Section 4.7, respectively. 4.1 Experimental Setup The results described in this chapter are all based on the ResNet architecture shown in Table 3.2. The overall conclusions regarding embedding space properties which have been obtained from the VGG architecture from Table 3.1 are equivalent. However, the performance of this architecture is slightly worse compared to the ResNet, which is why its results are not further discussed. For training, the Adam optimizer introduced by Kingma et al. [KB14] has been used with a learning rate of 0.001. The other hyper-parameters have been tuned as described in Section 3.1.2.2. A good configuration for the used LPTR dataset is an embedding space dimensionality of 32 and a scale factor of 16. The margin hyper-parameter is kept variable for the benchmarks. The LPTR dataset consists of 7500 annotated images of 128 distinct LPTs from the USA. The TRCs contain 69 of these LPTs and a total of 5800 images, while the TECs cover the remaining 59 LPTs and 1700 images. Notice that although the number of samples of the TRCs is much greater, the number of classes is not. This is because the LPTs of which most samples are available have been used for network training. This does not 55 4. Results introduce a more significant bias in the TRC selection than a random split between TRC and TECs would. Careful human inspection has shown that the LPTs of which most samples are available are distributed almost uniformly over both different states and the LPTs’ main visual features, which a human would use for distinguishing between them. Additionally, 700 images of LPTs which are neither part of the TRCs nor the TECs have been used as unknown class instances. It must be noted that the LPTs which occur in the unknown classes belong to the same states of origin as the LPTs of the known classes. Therefore, many images within the unknown classes look quite similar to the images within the TRCs and TECs. Images of other arbitrary objects are not contained within the unknown classes. This setup has been decided because the adequate prior closely mimics the problem during real world operation. In a full LPR system, inputs are only passed to the LPTR subsystem if there is very little uncertainty that the inputs show an LP and not something else. The first task of the LPTR subsystem then is to decide whether or not the image shows a known LPT. Also, for a given region, unknown LPTs from that region are the most likely inputs for the respective system. The used train/test set ratio is 70/30 for all classes. The dataset statistics are summarized in Table 4.1. Total TRC TEC Unknown Class #Classes 128 69 59 - #Images 8200 5800 1700 700 #Training Images 5740 4060 1190 490 #Test Images 2460 1740 510 210 Table 4.1: Statistics of the used LPTR dataset. The LPT images which are used as neural network inputs have a size of 128 × 64 pixels. This resolution is a good trade-off between preserving image detail and performance regarding computational speed. As the image quality of the used dataset is not the best, a much larger resolution would require upsampling anyway. More precisely, the original resolution of the average LP before resampling is about 20% larger. 4.2 Data Augmentation and Character Masking 4.2.1 Traditional Data Augmentation Techniques Each data augmentation operation which is mentioned in this section is performed according to a random value within a given interval. Within these bounds, all values are equally likely. First of all, contrast and brightness modifications of ±20% are used. Next, a shift along the x and y axes by ±10% is implemented. Furthermore, rescaling is performed by zooming in or out of the image by ±15%. To avoid artificial computation of pixels when zooming out of an image, for example mirroring the edge pixels, the LPTs are stored at a zoomed out scale. This way, all rescaling operations can be implemented as 56 4.2. Data Augmentation and Character Masking inward zooms and no artificial data has to be generated for pixels outside of the original bounds. As a consequence, the mean zoom value within the augmentation interval is not 0, but a positive number, which depends on the maximal zoom-out and shift factor. Other traditional augmentation techniques, such as different variants of noise or mirroring, are not used. The former has been tried, but has not lead to an increase in performance. Presumably, because the dataset already contains a lot of noise as it is. The latter does not make any sense, as no mirrored LPs can be found in the real world without legal consequences. Data augmentation is also used to create several embeddings per training image in order to obtain more samples for computing the embedding space classifiers. This is particularly important for the Gaussian classifier, as it requires at least d + 1 independent samples. In our setup this means that 33 samples are needed per class. Notice that the training set for some of the TECs only consists of a little more than 10 images. Consequently, estimating a Gaussian distribution for these classes would not be possible without data augmentation. In our experiments we create 10 embeddings per training image for computing the embedding space classifier’s parameters. 4.2.2 Character Masking Another augmentation strategy which needs to be considered is specific to LPTR and deals with the masking of the identifying characters of LPs. The idea behind masking is to avoid learning the presence of specific characters as decision indicators, as explained in Section 1.5.1. In this work, masking out a character refers to putting a rectangular bounding box of uniform color over it, such that the character is no longer visible. The used color is fixed and is set to approximately match the average pixel intensity over the entire dataset. The width and height and position of each bounding box are part of the annotated ground truth. The bounding boxes’ sizes and positions are also varied randomly by a small amount during augmentation to further improve the generalization capabilities of the network. The simplest strategy is to mask out all characters for every image during training. This gives quite good results on the test set, as long as the characters are also masked out for these images. In contrast, if the characters are not masked out for the test set, performance drops enormously. The opposite effects on the test set can be observed if no characters are masked out at all during training. This indicates that the masks/characters, depending on what has been used for training, are important features to the network and that they are required for many of its decisions. It is not surprising that the network learns the masks as reliable indicators, as they are very primitive, yet effective features. After all, a bounding box is perfectly described solely by its edges. For the used dataset, exploiting the number of bounding boxes in an image and their relative positions towards each other is enough to identify the underlying LPT, in some cases. Note that the used TRCs only include 69 distinct LPTs. For many more classes than this, it will become increasingly harder to distinguish between different LPTs based on their respective number of identifying characters and their positioning. This in turn forces the neural network to learn more complex features in order to be able to successfully distinguish 57 4. Results between the LPTs. However, the goal is to avoid learning the shape of specific masks altogether, independently of the dataset size. Therefore, it should be verified that neither specific characters nor specific masks are the main features which are used to classify input images. One way to confirm this to some degree is to ensure that performance on the test set is equally good whether or not an LP’s identifying characters are masked out. Notice that even if this is the case, it does not mean that the contents of the image regions containing the identifying characters or masks are not relevant at all. Instead, it merely implies that the network is able to handle both cases and therefore must have learned to generalize to different features in such regions to some degree. Using the presence of some abstract object at a certain position as feature is actually reasonable. For example, a human may use the information that any character is present at a specific location as an indicator for classification. In cases where two LPTs only differ by their numbers of characters, this is the only way to distinguish between them. A simple strategy which has proven to be effective and efficient is to independently and randomly decide if masking is applied for each character of an LP during training. The probability of applying the masking has been set to 50%, such that keeping the original character and masking it out are equally likely. To generate even more data and boost generalization further, this procedure is applied every time an image is used as input for training, as opposed to once per image in total. Consequently, the same original image can be subject to many non-identical maskings, yielding different training inputs. The number of characters which are masked out for an LP follow a binomial distribution. This means that it is most likely that half of an LP’s characters are masked out and rarest that all or zero characters are masked out. This strategy is very effective because the network cannot rely on the presence of a specific character or mask at a certain position due to the randomization of the masking. Besides, it is very inefficient to learn the relative positioning between different masked characters, as there are too many different configurations. For example, there are 27 = 128 possible masking options for an LP that contains 7 characters. Because of the random process, it is easier for the network to learn other features, which are shared amongst all inputs and thus are more robust. As desired, using this strategy, there is no significant difference in the accuracy on the test set whether or not all identifying characters of the LPs are masked out. To verify this, the McNehmar test statistic has been computed, for which the null hypothesis could not be rejected for an alpha of 0.05. A brief introduction to this test is given by Adedokun et al. [AB12]. In a nutshell, it can be used to compare the accuracies of two paired classification results based on a contingency table. Each sample can either by classified correctly by one of the classifiers, none of them, or both of them. These four possibilities make up the entries of the 2x2 contingency table. For the null hypothesis, the McNehmar test uses the assumption of marginal homogeneity for the table, which implies that both classifiers have equal accuracy. The p-value can then be computed based on a chi-square test statistic. For example, the p-value for using the Gaussian classifier on the test set of the TECs equals 0.58. Consequently, the null hypothesis could 58 4.3. Performance Benchmarks not be discarded, as desired. 4.3 Performance Benchmarks 4.3.1 Mean Accuracy Mean accuracy summarizes performance intuitively, using just a single number. This allows to compare different settings for Arcface’s margin hyper-parameter and different dataset subsets efficiently. Figure 4.1a and Figure 4.1b show the results for the Cosine and Gaussian classifier, respectively. The left most bar of each subset category represents the accuracy obtained by training a traditional softmax classifier. For the cosine classifier, accuracy is highest at 0 margin for all subsets and starts to decrease monotonically with increasing margin afterwards. In contrast, the Gaussian classifier has its peak at a margin of 0.4 or 0.6, depending on the subset which is evalu- ated. Surprisingly, the Gaussian’s performance is not better than the Cosine’s. In fact, considering the best margin setting for each classifier individually, the latter slightly outperforms the former. For small margins m <∼ 0.6, the Cosine’s performance is superior to the Gaussian’s. The worse performance of the Gaussian classifier suggests that the TECs true distribu- tions either are not clean Gaussians or that their parameters have not been estimated accurately enough. The latter can occur if not enough samples are available for estimating the Gaussian, which can lead to overfitting. The incorrectly captured TEC distributions become especially apparent when considering the evaluation on the third subset, where the Gaussian’s performance is poorest compared to the Cosine’s. However, it can also be observed that the TECs do not seem to follow hyper-spherical distributions with similar variances either. This is because the Cosine’s performance drops with an increasing margin, whereas the Gaussian’s increases up to a certain margin. This suggests that the TECs become less round with increasing margin, as the Gaussian classifier is able to capture their true distributions more accurately than the cosine classifier. Surprisingly, the traditional softmax classifier performs almost as well as the sophisticated Arcface classifier with fine tuned parameters. However, studies by Deng et al. [DGXZ19] show that Arcface outperforms a traditional softmax classifier by a large margin on big face recognition datasets such as MegaFace [KSSMB16]. MegaFace has 690 thousand classes, whereas the LPTR dataset only has slightly more than a hundred. This suggests that large differences in performance only start to become apparent for much bigger datasets. Sadly, creating a larger LPTR dataset for this work is not possible, as human annotation is expensive. For this work, this leaves no way for empirically confirming that the performance of Arcface would also become more superior to a traditional softmax classifier for a larger LPTR dataset. 59 4. Results (a) Cosine classifier (b) Gaussian classifier Figure 4.1: The mean accuracies obtained by different embedding space classifiers and by training with different margin values. There are a total of four different dataset subsets, for which each parametrization is evaluated. TRC and TEC refer to only evaluating on the TRCs and TECs, respectively. TEC on TRC&TEC means that only TEC inputs are used which can be classified as both TRC or TEC instances. TRC&TEC represent the entire dataset. 60 4.3. Performance Benchmarks 4.3.2 Confusion Matrix A more detailed view on the essentially same information captured by the mean accuracy is given by the confusion matrix. For the used dataset with slightly more than a hundred classes, it is still accessible for in depth inspection by a human. The confusion matrices for the Cosine and the Gaussian classifier are shown in Figure 4.2a and Figure 4.2b, respectively. Both images show that there are almost no confusions between different TRCs. This is to be expected, as the network has been directly optimized for this purpose. The Gaussian classifier confuses TECs as TRCs more often than TECs as other TECs. However, it almost never confuses TRCs as TECs. Both is surprising, as the confusions of TECs as other TECs, TECs as TRCs and TRCs as TECs are expected to be distributed almost uniformly, as it is the case for the cosine classifier. A closer look at Figure 4.2a reveals that for the cosine classifier the percentage of confusion of TRCs as TECs is slightly lower than in the other categories. This is because the TECs have far fewer samples than the TRCs, which can lead to a higher confusion even if just a few samples are classified incorrectly. This is not unlikely to happen by chance a few times for 59 TECs. For the Gaussian classifier, the disproportionally large number of confusions of TECs as TRCs (Figure 4.2b) could be explained by incorrect distribution estimates for the TECs. More precisely, samples which belong to a certain TEC falsely achieve just a small confidence score for the respective class compared to some other TRC, which leads to a false classification. The true distributions of the TRCs on the other hand are estimated more precisely, which leads to almost no confusions of TRCs as TECs. If this hypothesis was true, it would mean that the TRCs are described more precisely by Gaussians than the TECs. This is further investigated using the embedding space metrics in Section 4.4. For both classifiers, the confusions are not distributed uniformly in each quadrant, but form clusters in the proximity of the diagonals (Figure 4.2). This is because the order of the LPTs is not arbitrary, but sorted by state. In general, LPTs of the same state are more similar to each other than to LPTs of other states, as they often share many features. Evaluating the accuracy on state correctness rather than LPT correctness supports this observation. For example, the state accuracy on the TEC test set using the cosine classifier is 98.9%, whereas the LPT correctness only equals 92.2%. 4.3.3 Precision Recall Curve and mAP For the used dataset, the computed mAP values are very similar to the respective mean accuracy scores. For this reason, a specific plot illustrating mAP under various settings is omitted. The PRCs for the optimal configurations for the cosine classifier and the Gaussian classifier are shown in Figure 4.3a and Figure 4.3b, respectively. The area under each curve of Figure 4.3 defines the respective mAP on the dataset. The cosine classifier achieves a mAP of 90.2% on all classes. In contrast, the Gaussian classifier only reaches a mAP of 89.3%. Just as for the mean accuracy, the mAP is worst on the subset which allows to classify each TEC input as any TRC or TEC. 61 4. Results (a) Cosine classifier (b) Gaussian classifier Figure 4.2: The confusion matrices of different embedding space classifiers resulting from their respective optimal margin configurations for training. The entries indicate percentages as opposed to an absolute number of samples. The greener a specific entry is, the more percent of a class’s total test samples it represents. The x-axis represents the predicted classes and the y-axis the target classes. 62 4.3. Performance Benchmarks (a) Cosine classifier (b) Gaussian classifier Figure 4.3: The PRCs of different embedding space classifiers. The best performance is reached on the TRCs only, where the curves of both classifiers are close to optimal. The plots highlight the best recall which is actually achieved by the trained LPTR system. Therefore, artificial sample points for a recall of 100% with precision values close to 0% are omitted. 63 4. Results 4.3.4 Distinction between Known and Unknown Classes The underlying OSR problem for distinguishing between known and unknown classes is based on a threshold value. The variable used for thresholding is obtained from the embedding space classifier and is either based on distance or confidence. For distance thresholding, the smallest distance to all known classes is used, whereas the largest confidence score is used for confidence thresholding. The advantages of the distance threshold are that it does not introduce any additional parameters and that it is transparent for human inspection. This is because for both the Mahalanobis distance and the angular distance the expected range of values are familiar for known classes. The confidence threshold on the other hand requires the additional scale parameter s from Equation 3.1 for computing confidence. Confidence cannot be interpreted as measure of similarity directly, as it depends on the ratios between the similarities of all known classes. The benefit of the confidence threshold is that it can give better results than the distance threshold for an optimal configuration of the scale parameter. This is the case for the cosine classifier, as illustrated in Figure 4.4. This diagram shows the mean accuracy for the underlying binary classification problem for various margin values and threshold configurations. Figure 4.4: Mean accuracy for distinguishing between known and unknown classes based on a threshold value. The thresholding is done either on a distance value or confidence score, for both the Gaussian classifier and the cosine classifier. For the Gaussian classifier, better results are achieved by the distance threshold, whereas 64 4.3. Performance Benchmarks the opposite is the case for the cosine classifier. If only the peak of each threshold configuration is considered, all configurations, except for the Gaussian classifier which uses the confidence threshold, perform similarly well. This worse performance can be explained by the relative nature of the confidence threshold, which can falsely boost a large minimal distance if the distances to the other known classes are even greater, as discussed in Section 3.3.1. This is not such a big problem for the cosine classifier because the maximum angular distance is limited to 180 degrees. In contrast, the Gaussian does not have a theoretical upper bound, as it stretches to infinity in all directions. For a specific margin configuration, a more detailed view on the underlying distributions on which thresholding is performed can be obtained by plotting the respective PDFs for the known and unknown classes, as in Figures 4.5, 4.6 and 4.7. These PDFs are obtained from a limited number of samples using a Gaussian KDE. If mean accuracy is evaluated, the optimal threshold value is located at the intersection point of the known and unknown class distributions. This is because both distributions have an equivalent prior probability in this case. The area in which the two distributions overlap represents the error caused by the thresholding. More precisely, the size of this area relative to the sum of the full areas underneath both curves is equal to the error in percent. Figure 4.5 shows the PDFs for distance thresholding and Figure 4.6 the PDFs for confidence thresholding under optimal scale parameter settings. (a) Cosine classifier (b) Gaussian classifier Figure 4.5: The distribution of the smallest distance to any known class plotted based on samples from known and unknown classes. If the scale parameter is not configured properly, the confidence distributions get distorted. The further away the value of the scale parameter is from its optimum, the worse this distortion becomes and the distributions of the known and unknown classes start to overlap more and more. This is illustrated in Figure 4.7. 65 4. Results (a) Cosine classifier (b) Gaussian classifier Figure 4.6: The distribution of the largest confidence score of any known class plotted based on samples from known and unknown classes. (a) Scale value too small (b) Scale value too large Figure 4.7: These PDFs show how the confidence distributions for the cosine classifier are affected if the used scale parameter is too small or too large. 66 4.4. Embedding Space Metrics 4.4 Embedding Space Metrics 4.4.1 Simple Angular Distance Statistics The MAAD and StdAAD for TRCs and TECs and various margin values are shown in Figure 4.8a and Figure 4.8b, respectively. For the TRCs, the MAAD decreases for an increasing margin up to a value of 0.6. This shows that larger angular margins make the TRCs’ embedding space distributions more compact, as expected. For the TECs, the MAAD also decreases for larger margins, which is desirable. However, Figure 4.8a illustrates that the MAAD is significantly larger for TECs than for TRCs, independently of the configuration. Ideally, the MAAD should be homogeneous for both class groups, as this would mean that the features of TRCs and TECs are equally well represented in the embedding space. The larger MAAD values for the TECs indicate that the uncertainty for recognizing TECs in greater than for TRCs. Figure 4.8b shows that the StdAAD drops for the TRCs for an increasing margin. In this context, a small StdAAD means that the distributions of different TRCs have similar variance. On the other hand, the StdAAD grows almost linearly for an increasing margin for the TECs. These comparably large StdAAD values imply that both TECs with small and large AADs exist. Features of TECs with a small AAD are represented better in the embedding space than features of TECs with a large AAD. This suggests that some LPTs of the TECs are more similar to the LPTs covered by the TRCs than others, which is the case. The gap between the StdAAD of the TRCs and TECs grows for an increasing margin. This means that the network starts to overfit to the TRC features as the margin gets larger. Overall, Figure 4.8 suggests that the underlying features of the TECs have not been learned well enough by training with the TRCs. In other words, the network has not been able to fully generalize to features which are similar but not equivalent to those given by the TRCs. This can easily happen if the learning task is not sufficiently rich in complexity or shortcuts to a solution are exploited, which a human would not take. This is discussed with more detail in Section 4.5. Both problems can be mitigated by training with more data of more distinct classes. 4.4.2 PCA Figure 4.9 illustrates the average relationships between the principal components of the TRC’s and TEC’s distributions. Assuming that the TRCs and TECs are distributed like multivariate Gaussians, Figure 4.9a and Figure 4.9b can be interpreted as showing the mean relative standard deviations along the Gaussian’s major axes, respectively. Figure 4.9 illustrates that the ratios between the principal components are quite similar for TRCs and TECs. Furthermore, it shows that on average TRCs are rounder than TECs for margin values smaller than ∼ 0.4. The opposite is true for margins larger than this. In general, the distributions become less round for an increasing margin and do not have hyper-spherical shapes. 67 4. Results (a) The MAAD drops for an increasing margin and is significantlly smaller for TRCs. (b) The StdAAD of TECs rises rapidly for larger margin values and increases the gap between the StdAAD values for TRCs and TECs. Figure 4.8: Simple angular distance statistics for different margin values for both TRCs and TECs . 68 4.4. Embedding Space Metrics (a) TRCs (b) TECs Figure 4.9: The principal components’ amounts of contribution to the total standard deviation in percent. Both plots show the average contribution over all respective classes. Values are shown in percent rather than absolute values, such that averaging over distributions which have standard deviations at different scales is possible. A perfectly round distribution would be visualized as a straight line in this plot, where each principal component would have a contribution of 1 32 . 69 4. Results 4.4.3 Separability of Distinct Classes The GRR is computed for only TRCs, only TECs and all classes and is shown for various margins in Figure 4.10. The calculated values are based on 99.9% confidence SDHEs. This value has been chosen because it is similar to the confidence interval given by the 6σ rule, which is commonly used for univariate Gaussian distributions. Figure 4.10: The GRR for different margin values and class subsets. As the margin increases, the relative gaps between distinct classes on average grow larger. At first glance, it may seem like a contradiction that the GRR grows for larger margin values while the performance for the cosine classifier decreases, as illustrated in Figure 4.1a. However, the GRR does not consider the true distributions of the classes. It simply approximates the classes as hyper-spheres. This assumption is almost equivalent to the prior used by the cosine classifier, except that the latter additionally assumes that the hyper-sphere approximations of all classes have the same radius. Figure 4.9 shows that the true distributions become less spherical for larger margin values, which explains why the performance of the cosine classifier gets worse. The GRR on the other hand increases because the distributions become more compact, as displayed in Figure 4.8a. To summarize, although the gaps between distinct classes grow larger for an increasing margin, the distributions become less spherical. This in turn leads to worse performance of the cosine classifier due to stronger violations of the prior. In contrast, the Gaus- sian classifier demonstrates that larger GRR values can also have a positive effect on the performance if the true underlying distributions are estimated well enough. This is visualized in Figure 4.1b, where the performance increases up to a certain margin. However, as mentioned in Section 4.3.1, the estimated Gaussians do not describe the true 70 4.4. Embedding Space Metrics distributions of TECs accurately, which suggests further performance improvements for better estimates of the true distributions. These results show that analyzing the quality of an embedding space solely based on the GRR can be misleading because the true shapes of the distributions are neglected to a large degree. For this reason, we propose to use the GRR in combination with PCA, as these two metrics complement each other well. PCA evaluates the variances of the distributions while the GRR computes how densely distinct distributions are packed together. Experiments for computing a lower bound for the GRR instead of an approximation have also been carried out. To retrieve a lower bound each class is also approximated by a hyper-sphere, but the radius of a specific hyper-sphere is based on the length of the respective class’s largest principal component. In contrast, the approximated GRR shown in Figure 4.10 is computed using the length of the principal component with median length. The computed lower bounds were quite poor and yielded values < 1 for the GRR because there were more overlaps than gaps between distinct classes. This is because the longest major axis is about 3 − 10 times longer than each of the major axes in the lower half of the distribution shown in Figure 4.9. As a consequence, the exaggeration for the radius yields many overlaps which do not actually exist. 4.4.4 Mean Distance PDF The mean angular distance PDFs for the TRCs’ and TECs’ test sets are shown in Figure 4.11. On one hand, the image shows the distribution of the distances which are relevant for the cosine classifier’s decisions. At the same time, it illustrates the final distribution of the classes’ angular distances after Arcface training. The TRC PDF looks similar to a univariate Gaussian. However, its right tail, which corresponds to larger angles, stretches out much further than its left tail. This is because variance is not constant in all directions, as shown in Figure 4.9a. The TEC distribution is shifted to the right compared to the TRC PDF, i.e. its distances on average are larger. Figure 4.11 also reveals the full distributions from which the MAAD and StdAAD values shown in Figure 4.8 have been computed for a margin of 0. Figure 4.12 visualizes the mean Mahalanobis distance PDF for the TRCs and TECs. Additionally, it shows the distribution both of them are expected to match. Notice that the expected distribution is known because the visualized TRC and TEC distributions are based on their respective Gaussian estimates. The similarities to the expected distribution are discussed in Section 4.4.5. Finally, let us consider the mean Euclidean PDFs for the TRCs and TECs, which are illustrated in Figure 4.13. The overall shape of both PDFs looks very similar to the shape of the mean angular distance PDF of the TRCs shown in Figure 4.11. This is expected, as the angular distances are measured based on the projection of the true distribution onto the manifold of a unit hyper-sphere. In contrast, the Euclidean distances and Mahalanobis distances are measured using the true distribution. Figure 4.13 also shows that both PDFs are very similar, which seems to contradict 71 4. Results Figure 4.11: The mean angular distance distributions obtained from the test sets of the TRCs and TECs. Both distributions are plotted for the cosine classifier’s optimal margin configuration of 0. the insights obtained from Figure 4.11 at first sight. Although the true distributions of the TRCs and TECs have a very similar mean Euclidean distance distribution, the TEC’s mean angular distance distribution on average has larger values than the TRC’s respective distribution. To explain this, a different view point has to be considered. As stated above, the angular distances are based on a projection, which effectively drops one dimension of information which represents the radius. The further away a true class distribution is from the embedding space’s origin, the smaller the resulting projection on the unit hyper-sphere’s manifold will be. To understand this intuitively, imagine the sun was closer to the earth, which would lead to more sunlight hitting the earth. This is due to the larger projection of the sun onto the earth’s surface. Knowing this, the mean angular distance distribution mismatch between TRCs and TECs could be explained if the TECs’ true distributions on average were closer to the origin of the embedding space. This is because the TEC distributions resulting from the projection would be larger than those of the projected TRC distributions. In fact, this is the case and is visualized in Figure 4.14. 72 4.4. Embedding Space Metrics Figure 4.12: The expected mean Mahalanobis distance distribution and the actual PDFs estimated from the TRC’s and TEC’s test sets. Both are based on a training using the best margin of 0.4 for a Gaussian classifier and do not match the expected curve perfectly. Figure 4.13: The mean Euclidean distance distributions for the test data of the TRCs and TECs. The PDFs are also based on a 0 margin setting to make it comparable to the angular distance distribution in Figure 4.11. Both PDFs look very similar. 73 4. Results (a) Margin 0 (b) Margin 0.4 Figure 4.14: The estimated distributions for the Euclidean distances from the origin of the embedding space to the centers of the TRC and TEC distributions. The TECs on average are closer to the origin. As the margin increases, the difference between the TRC and TEC distributions becomes larger. 74 4.5. Interpreting the Saliency Maps 4.4.5 Similarity of Estimated Class Distributions to Gaussians Although the TRC PDF shown in Figure 4.12 visually matches the expected distribution quite well, the difference is statistically significant and not due to random chance. This has been verified using the Kolmogorov-Smirnov [MJ51] goodness of fit test. As null hypothesis, this statistic assumes that the empirical CDF given by the samples matches the CDF of the expected distribution. For the TRC distribution, this hypothesis has been discarded for an alpha of 0.01. The TEC PDF in Figure 4.12 deviates from the expected distribution heavier than the TRC’s. For this reason, it is unsurprising that the Kolmogorov-Smirnov test also yields a statistically significant difference between the TEC distribution and the expected Mahalanobis distance distribution for the same value of alpha. Therefore, it follows that both test set distributions are not perfectly distributed accord- ing to their Gaussian estimates. Based on the JSD, the TRC and TEC PDFs have a similarity of 0.72 and 0.48 to the expected distribution, respectively. The low TEC score confirms the hypothesis that the Gaussian estimates for the TECs are not a good fit for the true distributions. This observation explains the poorer performance results of the Gaussian classifier compared to the cosine classifier, which is shown in Figure 4.1. The TRC PDF is not a close match either. However, Figure 4.12 shows that in contrast to the TEC PDF, its deviations are spread more uniformly on both sides of the mode, indicating an overall better match to the estimated Gaussians. 4.5 Interpreting the Saliency Maps The average vanilla saliency map and the average Grad-CAM saliency map have been calculated for each TRC and TEC. Additionally, the average input image has been computed for every LPT. It must be noted that a saliency map has only been used for averaging if the class represented by the largest logit matched the target class, i.e. argmax(logitj) = y. The saliency maps which failed to fulfill this condition have been discarded. This has been done to avoid using incorrectly classified instances of an LPT when creating its average saliency map. Most results regarding which features the network reacts to when making a classification decision appear reasonable and convincing. Sadly, there is no metric by which the integrity of the network can be measured, which makes this statement fundamentally subjective. Such a metric does not exist because there is no single right choice of features the network should use to make its decisions for a given LPT. This is because different humans may consider distinct aspects of an LPT when identifying it. Figure 4.15 visualizes some convincing TRC saliency map examples and shows different types of features the network reacts to. The greener the overlay at a pixel is, the more important the pixel is for the network. In Figure 4.15, the vanilla saliency maps are a lot noisier than the respective Grad-CAM saliency maps. However, both techniques largely agree on the highlighted features. After 75 4. Results careful visual inspection of the average saliency maps, it appears that altogether special symbols like in Figure 4.15a and textures as in Figure 4.15b are the most attractive features for the network to learn. This seems reasonable, as both feature types typically are very discriminative for different LPTs. Slightly less popular feature types are text, as in Figure 4.15c and the positioning of the identifying characters, like in Figure 4.15d. The reason why they are not used as frequently is that several LPTs share the same text or the same character positioning in many cases. For example, "VIRGINIA" is written in the same font and on the same spot for the two LPTs in Figures 4.15a and 4.15e. Therefore, this feature alone cannot be used by the network to determine the correct LPT. Consequently, the network either has to learn an additional indicator or a completely different feature in the first place. In this concrete example, the network seems to have chosen the latter option. Another example is the star in the top left corner of Figure 4.15b, which is not regarded by the network, as it also appears in other LPTs of Texas. In some cases, the network uses the absence of any specific feature, i.e. a static background color, as strong indicator for choosing a class. Examples of this are the highlighted black spot near the bottom left corner of the Grad-CAM saliency map in Figure 4.15d and the gap between the identifying characters in Figure 4.15e. It is attractive for the network to learn this type of feature as indicator for a certain TRC if there is no other TRC which has a static background color at the exact same spot. This is because a static background color is the most primitive feature. The network should not even be blamed for learning such features as shortcuts. After all, it solves the task it is trained for, which is distinguishing between TRCs, perfectly well. However, in the context of classification on open sets this sort of behavior is undesirable and problematic because it generalizes poorly to TECs. Imagine a scenario where a network has learned to use a spot of a TRC LPT showing a static background as the only indicator for identifying the respective class. This scenario is similar to the saliency maps shown in Figure 4.15e. Now, if one of the TECs happens to have the same static background at the exact same spot, the network will not be able to distinguish between these two LPTs. To solve this problem, many more LPTs must be used for training. The more diversity these LPTs have the better. More TRCs decrease the likelihood of the network learning such undesired behavior, as shortcuts which exploit static background are less likely to exist. In general, it can be said that more feature rich TRCs force the network to learn more specific features. This is very desirable, as more complex features also improve the generalization capabilities to correctly classify TECs. The average saliency maps of correctly classified TEC images appear equally convincing as the ones of the TRCs. The TEC saliency maps of misclassified images sometimes reveal further insight about why certain confusions occur. 76 4.5. Interpreting the Saliency Maps (a) Symbol (b) Texture (c) Text (d) Character positioning (e) Gap (f) Mixed Figure 4.15: Some examples for different types of indicators which the network uses to make a classification decision. The caption of each sub-figure specifies the most important indicator for the respective LPT. The columns from left to right show the average input image, the average vanilla saliency map and the average Grad-CAM saliency map for each LPT. 77 4. Results A few examples of misclassified images are shown in Figure 4.16. The respective saliency maps of each image highlight why the network believes that the false prediction it makes should be correct. Figure 4.16a illustrates the problem of mostly using static background color and character positions as features for identifying a certain LPT. The image regions highlighted by the Grad-CAM saliency map are identical for the predicted LPT and the target LPT. In order to correctly distinguish between these two LPTs, the text at the top would have to be considered more heavily by the network. Figure 4.16c shows a confusion which happens because the special symbol identifying the target LPT is misinterpreted. This suggests that the network has not learned to generalize features of TEC symbols very well. In this particular case, the bear on the target LPT is easily distinguishable from the coat of arm on the predicted LPT for a human. A similar example caused by a symbol confusion is demonstrated in Figure 4.16b. Figure 4.16d shows a confusion between two LPTs where the only difference is the spatial location of the "GEORGIA" text on the top. (a) Background and character positioning (b) Symbol misinterpretation (c) Another symbol misinterpretation (d) Undesired position invariance Figure 4.16: Some examples of misclassified LPT images. The columns from left to right show the input image, the vanilla saliency map, the Grad-CAM saliency map and a representative image of the falsely predicted class, for each LPT. 78 4.6. Summary 4.6 Summary All insights highlighted within this section refer to the used LPTR dataset introduced in Section 4.1. This dataset only consists of 5740 training images and therefore is considered rather small. Analyzing the class distributions in the embedding space has shown that increasing the margin hyper-parameter for training decreases the average reach or size of the classes’ distributions up to a certain point. This insight can be obtained from the MAAD. Consequently, the gaps between distinct classes grow larger, which is expressed by the GRR. However, at the same time each class distribution also becomes more elongated, which can be observed by conducting PCA. This is similar to the effect of squashing a balloon. In this analogy, the external forces on a class’s distribution are applied by the other classes. The StdAAD shows that the variance of the average class size increases with a growing margin. Consequently, the performance of the cosine classifier decreases for an increasing margin, as it does not account for variance in any way. The cosine classifier operates on the embeddings and uses the nearest class center as decision criterion. This type of classifier has been heavily used by recent research, as explained in Section 3.2. A multivariate Gaussian classifier has been evaluated to test how similar the shapes of the class distributions are to hyper-spheres and hyper-ellipsoids in the embedding space. Gaussians also test if the overall performance can benefit from a more expressive classifier, given that the embedding space is fixed. The results have revealed that for both TRCs and TECs the respective distribution estimates, which are based on the test sets, show statistically significant differences from a true Gaussian distribution. However, the overall shapes of the estimated TRC distributions come close to the shape of a true Gaussian. This is also noticeable in the mean accuracy on the TRC test set, where the performance of the Gaussian classifier is better than the performance of the cosine classifier for a large margin value. For the TECs, the distribution estimates deviate more heavily from a true Gaussian distribution. As a result, the Gaussian classifier for the optimal margin configuration has a worse performance than the optimal cosine classifier for the TECs. However, for large margins, the Gaussian classifier has a better performance than the cosine classifier on TECs. This suggests that the assumption of a Gaussian distribution for the classes is not fundamentally incorrect, but not expressive enough. The types of features of LPTs which the network considers when making a classification decision have been analyzed using gradient-based saliency maps. The manual inspection of the saliency maps has revealed that especially symbols and textures commonly are strong indicators for a network’s decision, as they often are unique to a certain LPT. In some cases, we observed that the network has learned primitive features such as static backgrounds as strong indicators. This is possible in situations where such features allow to distinguish a certain TRC from all other TRCs. These primitive features generalize poorly, as they either capture the absence of meaningful content or a combination of simple reoccurring patterns such as character placements. When the network training exploits such shortcuts towards a good solution, generalization capabilities are reduced. 79 4. Results 4.7 Discussion While Section 4.6 summarized the insights obtained from the experiments, this section is about the quality of the achieved results. A mean accuracy of 99.8% has been reached on the TRCs. This demonstrates the capabil- ities and potential of using deep learning for the LPTR task. However, the experiments have also shown that the performance on the TECs is a lot worse than on the TRCs. This implies that the generalization capabilities of the trained model are limited and not optimal. But considering that only 4060 images have been used for training the network, we believe that the achieved mean accuracy of 92.2% on the TECs is a good result. The same argument applies to the reached mean accuracy of 88.2% for the OSR of unknown classes. In the beginning, we were quite confident that accuracy would benefit from modeling the class distributions in the embedding space as multivariate Gaussians. We were surprised to see that this is not the case. It is not clear to what degree this is due to the limited number of available samples for estimating the Gaussians. For some of the TECs, the Gaussians have been estimated based on 10 to 30 images. Using data augmentation to create several embeddings per image to have more samples for parameter estimation only slightly mitigates the problem. This is because the embeddings which are based on the same original image are not independent from each other. Sadly, we could not test how the performance of the Gaussian classifier would be affected by a significantly larger dataset. Due to the lack of samples the estimated Gaussians are prone to overfitting. Therefore, it is possible that the true class distributions are closer to Gaussians than our results indicate. 80 CHAPTER 5 Conclusion and Future Work 5.1 Conclusion Learned embedding spaces are commonly used for classification problems on an open set of classes. In this work, the class distributions in such an embedding space have been analyzed. For this purpose, the Gap Radius Ratio (GRR), which measures how filled a given embedding space is, has been introduced. For embedding spaces which are trained based on angular distances, two more metrics have been proposed. Both evaluate the distributions as they are seen by the training procedure for the embedding space. The Mean Average Angular Distance (MAAD) measures the mean intra-class compactness. The Standard Deviation of the Average Angular Distance (StdAAD) quantifies the variance of the intra-class compactness as a complementary measure. Adapting these two metrics to work with different metric spaces is straight forward. Our analysis has revealed that the class distributions in our trained embedding spaces do not follow multivariate Gaussians accurately for the used license plate type dataset. Another contribution is the analysis on the effects of masking out the identifying characters of a license plate. Empirical experiments have shown that training with only unmasked data encourages the model to learn specific characters as features. Likewise, the model tends towards learning the masks when it is only trained with masked license plates. A solution which solves this problem by randomizing the masking process during training has been proposed. Using this technique, the results on the test show no significant difference independently of whether or not the identifying characters of a license plate are masked out. On the used license plate type dataset, mean accuracies of 99.7% and 92.2% are achieved on the classes used for training the network and the test classes which are added in retrospective, respectively. For distinguishing between known and unknown license plate types, a mean accuracy of 88.2% is reached. A method for computing gradient-based saliency maps for test classes has been introduced. 81 5. Conclusion and Future Work Furthermore, a mean saliency map has been computed for each known license plate type. Overall, we find that our model has learned suitable features for recognizing the license plate types it has been trained with. However, due to the size of the used dataset its generalization capabilities are limited. Our saliency maps also show that features which generalize poorly are used for recognizing a few of the license plate types. This explains the significantly worse performance on the test classes compared to the ones used for embedding space training. 5.2 Future Work 5.2.1 Feature Visualization One interesting next step would be a more rigorous analysis of what a trained network has learned using feature visualization. This would allow to identify the visual properties of the features the network is looking for more carefully. Notice that the saliency map techniques used in this work only answer which parts of an image are important to the network but not why they are. For this reason, saliency maps and feature visualization complement each other quite well. One possible use case of feature visualization would be an investigation of which features the model has learned to extract from special symbols of LPTs. This in turn would reveal further insights about why the network makes confusions which are caused by misinterpreting such symbols. 5.2.2 Larger Dataset Being able to make use of a much larger dataset which has both more classes and more samples per class would allow to conduct further interesting experiments. First of all, it could be investigated how the network’s generalization capabilities develop depending on the amount of training data used. This can be measured by both traditional benchmarks such as mean accuracy and by some of the introduced embedding space metrics like the MAAD. Again, in the optimal case there should be little to no difference between the results for TRCs and TECs. Furthermore, adding and removing carefully selected classes from the training data would enable to study the network’s behavior regarding the learning of shortcuts. For example, if a shortcut which exploits static background is learned for a specific class of a smaller dataset, additional classes could be added to the training data. If they were considered individually, these additional classes would allow to make use of the same exploit. However, as all these classes could not make use of the same exploit if they were considered together, the network would have to learn different features for each one in order to recognize all classes correctly. The experiment would be to investigate how the learned features adapt in such a scenario. In general, training on a much larger LPTR dataset is a key aspect for the future. This will enhance generalization capabilities and will discourage the network from taking undesired shortcuts during training. It is expected that the performance gap between TECs and TRC will decrease significantly, as the network learns to generalize better and 82 5.2. Future Work better by using more training data. 5.2.3 Automatic Inference for the State of Origin Another idea is to automatically infer the state of origin if an image of an unknown LPT class is given as input. In theory, this should be possible in a lot of cases because many LPTs of a certain state share a lot of similarities with each other. Automatic state inference would be possible by showing that certain features present in an unknown LPT can also be found in known LPTs of some state. One way to achieve this could be to train a network which has additional outputs for predicting the state of origin of an LPT. The total loss of such a network would then be a linear combination of the state and LPT classification losses. Learning to predict the correct state would encourage the network to learn commonly shared features of LPTs with the same state of origin. The hypothesis is that in many cases these learned features will also work reliably for unknown LPTs of the same state. 5.2.4 Automatic Incremental Growth Going even one step further, the system could be configured to add new LPTs to its database automatically after its initial deployment. If very similar embeddings which are unlikely to be instances of any known LPT occurred over and over again, this would suggest that they are instances of the same LPT which was unknown to the system at the time. If enough statistical evidence was collected which supports this hypothesis, a new LPT could be added to the known classes automatically based on the collected data. Additionally or alternatively, all automatically collected data of new LPTs could be validated by a human after some time. Then, it could be used for building a new model together with the original training data. Applying this technique iteratively would allow to significantly improve the system over time, assuming that enough data of unknown classes could be collected. 83 List of Figures 1.1 Some example LPs and their corresponding LPTs from the US state of Virginia. Images by Kustermann [Kus20]. . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Different images of the same LPT from the real world dataset used in this work. The LPT shown in these images is the same as the one at the bottom left in Figure 1.1. The images show some of the variances which occur in the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Points within two 3D embedding spaces trained with the same CNN and dataset, using a different loss function. Each colored point represents the embedding of one image and each color one class. The used dataset is a subset of the license plate type dataset described in Section 4.1 and consists of 9 classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 The image shows two distinct classes, where each class for itself has the same variance in all directions. However, the variance of the left class is larger. The yellow dots represent the samples based on which the parameters (mean and variance) of each class have been calculated. The blue circle symbolizes a new sample which should be classified as either belonging to the left or to the right class. The point’s Euclidean distance to the right class Y is slightly smaller than its distance X to the left class. However, because the left class has higher variance, it is more likely that the blue point belongs to this class. If only the class means are considered without the variances, the blue circle will be incorrectly assigned to the right class. . . . . . . . . . . . . . . . . 12 2.1 A schematic drawing of a typical architecture for a neural network classifier. The input a is transformed to a feature vector x over many layers. This vector is then multiplied by the vector Wj to obtain the respective output logit for class j. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 A 3D embedding space obtained by training a classifier using Arcface [DGXZ19]. Each point represents the normalized feature vector x of an input a and each class is illustrated by one color. Due to the normalization, all points lie on the surface of a sphere. During training, the angular distance of each x to its respective class center Wj is minimized. . . . . . . . . . . . . . . . . . . . 19 85 2.3 Decision regions for a binary classification problem for the traditional softmax classifier and its extensions which make use of a margin parameter. The blue regions represent the combinations of angle values which classify the respective input as A and the red regions illustrate the equivalent for B. The algorithm and margin value used are written above each respective graph. 22 2.4 The maximal achievable target confidence score py for a classifier, assuming the scale s and the number of classes N are parameters. This graph represents the best case scenario, yielding the theoretical maximal values for py, as defined in Equation 2.17. Notice that the N axis shows values on a log scale. 23 2.5 A schematic drawing of the principle behind triplet loss. The inputs, anchor aa, positive ap and negative an are transformed to their respective embedding representations xa, xp and xn, using the same neural network. The green arrow indicates that the network aims to pull xp closer to xa, whereas the red arrow symbolizes that xn is being pushed away from xa, during training. . 24 2.6 A schematic drawing of an auto-encoder. The input a is compressed into x, which in turn is used for reconstructing o. The aim of training is to make o as similar to a as possible. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 A schematic drawing of a single ResNet block. The upper path applies two convolutional layers. The lower path is the identity function. As only the upper path contains BN and ReLU units, gradients are propagated over the identity unaltered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Three different bivariate Gaussians obtained from the same samples. The orange dot shows the mean while the orange circle represents the covariance matrix of the Gaussian. The yellow points show the samples from which the Gaussian is obtained. In (a), the variance is constrained to be equal in all directions. In (b), the axes of the largest variances are aligned with the axes of the reference coordinate system. (c) shows the general case without constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 The Mahalanobis distance PDFs for multivariate Gaussian distributions with a different dimensionality d. As the dimensionality increases, the distribution approaches a univariate Gaussian with its center further and further away from the center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1 The mean accuracies obtained by different embedding space classifiers and by training with different margin values. There are a total of four different dataset subsets, for which each parametrization is evaluated. TRC and TEC refer to only evaluating on the TRCs and TECs, respectively. TEC on TRC&TEC means that only TEC inputs are used which can be classified as both TRC or TEC instances. TRC&TEC represent the entire dataset. . . . . . . . . . 60 86 4.2 The confusion matrices of different embedding space classifiers resulting from their respective optimal margin configurations for training. The entries indicate percentages as opposed to an absolute number of samples. The greener a specific entry is, the more percent of a class’s total test samples it represents. The x-axis represents the predicted classes and the y-axis the target classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 The PRCs of different embedding space classifiers. The best performance is reached on the TRCs only, where the curves of both classifiers are close to optimal. The plots highlight the best recall which is actually achieved by the trained LPTR system. Therefore, artificial sample points for a recall of 100% with precision values close to 0% are omitted. . . . . . . . . . . . . . . . . 63 4.4 Mean accuracy for distinguishing between known and unknown classes based on a threshold value. The thresholding is done either on a distance value or confidence score, for both the Gaussian classifier and the cosine classifier. 64 4.5 The distribution of the smallest distance to any known class plotted based on samples from known and unknown classes. . . . . . . . . . . . . . . . . . . 65 4.6 The distribution of the largest confidence score of any known class plotted based on samples from known and unknown classes. . . . . . . . . . . . . 66 4.7 These PDFs show how the confidence distributions for the cosine classifier are affected if the used scale parameter is too small or too large. . . . . . 66 4.8 Simple angular distance statistics for different margin values for both TRCs and TECs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.9 The principal components’ amounts of contribution to the total standard deviation in percent. Both plots show the average contribution over all respective classes. Values are shown in percent rather than absolute values, such that averaging over distributions which have standard deviations at different scales is possible. A perfectly round distribution would be visualized as a straight line in this plot, where each principal component would have a contribution of 1 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.10 The GRR for different margin values and class subsets. As the margin increases, the relative gaps between distinct classes on average grow larger. 70 4.11 The mean angular distance distributions obtained from the test sets of the TRCs and TECs. Both distributions are plotted for the cosine classifier’s optimal margin configuration of 0. . . . . . . . . . . . . . . . . . . . . . . 72 4.12 The expected mean Mahalanobis distance distribution and the actual PDFs estimated from the TRC’s and TEC’s test sets. Both are based on a training using the best margin of 0.4 for a Gaussian classifier and do not match the expected curve perfectly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.13 The mean Euclidean distance distributions for the test data of the TRCs and TECs. The PDFs are also based on a 0 margin setting to make it comparable to the angular distance distribution in Figure 4.11. Both PDFs look very similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 87 4.14 The estimated distributions for the Euclidean distances from the origin of the embedding space to the centers of the TRC and TEC distributions. The TECs on average are closer to the origin. As the margin increases, the difference between the TRC and TEC distributions becomes larger. . . . . . . . . . 74 4.15 Some examples for different types of indicators which the network uses to make a classification decision. The caption of each sub-figure specifies the most important indicator for the respective LPT. The columns from left to right show the average input image, the average vanilla saliency map and the average Grad-CAM saliency map for each LPT. . . . . . . . . . . . . . . . 77 4.16 Some examples of misclassified LPT images. The columns from left to right show the input image, the vanilla saliency map, the Grad-CAM saliency map and a representative image of the falsely predicted class, for each LPT. . . 78 88 List of Tables 3.1 Structure of the used VGG type architecture. An entry in the layer shape column with content [a × a, b] × c can be understood as following: A con- volutional block of c individual convolutional layers, where each layer has b feature maps and a kernel width and height of a. In total, the network has 245632 weight parameters. The outputs are not individual class scores, but form a feature vector in the embedding space. . . . . . . . . . . . . . . . . 35 3.2 Structure of the used ResNet [HZRS16a] architecture. The entries of shape a×a,b a×a,b × c within the layer shape column represent c consecutive residual blocks. The conceptual shape of one such block is shown in Figure 3.1. a defines the width and height of each convolutional kernel and b the number of feature maps per convolutional layer. In total, the network has 234592 weight parameters. The outputs are not individual class scores, but form a feature vector in the embedding space. . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1 Statistics of the used LPTR dataset. . . . . . . . . . . . . . . . . . . . . . 56 89 Acronyms AAD Average Angular Distance. 47, 67 AP Average Precision. 45 BN Batch Normalization. 34, 35, 86 CDF Cumulative Distribution Function. 4, 51, 75 CNN Convolutional Neural Network. 4, 6, 8, 10, 33, 34, 85 FSL Few-Shot Learning. 6, 7, 16 Grad-CAM Gradient-weighted Class Activation Mapping. 29, 30, 52, 75–78, 88 GRR Gap/Radius Ratio. 52, 70, 71, 79, 81, 87 JSD Jensen-Shannon Distance. 50, 51, 75 KDE Kernel Density Estimation. 47, 65 KLD Kullback-Leibler Divergence. 50, 51 kNN k-Nearest Neighbors. 38 LP License Plate. 1–4, 10, 15, 16, 27, 34, 55–58, 85 LPD License Plate Detection. 15, 16 LPR License Plate Recognition. 1, 2, 15, 16, 56 LPT License Plate Type. 1–5, 7, 9, 10, 16, 52–58, 61, 67, 75–79, 82, 83, 85, 88 LPTR License Plate Type Recognition. 1–7, 9, 10, 15, 16, 27, 33, 34, 36, 38, 44–46, 55–57, 59, 63, 79, 80, 82, 87, 89 MAAD Mean Average Angular Distance. 47, 67, 68, 71, 79, 81, 82 91 mAP Mean Average Precision. 11, 45, 61 NCC Normalized Cross Correlation. 4 OCR Optical Character Recognition. 2, 3 OSR Open Set Recognition. 7, 16, 45, 46, 64, 80 PCA Principal Component Analysis. 4, 49, 67, 71, 79 PDF Probability Density Function. 4, 40, 42, 43, 47–51, 65, 66, 71, 73, 75, 86, 87 PRC Precision Recall Curve. 45, 61, 63, 87 ReLU Rectified Linear Unit. 29, 30, 34, 35, 86 SDHE Standard Deviational Hyper-Ellipsoid. 51, 70 StdAAD Standard Deviation of the Average Angular Distance. 47, 67, 68, 71, 79, 81 TEC Test Class. 6–9, 11–13, 33, 38, 39, 44–47, 50, 52, 53, 55–61, 67–76, 78–80, 82, 86–88 TRC Train Class. 6–9, 11–13, 39, 44–47, 50, 53, 55–57, 60, 61, 63, 67–76, 79, 80, 82, 86–88 VGG Visual Geometry Group Network. 33–35, 55, 89 YOLO You Only Look Once. 16 92 Bibliography [AAP+08] Christos-Nikolaos E Anagnostopoulos, Ioannis E Anagnostopoulos, Ioannis D Psoroulas, Vassili Loumos, and Eleftherios Kayafas. License plate recognition from still images and video sequences: A survey. IEEE Transactions on intelligent transportation systems, 9(3):377–391, 2008. [AB12] Omolola A Adedokun and Wilella D Burgess. Analysis of paired dichotomous data: A gentle introduction to the McNemar test in SPSS. Journal of MultiDisciplinary Evaluation, 8(17):125–131, 2012. [AB20] Apoorv Agnihotri and Nipun Batra. Exploring bayesian optimization. Distill, 5(5):e26, 2020. [AGM+18] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. Advances in neural information processing systems, 31:9505–9515, 2018. [BBM+15] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015. [BH09] Jop Briët and Peter Harremoës. Properties of classical and quantum Jensen- Shannon divergence. Physical review A, 79(5):052311, 2009. [Bis06] Christopher M Bishop. Pattern recognition and machine learning. Springer, 2006. [CAS+19] Shan Carter, Zan Armstrong, Ludwig Schubert, Ian Johnson, and Chris Olah. Activation atlas. Distill, 4(3):e15, 2019. [Cha07] Sung-Hyuk Cha. Comprehensive survey on distance/similarity measures between probability density functions. City, 1(2):1, 2007. [DGXZ19] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019. 93 [DISB12] Shan Du, Mahmoud Ibrahim, Mohamed Shehata, and Wael Badawy. Auto- matic license plate recognition (ALPR): A state-of-the-art review. IEEE Transactions on circuits and systems for video technology, 23(2):311–325, 2012. [FEHP11] Catherine Forbes, Merran Evans, Nicholas Hastings, and Brian Peacock. Statistical distributions. John Wiley & Sons, 2011. [GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning, vol- ume 1. MIT press Cambridge, 2016. [GHC20] Chuanxing Geng, Sheng-jun Huang, and Songcan Chen. Recent advances in open set recognition: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. [GWK+18] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al. Recent advances in convolutional neural networks. Pattern Recognition, 77:354–377, 2018. [GZJ+15] Shenghua Gao, Yuting Zhang, Kui Jia, Jiwen Lu, and Yingying Zhang. Single sample face recognition via learning deep supervised autoencoders. IEEE transactions on information forensics and security, 10(10):2108–2118, 2015. [HBL17] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. print arXiv:1703.07737, 2017. [HH15] John H Hubbard and Barbara Burke Hubbard. Vector calculus, linear algebra, and differential forms: A unified approach. 2015. [HZRS16a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [HZRS16b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016. [IS15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015. [Jør17] Hogne Jørgensen. Automatic license plate recognition using deep learning techniques. Master’s thesis, NTNU, 2017. [KB14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti- mization. arXiv preprint arXiv:1412.6980, 2014. 94 [KSSMB16] Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4873–4882, 2016. [Kus20] Michael Kustermann. World license plates. www.worldlicenseplates.com, 2020. [LH02] Anhua Lin and Shih-Ping Han. On the distance between two ellipsoids. SIAM Journal on Optimization, 13(1):298–308, 2002. [LSZ+18] Rayson Laroca, Evair Severo, Luiz A Zanlorensi, Luiz S Oliveira, Gabriel Re- sende Gonçalves, William Robson Schwartz, and David Menotti. A robust real-time automatic license plate recognition based on the YOLO detector. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–10. IEEE, 2018. [LWY+17] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017. [LZG+19] Rayson Laroca, Luiz A Zanlorensi, Gabriel R Gonçalves, Eduardo Todt, William Robson Schwartz, and David Menotti. An efficient and layout- independent automatic license plate recognition system based on the YOLO detector. arXiv preprint arXiv:1909.01754, 2019. [MD18] Wang Mei and Weihong Deng. Deep face recognition: A survey. arXiv preprint arXiv:1804.06655, 1, 2018. [MH08] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(Nov):2579–2605, 2008. [MJ51] Frank J Massey Jr. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68–78, 1951. [MN19] Jan R Magnus and Heinz Neudecker. Matrix differential calculus with applications in statistics and econometrics. John Wiley & Sons, 2019. [OMS17] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visual- ization. Distill, 2(11):e7, 2017. [RDGF16] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 95 [RSG16] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM, 2016. [SCD+17] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017. [SCWT14] Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pages 1988–1996, 2014. [SGK17] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning impor- tant features through propagating activation differences. In International Conference on Machine Learning, pages 3145–3153. PMLR, 2017. [SJ17] Sergio Montazzolli Silva and Claudio Rosito Jung. Real-time brazilian license plate detection and recognition using deep convolutional neural networks. In 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pages 55–62. IEEE, 2017. [SKP15] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. [Str16] Gilbert Strang. Introduction to linear algebra, volume 5. Wellesley-Cambridge Press, 2016. [STY17] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pages 3319–3328. PMLR, 2017. [SVZ13] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013. [SZ14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [TYRW14] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014. 96 [VLL+10] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre- Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010. [WSM15] Bin Wang, Wenzhong Shi, and Zelang Miao. Confidence analysis of standard deviational ellipse and its extension into higher dimensional Euclidean space. PloS one, 10(3):e0118537, 2015. [WWZ+18] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5265–5274, 2018. [WYKN20] Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys (CSUR), 53(3):1–34, 2020. 97