D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . Deep Learning of Humor from Gary Larson’s Cartoons DIPLOMARBEIT zur Erlangung des akademischen Grades Diplom-Ingenieur im Rahmen des Studiums Artificial Intelligence and Game Engineering eingereicht von Robert Fischer, BSc. Matrikelnummer 01425684 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Dr. Horst Eidenberger, Ao.Univ.Prof. Wien, 14. Mai 2018 Robert Fischer Horst Eidenberger Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . Deep Learning of Humor from Gary Larson’s Cartoons DIPLOMA THESIS submitted in partial fulfillment of the requirements for the degree of Diplom-Ingenieur in Artificial Intelligence and Game Engineering by Robert Fischer, BSc. Registration Number 01425684 to the Faculty of Informatics at the TU Wien Advisor: Dr. Horst Eidenberger, Ao.Univ.Prof. Vienna, 14th May, 2018 Robert Fischer Horst Eidenberger Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . Erklärung zur Verfassung der Arbeit Robert Fischer, BSc. Stauraczgasse 8/13 1050 Wien Österreich Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. Wien, 14. Mai 2018 Robert Fischer iii D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . Danksagung Ich wünsche meinem Betreuer Professor Horst Eidenberger ein aufrichtiges Dankeschön. Nach meiner Bachelorarbeit habe ich bereits gelernt, was für ein hilfreicher, konstruktiver und vorwärts-denkender Betreuer er ist. Dennoch wurde diese Erwartung übertroffen. Mein Danke geht an die TU Wien und dem gesamten Lehrpersonal für die beste Infor- matikausbildung in ganz Österreich und noch weiter. Es war der perfekte Ort um nicht nur den technischen Aspekt, sondern auch um die sozialen Fähigkeiten und Mentalität die notwendig ist um erfolgreich zu sein zu erlernen. Die TU Wien hat mich außerdem durch Stipendien, sowie durch meine Tutorenanstellung für einige Lehrveranstaltungen sehr unterstützt. Ich hatte viel Glück, weil ich mit vielen Schulfreunden das Abenteuer auf die TU Wien mit mir fortsetzen konnten. Wir haben uns viel geholfen und uns gegenseitig motiviert wo es uns möglich war. Besonders möchte ich mich bei Simon Fraiss und Alfred Soare für die gemeinsame Zeit bedanken. Schlussendlich, möchte ich mich bei meiner Familie bedanken. Sie haben mich immer unterstützt und geholfen wo sie konnten. Ich möchte mich bei meiner Mutter Doina, meinem Vater Maximilian und meinen vier Geschwistern Verena, Kathrin, Maximilian und Alexander bedanken. Danke! iv D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . Kurzfassung Das Ziel dieser Diplomarbeit ist es Humor durch Deep Learning basierend auf Gary Larsons Cartoons zu modellieren. Der jüngste Erfolg von Deep Learning in den Bereichen Computer Vision und Natural Language Processing zeigt, dass ähnliche Techniken im Bereich Computational Humor eingesetzt werden können. Das Training von Deep- Learning-Modellen benötigt Datensets mit vielen Trainingsbeispielen, weshalb ich ein neuartiges Datenset mit einigen tausend Cartoons mit Pointe und einer zugehörigen Lustigkeits-Annotation erstellt habe. Das Datenset wurde mit einem eigenen Labelling Tool erstellt, durch eine einzelne Person. Dadurch enthält das Datenset den Humor dieser einen Person. Damit ist es möglich den Humor quantitativ mit den Ergebnissen der Deep-Learning-Modelle oder anderen Personen zu vergleichen. Nach einer extensiven Datensetanalyse, habe ich mehrere Deep Learning Architekturen entworfen und trainiert. Zuerst habe ich mich auf die visuelle Domäne (Cartoons) mit Con- volutional Neural Networks, Transfer Learning und Objekterkennung konzentriert. Danach war der Fokus bei der Text-Domäne (Pointe) mit Long Short-Term Memory Netzwerken, Word Embeddings (deep-learning-basierte und klassische) und automatischem Machine Learning. Abschließend, habe ich alle Erkenntnisse in einer Zwei-Phasen-Architektur vereint. Unglücklicherweise hat die Evaluation ergeben, dass die Aufgabe noch nicht mit den von mir angewandten Deep Learning Techniken fassbar ist. Ich habe zwei Performance Metriken ausgewählt (Durchschnittlicher Absoluter Fehler und Genauigkeit), sowie einige Baseline-Modelle (Häufigste Klasse, Durchschnittliche Klasse, etc.), aber kein Modell hatte signifikant bessere Ergebnisse als die Baselines. Auf dem Test-Set hatte ein Transfer- Learning-Ansatz die beste Genauigkeit mit 26.10%, während die häufigste Klasse eine Genauigkeit von 24.50% erreicht hat. Sowohl ein Deep-Learning Ansatz, als auch die mittlere Klasse konnten einen durchschnittlichen absoluten Fehler von 1.57 erreichen. Dies zeigt, dass die semantische Lücke zwischen Computer und Menschen noch zu groß ist und es mit aktuellen Deep-Learning-Techniken nicht möglich ist Humor einer einzelnen Person (subjektiv) zu modellieren. Es scheint, als sei für diese Aufgabe ein weiterer Durchbruch neben Deep-Learning notwendig. v D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . Abstract The aim of this thesis is to model humor using deep learning based on Gary Larson’s cartoons. The recent success of deep learning in computer vision and natural language processing shows that similar techniques can be applied in the field of computational humor. The training of deep learning models requires a dataset with many training samples, which is why I created a novel dataset containing several thousands of Gary Larson’s cartoons, punchlines and corresponding funniness annotations. The dataset was annotated using a custom labelling tool, by the single person. Therefore, the dataset entails the humor of a single person. With this dataset it is possible to quantitatively compare humor with the results of the deep learning models or with other people. After an extensive dataset analysis, I designed and trained several deep neural archi- tectures. First, focusing on the visual domain (cartoons) using convolutional neural networks, transfer learning and object detection techniques. Afterwards, I focused on the text domain (punchlines) using Long Short-Term Memory networks, several word embeddings (deep learning based and classical) and an automated machine learning approach. Finally, I tried to combine all the findings into a unified two stage architecture. Unfortunately, the evaluation revealed that this task is not yet tractable by the deep learning techniques applied. I chose two performance metrics (Mean absolute error and accuracy) and several baseline models (most frequent class, mean class, etc.) and no model improved on the baselines significantly. On the test set a transfer learning based approach scored the best accuracy of 26.10%, while the most frequent class scored 24.50%. Both a deep learning approach and the mean class reached a mean absolute error of 1.57. These results show, that the semantic gap between computers and humans is too large for current deep learning based approaches to successfully model the humor of a single person. It seems another breakthrough besides deep learning is required for this task. vi D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . Contents Kurzfassung v Abstract vi 1 Introduction 1 1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Overview over thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Machine Learning and Artificial Intelligence . . . . . . . . . . . . . . . 5 2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5 AutoML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 Computational Humor . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 Design 40 3.1 Dataset design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Visual domain model design . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Text domain model design . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Visual and text domain combined model design . . . . . . . . . . . . . 50 4 Implementation 54 4.1 Technical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Labelling tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Debugging of models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5 Evaluation 60 5.1 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 vii D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 5.3 Label Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6 Conclusion 76 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 List of Figures 78 Bibliography 80 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . Cartoon by Gary Larson D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . CHAPTER 1 Introduction 1.1 Aim This diploma thesis aims to research whether computers are capable of understanding humor by learning Gary Larson’s cartoons. This was done by training several deep neural networks. Their task was to predict the funniness of a cartoon better than simple baselines. Even though humor is intrinsicly subjective, the available data allowed quantitative comparisons using the chosen metrics (accuracy and mean absolute error). The goal was to find the best neural network architecture for predicting the funniness of a cartoon. Furthermore, based on Gary Larson’s cartoons a new dataset for computational humor was created, which could help further research in this area. The questions this work tries to answer, are the following: 1. Which neural network architecture is best suited for predicting the associated funniness of each cartoon? 2. Do the trained models outperform previously defined baselines? These baselines include most frequent funniness, average funniness, uniform random funniness and stratified random funniness. 3. If a model outperforms the baselines, does it generalize and understand humor? The model could instead exploit spurious statistical cues in the dataset similar to Niven et al. [74] 4. Which feature domain is best suited for predicting the funniness: Is using both visual and text input better than they perform individually? 1 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 1.2. Motivation On a higher level this thesis tries to bring subjectivity considerations into computational humor. Previous work in this field did not model the humor of a single person, but instead focused on what humans collectively consider humorous. The novel dataset and deep learning enable this new perspective. 1.2 Motivation Based on the recent success of deep learning in fields such as self-driving cars [45], speech recognition [116], machine translation [115], etc., one could expect similar success in computational humor. Especially natural language processing and image-related tasks have seen significant improvements of the state-of-the-art through deep neural architectures. Initially the idea of this thesis was, that the trained model might have been able to extract certain patterns of humor. Gary Larson’s cartoons can be grouped into several themes. For example stone age, scientist or cats versus dog are common themes he uses. A competent model could distinguish these themes and might find patterns of humor among these subjects. Someone might find stone age themed cartoons funnier than cats versus dogs cartoons. Important to note is that preference is very subjective. The model must be trained by data of the same person, because otherwise the personal preference might be lost. If trained successfully it would open a whole set of new research possibilities: The study of humor of people through the lense of deep neural networks. This is especially interesting, since it has been shown that there are key similarities between how deep neural networks work and how the human brain works [19]. For example if such a model were possible, it could possibly allow a more formal comparison of humor between different authors or different people. The same model might be able to be transferred to the work of other authors. Depending on how well this transfer works, one could argue whether the humor of these authors are similar or not, given the annotator’s sense of humor. Also the trained model themselves could contain interesting features. Possibly a functioning model could develop a cat-detecting neuron, which could reveal insights into human humor. Another key motivation was in the creation of the underlying dataset: This thesis uses a novel dataset, where each cartoon was paired with a funniness annotation of the same person. This allows to quantitatively evaluate the humor of a single person, as the dataset is comparatively large. The scope of the dataset is also unique, as there is no other dataset with similar cartoons available. In future the same dataset could be used for other tasks. A very hard, but very interesting task would be using a generative architecture, such as generative adversarial network (GAN) to generate a funny cartoon image from a punchline. Or vice versa: Generating a funny punchline from a cartoon image. And finally generating completely new funny cartoons with fitting punchlines [38][98]. 2 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 1.3. Methodology In recent years more and more deep learning techniques have been applied in the field of computational humor. One problem of these approaches is that they do not take the subjectivity of this problem into account. Humor differs from person to person and also changes over time significantly. For example in Chiruzzo et al [18] the corpus was extracted from twitter where each tweet was classified as humorous or not humorous. The ground truth classification was annotated using crowd sourcing. The problem with such datasets is that the funniness is determined by some form of average. Some individuals might find the same text very humorous, while other individuals might find it not humorous at all. The crowd sourcing hides this fact by averaging. Therefore, these datasets and models usually do not learn the humor of a single person, but instead what most people generally find funny (or not funny). This is very important and also reveals interesting insights into humor, but this thesis’ motivation is to focus more on the subjectivity of this topic. This work finally shows that humor is still a hard problem and current deep learning architectures are not yet capable of generalizing humor in the chosen context of this thesis. Further research is still needed, especially since, if successful, it would enable many new research possibilities. 1.3 Methodology Firstly I performed a detailed literature survey. The goal was to get a grasp of the state-of-the-art of deep learning in the field of computational humor, image classification, natural language processing and humor in general. Next I did the data acquisition, in which the ground truth was created: A dataset of Gary Larson’s cartoons with funniness annotations was created. These cartoons were prepared such that they could be used for training the neural networks effectively. This included cropping, resizing and filtering the cartoons. Extracting the punchlines, as well as annotating the funniness. I performed the following steps iteratively. This allowed to adapt to the challenges which occured during the course of this research: 1. I picked an architecture candidate and designed it accordingly. This was based on prior knowledge and previous literature research. 2. Then I implemented and trained the architecture. 3. Afterwards I evaluated the resulting model. Problems were determined and possible solutions analyzed. At first the focus was on visual-only models. Then the focus shifted towards text-only models. Finally I evaluated several approaches that combine both text- and visual 3 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 1.4. Overview over thesis features into a single model. Also the architectures went from simple to more complex per iteration. Finally I performed a detailed evaluation. The most interesting models were picked and analyzed thoroughly. Goal of this phase was to find answers to the questions listed in section 1.1. 1.4 Overview over thesis This section gives a short overview over the contents of this thesis. Chapter 2 (Background) contains the background necessary for this thesis. Topics covered are computational humor, machine learning, artificial intelligence, neural networks, natural language processing, automated machine learning and an overview over related work. Chapter 3 (Design) covers the iterative design process of this thesis. This process began with the visual component of Gary Larson’s cartoons using convolutional neural networks. Next, design shifted towards the development of deep neural networks for punchlines using LSTMs, AutoML, feed forward neural networks and word embeddings. And finally this chapter outlines the combination of both domains in a single architecture. Chapter 4 (Implementation) focuses on the implementation aspects of the previously described neural network architectures. The reader gets an overview over the technology chosen and I implemented the the architectures. Chapter 5 (Evaluation) gives the reader an insight into the final results obtained by selected interesting architectures, compared to the baselines. I examine the inner working of the models, outlining their mistakes using confusion matrices and other techniques. Chapter 6 (Conclusion) finalizes this thesis with lessons learned during the course of the development of this work. 4 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . CHAPTER 2 Background As this work combines the field of computational humor with deep learning, an un- derstanding of both fields is required. This chapter contains the required background for this thesis. First an overview over artificial intelligence and machine learning is outlined. Section 2.3 lays out techniques of natural language processing. Section 2.4 gives an overview over a technique called transfer learning. The next section (2.5) outlines automated machine learning. Based on the previous sections I summarize the research area of computational humor in section 2.6. This chapter ends with related work of traditional computational humor and deep learning based computational humor in section 2.7. 2.1 Machine Learning and Artificial Intelligence The field of Artificial Intelligence (AI) tries to reproduce (human) intelligence using computers [73]. It is yet impossible to develop a truly intelligent computer program, also known as strong AI [64]. AI researchers typically pick certain narrow subtasks of what is considered to require intelligence. Such a system is also known as weak AI [64]. These tasks usually entail some utility for other application domains and range from image classification, self driving cars to machine translation. 2.1.1 History of Artificial Intelligence In the year 1950 Alan Turing proposed the idea of a test which could determine if machines have human-level intelligence. In this test a human chats with a human and a computer. Both try to convince the human that they are a real human being. If the human is not reliably able to tell them apart, then the computer passes the Turing test [111]. 5 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.1. Machine Learning and Artificial Intelligence In the 1960s and 70s AI researchers overestimated the possibilities of their systems greatly. For example, Marvin Minsky (a key researcher back then) said in 1970: "In from three to eight years we will have a machine with the general intelligence of an average human being." The goal of artificial intelligence initially seemed tractable: The computational power of computers increased exponentially and new algorithms and methods were developed [100][73]. The first chatbot (ELIZA) by Joseph Weizenbaum which showed how communication between humans and machines could be implemented using relatively simple rules seemed promising [114]. There were many schools of thoughts, among others: The symbolic-based approach and the connectionism-based approach [63]. Symbolic-based systems reached early success, by applying logic reasoning to AI [63]. The idea of connectionism was to model AI systems using the human brain as a blueprint [63]. In the end of the 60s many AI researchers lost interest in connectionism, as the limitations of the perceptron were revealed [68]. This insight halted the research in this area for many years [73]. The symbolic-based approach initially showed much success in several tasks (for example: theorem proving [97]) but also reached a limit. The general approach was to model reasoning as a form of search [97]: Proceed step-by-step towards the goal and backtrack if the steps lead to a dead end. The problem with this approach is the sheer size of the search space. The combinatorial explosion is intractable and can only be handled by heuristics, which limit the power of the system [97]. The limiting capabilities of perceptrons and the lack of progress in key areas resulted in the first AI winter in which funding was cut to a minimum [22]. Expert systems dominated the 1980s [63]. These were systems which were crafted by human experts and entailed their knowledge of the specific domain it tried to model using logical rules [73]. One important contribution of expert systems was their usefulness. Even early experiments managed to diagnose infectious blood diseases [73]. Research in the field of connectionism continued as well. Backpropagation solved issues of previous attempts [73][97]. Training larger neural networks was now tractable. One of the most popular architectures of the time is the Hopfield network which models an associative memory [40]. Still, the expectations were not met and the second AI winter emerged during the late 1980s [63]. This time it not only hit the scientific community, but also many AI companies [73]. During the 1990s AI had a slightly negative connotation [61]: AI-based systems were working in the background and many problems were largely solved (for example speech recognition and information retrieval [61][99]). A big milestone of AI research was achieved, when Deep Blue managed to win against the best chess player of the time, Garry Kasparov [73]. This was largely due to the increased computational power and clever engineering, instead of some new paradigm [73]. During this period the collaboration with other research fields began [97]. For example, knowledge from mathematics and economics was applied in AI. Especially decision theory 6 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.1. Machine Learning and Artificial Intelligence and probability theory brought in very important techniques, such as hidden Markov models and Bayesian networks [82][97]. The next boom in the field of AI was fueled by Deep Learning and the success it had in many problems of artificial intelligence [45][115][116], beginning with the convolutional neural network AlexNet which was the first approach that reached a top-5 error of 15.3% on the ImageNet Large Scale Visual Recognition Challenge in 2012 [52]. The boom continued in 2016 with AlphaGo winning against one of the best human player of the board game Go [106]. Compared to chess Go is a much more difficult game to play for computers, as the combinatorial explosion is much larger and the heuristics available are much weaker in Go compared to chess [106]. The next frontier of artificial intelligence is not clearly established yet, as the progress of deep learning is very fast. For example the team behind AlphaGo applied similar techniques to real time strategy games successfully [5]. It seems that especially tasks requiring long time planning or deep understanding of the underlying structure are a problem [69]. For example this thesis shows that understanding humor is still very challenging. 2.1.2 Machine Learning Machine Learning (ML) is a domain of AI [97]. Machine learning tries to build a model which generalizes as good as possible to previously unseen data by learning automatically from a training dataset [11]. Learning means extracting patterns from the dataset by applying a machine learning algorithm [11]. Models can be parametric or non-parametric. A parametric model means that the general function is fixed and can only be adjusted by adapting the parameters [97]. For example neural networks are parametric, as gradient descent adjusts the weights and biases for each neuron during training, while the general structure of the neural network is not. In contrast, for a non-parametric model the machine learning algorithm learns the function itself [97]. An example for this type of model would be k-Nearest Neighbor. Since the focus of this work are neural networks we will discuss parametric models in more detail. More formally, a model is a function y(X, θ) where X is the training data and θ are the parameters of the model [11]. Through some optimization algorithm (for example gradient descent) the model is adjusted such that it generalizes ("learns") the underlying structure [40]. This model should make good predictions for previously unseen samples. Important is the distinction between model parameters (θ) and model hyperparameters [97]. Model hyperparameters are parameters which cannot be estimated from the available data and are typically set by a human heuristically [21]. A parameter on the other hand, can be estimated from the data, given the previously set hyperparameters [40]. The data X is a matrix where each row is a sample and each column a feature. A feature is an individual measurable property or characteristic of a phenomenon being observed [11]. For example when trying to apply a machine learning task on an image one could 7 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.1. Machine Learning and Artificial Intelligence Regression Value < 50 51 − 100 101 − 200 200 < Classification Class CLASS 50 CLASS 75 CLASS 150 CLASS 250 Table 2.1: Example for the conversion between regression and classification tasks use the red, green and blue channel values of each pixel as features. Choosing appropriate features is a difficult task and can make or break the performance of the trained model. Machine learning tries to approximate a function from the selected dataset. Depending on the available information, different strategies must be applied. Typically, the most convenient and preferred mode is supervised machine learning [97]. Supervised means that not only the data X is available, but also the target values t. This means that the problem can be relatively easily solved using an optimization algorithm. More formally we try to find a θ such that y(X, θ) = t. If the t labels are not available, then unsupervised machine learning is typically applied [97]. For example clustering methods allow unsupervised ML. If a mixture of labelled and unlabelled data is available, machine learning practitioners apply semi-supervised machine learning [120]. Additionally, there is a distinction between classification and regression tasks [11]. If t is a continuous variable, machine learning practitioners call it a regression task and the model is a regressor. Otherwise, if t is categorical and can only take finite values, then machine learning practitioners call it a classification task and the model is a classifier. An example for a regression task is stockmarket prediction, where the model predicts the price of some stock. An example for a classification task is spam detection. Such a model predicts whether an e-mail is spam or not (binary classification). Often a regression task can be modelled as a classification task by converting the features using discretization and vice versa [30]. Instead of predicting the stock market directly, one can instead predict classes and use the average value of each bucket as the predicted stock price. Refer to table 2.1 for an example. The right approach is highly task dependent [97]. 2.1.3 Training of Machine Learning models The most important property of a model y is the generalization power for unseen data [11]. Training a model y which predicts the training data X perfectly is trivial: The machine learning algorithm could always build a lookup table which memorizes a mapping from the training samples to the target labels t [25]. If this happens, it is called overfitting [25]. The holdout method facilitates the approximation of the performance of a model y for previously unseen data [94]. There is often some bias towards the original dataset [110]. The better the dataset characterizes the underlying problem, the better the final model may perform practically [93]. For example, if one trains a vehicle detection model, but the dataset does not include any images of motorcycles, then the model will perform badly in the real world. Regardless of the method applied, the model would not be able to detect motorcycles - despite the performance metrics possibly indicating otherwise. 8 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.1. Machine Learning and Artificial Intelligence The holdout method first splits the data into three distinct subsets: Training, validation and test set [94]. The practitioner incrementally adjusts the hyperparameters and trains on the training set, based on the performance of the model on the validation set [94]. Finally, to gather the true performance the practitioner has to evaluate on the test set [94]. The following algorithm illustrates the holdout method using pseudo code: Algorithm 2.1: Holdout method [94] 1 Take the dataset X and split randomly into three distinct subsets: Training, validation and test set. 2 while ¬some halting criterion do 3 Train the model on the train split. 4 Check performance of the trained model on the validation set. 5 Manually adjust hyperparameters such that the model generalizes better 6 end 7 Check final performance on the test set of the best trained model This variant is also commonly referred to as the three-way holdout method [94]. The validation/test split is only needed if there are hyperparameters which are being optimized iteratively [94]. Due to the lack of standardization, sometimes the validation set is actually called the test set and the test set the validation set [95] [40]. Commonly used halting criteria include: • Maximum number of iterations [49] • Plateauing performance: No improvement after a previously defined number of iterations [89] • Maximum performance: The model reaches a certain performance threshold [49] • A combination of multiple halting criteria [49] If there is no clear separation between the test and training set, we call it a data leakage [103]. The actual performance of the trained model y could be potentially much worse for unseen data than the metrics suggest. Selecting the best performing model, instead of the last model is called early stopping or the pocket algorithm. Another method for evaluating the performance of a machine learning model is called cross-validation [94]: This method randomly splits the data into n partitions. Then n models are trained for each partition, where the partition is the test set and the remaining data is the training set. In general cross validation approximates the performance better than the holdout method [97]. However, as many models are trained (depending on n) it can be computationally very expensive [11]. The following pseudo code illustrates cross-validation, which first splits the data in a T RAIN and a T EST set [94]. Then incrementally the T RAIN set is split into 9 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.1. Machine Learning and Artificial Intelligence k partitions P . For each (T RAIN \ P , P ) pair a model is trained and evaluated [11]. Based on the overall performance the practitioner adjusts the hyperparameters accordingly. Finally the T EST set evaluation returns the performance on unseen data [94]. Algorithm 2.2: Cross-validation method [94] 1 Take the dataset X and split randomly into two distinct subsets: Training (T RAIN) and test set (T EST ). 2 while ¬some halting criterion do 3 Take the train set T RAIN and split into k partitions 4 foreach partition P do 5 Train the model on T RAIN \ P 6 Calculate performance on P 7 end 8 Calculate average performance 9 Manually adjust hyperparameters such that the model generalizes better. 10 end 11 Check final performance on the test set T EST of the best trained model Depending on the machine learning task and applied methodology, there is no distinct test set [94]. Figure 2.1 compares the splits of holdout method and cross validation. Figure 2.1: Comparison of the splits between holdout and 5-fold cross validation 10 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.1. Machine Learning and Artificial Intelligence 2.1.4 Evaluation of Machine Learning models Measuring the performance of a model y is an important aspect of proper machine learning [32]. Choosing the performance measure is highly task dependent [32]. For this thesis the mean absolute error (MAE) and the accuracy were selected [34]. Other metrics considered were F1 and mean squared error (MSE) [34] [87]. Mean absolute error and mean squared error MAE = 1 n ∑n i=1 | X̂i − Xi | (2.1) MSE = 1 n ∑n i=1 ( X̂i − Xi )2 (2.2) where X̂ is the predicted sample X is the ground truth MAE is interesting because a near miss is less penalized [34]. For example, if a model predicts funniness of 5, while the real funniness is 6, the term | 5 − 6 |= 1. On the other hand if the predicted funniness were a 2, the term | 2 − 6 |= 4. The lower the MAE the better. MSE works similarly to MAE, but big errors are penalized more [34]. The calculations for the above examples are (5 − 6)2 = 1 and (2 − 6)2 = 16, which is a significant difference. For a very noisy dataset MAE is usually better suited than MSE. Accuracy and F1 Accuracy = |True Positives| + |True Negatives| |N | (2.3) where |True Positives| is the number of correctly classified samples of the positive class |True Negatives| is the number of correctly classified samples of the negative class |N| is the total number of samples Accuracy is easy to understand and is commonly applied in classification tasks [34]. The higher the accuracy the better [87]. When applied on an imbalanced dataset, the accuracy metric might not sufficiently describe the performance of the model, as it is trivial to increase the accuracy by returning the most frequent class of the dataset [34]. Consider the following example: The task is evaluating a model which predicts whether a patient has cancer or not. The test dataset is highly imbalanced, and contains 99 samples of patients without cancer and 1 sample of a patient with cancer. If the model always predicts that a patient has no cancer, the resulting accuracy is 99%. 11 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks F1 scores handles such cases better, by emphasizing false negatives and false positives more [87]. It is the harmonic mean of the precision and recall score [87]: Precision = |True Positives| |True Positives| + |False Positives| (2.4) Recall = |True Positives| |True Positives| + |False Negatives| (2.5) F1 = 2 ∗ Precision ∗ Recall Precision + Recall (2.6) where |True Positives| is the number of correctly classified samples of the positive class |True Negatives| is the number of correctly classified samples of the negative class |False Negatives| is the number of incorrectly classified samples of the negative class |False Positives| is the number of incorrectly classified samples of the positive class For multi class classification (more than two possible classes) a confusion matrix helps to visualize the predictions of a model [34]. The rows of a confusion matrix are usually predicted classes and the columns the ground truth. A perfect model would show 100% along the diagonal of the matrix, as there would be no wrong classifications. A confusion matrix usually reveals common mistakes of a model [34]. For example if a model never predicts a class, it can be easily detected using this visualization. For an example of a confusion matrix please refer to figure 5.3. 2.2 Artificial Neural Networks One of the first artificial neural network, was the perceptron [73]. It is a binary classifier and uses a linear function to perform classification [68]. Binary refers to the fact that a simple perceptron can only distinguish between two states. The learning process starts with a random decision boundary, which is iteratively adapted such that the number of misclassifications is minimized [97]. Due to the linear decision boundary the perceptron is only able to learn decision boundaries that are linearly separable [73]. For example, a simple XOR cannot be learned by a basic perceptron. This motivated the introduction of multi layer perceptrons (MLPs), since they are in theory able to approximate any function, given a nonlinear activation function [24]. An MLP consists of multiple perceptrons grouped in layers and each layer connects with the previous layer [11]. MLPs are feed forward neural network [11]. Part of the reason why artificial neural networks are so powerful, especially deep neural networks, stems from their flexibility [45][116][115]. They can adapt to many different domains very easily. Illustrated at figure 2.2 is a fully connected feed forward neural network with 3 hidden layers. Neural networks can have many more layers (refer to section 2.2.5) or even contain loops (section 2.2.7). 12 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks Input Layer Hidden Layers Output Layer Figure 2.2: Fully Connected Feed Forward Neural Network A neural network generally consists of multiple neurons, layers and connections [97]. The general inspiration for them is the human brain which also consists of similar building blocks [40]. In the human brain neurons receive their signals through dendrites (connections in neural networks) and are then sent through the axon to other neurons [40]. The signal strength is adjusted (analogous to activation functions in neural networks) such that other neurons in the brain are more stimulated [40]. Note that key parts of the human brain are not simulated in a neural network: During training the weights and biases are adjusted using gradient descent (for more information refer to section 2.2.2) [23]. Nevertheless artificial neural networks (especially convolutional neural networks) produce similar structures also observed in the human brain [19]. 2.2.1 Basics of artificial neural networks From a different perspective, neural networks can also be seen as a series of functional transformations of some input vector into an output vector [78]. Given the input x1, ..., xD and M connections the input layer can be described as: aj = D∑ i=1 w (1) ji xi + w (1) j0 (2.7) where j = 1, ..., M . w (n) ji is usually referred to as weights and w (n) j0 as biases [11]. aj is 13 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks called an activation or sometimes potential, which is then transformed using a nonlinear, differentiable activation function h(·) [11]: zj = h(aj) (2.8) Depending on the problem different activation functions are applied. Common functions include: • Tanh function [76]: zero-centered function between -1 and 1 [76]. Suffers from the vanishing gradient problem [76]. h(x) = ex − e−x ex + e−x (2.9) • Sigmoid Function [40]: "Squashes" the input value between 0 and 1 [40]. h(x) = 1 1 + e−x (2.10) • ReLU (Rectified Linear Unit) [37]: Very commonly used in deep learning, as it performs very well on a wide range of different tasks [76]. Can be calculated very quickly, as it does not contain any division or exponential [76]. h(x) = max(0, x) (2.11) • SoftMax [37]: Used to transform any vector into a probability distribution [76]. This is commonly used for the final layer of classification tasks with multiple classes (multi class classification) [76]. h(x) = exi ∑M k=1 exk (2.12) • Logistic Sigmoid [37]: Similar to SoftMax, but used for binary classification [37]. σ(x) = 1 1 + e−x (2.13) For the first hidden layer of a neural network the following values are linearly combined [11]: ak = M∑ i=1 w (2) kj zj + w (2) k0 (2.14) where k = 1, ..., K is the total number of outputs of the first hidden layer. The same procedure is repeated for the remaining hidden layers [11]. The final result for classification tasks is usually obtained by applying a softmax activation function of the last layer’s output [11]. 14 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks Combining these transformations together for a relatively simple two layer network results in the following equation: yk(x, w) = σ ( M∑ i=1 w (2) kj h ( D∑ i=1 w (1) ji xi + w (1) j0 ) + w (2) k0 ) (2.15) This term can be simplified by assuming that the input vector x has always a constant term with the value 1 [11], as this removes the additional addition for the bias for each layer. yk(x, w) = σ ( M∑ i=1 w (2) kj h ( D∑ i=1 w (1) ji xi )) (2.16) Evaluating 2.16 is called forward propagation [78]. 2.2.2 Training of neural networks For training a loss function L(w) is defined, where w are the trained weights [37]. The loss can have many different forms, but in the following we will discuss primarily the cross entropy loss, since it is commonly used for classification tasks [88]. L(w) = − N∑ n=1 (tn ln yn + (1 − tn) ln(1 − yn)) where yn = y(xn, w) (2.17) Often we want to calculate the loss for multiclass classification problems. Here we encode the labels as K mutually exclusive one hot-encoded-classes [88]. One-hot-encoded means that a classification c ∈ {1, ..., K} is represented using a vector of size K where the c-th component is set to 1 and every other component is set to 0. Suppose we have 3 classes and the classification for a sample is 2, then the one-hot-encoding of this class is (0, 1, 0). We apply the following loss function for multiclass classification: L(w) = − N∑ n=1 ( K∑ k=1 (tkn ln yk(xn, w)) ) where yk(x, w) = SoftMax(y(xk, w)) (2.18) For regression tasks the L1 loss can be used [34]: L(w) = 1 N N∑ n=1 |xn − yn| where yn = y(xn, w) (2.19) The goal of training neural network is to minimize the loss of the model [88]: 15 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks arg min L(w) (2.20) Training takes advantage of the fact that it is a function composition of differentiable functions [37]. This allows the use of a technique called gradient descent [40]. The idea is intuitive: Walk along the path with the steepest descent until some halting criterion is met [40]. Because the neural network is differentiable the steepest descent is relatively easily obtained, by taking the negative of the first derivative of the loss function [37]. A halting criterion could be one of those described in section 2.1.3. Algorithm 2.3 illustrates the pseudo code of gradient descent. Algorithm 2.3: Gradient descent pseudo code [40] 1 while ¬some halting criterion do 2 Compute gradient: w′ = ▽L(w) 3 Update weights: w := w − αw′ 4 end α > 0 is a hyperparameter and called the learning rate [37]. α is usually set empirically [37]. A too high learning rate (for example α = 1) means that the loss of the neural network could oscillate violently, where the loss function often increases significantly [37]. The lower α the more likely it is that the gradient descent converges in a local optima [37]. Figure 2.3 illustrates the process of gradient descent. Figure 2.3: Visual representation of gradient descent with momentum. Longer arrows mean steeper gradient. [37] 16 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks 2.2.3 Applying artificial neural networks The naive implementation of gradient descent has problems, which can be mitigated by different measures. Usually a technique called momentum is applied [37]. A real world analogy would be a ball rolling down a steep hill. Over time, the steeper the gradient, the faster the ball rolls towards the local minimum. This makes the gradient descent algorithm to be less prone for local minima and usually causes faster converging to the desired optimum [37]. Another technique is called minibatching [37]: Here instead of calculating the gradient of the entire dataset X, a small subset of X is taken [37]. This not only reduces the amount of computation needed, but also makes it such that local minima are less of a problem, as the gradient is slightly different depending on the minibatch [37]. Obtaining the gradient is also a computationally complicated step. The usual forward derivation method commonly learned in high school does not scale well for many parame- ters [78]. Therefore a technique called backpropagation is commonly used [78]. For more information, refer to the next section. In general, overfitting is a problem of neural networks (and other machine learning algorithms) [97]. To counteract this phenomena we can apply a different loss function with regularization, synthetically increase the dataset size using data augmentation or randomly dropping certain neurons with dropout layers [37]. Note that these techniques are not exclusive and are usually applied in combination [37]. The idea of data augmentation is to synthetically increase the size of the dataset by applying transformations on the original data [105]. No transformation must alter the original label of the sample. Image data usually allows for many different kind of filters and transformations to be applied [105]. Common operations are random crop, random rotation and adding synthetic noise [105]. In general, this technique allows for better generalization of the trained model [105]. For example, a model which is trained on a cats versus dogs image dataset with a rotation transformation applied might distinguish rotated cats and dogs better, compared to a model trained on a dataset without this augmentation. On the other hand the transformations could unnecessarily force a model to generalize [105]: Applying a skew transformation on the cats versus dogs dataset might make the performance even worse than the trained model. Regularization tries to avoid overfitting by penalizing high weights of neurons [97]. This avoids certain inputs from dominating the outputs and nudges the network to make use of all inputs [88]. This can easily be achieved by adapting the loss function slightly [11][88]: Lreg(w) = δ 2 ‖w‖2 + L(w) (2.21) where δ is called weight decay. 17 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks A dropout layers can be used as well [37]. The idea is similar to regularization: The network should not overly rely on certain inputs [37]. A dropout layer turns off neurons with a certain probability [88]. This forces the network to generalize better [37]. Often in CNNs dropout layers are placed at the end of the network before the last fully connected layer [37]. Due to the sheer amount of hyperparameter and possible approaches, successfully design- ing and training of a neural network can be a challenging task [40]. There are many rules of thumb for many hyperparameters but in the end it is in the hand of the developer to use a good mixture of clever tricks, as well as trial and error to find a good neural network model [37]. Automated machine learning is also a very promising research field, more information is available in the next section. The architecture of the neural network also influences how much it tends to overfit: As a rule of thumb, the more neurons a neural network has, the bigger the learning capacity of the neural network and therefore it overfits easier if the dataset is not sufficiently large [37]. Empirically, deep networks (many layers) have shown to generalize better compared to less deep networks with a similar number of neurons [37]. 2.2.4 Backpropagation Backpropagation is an alternative way of determining the gradient of a function [37]. Backpropagation makes training big neural networks tractable for modern computers [11]. Due to its useful properties backpropagation has been reinvented independently in several fields [39]. It is also known as reverse-mode differentiation [78]. Figure 2.4: A computational graph with derivatives on the edges. [78] Another view of neural networks is the view of computational graphs [37]. Neural networks essentially define a graph of simple computations consisting of differentiable operations [37]. In a computational graph derivatives are placed on the edges between nodes. Figure 2.4 illustrates a computational graph. 18 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks Calculating the derivative of e with respect to b in the graph illustrated in figure 2.4 can be calculated by summing up over all paths in the computational graph [78]: ∂e ∂b = 1 ∗ 2 + 1 ∗ 3 (2.22) Figure 2.5: A simple computational graph [78] Summing up over all paths quickly explodes combinatorially, as the number of paths may increase exponentially with the size of the graph [78]. For example, applying traditional forward-mode differentiation ( ∂ ∂X ) the computational graph illustrated in figure 2.5 would result in the following derivative [78]: ∂Z ∂X = αδ + αǫ + αζ + βδ + βǫ + βζ + γδ + γǫ + γζ (2.23) which could be greatly simplified instead by using backpropagation ( ∂Z ∂X ) instead: ∂Z ∂X = (α + β + γ)(δ + ǫ + ζ) (2.24) Backpropagation starts at the output node and works towards the root nodes by calcu- lating the derivative of ∂Z ∂ [78]. Figure 2.6 outlines backpropagation on a computational graph. Additionally this process also returns a derivative with respect to the output of every node in the graph [37]. This is a very useful property for training neural networks [37]. Backpropagation can be summarized using the pseudo code described in algorithm 2.4. Algorithm 2.4: Backpropagation algorithm [88] 1 foreach node n do 2 Compute local gradient lc = ∂n ∂c for each child c 3 Compute mc = lc ∂e ∂n for every child c 4 Compute ∂e ∂c by summing up over all mc 5 end 19 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks Figure 2.6: Backpropagation applied on computation graph 2.4 [78] 2.2.5 Deep Learning A neural network with multiple layers, where each layer applies increasingly higher level feature extraction is a deep neural network [101]. The research field is called deep learning [101]. An advantage of deep neural networks is that they scale very well with the size of the dataset and with the amount of computation power available [52]. As the computational resources increase and the size of the datasets as well, the performance of deep learning models increases as well. The technique of using multiple layers in neural networks has been around for many years [101]. Relatively recently (2012), the field of deep learning has gained a lot of attention, since the introduction of AlexNet [52] as this was one of the first practical deep neural network outperforming more traditional methods [101]. As the basic techniques of deep learning have been around for many years, the exorbitant computational requirements were intractable for many years [101]. Deep learning on a conventional central processing unit is infeasible for large models [101]. Emerging general purpose programming of graphics cards solved this problem: The training of deep neural networks can be very effectively parallelized and graphics cards excel at such tasks compared to CPUs [37]. Training neural networks requires to compute the gradient of a large function [37]. The gradient can be efficiently calculated using backpropagation [97]. Despite backpropagation and large datasets, controlling the number of weights and biases is still an important factor of successful deep learning models [101]. There are many types of layers, which essentially all serve to reduce the number of parameters [37]. For example, convolutional layers are especially suited for image related tasks, as they have a relatively small number of parameters, but perform significantly better than a comparable fully connected layer [101]. A general problem of neural networks and deep neural networks in particular is the lack 20 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks of interpretability [36]. For humans it is very difficult to make founded assertions of what the neural network actually uses for its prediction [15]. Neural networks are essentially black boxes with input on one side of the network and some prediction returned at the other side of the network [36]. This is especially a problem of safety critical systems, such as self-driving cars [45]. In this domain it is critical for humans to know on which basis the system has decided and whether it is sensible. Natural language processing is also prone to such "clever hans" behaviour, where the neural network exploits statistical cues of the underlying data, instead of really understanding the task [74]. Interpretable deep learning is an active field of research [36]. 2.2.6 Convolutional Neural Networks Figure 2.7: Visual representation of a convolutional neural network Convolutional neural networks were developed based on feed forward neural networks [11]. A convolutional neural network (CNN) is a deep neural network [52]. The basic layer types of CNNs are fully connected hidden layers, convolutional layers and max-pooling layers [37]. Figure 2.7 illustrates an overview over the architecture of a convolutional neural network. Convolutional layers apply a convolution on each pixel [11]. A convolution is a weighted sum between two functions [37]. The first function is the input signal and the second function a filter kernel [11]. Instead of each neuron being connected to all other neurons, they are connected to a relatively small local subset of the neurons in the previous layer, which allows the layer to recognize certain local patterns (for example: edges) [37]. The kernel is shared among the input neurons (parameter sharing) [11]. This has two main advantages: It introduces translation invariance, which means that it does not matter if a feature occurs at the center or near the edges of an image [37]. Also it reduces the number of learned parameters significantly [11]. Additionally to increase the capacity, multiple convolutions are applied per layer [37]. This is called the depth of a convolutional layer [37]. Due to their design, in contrast to fully connected layers, convolutional layers do not require a fixed input size, which means that they can be applied on images of different sizes [88]. The kernel is a quadratic matrix of size n [52]. To avoid the size reduction of standard convolutions, one can set the padding parameter of a convolutional layer [37]. This increases the input data synthetically (for example by introducing a border with constant value). There is also a stride, which defines how the filter moves along the 21 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks Figure 2.8: 3x3 convolution with stride=1 and no padding. input [37]. If the stride is fractional (1/s), it is also called a deconvolution or transposed convolution [88]. This operation increases the output size [37]. Figure 2.9: 2x2 max-pooling of a simple 4x4 input layer into a 2x2 output layer Pooling layers are applied after convolutional layers [52]. Their goal is to reduce the dimensionality of the data (as well as sometimes translational invariance) [88]. Local pooling layers are connected to a local subset of the neurons in the previous layer [37]. The input is transformed by some function [88]. A typical pooling layer is max-pooling, where the result is the maximum value of all incoming connections of the current local subset (for example 2x2) [52]. Figure 2.9 illustrates a max-pooling operation. Another common type of pooling is average pooling, which is often combined with global pooling [56]. Global pooling means, that the entire data is pooled at once. This operation is 22 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks usually applied at the end of the convolutional part of a CNN [56]. Pooling does not involve learning weights, as the function applied is fixed before training [37]. There are also problems associated with pooling, as it essentially makes the model forget information [41]. This is one reason, why many modern convolutional neural network architectures mostly avoid max-pooling [41]. Fully connected layers are usually the last layers of a CNN, their task is to perform classification based on the extracted features of the previous layers [52]. Interestingly the convolutional part of a CNN can be used for feature extraction for different problem domains (see section 2.4) [101]. 2.2.7 Recurrent Neural Networks RNN ht xt ⇒ RNN x0 RNN x1 RNN x2 RNN xt h0 h1 h2 ht Figure 2.10: Unrolling of a simple RNN cell over time [79] Memory Cell Input Memory Cell Output Input Gate Output Gate Forget Gate Figure 2.11: A schematic overview of an LSTM [44] Recurrent neural networks (RNNs) are suited for sequences of arbitrary length, in contrast to normal fixed length input vectors for conventional feed forward neural networks [44]. Recurrent neural networks are recurrent in the sense that they contain cycles in the topology of the network [101]. In theory RNNs should be capable of understanding long-term dependencies [44]. In practice naive RNNs with simple loops are not suited for longer sequences, because of the vanishing/exploding gradient problem [44]. To 23 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks train a recurrent neural network it must be converted into an equivalent feed forward network [37]. Figure 2.10 shows a visualization of such unrolling. This process converts an RNN for a sequence of length n into a feed forward networks with n layers. During backpropagation we need to calculate the partial derivative of each weight w with respect to the loss function [44]. This partial derivative gets exponentially smaller with increasing number of layers, which causes the training to slow down significantly or even stop entirely [44]. Replacing the naive approach with Long Short-Term Memory (LSTM) cells has shown to be successful for many tasks [115][44][101]. As the architecture of LSTMs keep the vanishing gradient locked inside the cell, instead of propagating through the entire network [44]. LSTM networks are capable of learning long-term dependencies, which makes them very suited for problems in the domain of text and audio [115][107]. An LSTM cell is unrolled over time similar to an RNN cell, where for a sequence the LSTM cell is transformed into a multi layer feed forward network [79]. There are many different variations of LSTMs, but all share common building blocks [79]: Memory cell input & output, output gate and a forget gate [79]. The memory of an LSTM cell enables the network to remember information over time [44]. We will now examine a specific implementation of an LSTM cell [79]. The forget gate ft removes information from the memory [79]: ft = σ (Wf · [ht−1, xt] + bf ) (2.25) where Wf are the trained weights for the forget gate [., .] is a concatenation of two vectors ht−1 is the last LSTMs unit output xt is the current input bf is the learned bias for the forget gate The σ ensures, that the result is between 0 and 1, where 0 means forget the state and 1 means keep the state. The input gate it has a similar function to the forget gate, but instead it decides which information of the cell’s memory needs to be updated [44]. Then the LSTM calculates a new memory vector C̃t [79]: it = σ (Wi · [ht−1, xt] + bi) (2.26) C̃t = tanh (WC · [ht−1, xt] + bC) (2.27) 24 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks where Wi are the trained weights for the input gate WC are the trained weights for calculating the new memory bi is the learned bias for the input gate bC is the learned bias for calculating the new memory Next, the forget gate and input gate are combined to calculate the new cell memory Ct [37]. Ct = ft ∗ Ct−1 + it ∗ C̃t (2.28) where Ct is the memory for the current LSTM unit ∗ is element-wise multiplication For the output gate ot, we want to calculate the LSTMs output ht, by combining the input and memory [79]: ot = σ (Wo · [ht−1, xt] + bo) (2.29) ht = ot ∗ tanh (Ct) (2.30) where Wo are the trained weights for the forget gate bo is the learned bias for the output gate To increase the learning capacity of LSTMs we can apply multiple LSTM cells sequentially, where the output from the first LSTM is the input for the second LSTM, similar to how convolutional layers are stacked together [37]. Alternatively we can increase the learning capacity by increasing the size of the LSTM’s memory C [37]. Empirically LSTMs have shown to have a tendency to overfit to the training data easily, which makes it even more important to find right hyperparameters [118]. A newer alternative to LSTMs are gated recurrent units (GRUs) [28]. Empirically GRUs have shown to train faster and are faster during inference, but in general no significant difference regarding performance has been observed [28]. Even more recently transformer architectures have shown to be also very successful at learning long-term dependencies [112]. Transformer architectures lack recurrent connections entirely [112]. 25 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.2. Artificial Neural Networks 2.2.8 Autoencoder An autoencoder is a deep neural network which learns a compact representation of data without supervision [37]. This is similar to principal component analysis of traditional machine learning [11]. Application areas of autoencoders include feature extraction, image processing and anomaly detection [16][12][90][65]. An autoencoder consists of two parts: • Encoder: Performs a dimensionality reduction from high dimensional data to a low dimensional representation [37] • Decoder: Tries to reconstruct the original data from the low dimensional represen- tation [37] The intermediate low dimensional representation is also called code [37]. The intuition being, that this forces the neural network to extract more general features, which the autoencoder uses to reconstruct the original signal. Still, a big problem of autoencoders is their tendency to overfit and learn a trivial identity mapping [37]. There are several advanced autoencoder architectures, which try to mitigate this issue by applying some kind of regularization during training [37]. Denoising autoencoders get corrupted input data and their loss is still measured by how well it reconstructs the uncorrupted data [113]. The regularization of sparse autoencoders only activates a certain number of neurons per layer, which causes the network to especially react to unique statistical features of the dataset [72]. For example, an autoencoder architecture can be used to reduce the dimensionality of images while still preserving important semantic information [119]. Figure 2.12 shows a simple fully connected autoencoder. As CNNs perform very well in the image domain, they are a natural fit for autoencoders for images [52]. The encoders consists of standard convolutional blocks (convolution + max-pooling). While the decoder has to restore the original image, therefore it usually consists of layers performing transposed convolutions [37]. Transposed convolutions are used for up-sampling from a low resolution input to a high resolution output (refer to section 2.2.6) [88]. 26 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.3. Natural Language Processing Encoder Code Decoder Figure 2.12: A fully connected autoencoder with 2 encoding and 2 decoding layers 2.3 Natural Language Processing Natural Language Processing (NLP) combines the disciplines of artificial intelligence and linguistics [60][48]. NLP researchers try to build computer programs capable of under- standing human language [48]. Common tasks of NLP include part-of-speech tagging, para- phrase identification, machine translation and speech recognition [86][115][107][116][54]. Early NLP tried to achieve its goal by building traditional parsers, similar to what compilers do for programming languages [48]. Unfortunately, these techniques are inadequate for human languages, due to the ambiguity of human language [60]. Another common approach has been to apply traditional machine learning for NLP [60]. The general approach is to extract certain features of a text corpus and use these features to train a classical machine learning model [55]. Empirically, the best performing machine learning algorithm for this context are support vector machines [55]. Deep learning currently achieves many state-of-the-art results in NLP [115][54]. An advantage of many text problems, is the abundance of large corpora, partly due to the rise of the internet [27]. This is an important advantage especially for unsupervised methods, since labelling of large corpora may be very expensive [85][27]. For both traditional machine learning and deep learning based natural language processing it is important to extract the right features from the dataset [60]. High level features (sentence length) are usually not sufficient, therefore the sentences must be converted 27 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.3. Natural Language Processing into more useful features [117]. A trivial approach would be to give each word a unique identification number [33]. This approach does not grasp important semantic information about the words [33]. For example, the words king and queen are more similar than king and house. If the featurization assigns the ID=2 for king and the ID=42 for queen, then the machine learning model has no additional information, that these words are similar concepts. Keeping this information is crucial for successful natural language processing [115][85]. Vector space models avoid this problem by representing each word in a high dimensional vector space, where a certain similarity measure holds [83]. There are different methods for applying natural language processing successfully. A great challenge is to entail semantic context into the trained models [60]. Because text is a sequence of words, we can apply recurrent neural networks in this context [57]. Especially recurrent neural networks which handle long-term dependencies reach state-of- the-art results in NLP (for more information, refer to section 2.2.7) [44]. Alternatively, a weaker form of context, is achieved by using a technique called n-grams [59]. Here we split the sentences into a set of tuples with size n, where each tuple contains the word itself and the next n − 1 words of the sentence [59]. The intuition being, that the context of words can be represented by their surrounding words [59]. For example the 2-grams (also known as bigrams), for the sentence I like machine learning would be {(I, like), (like, machine), (machine, learning)}. This method allows applying bag-of-words techniques to also entail context [4]. Another way we can build a context-aware model is by applying hidden markov models (HMM) [11]. This is a statistical approach where we assume that sentences can be described by a markov process with latent (or hidden) variables [11]. HMMs are much simpler, compared to RNNs, as HMMs rely on common statistical techniques [60]. Given that the underlying process is actually a markov process, then they perform very well, otherwise other techniques may perform better [60]. Often the corpus available does not contain sufficient labelling. In such cases we can apply clustering methods, which search for patterns in the underlying data, without any additional labelling [97]. Gaussian mixture models are a very powerful clustering method, which assumes that the data is a mixture of multiple gaussians [97]. Through an algorithm called expectation-maximization we get a set of gaussian distributions, which allows us to get a probability for each sample that it originated from a specific gaussian, effectively a classification for this sample [11]. In the following subsections I will outline several techniques for extracting word em- beddings from a sentence. Beginning with TFIDF, a technique which ranks words by the term frequency and inverse document frequency [53]. Then I continue with GloVe, an unsupervised statistical approach [83]. And finally I conclude this section with a state-of-the-art deep learning based word embedding called ELMo [85]. 28 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.3. Natural Language Processing 2.3.1 TFIDF TFIDF stands for term frequency inverse document frequency and is a simple, but powerful vector space model [53]. The idea originated in information retrieval, where the goal is to weight words by their importance for a document and for a query [99]. The TFIDF consists of two terms multiplied: Term frequency and inverse document frequency. The term frequency (TF) describes the total number of times a word occurs in a document [77]. The idea here, is that the more often a word occurs in a given document, the more important it is in describing that document [77]. The problem with this is that certain words occur almost ever very frequently (stop words) [53]. For example, the word the has in almost every text corpus a very high term frequency, but is rarely important in describing the document. This is the reason why the inverse document frequency (IDF) is also considered [53]. This weight describes the importance of a word compared to all other documents [99]. For example, a document about the Enigma machine would have frequent mentions of the word Enigma, therefore the IDF would be relatively high compared to other words. In contrast, the word the occurs usually in all documents, therefore the IDF would be very low. There exist many different formulas which entail this idea [53]. In this thesis the following TFIDF equations were used [53]: idf(t) = log ( n df(t) ) + 1 (2.31) tfidf(t, D) = tf(t, D)idf(t) (2.32) where t is some term, D some document, tf(t, D) the number of occurrences of term t in document D and df(t) the number of documents containing term t. The advantage of TFIDF is, that is relatively easy to understand and does not need any time consuming training, as it only counts word occurrences [77]. This makes it feasible to be trained locally, without the need of any pretrained model. The main disadvantage is that TFIDF does not capture the context of the word well. It is a bag-of-word approach which ignores the word order completely [77]. 2.3.2 GloVe The basis for training Global Vectors for Word Representation (GloVe) is a word to word matrix which denotes the frequency of words occurring together in the training data [84]. The intuition of this word embedding is that the ratios of word-word co-occurrence may encode the underlying meaning of the words [83]. During training of the model the objective is to learn word vectors such that the dot product equals to the logarithm of the probability of co-occurrence with another word [83]. GloVe is a bag-of-word approach, as it uses a word to word matrix without any information about sequence [83]. 29 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.3. Natural Language Processing Probability and Ratio k = solid k = gas k = water k = fashion P (k|ice) 1.9 × 10−4 6.6 × 10−5 3.0 × 10−3 1.7 × 10−5 P (k|steam) 2.2 × 10−5 7.8 × 10−4 2.2 × 10−3 1.8 × 10−5 P (k|ice)/P (k|steam) 8.9 8.5 × 10−2 1.36 0.96 Table 2.2: Example for co-occurrence probabilities for words ice and steam given solid, gas, water and fashion [83] Table 2.2 shows the calculated conditional co-occurrence probabilities for the words ice and steam given the words solid, gas, water and fashion. Expectedly steam co-occurs more frequently with gas, while ice co-occurs more frequently with solid. The unrelated word fashion co-occurs rarely with both steam and ice. The ratio of P (k|ice)/P (k|steam) entails some basic semantic meaning: If the ratio is much greater than 1 than it correlates well with properties specific to steam. On the other hand, if it is significantly below 1 then the word has a high correlation with gas. 2.3.3 ELMo The previously mentioned models have very limited sense of context, especially each word had the same word vector associated regardless of the context [83][53]. ELMo (Embeddings from Languages Models) assigns word embeddings depending on context [85]. The basis for ELMo is a language model trained on a large corpus of text [85]. A language model predicts the next word given a sequence of prior words [60]. For example given the sequence of words Yesterday I was in New York. I visited the statue of ..., a good language model could predict that the next word would be Liberty. ELMo models language using a 2-layer bidirectional LSTM with residual connections, which allows training of deep models more efficiently. However the words still must be converted into a vector representation [85]. This is achieved by first converting the words into a character embedding and then a convolutional layer with max-pooling is applied [85]. The character embedding has the advantage that the model handles words better which are not in the vocabulary [85]. The novel idea of ELMo is not how it is trained, but how the trained language model is applied [85]: ELMo calculates the final word embedding for a word by combining the different hidden states of the LSTMs and the output of the convolutional part into a unified vector representation [85]. The combination is done by multiplying a weight to each state, which is also learned [85]. 30 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.4. Transfer Learning 2.4 Transfer Learning Usually, the more complex a given task is, the more labelled data deep learning needs to achieve good results [37]. This is not only expensive, but sometimes not possible. For example, Gary Larson has only drawn so many cartoons in his lifetime. Labelling a big dataset is also a big challenge and not always feasible. Transfer learning tries to solve this problem: Instead of training a completely new model for each problem task, the idea is to reuse models and only slightly adapt them for new tasks [37]. The original model is trained for an already existing big dataset and the dataset of the fine tuning can be orders of magnitudes smaller [88]. This tuning in the context of deep neural networks is usually done by freezing the weights of neurons in certain higher layers during training [37]. Usually lower layers perform more general feature extraction, which can be transferred to other domains more easily [88]. For example there are already very big datasets for image classification tasks [52]. Training a good performing classifier on these datasets has already been done numerous times [41][52]. Especially CNNs have an architecture well suited for transfer learning: As mentioned in section 2.2.6, there is the convolutional part and the fully connected part. Usually the convolutional part is acting as a feature extractor and the fully connected part makes the final prediction. Empirically it has been shown that the feature extractor generalizes often very well to new domains. Depending on the similarity of the domains, less training is needed for the new domain. For example, this thesis uses a pretrained ResNet8 model which has very good performance on traditional image classification tasks and applies it on Gary Larson’s cartoons [41]. Natural language processing works similarly by transferring word embeddings between different corpora [27]. Importantly, these word embeddings transfer well for a broad range of tasks and still achieving state-of-the-art performance [85]. 2.5 AutoML In machine learning models have increasingly more hyperparameters [52][41]. Traditionally these have been manually set by the machine learning practitioner, which does not scale very well for many hyperparameters [97]. Trivially grid search or random search can be applied. Grid search samples all possible combinations of hyperparameters [42]. Consider the following example: If a machine learning algorithm has four hyperparameters and each hyperparameter has six possible values, then 64 models have to be trained and evaluated. There are two significant problems with this approach: It scales exponentially with the number of hyperparameters and for continuous hyperparameters only a predefined subset of possible values can be tested [7]. Random search assigns random hyperparameters until some halting criterion is reached [7]. This technique performs very well for many hyperparameter optimization tasks, as it does not waste time on optimizing irrelevant hyperparameters [7]. 31 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.5. AutoML Automated machine learning (AutoML) is the process of finding hyperparameters or even the entire machine learning pipeline automatically, without (or with much less) human intervention using machine learning itself [50]. An ideal AutoML system could perform the following tasks automatically [42]: • Data pre-processing • Feature engineering • Feature selection • Algorithm selection • Hyperparameter selection In practice completely autonomous AutoML it is not yet possible, as the possible search space is intractable by currently available computer hardware [42]. Nevertheless, AutoML still may save time and result in possibly better models, as we can focus on more important questions than hyperparameter tuning [42]. This thesis uses the automated machine learning algorithm implemented in the works of Komer et al. [50]. Their philosophy is that the selection of the entire machine learning pipeline (including data pre-processing) can be seen as a single large hyperparameter optimization problem [50]. Using Hyperopt-sklearn requires the machine learning practitioner to define the following parameters, also known as hyper hyperparameters (because they are parameters for training the hyperparameters) [50]: • Objective function: Mapping of a configuration into a scalar value, which the algorithm tries to minimize [50] • Search domain: Random variables whose distribution should match the most likely best configuration [50] • Optimization algorithm: An algorithm which finds the best configuration in the search domain given the objective function [50] Now that I covered the necessary foundations of machine learning, artificial neural networks, natural language processing, transfer learning and automated machine learning, next I will outline the context in which I applied them. 32 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.6. Computational Humor 2.6 Computational Humor According to the Oxford dictionary humor is the "the quality in something that makes it funny or amusing; the ability to laugh at things that are amusing" [3]. The exact reasons for why humans developed humor are still under debate. It seems that humor contributes to physical and psychological well-being [62][71]. Evolutionary psychologists have proposed that humor may have been a result of sexual selection: Women may have preferred men with humor, as it may have been a sign of intelligence, adaptability, and desire to please others [70]. Over the centuries, many theories of humor have been proposed: • Superiority theory: One of the first theories of humor, which proposes that laughter expresses feelings of superiority over other people (or a former self) [71]. This theory was popular until the 20th century. • Relief theory: Another early theory of humor, which proposes that humor can be seen as a valve which releases pressure [71]. This pressure was initially thought to be some kind of gas or energy, but later it was considered to be some form of psychic energy [70][71]. • Incongruity theory: This theory proposes that humor is the result of the perception of something incongruous [14]. In particular if the mind observes something which violates its mental model and expectations [14]. This is in line with many of Gary Larson’s cartoons: The cartoon sets up some expectation, which is usually in contrast to what the reader would have expected. For example scientists behaving like toddlers in usually serious situations. • A more modern view of humor is seeing it as a form of play [70]. It has been observed, that many animals learn important skills during play [70]. Humor seems to serve a similar purpose for human social behaviour [58]. Usually humor shows what should be avoided during communication or in other (social) situations [70]. Laughing and smiling enables these situations and therefore can be considered as a signal to play [58]. Computational humor combines the field of artificial intelligence with the research of humor [96]. As the interface between humans and computers becomes more natural, it is also important that computers can also understand humor [6]. The rise of chatbots and virtual assistants shows that humor is still a very difficult task for these applications [6]. An important challenge of computational humor is that humor is very subjective: There is no general humor, as every person finds different things funny [71]. This may be the reason why there is currently no unified computational model of humor. Instead models for certain subtasks of computational humor have been constructed [117][8]. Common tasks of computational humor involve generative or discriminative tasks [6][117]: An example for generative task would be joke generation or pun generation. In these 33 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.7. Related Work types of tasks the computer has to synthesize a joke or a pun. A discriminative task would be for example joke recognition. In this setting the computer should be able to detect whether some text is a joke or not. Computational humor and language understanding go hand in hand [117]: To really understand a joke, the interpreter (be it human or computer) has to understand the language. Proper language understanding is still an important research topic [85][27]. Many early approaches of computational humor have focused on the incongruity theory, where humor is an unexpected difference between the expected and the unexpected [10]. The general idea was to find or produce contradictions between what a human would expect and what has been communicated. These approaches had only limited success, most likely because the underlying language understanding was not sophisticated enough [10]. The computer did not have a proper model of what the human would understand and therefore could not apply the incongruity theory successfully. Another problem of early computational humor models, was their heavy reliance on ontologies, such as WordNet and VerbNet [10][67]. These are human crafted databases with concepts and their relationships among each other. A challenge for these databases are the ambiguity of many concepts and also their limited size. For example VerbNet contains only around 4500 verbs, which results in a very limited understanding of the sentences [102]. 2.7 Related Work In this subsection I outline several works of the field of computational humor related to our work. I start discussing several traditional approaches, which do not use deep learning. Then I will continue with deep learning based works. 2.7.1 Computational Humor using Traditional Methods In the following sections we will discuss several works in the field of computational humor that rely on traditional machine learning methods. Humor Recognition and Humor Anchor Extraction Yang et al. formalize humor as a binary classification task between humorous and non-humorous [117]. The authors perform their experiments on two datasets: Pun of the Day and One-liner dataset for positive examples. Samples for the negative class come from various other sources (Yahoo! Answers, New York Times, AP News and Proverb). The negative class samples are filtered, such that the sentences are approximately of the same length and contain the same words as the positive class. Additionally their work also performs humor anchor extraction. This technique extracts the key words that enable the humor of a sentence. Removing these words would make the sentence not humorous anymore. The paper lists the example of the sentence I used 34 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.7. Related Work to be a watchmaker; it is a great job and I made my own hours with a sensible selection of anchors watchmaker, made and hours. The authors extract several features from the input sentences that describe some aspect of humor. • Incongruity structure: disconnection (maximum meaning distance of word pairs) and repetition (minimum meaning distance of word pairs) • Ambiguity theory: sense combination (the higher the more possible interpretations a sentence has), sense farmost (largest path similarity of any sense of a word) and sense closest (smallest path similarity of any sense of a word) • Interpersonal effect: negative and positive polarity (number of occurrences of negative or positive words) and weak and strong Subjectivity (number of occurrences of all weak or strong subjectivity oriented words) • Phonetic style: alliteration (number and maximum length of alliteration chains) and rhyme (number and maximum length of rhyme chains) The work describes the training of a classifier based on these features. Important for the humor anchor extraction is the fact that the features are word order independent and function also on partial sentences. Additionally to these human centric features the best performing model also uses Word2Vec features [66]. The procedure of determining humor anchors is surprisingly simple. As the feature can be computed on any (partial) sentence the algorithm can freely remove words and still get a reasonable prediction. Also, only certain word types are considered (verbs, nouns, adverbs and adjectives). Algorithm 2.5 describes this idea on a high level. Algorithm 2.5: Humor anchor extraction 1 Take only verbs, nouns and adverbs of original sentence 2 Calculate humor score 3 foreach possible anchor subset of sentence do 4 Calculate humor score of subset 5 end 6 Return subset with maximum decrement From a performance point of view the humor recognition seems to work well. On the Pun of the Day dataset the best model reaches an F1 score of 0.859. The One Liners dataset reaches an F1 score of 0.805. Humor anchor extraction also performs quite well. The authors perform a quantitative evaluation by comparing 200 random samples that were annotated by three annotators. The strict variant, where the anchors must be exact reached an F1 score of 0.438 and 35 D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D ip lo m ar be it is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . 2.7. Related Work an F1 score of 0.756 for the relaxed version where at least one word had to match on the Pun of the Day dataset. For the One Liners dataset the performance was worse: The strict version reaches an F1 score of 0.288 and an F1 score of 0.616 for the relaxed version. Humorist Bot: Bringing Computational Humour in a Chat-Bot System Augello et al. implement humor detection and joke generation in the context of a chatbot [6]. A chatbot is a software agent able to hold a conversation with a human. Humorist bot is able to generate humorous anecdotes and react to the users humorous messages by adjusting its avatar. A funny message not only affects the answer of the bot, but also changes its avatar to a laughing avatar. This chatbot uses a rather traditional knowledge base with different answer sets: • Set for general conversations • Set for humorous sentence generation • Set for humorous sentence recognition AIML knowledge base is an XML format which describes a set of possible answers that are matched using pattern matching. How are you?