Analyse politischer Meinungen und Erkennung von Stilfiguren Klassifizierung basierend auf Themen und Meinungstypen im politischen Kontext; Erkennung von Alliterationen und Hyperbeln DIPLOMARBEIT zur Erlangung des akademischen Grades Diplom-Ingenieur im Rahmen des Studiums Data Science eingereicht von Christopher Deringer, BSc Matrikelnummer 01529026 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Ao.Univ.Prof. Ing. Mag. Dr. Horst Eidenberger Wien, 29. Februar 2024 Christopher Deringer Horst Eidenberger Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Political Opinion Analysis and Figure of Speech detection Topic and opinion type classification in the political context; alliteration and hyperbole detection DIPLOMA THESIS submitted in partial fulfillment of the requirements for the degree of Diplom-Ingenieur in Data Science by Christopher Deringer, BSc Registration Number 01529026 to the Faculty of Informatics at the TU Wien Advisor: Ao.Univ.Prof. Ing. Mag. Dr. Horst Eidenberger Vienna, 29th February, 2024 Christopher Deringer Horst Eidenberger Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Erklärung zur Verfassung der Arbeit Christopher Deringer, BSc Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. Wien, 29. Februar 2024 Christopher Deringer v Danksagung Diese Stelle möchte ich gerne nutzen um mich bei den Personen zu bedanken, die es mir ermöglicht haben, an diesen Punkt zu kommen. Ich möchte meine tiefste Dankbarkeit gegenüber meinem Betreuer Ao.Univ.Prof. Ing. Mag. Dr. Horst Eidenberger ausdrücken, der mir während dieser Arbeit kontinuierliche Unterstützung zukommen ließ und immer sehr zeitnah geantwortet hat wenn ich eine Frage hatte. Ich bin sehr dankbar für das konstruktive Feedback das ich erhalten habe und für die Freiheit die mir gewährt wurde, ohne ihn wäre diese Diplomarbeit nicht realisiert worden. Ein besonderer Dank gilt auch meiner Familie und ganz besonders meinen Eltern, die mich während meines Studiums fortwährend unterstützt haben und immer ein offenes Ohr für mich hatten. Danke, dass ihr mir diese Möglichkeit gegeben habt und stets hinter mir gestanden seid. Abschließend möchte ich mich bei meiner Partnerin bedanken, die stets eine Quelle der Motivation für mich war und es immer schaffte mich aufzuheitern. Danke, dass du immer für mich da bist. vii Acknowledgements Now is the moment to express my appreciation towards the people who made it possible for me to come to this point. I would like to express my deepest gratitude to my supervisor Ao.Univ.Prof. Ing. Mag. Dr. Horst Eidenberger who supported me throughout this thesis and always responded as quickly as possible. I am very thankful for the constructive feedback that I received and the freedom I was granted, it would not have been possible to write this thesis without him. I would like to extend my gratefulness to my family and especially my parents who always supported me during my studies and have been there for me. Thank you for giving me this opportunity and for having my back during all this time. I wish to expand my thankfulness to my partner who was able to motivate me and cheer me up on all occasions. Thank you for always being there for me and for the constant support. ix Kurzfassung Im Rahmen dieser Arbeit wurden vier verschiedene Probleme im Kontext von deutsch- sprachigen politischen Aussagen und der deutschen Sprache im Allgemeinen behandelt. Diese sind die Klassifizierung von Themen und Meinungstypen und die Erkennung von Alliterationen und Hyperbeln. Die meisten Experimente wurden mit einem Datensatz durchgeführt, der basierend auf den Protokollen des österreichischen Nationalrates erstellt wurde und rund 65.000 politische Aussagen enthält. Der gewählte Ansatz für die Themenerkennung baut auf der Extrahierung relevanter Begriffe aus dem Wikipedia Artikel zu einem Thema auf. Die Resultate wurden händisch evaluiert, die Genauigkeit war im Falle von den Themen "Feminismus"(36,39%) und "Flüchtlingskrise in Europa"(19,04%) sehr niedrig, im Fall von dem Thema "Klimawan- del"(89,02%) jedoch gut. Der Ansatz, der für die Klassifizierung von Meinungstypen verfolgt wurde, stammt von Othman et al. [Using NLP Approach for Opinion Types Classifier, Othman et al., 2015] und wurde für die englische Sprache entwickelt. Das Ziel der Experimente war es herauszufinden, ob der Ansatz auch für die deutsche Sprache funktioniert. Die Evaluierung zeigte, dass die Genauigkeit im Falle von Meinungen im Positiv vergleichbar ist (76,60% vs 71,00%), das war jedoch nicht der Fall für Meinungen, die den Komparative (78,30% vs 44,00%) oder Superlativ (82,10% vs 44,00%) verwenden. Die Erkennung von Alliterationen war erfolgreich bei der Verwendung eines Datensatzes der 605 Alliterationen enthält, eine Genauigkeit von 99,33% wurde erreicht. Es wurden noch drei zusätzliche Experimente auf freien Text durchgeführt, hier wurde eine durch- schnittliche Genauigkeit von 53,83% erreicht, das Minimum war 30,00%. Der Ansatz verwendet den Kölner Phonetik Algorithmus der von Postel [Die Kölner Phonetik - Ein Verfahren zur Identifizierung von Personennamen auf der Grundlage der Gestaltanalyse, Postel, 196] entwickelt wurde und kombiniert ihn mit zusätzlichen Regeln. Für die Erkennung der Hyperbeln wurde ein existierender Ansatz für die englische Sprache, der von Troiano et al. [A computational exploration of exaggeration, Troiano et al., 2018] entwickelt wurde und auf semantischen Eigenschaften basiert, für die deutsche Sprache implementiert. An das Problem wurde mit überwachtem Lernen herangegangen, es wurde als ein binäres Klassifikationsproblem definiert. Die Resultate wurden im Hinblick auf Genauigkeit (76,00% vs 52,23%), Trefferquote (76,00% vs 38,52%), Treffergenauigkeit (72,00% vs 68,90%) und F1-Score (76,00% vs 41,11%) verglichen. xi Abstract In the scope of this work, four different problems have been studied in the context of German political statements and the German language in general, namely topic classification, opinion type classification, alliteration detection, and hyperbole detection. Most of the experiments were conducted using a dataset that was created based on protocols of the Austrian national council containing around 65000 political statements. The topic classification was performed by extracting topic related terms from the Wikipedia article on a certain topic. It was manually evaluated and led to results that leave room for improvement, as the precision regarding the topics feminism (36.39%) and European migrant crisis (19.04%) showed. In the case of climate change, a precision of 89.02% was achieved. The approach that was implemented for opinion type classification is based on part-of- speech tagging and was proposed and implemented for the English language by Othman et al. [Using NLP Approach for Opinion Types Classifier, Othman et al., 2015]. The goal of the experiments was to show whether the approach is applicable to the German language as well when using a part-of-speech tagger for the German language and the respective tags. The evaluation showed that the performance of this approach is comparable in terms of precision in the case of the opinionated statements (76.60% vs 71.00%). It was not the case for comparative (78.30% vs 44.00%) and superlative opinionated statements (82.10% vs 44.00%). In the case of alliteration detection, a precision of 99.33% was achieved on an alliteration dataset containing 605 alliterations. Three additional experiments were performed on free text, where an average precision of 53.83% was achieved, with 30.00% being the worst case. The approach utilizes the Cologne Phonetics algorithm by Postel [Die Kölner Phonetik - Ein Verfahren zur Identifizierung von Personennamen auf der Grundlage der Gestaltanalyse, Postel, 1969] and combines it with additional rules. For hyperbole detection, an existing approach for the English language by Troiano et al. [A computational exploration of exaggeration, Troiano et al., 2018] based on the computation of semantic features has been implemented for the German language. It was defined as a supervised machine learning problem; a binary classification task. The results were compared in terms of precision (76.00% vs 52.23%), recall (76.00% vs 38.52%), accuracy (72.00% vs 68.90%) and F1-score (76.00% vs 41.11%). The performance was only comparable in terms of accuracy. xiii Contents Kurzfassung xi Abstract xiii Contents xv 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 7 2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Natural Language Processing (NLP) . . . . . . . . . . . . . . . . . . . 16 2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Design 31 3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Methodological Approach . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Experiments 39 4.1 Extracting technical terms from Wikipedia . . . . . . . . . . . . . . . 39 4.2 Opinion Type Classification . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Alliteration Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Hyperbole Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5 Evaluation 61 5.1 Extracting technical terms from Wikipedia . . . . . . . . . . . . . . . 61 5.2 Opinion Type Classification . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 Alliteration Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4 Hyperbole Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 xv 6 Conclusion 93 List of Figures 97 List of Tables 99 List of Algorithms 101 Bibliography 103 CHAPTER 1 Introduction 1.1 Motivation In recent years, the landscape of political discourse has undergone a profound transfor- mation, propelled by the unprecedented surge of information and the growing influence of digital platforms. This dynamic environment has given rise to critical challenges, notably the erosion of public trust in government institutions and the pervasive spread of misinformation and disinformation. As we navigate this complex terrain, there is an urgent need for robust tools and frameworks that can dissect political speech, providing valuable insights into sentiments, opinions, and the underlying narratives shaping our socio-political landscape. The OECD Trust Survey [OEC], which was conducted in 2021 for the first time, is a cross-national survey which aims to measure the trust in government and public insti- tutions. In the first edition, 22 countries including Austria participated. In the second edition, 30 countries participate. However, the second edition of the survey does not include Austria. The survey collected opinions on a range of questions, some of them relating to the responsiveness of governments to public feedback. In Austria, 41,36% of the participants said that they think it is unlikely that a public service would be improved in response to public feedback, 37,23% said it is likely, and 18,84% were neutral on the issue, while the remaining 2,57% said they do not know. The OECD report presented the results with the headline "Governments seen by many as unresponsive to public feedback", which might be true in the case of Austria. To work on an issue like this, it is worthwhile to investigate which topics are being discussed in the national council. To solve this problem, I suggest an approach utilizing topic-based information retrieval. The proliferation of misinformation and disinformation further exacerbates the challenges 1 1. Introduction faced by citizens seeking to make informed decisions. By developing a comprehensive framework for political sentiment analysis, I aim to implement an approach to effectively discern between factual information and opinions of varying effects. Two linguistic elements that hold particular significance in this realm are hyperboles and alliteration, and this thesis also aims to implement a method for reliably detecting them. Hyperboles, characterized by exaggerated statements or claims not meant to be taken literally, are frequently employed by political figures to emphasize a point, elicit an emotional response, or underscore the urgency of an issue. The detection of hyperbolic expressions is crucial as it unveils the rhetorical strategies at play, enabling a more nuanced understanding of the speaker’s intent and the potential impact on public sentiment. Simultaneously, alliteration, the repetition of consonant sounds at the beginning of adjacent or closely connected words, serves as a rhetorical device that enhances the rhythm and resonance of language. In political speech, alliteration can be strategically utilized to amplify key messages, create memorable slogans, and reinforce a sense of unity or urgency. By incorporating the detection of alliterative patterns into my frame- work, I aim to unveil the stylistic choices made by politicians to craft persuasive narratives. This thesis draws inspiration from the protocols of the Austrian National Council, recognizing the importance of tailoring natural language processing techniques to the specific nuances of political discourse in the Austrian context. The intersection between political opinion analysis and figure of speech detection holds the potential to strengthen the understanding of political communication in the sense that rhetoric details are highlighted, which gives interesting insights into how politicians are expressing their ideas and views. Alliterations can be used to draw attention to certain parts of a statement, hyperboles might be used to evoke an emotional response. Both of these rhetorical instruments could be used to influence how a certain issue is perceived and how opinions are formed. One example is the following sentence: "Statt Mut impft ihr den Leuten Angst ein". This sentence is a quotation of representative Wolfgang Zanger, the statement was included in his speech in the 58. sitting of the national council [nat]. It can be interpreted as stating that the addressed politicians try to scare the public instead of giving them hope. The interesting part in this context is the word "impft". The word "impft" is the inflected form of "impfen", "impfen" stands for "to vaccinate" and could potentially lead to emotional responses as the topic of vaccination played an important role during the COVID-19 pandemic. This type of analysis can lead to a deeper understanding of political speeches. 1.2 Problem Statement This experimental study aims to propose further approaches for recognizing opinions and the way they are expressed in written text with the help of various natural language processing techniques. During an election a lot of different politicians and parties are 2 1.3. Research Questions trying to win the favor of the voters. One important factor for the individual voting decisions might be the opinions which are communicated by the politicians whom are associated with a certain party. It is hard to be aware of everything that has been said by members of a certain party, generally it is hard to identify whether the individual politicians of a party are pulling in the same direction. One subproblem of this larger problem is opinion type classification, where one wants to distinguish between non-opinionated statements, opinions, comparative opinions and superlative opinions, where the opinions are expressed with either the positive, compara- tive or superlative form. Figures of speech are deviating from ordinary language and have the goal to produce a certain rhetorical effect. Two popular figures of speech are hyperbole and alliteration. It is known that these figures of speech are sometimes used in marketing as an attempt to win the attention of a potential customer. The hyperbole is used in the political context as well. An automatic detection of these figures of speech would make it easier to detect their usage and pave the way for a better understanding. The paragraphs prior to this one state the two problems which will be the main fo- cus of this thesis. The first problem is the opinion type classification, the second problem is alliteration and hyperbole detection. In addition a rule-based approach is going to be implemented which aims at providing the capabilities to search for political statements based on their topic. 1.3 Research Questions The following four research questions were defined to tackle the problems that were described in the problem statement: RQ1: How high is the precision of a rule-based approach for topic classification in political statements using corpora extracted from Wikipedia? This research question is answered by implementing a simple form of information retrieval, an inverted index is created which contains political statements from the Austrian national council protocols. The idea is to extract topic-related terms from a Wikipedia article, which are then used to search for statements containing the topic-related terms. The precision is calculated based on the amount of relevant and irrelevant statements. RQ2: How does an approach using a German tag set and a tagger for the German language, instead of an English tag set and a tagger for the English language, for opinion type classification in German sentences, hold up against the approach developed by Othman et al. regarding opinion type classification in English sentences? This question will be evaluated using precision 3 1. Introduction Othman et al. [OHMI15] defined four different classes, namely opinionated, com- parative opinionated, superlative opinionated and non-opinionated. The different classes are assigned based on certain part of speech tags (POS-tags). This research question is answered by implementing the same approach for the German language, the precision is calculated based on whether the statement received the correct class from the implemented algorithm. RQ3: The Cologne phonetics algorithm can be used to transform a word into a numerical representation depicting the underlying phonetics. Which additional constraints need to be added so that the numerical representation resulting from the Cologne phonetics algorithm is feasible for alliteration detection with 95 percent precision? The third research question is answered by implementing an algorithm for allit- eration detection which uses the Cologne phonetics algorithm by Postel [Pos69] and additional constraints which are worked out by performing experiments on an alliteration dataset containing 605 alliterations. The precision is calculated based on whether the alliterations are detected or not. RQ4: Troiano et al. developed an approach for classifying English sentences as either hyperbolic or not. Is this approach applicable to the German language as well? This question will be evaluated using accuracy, precision, recall and the F1-Score. Troiano et al. [TSÖT18] developed an approach for hyperbole detection for the English language, which uses semantic features that are computed by using pre- trained models or specific datasets. The goal of this question is to implement the same approach for the German language. It requires research regarding datasets and models which are suitable for calculating the same semantic features for German statements. It is a classification task with two classes, multiple supervised machine learning algorithms are used and the metrics accuracy, precision, recall and F1- score are used to evaluate the results, which are then compared to the best results achieved by Troiano et al. [TSÖT18]. 1.4 Outline This section gives an overview regarding the remaining chapters. Chapter 2 is the literature chapter of the thesis and contains sections on machine learning 2.1, natural language processing 2.2 and the related work 2.3. The machine learning section 2.1 contains subsections on supervised learning 2.1.1 and deep learning 2.1.2. The supervised learning subsection 2.1.1 is further split into by different topics, namely regression 2.1.1.1, classification 2.1.1.2 and transformers 2.1.2.1. The section on natural language processing contains subsections on tokenization 2.2.1, phonetic algorithms 2.2.2, part-of-speech tagging 2.2.3, sentiment analysis 2.2.4 and BERT 2.2.5. 4 1.4. Outline The related work can be found in section 2.3, this section has a subsection on allit- erations and hyperboles in marketing an politics (2.3.1), a subsection on hyperbole detection (2.3.2) and a subsection on opinion type classification (2.3.3). In the third chapter (3) the research questions are defined and the methodological approach and experiment design are described. The first section (3.1) shows the requirements that need to be fulfilled so that each research question can be answered, section 3.2 shows the scientific methodology that is followed throughout the thesis and 3.3 describes how the experiments are designed. Section 3.4 provides information regarding the datasets that were created in the scope of this thesis. The fourth chapter (4) describes the experiments that were conducted in the scope of this thesis. Section 4.1 contains the description for the Wikipedia-based topic classification approach that was implemented to find political statements based on the topic. It describes the overall process that the algorithm follows and the datasets that were used The second section (4.2) of chapter 4 describes the opinion type classification approach that was implemented by Othman et al. [OHMI15] and shows how it can be applied to the German language, by using a part-of-speech tagger and a tagset for the German language. In addition it describes a rule-based approach and a dictionary-based approach, which was implemented in addition so that the results can be compared to the results achieved by using the approach which is based on the part-of-speech tags. Section 4.3 describes the algorithm that was implemented for alliteration detection and section 4.4 describes how the hyperbole detection approach by Troiano et al. [TSÖT18] was re-implemented for the German language. Chapter 5 contains the evaluation of all the experiments that were conducted in the scope of this thesis. The sections in the chapter are again separated by topic. Section 5.1 shows the evaluation of the experiments regarding the Wikipedia-based approach for topic classification, section 5.2 shows the evaluation of the three different approaches that were implemented for opinion type classification. The third section (5.3) of chapter 5 shows the results that were achieved using the alliteration detection algorithm and section 5.4 shows the results of the hyperbole detection algorithm for the German language. The final chapter (6) summarizes the thesis by giving answers to the research questions and by making suggestions for future work. 5 CHAPTER 2 Background 2.1 Machine Learning 2.1.1 Supervised Learning In machine learning, a supervised learning problem is a problem that is tackled by providing a dataset that already contains input and output variables. According to the book "The Elements of Statistical Learning" by Hastie, Tibshirani and Friedman [HTFF09] it is called "supervised" as there already is an outcome variable available to guide the learning process. One can distinguish between two types of supervised learning tasks, one being regression and the other one being classification. The distinction is made based on the output [HTFF09], where regression is used in tasks with quantitative output and classification is used in tasks with qualitative output. Both tasks can be seen as a task in function approximation. 2.1.1.1 Regression Regression is one of the two types of approaches that are available in supervised learning. It is used if the input variables of a dataset are used to compute a quantitative output variable. One popular teaching example for a regression task uses a dataset [Wic19] about dia- monds, which contains the four C’s of diamond quality, being carat, cut, color and clarity and five physical measurements. The dataset was created by Hadley Wickham [Wic19] based on data from a diamond search engine and it contains more than 50000 entries. The dataset is included in the 7 2. Background ggplot2 R package and it is described in the book "ggplot2: Elegant Graphics for Data Analysis" by Hadley Wickham [WW16]. The output variable is the price of a diamond, the goal is to train a model that can successfully calculate the price of a diamond based on the given input variables or a subset of them. Linear Regression The linear regression model, as shown in Hastie et al. [HTFF09], has the form f(X) = β0 + p j=1 Xjβj (2.1) with an input vector XT = (X1, X2, ..., Xn) and an output Y . The assumption which is made by the model is that the regression function E(Y |X) is linear or that a linear model can be seen as an approximation. The idea is to estimate the coefficients β by using the available training data. The most popular method for estimating β is least squares, where the coefficients β = (β0, β1, ..., βp)T are picked with the intention of minimizing the residual sum of squares (RSS) as described by Filzmoser [Fil20]: RSS(β) = N i=1 (yi − f(xi))2 RSS(β) = N i=1 (yi − β0 − p j=1 xijβj)2 (2.2) 2.1.1.2 Classification Classification is the second type of approach that is available in supervised learning, next to regression. It is used if the input variables of a dataset are used to compute a qualitative output variable. A popular example for a classification task, which is described in Hastie et al. [HTFF09], is an email spam filter which is used to decide whether an email is spam or a normal email that a person actually wants to read. There are two qualitative categories, being C = {spam, email} and the assumption is that an input email can either be one or the other. Naive Bayes Naive Bayes classifiers all share the property that they assume that the value of an input variable is independent of the value of any other input variable in regard 8 2.1. Machine Learning of the output variable. Li and Jain [LJ98] explain it in the context of a text classification system where C = (c1, ..., cm) are the m document classes. An unlabeled document, or input text, D is transformed into a list of words W = (w1, ..., wd), which is then used in the naive Bayes approach to assign a class to document D as follows: c∗ NB = argmaxcj∈CP (cj) d i=1 P (wi|cj) (2.3) with P (cj) being the a priori probability of class cj and P (wi|cj) being the conditional probability of word wi given class cj . In this case the underlying assumption is that each word in the document occurs independent of every other word in the document. This approach is yet quite simple and does not perform very well with small train- ing sets, as the relative frequency estimate of a word will be zero if it does not appear in any document in the training set [LJ98]. Li and Jain [LJ98] applied Laplace law of succession, which is explained by Ristad [Ris95] in "A Natural Law of Succession", to tackle this problem. Logistic Regression Logistic Regression is a model where the probability of certain outputs is computed based on the input which can be used to build a classifier. Hastie et al. [HTFF09] explain it as a model that was created for the purpose of modeling posterior probabilities of multiple classes by using linear functions under the consideration that the posterior probabilities sum to one and are in the interval [0, 1]. The logistic regression model has the following form [HTFF09]: log Pr(G = 1|X = x) Pr(G = K|X = x) = β10 + βT 1 x log Pr(G = 2|X = x) Pr(G = K|X = x) = β20 + βT 2 x ... log Pr(G = K − 1|X = x) Pr(G = K|X = x) = β(K−1)0 + βT K−1x (2.4) where Pr(G = k|X = x) and Pr(G = K|X = x) are defined as follows: Pr(G = k|X = x) = exp(βk0 + βT k x) 1 + K−1 i=1 exp(βi0 + βT i x) , k = 1, ..., K − 1 (2.5) Pr(G = K|X = x) = 1 1 + K−1 i=1 exp(βi0 + βT i x) . (2.6) 9 2. Background Logistic regression models are most of the time fit by maximum likelihood, utilizing the conditional likelihood of G given X [HTFF09]. The log-likelihood for N observations looks like the following [HTFF09]: l(θ) = N i=1 log pgi(xi; θ) (2.7) with pk(xi; θ) = Pr(G = k|X = xi; θ). K-Nearest Neighbors The k-nearest neighbors (KNN) algorithm is a supervised learning method that can be used for regression and classification. For this thesis the focus will only be on its use for classification. An input is classified based on a majority vote among the classes of its k nearest neighbors [HTFF09], where k is an integer. The nearest neighbors are found based on a distance metric, a common choice is the Euclidean distance, but there are many other options like cosine distance and Manhattan distance. In Hastie et al. [HTFF09] the k-nearest neighbor fit for Ŷ , with Ŷ being the prediction of the output, is defined as follows: Ŷ = 1 k xi∈Nk(x) yi (2.8) Linear Discriminant Analysis Linear Discriminant Analysis (LDA) is another ap- proach which can be used if the different classes or categories are known beforehand. The goal is to find a linear combination, which is based on the input variables, that separates two or more categories from each other. This explanation is taken from section 4.3 of Hastie et al. [HTFF09] where fk(x) is the class-conditional density of X in class G = k and πk is the prior probability of class k, with K k=1 πk = 1. Applying Bayes theorem leads to Pr(G = k|X = x) = fk(x)πk K i=1 fi(x)πi . (2.9) Linear Discriminant Analysis uses Gaussian densities as class densities. In this case the class densities are all modeled as multivariate Gaussian, the assumption is made that all classes have a common covariance matrix Σk = Σ ∀k: fk(x) = 1 (2π)p|Σk|e − 1 2 (x−µk)T Σ−1 k (x−µk) (2.10) 10 2.1. Machine Learning Support Vector Machine Support Vector Machines were developed by Cortes and Vapnik [CV95]. When used for classification it is an optimization algorithm for finding a hyperplane that separates data points based on the class they belong to. In the case of two dimensional space the hyperplane would be a line. In this para- graph, which is based on the explanation of Support Vector Machines by Berwick [Ber03] and the initial paper on the subject by Cortes and Vapnik [CV95], the focus will mainly be on the two dimensional case. The support vectors are the data points that are closest to the hyperplane, they are the data points influencing the optimal location of the hyperplane [Ber03]. The optimal hyperplane has the form wT x + b = 0 (2.11) where w is a weight vector, x is an input vector and b is a bias. The method of optimal hyperplanes by Vapnik [Vap82], which is described in the paper by Cortes and Vapnik [CV95], shows that the problem of constructing an optimal hyperplane is a quadratic programming problem. Decision Tree Tree-based methods can be applied to regression and classification tasks, in this case the focus lies on decision trees being applied to a classification task. In the context of classification, the features are distributed into different groups by applying a recursive algorithm, as stated by Strobl et al.[SMT09, p. 5], based on the different groups the class labels are added. The idea is to create a hierarchical structure which separates the different classes in the optimal way, based on decision rules. The following explanation is based on the description by Filzmoser [Fil20]. In the case of classification one needs the proportion of class k observations in node m, which is defined as follows [Fil20]: p̂mk = 1 Nm xi∈Rm I(yi = k) (2.12) where m is the current node, representing a region Rm with Nm observations. The observations in node m are classified as the majority class in node m. There are different measures of node impurity ((Qm(T )) including the misclassification error and the Gini index, which are used to determine how the features of a dataset are split. They are defined as follows [Fil20]: Misclassification Error: 1 Nm i∈Rm I(yi ̸= k(m)) = 1 − p̂mk(m) (2.13) 11 2. Background Gini index: k ̸=k ′ p̂mkp̂mk′ = K k=1 p̂mk(1 − p̂mk) (2.14) Random Forest Bagging or bootstrap aggregation is a technique that was developed by Leo Breiman [Bre96] which is designed to improve the performance of machine learning algorithms. It is the foundation for the random forest approach. This explanation is based on the chapter about random forests by Hastie et al. [HTFF09, p. 587]. The main idea is to average many models so that the variance can be reduced. In the case of classification a group of trees each propose a vote for a predicted class. The random forests approach, which was developed by Leo Breiman as well [Bre01], is a modification of bagging where a large collection of de-correlated trees is constructed and averaged, as described by Hastie et al. [HTFF09]. The prediction of a new point x is made based on a majority vote, which is defined as follows: ĈB rf (x) = majority vote {Ĉb(x)}B 1 (2.15) where Ĉb(x) is the class prediction of the bth random-forest tree. Neural Network The explanations in this section are based on the book "Neural Net- works and Deep Learning" by Aggarwal [A+18]. Neural networks are nonlinear statistical models which are based on biological principles. They are used to solve problems in different domains like image classification or weather forecasting and consist of neurons that are connected to other neurons. A neuron can have multiple inputs, each input has a weight that is associated to it. Let x1, x2, ..., xn be the inputs, the corresponding weights for the input variables are w1, w2, ..., wn, so the output can be defined as follows [A+18]: y = σ( i=n i=0 wixi) (2.16) where σ is the activation function. The classical activation functions according to Aggarwal [A+18] are listed below: • σ(x) = sign(x) - sign function • σ(x) = 1 1+e−v - sigmoid function • σ(x) = e2v−1 e2v+1 - tanh function 12 2.1. Machine Learning The neurons are then organised in some type of network, one version is the feed-forward neural network, which was first published by Ivakhnenko and Lapa [IL+65] in 1965. The feed-forward neural network is based on the idea of the perceptron, which was invented by Warren McCulloch and Walter Pitts [MP43] in 1943, and the idea of a layered network of perceptrons which was introduced by Frank Rosenblatt [Ros58] in 1958. This type of network consists of multiple layers, one of them being the input layer, one output layer and potentially several layers in between that are called hidden layers. From the input layer each input is sent to every neuron in the first hidden layer, the outputs are then again sent to the neurons in the next hidden layer. This process continues until the last layer is reached. One method that is often used to train a neural network is back-propagation which was proposed by Rumelhart et al. [RHW86]. The backpropagation algorithm consists of the following steps: 1. The pattern of the input is created and propagated forward through the network 2. The output of the network is compared to the actual expected output. The difference between the two values is considered as an error 3. The error is propagated back from the output layer to the input layer. The weights of the different connections between the neurons are changed based on their influence on the error. 2.1.2 Deep Learning Deep learning is a form of machine learning which can employ either supervised, semi- supervised or unsupervised techniques. LeCun et al. [LBH15] formulated it as a method that allows computational models, which are composed of multiple processing layers, to learn representations of data with multiple levels of abstraction. Deep learning approaches were able to improve the state-of-the-art in many different fields like speech recognition [HDY+12], image recognition [KSH12] and natural language understanding tasks [CWB+11]. Most of the modern deep learning models are based on multi-layered neural networks. 2.1.2.1 Transformers The transformer is a deep learning architecture which is based on the attention mechanism. It was proposed by Vaswani et al. [VSP+17] in their paper "Attention Is All You Need", this explanation is based on this paper. It formed the basis of current state-of-the-art models like BERT and GPT-4 and follows the encoder-decoder structure. The encoder is built out of 6 identical layers, where each layer consists of two sub- layers. The first sub-layer implements a multi-head self-attention mechanism and the 13 2. Background second sub-layer is a position wise fully connected feed-forward network. The decoder is built out of 6 identical layers as well, each of them consisting of three sub-layers, being the two sub-layers in the encoder layer and an additional layer which performs multi-head attention on the output of the encoder stack. [VSP+17] Figure 2.1, which is taken from Vaswani et al. [VSP+17], shows their proposed ar- chitecture for the transformer. The basic building blocks of the transformer are the attention blocks. The goal of an attention block is to add context related to all other tokens to each embedded token of an input sequence. This enables the network to learn which relationships between tokens are more relevant than others. Figure 2.1 shows the encoder on the left side and the decoder on the right side. When considering machine translation from English to German as an example, the encoder receives the input, in this case the English sentence, and the decoder receives the desired output, which would be the same sentence but in the German language in this case. Figure 2.2 which is also taken from Vaswani et al. [VSP+17] shows a multi-head attention block on the right side and a scaled dot-product attention block, which is a part of the multi-head attention block, on the left. The graph shows that the scaled dot-product attention block is utilized h times, each of the scaled dot-product attention blocks can fulfill a different task, based on the provided weight matrices. [VSP+17] The input of the scaled dot-product attention block subsists of queries (Q), keys (K) and values (V ). The queries and the keys are both (n × dk)-dimensional matrices and the values are a (n × dv)-dimensional matrix where n is the amount of elements in the input sequence (x1...xn). The attention is defined as follows: Attention(Q, K, V ) = softmax(QKT √ dk )V (2.17) Vaswani et al. [VSP+17] found it to be beneficial if they linearly project the queries, keys and values h times with different learned linear projections to dk or dv dimensions, depending on the case. They called this approach multi-head attention and defined it as follows: MultiHeadAttention(Q, K, V ) = concat(head1, ..., headn)W O where headi = Attention(QW Q i , KW K i , V W V i ) (2.18) with the following definitions: 14 2.1. Machine Learning Figure 2.1: Transformer architecture by Vaswani et al. [VSP+17] W Q i ∈ Rdmodel×dk W K i ∈ Rdmodel×dk W V i ∈ Rdmodel×dv W O ∈ Rhdv×dmodel 15 2. Background Figure 2.2: Scaled Dot-Product Attention and Multi-Head Attention by Vaswani et al. [VSP+17] 2.2 Natural Language Processing (NLP) 2.2.1 Tokenization Tokenization is the process of separating a text into smaller chunks. In this subsection the concept of tokenization will be explained based on the explanation by Jurafsky and Martin in their book "Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition" [JM]. The two most common variations are sentence tokenization and word tokenization. The goal of sentence tokenization is the separation of a text with multiple sentences into separate sentences, the goal of word tokenization is the separation of a text into single words. One might start with looking at a very simple example, consider the following text: This is a simple example text. It contains two sentences. Now, when sentence tokenization and word tokenization is applied one after the other, the result might look like the following: 1. This is a simple example text. It contains two sentences. 16 2.2. Natural Language Processing (NLP) 2. This is a simple example text . It contains two sentences . In the first step the two sentences are separated from each other, in the second step each word is an entity, and the dot at the end of each sentence as well. This might be different according to the implementation, this is just an example. To tokenize a text like this a few very simple rules are already sufficient, one only needs to be able to separate multiple sentences based on the ending dot and words based on white space. In addition the tokenizer has to be aware of the dot as a sentence ending token and therefore consider each dot as a separate token. This is a very trivial example, it can get way more complicated rather quickly. One might consider the following sentence: This sentence isn’t as simple anymore, the rules that were applied so far would not be sufficient in this case. The tokenizer needs to be able to consider additional special characters like comma and apostrophe and it needs to be able to deal with the "isn’t" construct. Here it needs to be decided how the tokenizer should handle it, theoretically there are a lot of options like the following: 1. isn ’ t 2. is n’t 3. isn’ t 4. is not There are a lot of further examples where a simple white space based separation is not sufficient, for example any sentence that contains the city "New York", here it might lead to bad search results if "York" is considered as a separate token, therefore it might be a good idea to consider "New York" as a single token. One approach for handling names of all kinds, for example person names, cities and companies but also years and dates in general, is named entity recognition. Named entity recognition has the goal to assign meaning to named entities in any kind of text. Different implementations exist, for example rule-based systems and ma- chine learning based systems. The problem of named entity recognition is not solved yet, 17 2. Background the best systems achieve a F-Score of around 93% [MP98]. The following example shows a simple case where the named entity recognition worked as expected: in: Frank moved to New York in 2006 out: [Frank]Person moved to [New York]City in [2006]Time In general one needs to consider that a lot of languages have a set of specific challenges. In Chinese for example each character has a meaning on its own, in combination with other characters the meaning can change. Chen et al. [CSQH17] used the following sentence as an example and showed, that it could be seen as consisting of three words, five words or even seven individual characters. In the following examples the language code "zh" is used for Chinese and the language code "en" is used for English. zh: 姚明进入总决赛 en: Yao Ming reaches the finals (Initial sentence) The first version splits the sentence into three parts: zh: 姚明 进入 总决赛 en: YaoMing reaches finals (Tokenized using Chinese Treebank) The second version splits the sentence into five different parts: zh: 姚 明 进入 总 决赛 en: Yao Ming reaches overall finals (Tokenized using "Peking University" segmen- tation) The last version splits the sentence into each of the individual characters: zh: 姚 明 进 入 总 决 赛 18 2.2. Natural Language Processing (NLP) en: Yao Ming enter enter overall decision game (Tokenized using character-wise segmentation) It turns out that the character-wise segmentation is sufficient for most NLP tasks regarding the Chinese language [LMS+19]. 2.2.2 Phonetic Algorithms Phonetic algorithms are used to encode a word into a representation that contains some information regarding its phonetics. One phonetic algorithm for the English language is Soundex [Rob18] which was developed by Russell and Odell and patented in 1918. It is a phonetic algorithm for indexing names by sound and it is used in many popular database systems like Oracle [Soua] and PostgreSQL [Soub]. The basic idea is that a word is turned into a representation consisting of one letter at the beginning, followed by three numbers, the conversion table can be seen in Table 2.1. The Soundex algorithm performs the following steps: 1. All letters are turned into upper case letters, punctuation marks are removed 2. The first letter of the word is kept and ignored in the following transformations, it is used in the end as a prefix. 3. Vowels (A, E, I, O, U), semivowels (W, Y) and the letter H are replaced with 0, as they should be ignored in the end 4. The remaining letters are transformed based on the rules in the conversion table which is shown in Table 2.1 5. All the occurrences of the same number next to each other, for example "11" are replaced with just one occurrence of the number (from "11" to "1" in this case). The number 0 is ignored, so it is still possible that there are two zeros next to each other 6. All the zeros are removed from the result 7. The final code should have a length of four, one letter at the beginning and three numbers following the letter. If there are not enough numbers, the remaining numbers are filled up with zeros. If there are more than three numbers all the numbers after the third number are removed An example is presented in the following list: 19 2. Background 1. Consider the surname Mueller-Luedenscheidt as an example 2. In step 1 the "-" is removed, all letters are turned into upper case: MUELLER- LUEDENSCHEIDT 3. In step 2 the first letter (M) is stored, it is not affected by the transformations in the next steps 4. In this case there are no semivowels the letter H and all the vowels are replaced with 0: M00LL0RL00D0NSC000DT 5. The remaining letters are replaced based on the conversion table (Table 2.1), the result is the following: M0044064003052200033 6. Now all occurrences of the same number next to each other (except zeros) are reduced to just one occurrence of the number: M0040640030520003 7. All zeros are removed: M4643523 8. In the last step the code is shortened to length 4: M464 Letter Code B, F, P, V 1 C, G, J, K, Q, S, X, Z 2 D, T 3 L 4 M, N 5 R 6 Table 2.1: Conversion Table from the Soundex algorithm by Russell and Odell [Rob18] For the German language, as well as for the English language, exist different kinds of phonetic algorithms, Wilz [Wil05] lists multiple different phonetic algorithms, some for the English language and some for the German language. Two of the phonetic algorithms for the German language are cologne phonetics (Kölner Phonetik) which was developed by Postel [Pos69] and Phonet, which has two different approaches which are both described in this article by Michael [Mic88]. The cologne phonetics algorithm transforms a word into a numeric representation. A central part of the algorithm is a transformation table, which is shown in Table 2.2. The following three steps are performed in the algorithm: 1. The letters are transformed into numbers based on the rules which are specified in the transformation table 20 2.2. Natural Language Processing (NLP) 2. For all the numbers that appear multiple times next to each other, all the appear- ances, besides the first one, are removed 3. All zeros, besides one being at the start, are removed To show a more concrete example, one can look at the transformation of the name Müller- Lüdenscheidt. One has to consider that this name counts as one word in this algorithm, as both names are connected by hyphen. This example is taken from Wikipedia [wike]: 1. Based on the transformation table, the letters are turned into the following repre- sentation: 60550750206880022 2. Now all the numbers that are shown multiple times next to each other are reduced to the number just showing once: 6050750206802 3. In the last step all zeros that are not the first number in the representation are removed, in this case all zeros are removed: 65752682 Letter Context Code A, E, I, J, O, U, Y 0 H - B 1 P if not in front of H 1 D, T if not in front of C, S or Z 2 F, V, W 3 P if in front of H 3 G, K, Q C if in the initial sound before A, H, K, L, O, Q, R, U or X 4 C if in front of A, H, K, O, Q, U or X, if not in front of S or Z X if not in front of C, K or Q 48 L 5 M, N 6 R 7 S, Z if after S or Z C if initial sound and not in front of A, H, K, L, O, Q, R, U, X if not in front of A, H, K, O, Q, U or X 8 D, T if in front of C, S or Z X if after C, K or Q Table 2.2: The transformation table of the cologne phonetics algorithm by Postel [Pos69] 21 2. Background 2.2.3 Part-of-speech Tagging A part-of-speech tagger or a POS-tagger is a tool which is used to add certain tags to words in a text based on grammatical rules and context or in other terms determine the part of speech. The goal is to analyze and understand the syntactic structure of a sentence by categorizing each word into its appropriate grammatical class. A few examples for parts of speech in the English language are noun, adjective, verb, pronoun, and adverb. There are different tag sets, a popular one for the English language is the Penn tag set, which was developed in the Penn Treebank project [MSM93, Pen]. For the German language the STTS [STT95, WSJB17] and the Tiger tag set [BDH+02] are popular. Figure 2.3 shows a simple example where POS-tagging has been applied to a German sentence, the STTS tag set has been used. The tags that have been used in the example are listed and described in Table 2.3, the STTS tag set contains 54 tags. STTS-Tag Part of Speech in German Example NE Eigenname (named entity) Hans, Berlin VVFIN Infinitiv voll (full infinitive) gehen, ankommen bestimmter/unbestimmter Artikel ART (definite/indefinite article) der, die, das, ein, eine attributives Adjektiv ADJA (attributive adjective) [den] langen [Roman] NN normale Nomina (common nouns) Tisch, Herr, Buch Table 2.3: POS-Tags from the STTS tagset [WSJB17] with explanation, related to the example The RFTagger is a POS-tagger which was created by Schmid and Laws [SL08]. It supports multiple languages like German, Hungarian and Russian. It is a Hidden-Markov-Model (HMM) tagger which computes the POS tag sequence which is the most likely one for a given word sequence. 22 2.2. Natural Language Processing (NLP) Figure 2.3: A simple POS-tagging example in German using the STTS tag set [WSJB17] 2.2.4 Sentiment Analysis Sentiment analysis is a text categorization task which aims at detecting the sentiment which is expressed in a given text. In the simplest case the goal is to have a distinction between positive and negative sentiment, some sentiment analysis tools provide a neutral sentiment as a third option. Common scenarios where sentiment analysis is used are product review analysis or movie review analysis. Wankhade et al. [WRK22] published a book called "A survey on sentiment analysis methods, applications, and challenges" in 2022, where they give an interesting overview of the field of sentiment analysis. Figure 2.4, which is taken from Wankhade et al. [WRK22] gives an overview of the different approaches, they separate them into lexicon based approaches, machine learning approaches, hybrid approaches and other approaches. The following explanations are based on the before mentioned book by Wankhade et al. [WRK22] Lexicon based approaches are based on the idea of a collection of tokens where each token has a certain score which indicates whether the token is neutral, positive or negative. This collection of tokens is a lexicon. There are different approaches for assigning a score to a token, one might for example assign the values [-1, 0, 1] for [negative, neutral, positive] tokens respectively or use a value range, for example the interval [-1,1]. Once a score has been assigned to each token of a text the different scores, according to their class, are summed up separately. After that the overall sentiment is assigned based on the highest score among the scores. The advantage of a lexicon based approach is that there is no training data required, a lexicon needs to be created though. A disadvantage is that it does not work very well across different domains as the vocabulary or even the 23 2. Background sentiment of certain words might be different. Machine learning approaches use different machine learning algorithms to assign a certain sentiment to a text. Supervised learning methods require training on a training set, common algorithms that are used in the domain of sentiment analysis are Naive Bayes, Support Vector Machine and Decision Tree. A hybrid approach is a combination of machine learning approaches and lexicon based approaches, there have been some successful implementations where a hybrid approach was able to perform better than the separated models, but according to Wankhade et al. [WRK22] it is still a promising research field where a lot of additional research should be conducted. Aspect based approaches aim at detecting different aspects of a text, so that the senti- ment regarding each aspect can be assigned separately. In the final step the different sentiments across the aspects are aggregated. These approaches are popular when an- alyzing reviews, for example product reviews or hotel reviews. For a simple example consider different aspects of a smartphone A = [camera, battery life, display, storage] and a review R ="the camera is amazing but the battery is draining fast". This reviews contains two different aspects, the camera and the battery life. The camera receives very positive feedback therefore the sentiment of the "camera" aspect is positive, the battery life receives negative feedback, therefore the "battery life" aspect has a negative sentiment. Transfer learning is a technique where a pre-trained model is used to train a new model. This approach is popular as it turned out that features that have been trained for one domain are often useful for other domains as well. One model that is often used for transfer learning is BERT, which has been developed by Devlin et al. [DCLT18] at Google. 2.2.5 BERT BERT (Bidirectional Encoder Representations from Transformers) is a language model that was developed by Devlin et al. [DCLT18] The training of BERT consists of two steps, pre-training and fine-tuning. Both steps are shown in Figure 2.5, which is taken from Devlin et al. [DCLT18] The architecture of the two steps is identical besides the output layer. It is a multi-layer bidirectional Transformer encoder based on the original Transformer that was proposed by Vaswani et al. [VSP+17] A good example for the difference between a unidirectional and a bidirectional approach is shown in the BERT Github repository by Devlin [ber], where the sentence "I made a bank deposit" is discussed, with the focus being on the word "bank". In the case of a unidirectional approach the model would either pay attention to the context on the left side ("I made a") or the context on the right side ("deposit"). Using the 24 2.2. Natural Language Processing (NLP) Figure 2.4: Different approaches for sentiment analysis, taken from Wankhade et al. [WRK22] bidirectional approach both sides are considered, the word "bank" is therefore represented by the context "I made a ... deposit" [ber]. BERT was pre-trained on two unsupervised tasks, namely "Masked LM" and "Next Sentence Prediction". [DCLT18]. During the training for the "Masked LM" task some of the input tokens were randomly replaced with either a predefined token (80%), a random token (10%) or the unchanged token (10%). The model is trained with the tar- get of predicting the tokens which have been replaced just based on the context [DCLT18]. The "Next Sentence Prediction" task is based on the connection of two sentences. Devlin et al. [DCLT18] model this connection with a simple data representation that contains three fields. Two of the fields are the sentences, the third field is the label that states whether the second sentence is the next sentence after the first sentence or not. A simple example for this is shown in the listing below: • Case 1: – Sentence A: I was going for a walk with my dog. – Sentence B: It saw a bird and started running after it. – label: IsNext • Case 2: – Sentence A: I was going for a walk with my dog. – Sentence B: Vienna is the capital of Austria 25 2. Background – label: NotNext While the pre-training of BERT is expensive [ber] the fine-tuning is relatively inexpensive. The input can be either single text or pairs of text, as shown above. This property allows one to model many different tasks using the same pre-trained model. BERT was able to outperform state-of-the-art models on eleven tasks at the time it was released, in October 2019 a blog post by Google was published which stated that BERT was incorporated into Google Search. [blo] Figure 2.5: Overall pre-training and fine-tuning procedures for BERT by Devlin et al. [DCLT18] 2.3 Related Work This section aims at giving a brief overview about a selection of related work. 2.3.1 Alliteration and hyperbole in marketing and politics Alliteration and Hyperbole are two figures of speech that find use in marketing. Davis et al. [DBB16] show that alliterations might have a positive effect on the chance of a product being purchased. Fox et al. [FND19] measured the impact of an ad by tracking the eye movement of participants. Their results demonstrate the positive impact of the use of alliteration on consumer attention. These references are included to demonstrate that alliteration is researched in marketing and that there might be an impact on consumer behavior. Stuckey [Stu17] cited Walter Dean Burnham, saying that critical elections are char- acterized by abnormally high intensity. This turns a hyperbole into a natural tactic according to Burnham’s opinion. A hyperbole intentionally overstates the case. Even if the audience is not fully captured by it, it might be moved a little bit more by a hyperbolic statement in comparison to a more balanced and rational statement. They 26 2.3. Related Work describe hyperbolic arguments as excessive and overwhelming and claim that audiences persuaded by a hyperbole are carried along and do not inch carefully towards a conclusion. Swartz [Swa76] conducted research regarding hyperbole in politics. He analyzed two Bena baraza cases, hyperbole was used in both. He concluded that the hyperbole did not work in the favor of its users. The examples used are very hyperbolic, one of two brothers stated that he doesn’t know the man who accuses him of owing money, while the man actually was his brother. This was interpreted as the first man not knowing his brother anymore, as he is not behaving like a brother should, in his opinion. In the second case the damage done to someones property was highly over-exaggerated. Considering this, Swartz [Swa76] made some conclusions about the use of hyperbole in politics, which will be stated in this paragraph. He suggested that the rare appearance of hyperbole is tied to it being best-suited for situations in which there are several aspects of reality bearing on the same value, yet affecting differently that value’s implications in the situation. What this effectively implies, is that hyperbole is used in situations where its user feels the need to structure reality such that certain aspects overshadow others. In addition, it is argued that speakers might almost always feel that they must structure reality for their listeners, if they want to gain their support, and that exaggeration is not the only way to alter a statement for that purpose. The discussion continues with the statement that individuals might exaggerate most in the areas of their greatest weaknesses, speaking of their greatest areas of weaknesses according to their own perception. Furthermore it is suggested that a hyperbole is more likely to occur if a speaker believes that the audience has certain ethical values, which might lead to the speaker trying to appeal to these ethical values using over exaggeration. 2.3.2 Hyperbole detection using Natural Language Processing Troiano et al. [TS18] worked on the computational exploration of exaggeration. They built a corpus called HYPO, containing overstatements collected on the web, validated via crowd-sourcing. The chosen approach is classification, the set of used algorithms includes Logistic Regression, Naive Bayes, k-Nearest Neighbors, Decision Trees, Support Vector Machine and Linear Discriminant Analysis. Their experimental results lead to the conclusion, that automatic hyperbole detection could be successfully executed based on semantic features. The semantic features that they extracted are imageability, subjectivity, unexpectedness, polarity and emotional intensity. Tian et al. [TkSP21] published a model called HypoGen for generating hyperboles at the clause or sentence level. The program expects an input prompt A and amis at generating a subject B and a predicate C. The example which is listed in their work is the following: Consider "the party is lit" as input prompt A, they would want to generate a sentence like "the party is so lit that even the wardrobe is dancing" where "the wardrobe" would be the subject B and "is dancing" would be the predicate C. 27 2. Background Zhang et al. [ZW21] propose an unsupervised approach for hyperbole generation, which uses a fine-tuned version of BART [LLG+19] and a BERT-based ranker for selecting the best candidate. They created an open source dataset containing hyperboles in the English language called HYPO-L, this dataset was used in this thesis as well. Vorakitphan et al. [Vor21] published a tool called PROTECT (PROpaganda Text dEteCTion) which focuses on propaganda technique classification. They used 14 differ- ent propaganda techniques like flag-waving, whataboutism and exaggeration. First they are classifying the text on the token level, where each token is either propagandist or not. After that the propagandist tokens are classified according to the 14 propaganda categories. Schneidermann et al. [SHP23] performed edge and minimal description length probing experiments on pre-trained language models in the context of hyperbolic information being encoded in the pre-trained language models. Their results showed, that hyperbolic information is encoded in pre-trained language models to a limited extent. 2.3.3 Opinion Type Classification Othman et al. [OHMI15] classified sentences based on whether they are opinionated statements, comparative opinionated statements, superlative opinionated statements or non-opinionated statements. The main method used for this task is Part-of-Speech (POS) Tagging using the Stanford Part-Of-Speech Tagger. The tag set used is the Penn Treebank P.O.S Tags tag set. They worked with text written in the English language. Yu et al. [YKD08] explored the characteristics of political opinion expression and found that recognizing the sentiment alone is not sufficient for political opinion classification. They compared the average sentiment level of Congressional debates to the ones found in neutral news articles and movie reviews, where movie reviews showed higher sentiment levels and neutral news articles showed lower sentiment levels on average. Furthermore they analyzed how sentiment is expressed in different kinds of texts and found out, that the choice of topics plays a very important role in the case of political opinion expression, which is reflected in nouns, while the sentiment expression in movie reviews is more adjective-centered. They compared the results of manual sentiment annotation and found out, that a significant number of political opinions are expressed in a neutral tone. Afzaal et al. [AUFF19] proposed a model called "enhanced multiaspect-based opin- ion classification" which extracts explicit and implicit aspects from tourist reviews and classifies the multiaspect opinions into different polarity classes. The model consists of three different methods, namely a probabilistic co-occurrence-based method, a hierarchy- based implicit aspect extraction method and a multiaspect opinion classification method. The probabilistic co-occurrence-based method captures the similarities between aspects using the co-occurrence of sentiment words with aspects, after that the similar aspects 28 2.3. Related Work are merged into a group. The hierarchy-based implicit aspect extraction method utilizes grammatical relationships between opinion words and aspects to create an aspect-specific lexicon, after that a hierarchy for extracting the implicit existence of aspects in reviews is built. An example can be seen in Figure 2.6. The multiaspect opinion classification approach is used to generate features from reviews. Once these three steps are completed some multilabel classification algorithms are applied to classify the multiaspect opinions into different polarity classes. Figure 2.6: Example of an aspect-sentiment hierarchy as described by Afzaal et al. [AUFF19] Liu [L+11] published a book on web data mining which includes a chapter on opinion mining. In this chapter they defined the three basic components of an opinion as follows: • Opinion holder: The person who communicates an opinion on something • Entity: The entity on which the opinion is expressed by the opinion holder • Opinion: The view, sentiment or appraisal which is communicated by the opinion holder on the object This chapter gave an overview over the algorithms, approaches and models that were used in the experiments that were conducted in the scope of this work. Furthermore it 29 2. Background presented a selection of related work. The following chapters will focus on describing the experiments and the results that were achieved. 30 CHAPTER 3 Design 3.1 Research Questions The first section of this chapter focuses on describing the requirements that are derived from the research questions. The requirements need to be fulfilled so that the research question can be answered. RQ1: How high is the precision of a rule-based approach for topic classification in political statements using corpora extracted from Wikipedia? To successfully answer this research question a method for computing the precision needs to be defined. The statements which are retrieved using the implemented method are manually evaluated as either being relevant to the topic or not. Based on this binary classification the precision can be calculated. RQ2: How does an approach using a German tag set and a tagger for the German language, instead of an English tag set and a tagger for the English language, for opinion type classification in German sentences, hold up against the approach developed by Othman et al. regarding opinion type classification in English sentences? This question will be evaluated using precision This question is answered by comparing the achieved precision to the precision that was achieved in the experiments by Othman et al. [OHMI15]. The results of the experiments are manually evaluated, the precision is calculated based on the results of the manual evaluation. RQ3: The Cologne phonetics algorithm can be used to transform a word into a numerical representation depicting the underlying phonetics. Which additional constraints need to be added so that the numerical representation resulting from the Cologne phonetics algorithm is feasible for alliteration detection with 95 percent precision? 31 3. Design The third question is answered by computing the precision on a dataset containing 605 alliterations, which was downloaded from the website of Ulrich Mehner [meh]. The results can be automatically evaluated, as it is known that all the entries are alliterations. The precision is calculated based on the amount of detected alliterations. RQ4: Troiano et al. developed an approach for classifying English sentences as either hyperbolic or not. Is this approach applicable to the German language as well? This question will be evaluated using accuracy, precision, recall and the F1-Score. This question is evaluated using a dataset containing hyperboles, therefore it can be evaluated automatically. The HYPO-L dataset by Zhang et al. [ZW21] will be used as a basis, as it is an openly available. The hyperboles need to be translated as they are currently only available in the English language. It contains around 3200 sentences where around 1000 are hyperboles, 200 hyperboles and 200 literal sentences are manually translated, in addition all the sentences are machine-translated using Google Translate. The results using the 400 manually translated sentences are compared to the results using the same 400 sentences in their machine-translated version and to the results which are achieved on the full machine-translated dataset. The accuracy, precision, recall and F1-Score can be computed as we know the class of each sentence. 3.2 Methodological Approach This section describes the methodological approach that will be followed in this thesis. The before mentioned research questions are answered by applying one of two different approaches. RQ2 and RQ4 will be answered by implementing an approach that has been developed for the English language so that it can be applied to the German language. RQ1 and RQ3 will be answered by applying evolutionary prototyping, an algorithm will be outlined, implemented, tested and improved until it can be used to answer the research question. The following listing contains the methodology that is followed throughout this work: 1. Literature Research: A literature research will be conducted which includes the following keywords among others: natural language processing, opinion mining, sentiment analysis, opinion type classification, topic classification, hyperbole detection, hyperbole, alliteration, alliteration detection. The master thesis by Stefan Zaruba [Zar21] will be the first entry point. 2. Creating datasets and working with existing datasets: Many existing datasets will be used during this work, if not available new datasets will be created. The first dataset that will be created is a dataset containing 32 3.3. Experiment Design political statements extracted from the Austrian national council protocols which are available on the website [aus]. Besides that a partly translated version of the HYPO- L datasets, which is a dataset containing English hyperboles, will be created. 400 sentences are manually translated and every sentence of the dataset (around 3200) will be machine-translated as well. Among the existing datasets that will be used are the DeWaC German corpus by Baroni et al. [BBFZ09, deWa, deWb, FE13], the German SentiWords dataset by Gatti et al. [RQH10, GGT15] and the alliteration dataset which is published on the website of Ulrich Mehner [meh]. 3. Using Natural Language Processing methods and Machine Learning models: Various different natural language processing methods will be used to tackle the problem of hyperbole detection, alliteration detection, opinion type classification and topic classification in the context of political statements. In addition pre-trained machine learning models will be used if available and new models will be trained if required. 4. Implementation: The code for the experiments will be written in the Python 3 programming language [VRD09], Jupyter Notebooks [KRKP+16] will be used for the experiments and the evaluation. Libraries and existing frameworks will be used if available. The implementation will be containerized using Docker [Mer14], so that it is easily possible to reproduce the results. 5. Evaluation: The evaluation of the experiments will be done automatically if possible and manually if required. The automatic evaluation will be done using existing Python libraries, common metrics like precision, recall, accuracy and F1- score will be used if possible. The manual evaluation will be done using doccano [NKK+18], which is an open source data labeling tool developed by Nakayama et al. and freely available on Github. 6. Data Analysis and Data Visualization: The results of the experiments will be analyzed so that conclusions can be drawn and potential improvements on the implemented approaches can be derived. Insights gathered from the data analysis will be visualized using libraries for drawing plots, for example Matplotlib [Hun07]. 3.3 Experiment Design This section is used to describe the design of the experiments that were conducted to answer the research questions. To answer the first research question it was necessary to find topics which might be discussed in the Austrian national council. Three topics were chosen, namely climate change, feminism and the European migrant crisis. The implementation of the topic classification approach could then be used to search for statements regarding these three 33 3. Design topics. The retrieved statements were then manually evaluated as being either relevant or irrelevant to the topic. Based on these two classes it was possible to compute the precision of the approach. The data analysis was conducted based on the topic-related terms that were extracted from the Wikipedia article, so that it is possible to see whether the retrieved terms are generally relevant to the topic or not. The experiments for the second question were focused on rebuilding the approach by Othman et al. [OHMI15], they used an approach based on part-of-speech tagging. First it was necessary to research part-of-speech tagging for the German language, find an appropriate tag set and a tagger that supports it. After that the tags from the tag set were matched to the tags used by Othman et al. [OHMI15] and the political statements were tagged. The statements are then put into four different categories based on the part-of-speech tags defined earlier, three of them being opinion types (opinion, comparative opinion, superlative opinion) and a fourth category for statements that are not opinionated. A small subset of the statements that were classified as containing one of the three opinion types were manually evaluated based on whether they contain the opinion type or not. The binary results were used to calculate the precision. In addition two different approaches were implemented, one being rule-based and one being based on a dictionary. The dictionary contains positive, comparative and superlative forms of around 13.000 adjectives in the German language, it was downloaded from Wiktionary [Wikf]. The rules that were implemented are grammatical rules for comparative and superlative forms. The results of the two additional approaches were evaluated in the same way as the main approach, a subset was manually evaluated. The third research question was answered by implementing an algorithm that is able to detect alliterations in the alliteration dataset by Ulrich Mehner [meh]. Different settings and approaches were tried until the algorithm was able to detect more than 95% of the alliterations in the dataset. The precison was computed automatically in this case as it was given that the dataset contains alliterations only. In addition to that an experiment was conducted on the political statements. A subset of the retrieved alliterations was then manually evaluated to check how well the current algorithm would perform on any free text. These results were used to determine problematic cases that can show up in free text and derive potential improvements to the algorithm. The first step to answering the fourth research question was the computation of the semantic features which are used by Troiano et al. [TSÖT18] in their approach for hyperbole detection. As they used specific datasets and libraries for the English language it was necessary to research how the same semantic features can be computed for the German language. Once the semantic features could be computed they were used in supervised machine learning experiments using the same algorithms as Troiano et al. [TSÖT18] and the Random Forest algorithm in addition. It was a classification task with two classes, literal and hyperbolic. Three different dataset were used in the experiments, they are all based on the English HYPO-L dataset by Zhang et al. [ZW21], where one of 34 3.4. Datasets them is a full machine-translated version of the dataset, one a manually translated subset containing 400 sentences (200 for each of the two classes) and the same subset using the machine-translated version so that the potential impact of the machine translation could be seen as well. The machine-translated dataset was created using Google Translate. The experiments were conducted using 10-fold cross validation. As the dataset already contained the classes it was possible to evaluate the experiments automatically. Precision, recall, accuracy and the F1-score were computed and compared to the results achieved by Troiano et al. [TSÖT18] in their experiments. In addition an experiment on statements from a single protocol of the Austrian national council was conducted. The statements which were classified as being hyperbolic were manually evaluated, so that the effect on free text could be evaluated as well using precision. 3.4 Datasets This section describes the datasets that were created in the scope of this master thesis. The most essential dataset for this thesis contains around 65.000 political statements that were extracted from the protocols of the Austrian national council, which are available in the HTML format on the website [aus]. Next to the statements the dataset contains the speaker and the party of the speaker. In some cases a speaker has been written in different ways, therefore a dataset containing every form of every speaker was created, these forms were then all mapped to one unique form which was then used to map all speakers to their unique form. The party was added to each unique speaker as well. This dataset is essential for this thesis as it provides the foundation for working on figure of speech detection, topic classification and opinion type classification in the context of political speech, which is the overall topic of this thesis. Some analysis was necessary to extract the data in a clean manner as the HTML follows a unique schema. The columns of the political statement dataset are the following: • speaker: The speaker of the political statement • speech: The political statement • file: The protocol name in the format NRSITZ_{number}_PARSED, for example NRSITZ_00087_PARSED for the parsed version of the 87. sitting. • party: The party that the speaker belongs to • unique_speaker: The unique speaker, this column was added for the cases where the same politician was mentioned in slightly different ways The following listing presents one example of the political statements dataset: 35 3. Design • speaker: Abgeordneter Michael Bernhard • speech: Natürlich können wir sagen, zwischen den Achtziger- und den 2020er- Jahren ist sehr viel passiert, aber wir dürfen keinen Moment darauf stolz sein, denn das ist in Wahrheit viel zu wenig für die vielen Jahrzehnte, die inzwischen verflossen sind. • file: NRSITZ_00087_PARSED • party: NEOS • unique_speaker: Abgeordneter Michael Bernhard As a result of the experiments of the first research question three subsets of the political statement dataset were created that contain the political statements that were retrieved based on the topic classification approach and a classification for each of the statements as either being relevant or irrelevant regarding one of the three topics. The topics are climate change, feminism and the European migrant crisis. To prepare the political statements dataset for the experiments regarding the second research question, the RFTagger by Schmid and Laws [SL08] was used to annotate all the statements with part-of-speech tags. The political statement dataset was extended by one additional column called "tagged", this column contains the political statement in its POS-tagged form. In the case of the example that was listed above, the "tagged" column would look similar to the item presented below, the only difference being that each tag is preceded by a \t and no white space in between (for example Natürlich\tADV). In this listing a color and white space was added and the \t was removed to enhance readability: • tagged: Natürlich ADV können VFIN.Mod.1.Pl.Pres.Ind wir PRO.Pers.Subst.1.Nom.Pl.* sagen VINF.Full.- , SYM.Pun.Comma zwischen APPR.Zwischen den ART.Def.Dat.Pl.Masc Achtziger- TRUNC.Noun und CONJ.Coord.- den ART.Def.Dat.Pl.Fem 2020er-Jahren N.Reg.Dat.Pl.Fem ist VFIN.Sein.3.Sg.Pres.Ind sehr ADV viel ADV passiert VPP.Full.Psp , SYM.Pun.Comma aber CONJ.Coord.Aber wir PRO.Pers.Subst.1.Nom.Pl.* dürfen VFIN.Mod.1.Pl.Pres.Ind keinen PRO.Indef.Attr.-.Acc.Sg.Masc Moment N.Reg.Acc.Sg.Masc darauf PROADV.Dem stolz ADJD.Pos sein VINF.Sein.- , SYM.Pun.Comma denn CONJ.Coord.Denn das PRO.Dem.Subst.-.Nom.Sg.Neut ist VFIN.Sein.3.Sg.Pres.Ind in APPR.In Wahrheit N.Reg.Dat.Sg.Fem viel ADV zu PART.Deg wenig PRO.Indef.Subst.- .*.*.* für APPR.Acc die ART.Def.Acc.Pl.Neut vielen PRO.Indef.Attr.- .Acc.Pl.Neut Jahrzehnte N.Reg.Acc.Pl.Neut , SYM.Pun.Comma die PRO.Rel.Subst.-.Nom.Pl.Neut inzwischen ADV verflossen ADJD.Pos sind VFIN.Sein.3.Pl.Pres.Ind . SYM.Pun.Sent 36 3.4. Datasets An additional dataset was created which contains 13.000 German adjectives in their positive, comparative and superlative forms if available. The data was extracted from Wiktionary [Wikf]. This dataset was used for an additional approach that was imple- mented in the scope of research question 2. It is a simple data-driven approach where the same opinion types (opinionated, comparative opinionated, superlative opinionated) as in the approach by Othman et al. [OHMI15] are computed based on the occurrence of an adjective in one of the three forms available in the dataset. If there was no comparative or superlative form available it was replaced by "-", two example entries of the dataset can be seen in Table 3.1: positiv komparativ superlativ hoch höher am höchsten portofrei - - Table 3.1: Two examples from the German adjectives dataset As a result of the experiments for the second research question, there exist subsets of the statements that were annotated by the algorithm as containing one of the three opinion types and a binary manual evaluation. For the fourth research question a machine-translated German version of the HYPO-L dataset by Zhang et al. [ZW21] was created using Google Translate. In addition 400 sentences (200 for each class), were manually translated, to increase the quality of the translation. The label 0 stands for the class "literal" and the label 1 is used for the class "hyperbolic". One problem is that the machine translation was lacking in quality in some cases, as the first example will show. A second problem is that the original dataset contains a lot of idioms which cannot be translated word by word, which also is a problem when using machine translation. There is still a need for a large manually created or translated German hyperbole dataset, the HYPO-L dataset by Zhang et al. [ZW21] could still be very useful as a basis. The list below will show four examples from the dataset: • Example 1: – label: 0 – english: We were pelted with rotten tomatoes – manually translated: Wir wurden mit faulen Tomaten beworfen – machine translated: Wir waren mit faulen Tomaten ausgestoßen • Example 2: – label: 0 – english: She bit the thread in two 37 3. Design – manually translated: Sie hat den Faden durchgebissen – machine translated: Sie biss den Faden in zwei Teile • Example 3: – label: 1 – english: He was boiling over with indignation – manually translated: Er kochte über vor Empörung – machine translated: Er kochte mit Empörung • Example 4: – label: 1 – english: Her steadfast belief never left her for one moment – manually translated: Ihr festgefahrener Glaube verließ sie nicht mal für einen Moment – machine translated: Ihr unerschütterlicher Glaube hat sie nie für einen Moment verlassen 38 CHAPTER 4 Experiments 4.1 Extracting technical terms from Wikipedia The idea, which led to the series of experiments described in this section, is the extraction of technical terms from a Wikipedia article about a certain topic. The technical terms are then used to search for political statements concerning a certain topic. The political statements are extracted from the protocols of the Austrian national council, which are available to the public. They are then put into an inverted index [ZM06] so that it is possible to search for statements containing a certain term. The pseudo code below (algorithm 4.1) shows the algorithm that was implemented for creating the inverted index. Each document has two properties, id and text. An empty key/value store R is created, where each word of each document’s text is inserted as a key. Dictionaries are the key/value stores implemented in Python. The value which is mapped to each word is an empty list. If a word is part of a document’s text, the id of the document is added to the list. This allows for a basic full text search, as one gets the id’s of all the documents containing a certain word, if the word is used as the input key to the dictionary. In this approach, the terms are extracted from the Wikipedia article based on a large frequency list extracted from the deWaC German corpus [FE13, deWa]. The frequency list from the source [deWb] was pre-processed and a few obvious outliers were excluded, for example words that contained only one letter, like aaaaa, or words that contained special characters, which lead to a frequency list containing 1.121.589 words with their respective amounts of appearance in the deWaC German corpus [FE13, deWa]. Based on these amounts and the sum of all the amounts a percentage of appearance was calculated. 39 4. Experiments Algorithm 4.1: Creation of a simple inverted index R Input: Set of documents D with size(D) > 0 1 i ← 0; 2 R ← {}; 3 while i < size(D) do 4 t ← tokenize(D[i].text) List of words in the text of document D[i] ; 5 k ← 0; 6 while k < size(t) do 7 w ← t[k]; 8 if w not in R then 9 R[w] ← [] Empty list; 10 end 11 R[w] = R[w] + (D[i].id) Append tuple with document id of D[i] ; 12 k ← k + 1; 13 end 14 k ← 0; 15 i ← i + 1; 16 end The percentage of appearance is then used to specify whether a word is appearing too frequently. The value specified for these experiments is the mean of all the calculated percentages. So if the value of a specific word is below the mean, it is considered as a technical term. Based on this approach the technical terms are extracted from the Wikipedia arti- cle about a specific topic. The extracted technical terms are then used to search for relevant speeches. 4.1.1 Process Description The following paragraphs are going to explain the whole process with the help of an activity diagram [JR99]Figure 4.1. In the first step, which is called "Provide a topic as input", any topic can be entered as input to the program. The topic needs to be the exact name of an article on Wikipedia. If the topic exists on Wikipedia, the whole text from the Wikipedia article is downloaded using the Wikipedia API [wikd]. In case there is no article about the topic, having exactly the name provided as input, an error is displayed. Before the technical terms are extracted from the article, one is able to define ad- 40 4.1. Extracting technical terms from Wikipedia ditional terms that should definitely be considered, even if they would not be considered according to the frequency list. It is written in a way so that every other word starting with a word from the custom terms is also considered. One example for this is the word "Klima" which would definitely not be considered per default as it is used too frequently. If this word is defined as a custom term, not only the word "Klima" is preserved but words like "Klimawandel" and "Klimakrise" as well, even if they would not be considered according to the frequency list. This can be of great help in cases where a lot of technical terms that are relevant to a topic would be ignored because they are used too frequently. After that optional step the technical terms are extracted from the article. The words are put into a set, so that one is left with all the unique words of the article, and afterwards every word in the set is then used as input to the frequency list. If the frequency of the usage of a word is below the overall mean frequency value, so at least slightly below average, the word is considered as a technical term. All of the words that were selected above are then used as input on the previously defined inverted index, so that all speeches containing a certain word are then retrieved. In the best case one would only find speeches that are relevant to the topic that was defined in the first step, but this is not realistic. The evaluation section will describe the results of the experiments in detail. 41 4. Experiments Provide a topic as input Exists on Wikipedia? Display error No Download text from Wikipedia page Yes Consider custom terms? Define additional terms that should be considered Extract technical terms based on frequency list No Yes Extract technical terms based on frequency list + custom terms Use terms as input to search texts in inverted index Texts found? No Return texts that contain at least one of the terms Yes Figure 4.1: Activity diagram visualizing the process that was implemented to extract the technical terms from Wikipedia 4.1.2 Dataset Description The dataset which was used to create the frequency list is the deWaC German corpus introduced by Baroni et al. in 2009 [BBFZ09, FE13, deWa]. In this case deWaC stands 42 4.1. Extracting technical terms from Wikipedia for German web corpus. It is made up of texts which were extracted from the internet. The corpus contains more than 1.34 billion words and was created following the standards which were defined by Kilgarriff et al. in this work [KRPA10]. Some additional datasets were derived from this corpus, one being a frequency list. This frequency list [deWb] was the basis for the frequency list that was used to extract the technical terms. Only a few pre-processing steps were performed, so that a few obvious outliers were excluded from the frequency list. These outliers contain words that consist of one letter only, like "aaaaa", words containing special characters, like "ab&" and words containing more than 42 characters. After all the pre-processing steps the frequency list contains 1.121.589 words with their respective amounts of appearance in the deWaC German corpus. The paper by Baroni et al. [BBFZ09] introduces two additional corpora, namely ukWaC and itWaC, which stand for English web corpus and Italian web corpus respectively. The first two letters of all three corpora describe the domain. The text that were used for the deWaC were extracted from websites under the .de domain, texts for the ukWaC were extracted from websites under the .uk domain and texts for the itWaC were extracted from websites under the .it domain. 43 4. Experiments 4.2 Opinion Type Classification This part of the thesis focuses on classifying the type of opinion communicated in a certain text. In this experiment the approach by Othman et al. [OHMI15] is followed. In their example, they focused on the English language and used the Stanford POS-Tagger [TM00, TKMS03, KT04]. A POS-Tagger, or Part-Of-Speech-Tagger, is a tool that is used to add certain tags to a part-of-speech in a text based on the definition and the context. A part-of-speech describes certain classes of words where each word in that class has similar grammatical properties. Common parts-of-speech are adjectives, adverbs and nouns. The approach by Othman et al. uses these tags to classify texts as either non-opinionated, opinionated, comparative opinionated or superlative opinionated. Table 4.1 shows the POS-Tags of the Stanford POS-Tagger which are used in this approach to mark a text as one of the four classes. Table 4.2 contains a description of each tag. The Stanford POS-Tagger uses the Penn Treebank Tag Set [MSM93, Pen] for the English language [TKMS03, KT04] Sentimental Category POS-Tags (Stanford) Non-Opinionated - Opinionated JJ Comparative Opinionated JJR, RBR Superlative Opinionated JJS, RBS Table 4.1: The four different sentimental categories defined by Othman et al. [OHMI15] The goal of this experiment is to use the approach by Othman et al. but for the German language instead of the English language. According to Schmid and Laws [SL08], a more fine-grained tag set is often considered more appropriate for a language with a rich morphology, like German. POS-Tag (Stanford) Description JJ Adjective JJR Adjective Comparative RBR Adverb Comparative JJS Adjective Superlative RBS Adverb Superlative Table 4.2: Used POS-Tags by Othman et al. [OHMI15] with description 44 4.2. Opinion Type Classification For this experiment the RFTagger by Schmid and Laws [SL08] is used. It uses a tag set, which is similar to the STTS [STT95, WSJB17] or Tiger tag set [BDH+02]. Table 4.3 shows the tags which are used in this experiment and Table 4.4 shows the description of the tags. Sentimental Category POS-Tags Non-Opinionated - Opinionated ADJA, ADJD Comparative Opinionated ADJA.Comp, ADJD.Comp Superlative Opinionated ADJA.Sup, ADJD.Sup Table 4.3: Used POS-Tags in this experiment There are different sub-forms of the tags mentioned in Table 4.4. In this experiment, they are all treated as the form depicted in the table. The experiment will show, however, that this method does not produce the results we expect. Instead, a few adjustments have to be applied. This leads to two additional approaches for finding comparative and superlative forms. POS-Tag Description ADJA Attributives Adjektiv ADJD Adverbiales or prädikatives Adjektiv ADJA.Comp Comparative form of ADJA ADJD.Comp Comparative form of ADJD ADJA.Sup Superlative form of ADJA ADJD.Sup Superlative form of ADJD Table 4.4: Description of the used POS-Tags in this experiment The first additional approach is based on grammatical rules and regular expressions. In this method, the idea is to formulate the grammatical rules as regular expressions so that the regular expressions can be used to search for these patterns in the text. Table 4.5 shows the different rules which are implemented and used in the experiments. The second additional approach is a data-driven approach. For this method, a dataset was created by processing data from Wiktionary [Wikf]. It contains around 13.000 adjectives, 45 4. Experiments along with their comparative and superlative forms, if available. The sets of comparative and superlative forms are then used as corpora; if one of the words appears in a text, the text is marked as a text containing a comparative or superlative opinion, respectively. The major problem with the original POS-Tagging approach is a set of words which are superlative forms, but do not - usually - express an opinion. One example that showed up particularly often in this case is the word ’nächsten’. It is true, that this word is a superlative form, the three forms are ’nah’, ’näher’, ’am nächsten’. In our case it is misleading, as we are searching for opinionated statements, but the word ’nächsten’ is used to refer to the next sitting, one example is ’In der nächsten Sitzung’. To solve this problem, "trivial" superlatives to be ignored were gathered in a set. The dataset used for these experiments is a dataset that had been extracted from the Austrian parliament protocols (national council). It contains 63909 statements. Rule Example Form So ... wie so groß wie comparative Nicht so ... wie nicht so groß wie comparative Immer ...er immer größer comparative ...er als größer als comparative Je ... desto Je größer desto comparative Je ... umso Je größer umso comparative am ...sten am stärksten superlative am ...ßten am größten superlative Table 4.5: Rules which are used in the rule-based approach 4.3 Alliteration Detection This part of the thesis focuses on the detection of alliterations in texts. An alliteration is a figure of speech that describes the situation in which the first letter of multiple words that follow each other have the same sound. The words do not have to follow each other directly, steps in between them are allowed. Two examples for an alliteration in the German language are "Der frühe Vogel fängt den Wurm" and "Flora und Fauna". The first example has three words with the same initial sound right after each other, being "frühe", "Vogel" and "fängt". This example shows that it is not enough to just focus on the first letter of a word, as different letters still can have the same sound in the German language. The second example shows another valid example, where the first letters of the words "Flora" and "Fauna" have the same sound and the first letter of the word in between ("und") does not have the same sound. So if one wants to write an algorithm for detecting 46 4.3. Alliteration Detection alliterations, one might want to include a way to detect cases where different letters produce the same sound (as in the first example) and where there are steps of a certain size between two or more words, where the first letter has the same sound (as in the second example). 4.3.1 Phonetic Algorithms To handle the first case, which is described above, an algorithm for encoding a word into some kind of representation that gives information about its phonetics might be helpful. One phonetic algorithm for the English language is Soundex [Rob18] which was developed by Russell and Odell and patented in 1918. It is a phonetic algorithm for indexing names by sound and it is used in many popular database systems like Oracle [Soua] and PostgreSQL [Soub]. The basic idea is that a word is turned into a representation consisting of one letter at the beginning, followed by three numbers. Letters are turned into numbers based on certain rules, which are not explained in detail in this section. To show one concrete example, the name "Britney" would be transformed into "BRTN" (as the letters a, e, i, o, u, y, h, and w are removed if they are not the first letter) and then the remaining letters, after the first letter, will be transformed into a numeric representation, which would lead to "B635" in this case. This is a simplification as there are additional rules to follow in the algorithm. For the German language there are different kinds of phonetic algorithms, Wilz [Wil05] lists multiple different phonetic algorithms, some for the English language and some for the German language. Two phonetic algorithms for the German language are cologne phonetics (Kölner Phonetik) which was developed by Postel [Pos69] and Phonet, which has two different approaches which are both described in this article by Michael [Mic88]. In the algorithm that was developed to execute the experiments for this thesis, the cologne phonetics algorithm by Postel was used [Pos69]. For Python 3 there is an im- plementation of the algorithm by Nouvertné [nou], this one was used in these experiments. The cologne phonetics algorithm transforms a word into a numeric representation. A central part of the algorithm is a transformation table, which is shown in Table 4.6. The following three steps are performed in the algorithm: 1. The letters are transformed into numbers based on the rules which are specified in the transformation table 2. For all the numbers that appear multiple times next to each other, all the appear- ances, besides the first one, are removed 3. All zeros, besides one being at the start, are removed 47 4. Experiments To show a more concrete example, one can look at the transformation of the name Müller- Lüdenscheidt. One has to consider that this name counts as one word in this algorithm, as both names are connected by hyphen. This example is taken from Wikipedia [wike]: 1. Based on the transformation table, the letters are turned into the following repre- sentation: 60550750206880022 2. Now all the numbers that are shown multiple times next to each other are reduced to the number just showing once: 6050750206802 3. In the last step all zeros that are not the first number in the representation are removed, in this case all zeros are removed: 65752682 4.3.2 Algorithm which is used in the experiments This subsection will describe the algorithm that is used in the experiments. In general there are three parameters that the algorithm can be configured with. The first parameter gives information on whether the cologne phonetics algorithm should be used to find the alliteration or not. If the cologne phonetics algorithm is not used, the alliterations are just searched based on their initial letter. The second parameters name is "size" and specifies the size of the sublists of a text that are considered. When setting the size to three, having the sentence "Das hier ist ein kurzer Beispielsatz" the sublists would look like the following: (Das hier ist), (hier ist ein), (ist ein kurzer), (ein kurzer Beispielsatz) The procedure for generating the sublists is defined in a way where each sublist has to have exactly the same size. The sublists are used to find an alliteration of a certain size. If all the words in a sublist would have the same letter, or the same number if encoded using cologne phonetics, it would be considered as an alliteration. This is restrictive, as it would not find alliterations like "Flora und Fauna", which is a valid example. To make the algorithm less restrictive a third parameter is added. The third parameters name is "steps", it is a list that could theoretically contain integers from 1 to the value of size - 1. In this example the maximum would be two, so the list cannot have more than two elements. It defines the amount of steps that is allowed between the first word of an alliteration and the second word of an alliteration. An example can be shown with the following sentence: "Die Flora und Fauna beschreiben zusammen die Natur". If the size is set to three the sublists would look like this: (Die Flora und), (Flora und Fauna), (und Fauna beschreiben), (Fauna beschreiben zusammen) ... 48 4.3. Alliteration Detection If there wouldn’t be any steps specified, the algorithm would not find an alliteration, as it is only looking for sublists where each word starts with the same letter, or in the case of cologne phonetics with the same number. But if the steps are set to a list containing two, like this [2], a step of two is allowed between the words as well. In that case the sublist "(Flora und Fauna)" would contain an alliteration. If one would set the size of the sublists to four, the list which is provided as input to steps could contain one, two, three or any combination of the three values. If the list is empty the stricter approach is taken, where each word needs to have the first letter or number. The pseudocode in algorithm 4.2 shows the procedure which is used to create the sublists of a text. It is a simple approach which expects a list of words as its input. The list of words is then splitted into multiple sublists containing exactly the amount of words that was specified using the the size parameter. Algorithm 4.2: Returns a list of sublists, each sublist has size s Input: List of words w with size(w) > 0 Input: Integer s where s ≤ size(w) 1 Function getSublists(w, s): 2 i ← 0; 3 res ← [] Empty list ; 4 while i < (size(w) − s + 1) do 5 res ← w[i : i + s] Words from index i until index i + s are put in a sublist and added to the result ; 6 i ← i + 1; 7 end 8 return res List of sublists of size s ; algorithm 4.3 shows the short preprocessing routine that is applied to each sentence or text. The text is transformed into its lowercase version, afterwards it is tokenized so that all the words are a single entry in a list and at the end all the numbers are replaced with strings. That is interesting for sentences that contain constructs like "Er hatte vorher 40 volle Fässer" which would be converted to "er hatte vorher vierzig volle fässer" (it would be separated into a list, here it is just displayed as a single string for visual purposes), which would be detected as an alliteration using the rule which is defined for code 3 in the cologne phonetics algorithm as seen in Table 4.6. To convert numbers into strings the python package num2words [num] is used, which is maintained by Virgil Dupras. Its predecessor pynum2word was developed in 2003 by Taro Ogawa, this new package is a continuation of pynum2word. It supports a lot of 49 4. Experiments Algorithm 4.3: Returns preprocessed words from sentence Input: String sent with size(sent) > 0 1 Function getPreprocessedWords(sent): 2 lowercase ← lower(sent); 3 words ← tokenize(lowercase); 4 nonnumeric ← replace_numbers_with_strings(words); 5 return nonnumeric; different languages, for example, English, German, Spanish and Arabic. algorithm 4.4 shows the whole alliteration detection algorithm. It takes four input parameters: • sent: A sentence or text (String) • s: The minimum size of the detected alliterations (Integer) • st: A list of integers describing the steps that are allowed between words in the alliteration • useCP: A boolean that is used for controlling whether cologne phonetics should be used or not In the first step the input text is preprocessed using the logic that has been defined earlier (algorithm 4.3), all the words are then stored in the list with the name "words". If the "useCP" boolean has been set to true, the cologne phonetics algorithm is applied to each word in the list and the first number is taken and put into a list, which is called "letters" in this example. If the "useCP" boolean is set to false a similar approach is taken, but the cologne phonetics part is left out and just the first letter of each word is stored in the list "letters". The resulting list ("letters") and the minimum size of the detected alliteration (Integer "s") are then used as an input to the getSublists(words, size) function (algorithm 4.2), the results are stored in a list called "sublists". In the next step the results from the previous step are used as input to the getFilteredSublists(sublists, useCP) function (algorithm 4.5). This step is just used if cologne phonetics was used before. It removes cases where each number is equal to zero, or in other words, each of the words of a part of a sentence started with the letters a, e, i, j, o, u or y, as these letters are handled by the rule which is defined for code 0 (Table 4.6). To show one example illustrating this process, consider the sentence at the beginning as the following: "Ein Ai ist eine Art von Faultier". After preprocessing the sentence 50 4.3. Alliteration Detection Algorithm 4.4: Returns true if a sentence contains an alliteration, false other- wise Input: String sent with size(sent) > 0 Input: Integer s Input: List of Integers st Input: Boolean useCP 1 Function containsAlliteration(sent, s, st, useCP): 2 result ← [] 3 words ← getPreprocessedWords(sent) 4 if useCP == True then 5 letters = firstColognePhoneticsNumberOfEach(words) 6 else 7 letters = firstLetterOfEach(words) 8 if size(letters) < 2 then 9 return False 10 end 11 sublists = getSublists(words, s) 12 sublists = getFilteredSublists(sublists, useCP) 13 foreach sublist ∈ sublists do 14 if the first character of each string in sublist is equal then 15 result.add(True) 16 end 17 end 18 foreach step ∈ st do 19 if hasAlliterationAtStep(sublists, step) then 20 result.add(True) 21 end 22 end 23 return True in result 51 4. Experiments Algorithm 4.5: Returns filtered list of sublists Input: List of lists sublists with size(sublists) > 0 Input: Boolean useCP 1 Function getFilteredSublists(sublists, useCP): 2 if useCP == True then 3 cologne_sublists ← []; 4 foreach sublist ∈ sublists do 5 if all cologne phonetics encoded words ∈ sublist start with 0 then 6 continue; 7 else 8 cologne_sublists.add(sublist) 9 end 10 end 11 return cologne_sublists 12 else 13 return sublists 14 end with cologne phonetics and taking the first number of each word the "letters" list would look like this: [0,0,0,0,0,3,3]. The getSublists(words, size) method would be called with the "letters" list as input and a size, for this example the size is defined as three. The resulting list of sublists is the following: [(0,0,0),(0,0,0),(0,0,0),(0,0,3),(0,3,3)]. If the getFilteredSublists(sublists, useCP) function is now applied to this list, with the "useCP" parameter being set correctly to true, the first three sublists would be removed and the list would look like the following: [(0,0,3),(0,3,3)]. This step is included as the letters a, e, i, j, o, u or y could all be the reason for the value being 0, therefore one cannot decide whether it is an alliteration or not based only on the number 0. In the next step the first for-each loop is used (line 13) to handle each sublist in the resulting list from the previous step. If all the elements in the sublist are equal, true is added to the list of results which is defined in line 2 above. This is important for the final step (line 23) which only checks whether true is included in the list of results and returns true or false based on the result of this check. The second for-each loop which starts in line 18 goes over the steps that are defined in the input list "st". In the default case the list is empty. Generally it is describing whether there are steps allowed between the same sounds in an alliteration or not. If yes, the size of the steps can be defined, so if the list is equal to [1,2] steps of the size one and two are allowed. For each of the steps the hasAlliterationAtStep(sublists,step) function is called (al- gorithm 4.6). This algorithm checks whether there is the same character after the step, so it is possible to skip characters. The sublists still have the specified size, but the 52 4.3. Alliteration Detection Letter Context Code A, E, I, J, O, U, Y 0 H - B 1 P if not in front of H 1 D, T if not in front of C, S or Z 2 F, V, W 3 P if in front of H 3 G, K, Q C if in the initial sound before A, H, K, L, O, Q, R, U or X 4 C if in front of A, H, K, O, Q, U or X, if not in front of S or Z X if not in front of C, K or Q 48 L 5 M, N 6 R 7 S, Z if after S or Z C if initial sound and not in front of A, H, K, L, O, Q, R, U, X if not in front of A, H, K, O, Q, U or X 8 D, T if in front of C, S or Z X if after C, K or Q Table 4.6: The transformation table of the cologne phonetics algorithm by Postel [Pos69] alliteration could be shorter, with size - max(st) being the minimum size. True is added to the list of results which is defined in line 2, if the criterion is fulfilled. At the end (line 23) a check is performed on whether there is at least one occurrence of true in the list of results. If there is at least one occurrence, the algorithm returns true, which would mean that there is an alliteration in the text. In any other case false would be returned. 4.3.3 Experiment setup In this subsection the experiments that were conducted are described in more detail. In total two different datasets were used. The first dataset consists of 605 alliterations which were extracted from the website of Ulrich Mehner [meh]. If the website is visited it automatically redirects to another website after a few seconds, but it is still possi- ble to look at the alliterations for a short period of time. At the time of this writing there might be even more alliterations than 605, as the dataset was created in August 2022. This dataset is helpful for evaluating whether the algorithm is detecting correct al- literations, but it does not help with evaluating the performance of the algorithm on any 53 4. Experiments Algorithm 4.6: Returns true if one sublist has an alliteration with step st, false otherwise Input: List of sublists ls with size(ls) > 0 all sublists have same size Input: Integer st where st ≤ size(ls[0]) − 2 1 Function hasAlliterationAtStep(ls, st): 2 ind_list ← 0; 3 res ← [] Empty list; 4 while ind_list < (size(ls) do 5 ind_sublist ← 0; 6 sublist ← ls[ind_list]; 7 size_sublist ← size(sublist); 8 while ind_sublist < size_sublist do 9 if ind_sublist + st ≥ size_sublist then 10 ind_sublist ← ind_sublist + 1; 11 else 12 if sublist[ind_sublist] == sublist[ind_sublist + st] then 13 res.add(True); 14 end 15 ind_sublist ← ind_sublist + 1; 16 end 17 ind_list ← ind_list + 1; 18 end 19 return True in res ; 20 True if there was an alliteration considering step st text input. For this case a dataset which contains more than 65000 statements from the Austrian national council was used, as in other experiments in this thesis. The dataset was created based on html protocols which are available on the website of the national council [aus]. To get the best possible results, the results of different setups of algorithm 4.4 were combined, so that multiple cases can be handled. The setup for the best result on the alliteration dataset combined the results of the four different setups shown in Table 4.7. As some of the alliterations in the dataset are not continuous it is important to consider different step sizes as well as different sizes. For the dataset which is consisting of political statements it was important to de- fine stricter rules, as the rules for the alliteration dataset were way to permissive, which was shown in the amount of statements that were marked as containing an alliteration, 54 4.4. Hyperbole Detection which was way above 90%. The rules can be found in Table 4.8, in this case no steps in between are allowed and the minimum size is set to four, which means that an alliteration needs to consist of at least four consecutive words with the same first letter, or the same first number in the case of cologne phonetics, to be counted as a valid alliteration. Setup Uses Cologne Phonetics Size Steps Setup 1 No 2 [1] Setup 2 No 3 [1,2] Setup 3 Yes 2 [1] Setup 4 Yes 3 [1,2] Table 4.7: Different setups of the algorithm which were combined in the best attempt on the alliteration dataset Setup Uses Cologne Phonetics Size Steps Setup 1 No 4 - Setup 2 Yes 4 - Table 4.8: Different setups of the algorithm which were combined to search for alliterations in the dataset of political statements 4.4 Hyperbole Detection This part of the thesis focuses on the detection of hyperboles in texts. More precisely it aims at reproducing part of the results that were achieved by Troiano et al. [TSÖT18] when implementing an approach for hyperbole detection for the English language. In these experiments the same approach is implemented for the German language. Troiano et al. [TSÖT18] worked on the computational exploration of exaggeration. They built a corpus called HYPO, containing overstatements collected on the web, vali- dated via crowd-sourcing. The chosen approach is classification, the set of used algorithms includes Logistic Regression, Naive Bayes, k-Nearest Neighbors, Decision Trees, Support Vector Machine and Linear Discriminant Analysis. Their experimental results lead to the conclusion, that automatic hyperbole detection could be successfully executed based on semantic features. 4.4.1 Dataset The dataset that was created by Troiano et al. [TSÖT18] is called HYPO and contains three different types of sentences, being hyperboles, paraphrases and minimal units. The previously collected hyperboles were evaluated by five annotators, if three out of the five annotators decided that the sentence contains an exaggeration, the sentence was kept as a hyperbole in the final dataset. In total 854 sentences were judged, 709 of them were hyperboles in the final dataset. The annotators also collected information 55 4. Experiments about the words that are hyperbolic in their opinion. A paraphrase in this context is a sentence that conveys the same message as a hy- perbolic sentence but without the exaggeration. It is practically very similar to a hyperbolic sentence but with a few slight changes in the syntax and semantic so that it does not contain an exaggeration anymore. The final dataset contained 709 paraphrases, one for each hyperbole. A minimal unit is a sentence that is not hyperbolic but contains hyperbolic words, which were collected during the annotating process. For each hyperbole the tokens which were selected by the majority of the annotators were chosen. These tokens were then used by Troiano et al. to extract sentences containing these tokens from corpora like the WaCKy corpus, which is a dump of the English Wikipedia. They selected the sentences from the WaCKy corpus [BBFZ09] based on the editorial criteria of Wikipedia which states that entries need to be neutral and verifiable, which leads to the conclusion that these sentences should not be hyperbolic. If they were not successful using the WaCKy corpus a Google search was performed. They collected 698 sentences of this category. In general the HYPO dataset is an English dataset, so it would not fit for the pur- pose of the experiments described in this section, as they are focusing on the German language. The first idea was to translate the HYPO dataset to German, but the dataset is not publicly available. Zhang et al. [ZW21] worked on similar datasets, namely HYPO-L and HYPO-XL, while developing MOVER. HYPO-L is a manually annotated dataset containing English sentences which are either hyperbolic or literal. The dataset is not as complex as HYPO, as it does not contain paraphrases and minimal units. The sentences in HYPO-L were annotated by students with proficiency in English. Two students annotated each sentence, only if both students chose the same class, either hyperbolic or literal, the sentence was classified as the chosen class and kept in the dataset. In this experiment a German version of the HYPO-L dataset is used. 400 of the sentences, 200 of each class, were manually translated. In addition all the sentences were automatically translated using the Google Translate function in a Google Sheet document. In the end the experiments were conducted with three different versions of the dataset: The whole machine translated dataset, the 400 manually translated sentences and the 400 sentences that were manually translated before but with their machine translated version. To validate the performance on a dataset that is not related to HYPO-L at all, a dataset containing sentences from the 58. sitting of the national council of Austria (legis- lature XXVII) was used. The model was trained on the full machine translated version of HYPO-L. The sentences that were classified as hyperbole were manually evaluated, so that the precision can be calculated. 56 4.4. Hyperbole Detection 4.4.2 Semantic Features The set of semantic features that were computed by Troiano et al. consists of imageability, unexpectedness, polarity, subjectivity and emotional intensity. According to the definition in the paper of Troiano et al. [TSÖT18] imageability refers to the extent to which a word has the capacity to conjure a mental picture. Their assumption is that speakers who hyperbolize might use words with a high value in imageability. To compute the imageability they used a resource by Tsvetkov et al. [TBG+14] which contains imageability ratings for 150.114 terms. For each sentence the imageability values of all the words were averaged. In this experiment the dataset which was developed by Köper et al. [KIW16] was used to compute the imageability for German sentences. This dataset contains 350.000 German lemmatised words which are rated on four psycholinguistic attributes, one of them being imageability. All ratings were calculated using a supervised learning algorithm . During the feature engineering procedure the imageability values of each word in a sentence were averaged. Unexpectedness is used under the assumption that hyperboles are less predictable expressions than literals. The idea is that words which are unexpected in a certain context might give a hint towards them being used in a hyperbolic way. Troiano et al. used pre-trained vectors by Mikolov et al. [MCCD13](from the Skip-gram model) and GloVe vectors by Pennington et al [PSM14]. The idea is that the words in a hyperbolic sentence have less similar meanings, which would mean that their vectorial representations are more distant from each other. The words are mapped to the vector representation of the two pre-trained vector sets men- tioned above and then the cosine distance between all possible word pairs in a sentence are computed. The resulting features are then found in two ways, one being the average similarity among all of the word pairs and the second one being the lowest value of the pair similarities. This results in four values in total as they are computed for both vector sets. In this experiment the GloVe embeddings [gita] and the Word2Vec embeddings [gitb] published on deepset.ai [emb] were used to compute the four unexpectedness values for the German language. Polarity refers to the sentiment of a statement. Troiano et al. [TSÖT18] computed it using TextBlob [LKH+14] by Loria et al. and SentiWords [GGT15] by Gatti et al. TextBlob is a library that can be used for multiple NLP tasks, like the computation of sentiment. SentiWords is a dataset which contains the polarity values of 155.000 POS-tagged lemmas. The SentiWords dataset is used to compute the polarity of each 57 4. Experiments word, afterwards they are averaged. In total there are two polarity scores: the TextBlob score and the average of the polarity values which were gathered from the SentiWords. As TextBlob and SentiWords are resources for the English language they needed to be substituted. There is a library called textblob-de [kil] for the German language which is developed by Markus Killer and there is a German version of the SentiWords dataset [RQH10] which was published by Remus et al. These two resource where used in the same way as the English resources were used by Troiano et al. Subjectivity is a value that is used to describe whether a statement expresses an opinion or objective information. Troiano et al. used TextBlob to calculate the subjectivity. As textblob-de does not support subjectivity yet, an alternative approach using the mdebertav3-subjectivity-german model [hug] developed by a team of the University of Groningen was chosen to calculate subjectivity. It was published by Folkert Leistra and Wietse de Vries on huggingface.co and it is a fine-tuned mDeBERTa V3 model [HGC21] which was developed for the second task of the CLEF 2023 CheckThat! Lab [cle] which was focused on subjectivity in news articles. Emotional intensity gives information about the strength of the sentiment. Troiano et al. compute the sentiment using VADER, which was developed by Hutto and Gilbert [HG14]. VADER returns four values regarding the sentiment, being pos, neu, neg and compound, where pos stands for positive, neu for neutral and neg for negative. The compound score gives information about the overall sentiment, and the other three values show the level of the respective sentiment in the sentence. Unfortunately it was not clear which of the four values was used by Troiano et al., an assumption is made that all four of them are used. As VADER is only applicable to sentences written in the English language, an al- ternative approach needs to be chosen. There is a library called GerVADER, which was developed by Tymann et al. [TLPG19] that aims at implementing VADER for the German language. GerVADER returns the same values as VADER, all four of them were used as a features when training the model. 4.4.3 Experiment Setup Troiano et al. [TSÖT18] defined the task of hyperbole detection as a supervised learning problem, more specifically as a classification task which is based on sentences. It is a binary classification task with the classes hyperbolic and literal. Various algorithms were used by Troiano et al. to conduct experiments using the semantic features described above, namely Logistic Regression (LR), Naive Bayes (NB), Linear Discriminate Analysis (LDA), k-Nearest Neighbors (KNN), Decision Trees (DT) 58 4.4. Hyperbole Detection and Support Vector Machine (SVM). In the experiments for this thesis Random Forest (RF) was used in addition. The implementations of all the mentioned machine learn- ing algorithms in the scikit-learn package , which was developed by Pedregosa et al. [PVG+11], were used to train, test and evaluate the models. A 10-fold cross validation was performed on the experiments which were conducted with the translated HYPO-L dataset, the results were evaluated using accuracy, precision, recall and the F1-score. The experiment with the dataset containing the political statements was manually evaluated and the precision was computed based on the results of the manual evaluation. For this experiment random forest was used as a classifier. 59 CHAPTER 5 Evaluation 5.1 Extracting technical terms from Wikipedia The following section presents the results that were achieved by using the frequency-based approach for finding technical terms in a Wikipedia article. Three different topics were chosen, namely the European migrant crisis, climate change and feminism. The target was finding as many relevant speeches as possible, by searching the collection using the technical terms extracted from the Wikipedia article. The results are visualized using bar charts. The bar charts are used to visualize the amount of relevant and irrelevant speeches that were returned and to show the amount of texts containing a certain technical term. The latter are ordered by amount and the ten terms with the highest amount of texts are shown. The plots visualize the data for three different sets of texts, namely for all texts, for all relevant texts and for all irrelevant texts. 5.1.1 Topic: European migrant crisis The following subsection gives an evaluation of the results which were achieved using the frequency-based approach when trying to find technical terms which are then used to find speeches that are relevant to the European migrant crisis. The Wikipedia article used can be found here [wikc]. The first plot Figure 5.1 shows the amount of relevant and irrelevant speeches retrieved from the corpus. The speeches were manually evaluated on whether they are relevant to the topic or not. In this case it is notable, that a lot of speeches were found, in total more than 3000 speeches. Only around one sixth of the speeches were actually found to be relevant to the 61 5. Evaluation R e le v a n t Ir re le v a n t 0 500 1000 1500 2000 2500 Amount of relevant and irrelevant speeches found (Topic = Asyl) Figure 5.1: Value counts of relevant and irrelevant speeches (Topic = European Migrant Crisis) topic, which is a bad result for this approach, as only every sixth speech that is presented to the end user would actually be relevant. To get a better picture of why the irrelevant speeches were found, a data analysis on the technical terms retrieved from the Wikipedia article was conducted. The Wikipedia article that had been used was not read before, such that the results are not based on a particularly well-suited article. In this case the article about the European migration crisis includes some information about the pandemic caused by COVID-19, which leads to a problem in that specific case, as many speeches in the corpus revolve around this topic. The second bar plot Figure 5.2 shows the amount of speeches containing a specific technical term, ordered by amount. The first ten entries are displayed in the plot. It becomes evident here that "Pandemie" - a word which is mostly included in speeches that are irrelevant to the topic - is included in over 1000 speeches. In other words, the technical term is present in around every third speech which was retrieved. This is already more than five times as frequent as the second technical term, which is also mostly included in irrelevant speeches. This strengthens the conclusion, that this word has a big impact on the result on this specific corpus, as it includes a lot of speeches revolving around this topic. The third bar plot Figure 5.3 shows the amount of relevant speeches containing a specific technical term, ordered by amount. The first ten entries are displayed in the plot. Here 62 5.1. Extracting technical terms from Wikipedia P a n d e m ie C o ro n a v ir u s A s y l M in d e s ts ic h e ru n g L a n d - E U -E b e n e E U -K o m m is s io n S o b o tk a G e s e tz e s p a k e t F lü c h tl in g e term 0 200 400 600 800 1000 Amount of texts containing a certain word (Top 10) (Topic = Asyl) amount Figure 5.2: Amount of texts containing a specific word (Top 10) (Topic = European Migrant Crisis) one can see, that they are all relevant to the topic and that the frequency is generally very low when compared to the the frequency of the word "Pandemie". The fourth bar plot Figure 5.4 shows the amount of irrelevant speeches containing a specific technical term, ordered by amount. The first ten entries are displayed in the plot. Here one can again see the strong impact of the word "Pandemie". 5.1.2 Topic: Climate Change The following subsection gives an evaluation of the results which were achieved using the frequency-based approach when trying to find technical terms which are then used to find speeches that are relevant to climate change. The Wikipedia article used can be found here [wika]. The first plot Figure 5.5 shows the amount of relevant and irrelevant speeches retrieved from the corpus. The speeches were manually evaluated on whether or not they are relevant to the topic. In this case the method performed very well, as 89,03 percent of all retrieved speeches are relevant to the chosen topic. The following plots will give some insight to the technical terms that were extracted from the Wikipedia article. The second bar plot Figure 5.6 shows the amount of speeches containing a specific technical term, ordered by amount. The first ten entries are displayed in the plot. Here one can see, that the first eight technical terms are relevant to climate change in many 63 5. Evaluation A s y l F lü c h tl in g e A s y l- A s y lw e rb e r A s y lv e rf a h re n F lü c h tl in g e n A s y lp o li ti k A s y la n tr ä g e A s y ls y s te m F lü c h tl in g s k o n v e n ti o n term 0 20 40 60 80 100 Amount of relevant texts containing a word (Top 10) (Topic = Asyl) amount Figure 5.3: Amount of relevant texts containing a specific word (Top 10) (Topic = European Migrant Crisis) P a n d e m ie C o ro n a v ir u s M in d e s ts ic h e ru n g L a n d - E U -E b e n e S o b o tk a G e s e tz e s p a k e t E U -K o m m is s io n E U -R ic h tl in ie L o p a tk a term 0 200 400 600 800 1000 Amount of irrelevant texts containing a word (Top 10) (Topic = Asyl) amount Figure 5.4: Amount of irrelevant texts containing a specific word (Top 10) (Topic = European Migrant Crisis) 64 5.1. Extracting technical terms from Wikipedia R e le v a n t Ir re le v a n t 0 100 200 300 400 500 600 700 800 Amount of relevant and irrelevant speeches found (Topic = Climate) Figure 5.5: Value counts of relevant and irrelevant speeches (Topic = Climate Change) K li m a s c h u tz K li m a k ri s e K li m a K li m a w a n d e l K li m a w a n d e ls C O 2 -E m is s io n e n K li m a s c h u tz m a ß n a h m e n C O 2 -A u s s to ß O tt o F a k te n la g e term 0 100 200 300 400 Amount of texts containing a certain word (Top 10) (Topic = Climate) amount Figure 5.6: Amount of texts containing a specific word (Top 10) (Topic = Climate Change) cases, "Klimakrise", "Klima" and "Klimawandel" were used in the social climate context as well. The third bar plot Figure 5.7 shows the amount of relevant speeches containing a specific technical term, ordered by amount. The first ten entries are displayed in the plot. The technical term "Klimaschutz" performed best, as the amount of relevant speeches 65 5. Evaluation K li m a s c h u tz K li m a k ri s e K li m a w a n d e l K li m a K li m a w a n d e ls C O 2 -E m is s io n e n K li m a s c h u tz m a ß n a h m e n C O 2 -A u s s to ß Tr e ib h a u s g a s e K li m a s term 0 100 200 300 400 Amount of relevant texts containing a word (Top 10) (Topic = Climate) amount Figure 5.7: Amount of relevant texts containing a specific word (Top 10) (Topic = Climate Change) containing it is higher as for any other technical term and because the total amount of speeches containing the term is nearly equal to the amount of relevant speeches containing it. The third bar plot Figure 5.8 shows the amount of irrelevant speeches containing a specific technical term, ordered by amount. The first ten entries are displayed in the plot. This plot shows the note-worthy result, that even words like "Klimaschutzmaßnahmen" and "Klimaschutz" are used in a social climate context, as the speeches containing them were irrelevant to climate change. In this experiment, it was possible to get good results by focusing on single words only, but even in such a case, it still shows that context plays an important role. 5.1.3 Topic: Feminism The following subsection gives an evaluation of the results which were obtained using the frequency-based approach when applied to feminism. The Wikipedia article used can be found here [wikb]. The first plot Figure 5.9 shows the amount of relevant and irrelevant speeches retrieved from the corpus. The speeches were manually evaluated on whether they are relevant to the topic or not. This plot shows that the total amount of retrieved speeches is way lower than the amount of speeches retrieved for the two other topics. A possible reason is the amount of words related to feminism that are very commonly found in the speeches of the national 66 5.1. Extracting technical terms from Wikipedia K li m a O tt o G e o d y n a m ik Z e n tr a la n s ta lt F a k te n la g e Z e it v e rl a u f K li m a s c h u tz K a u s a lz u s a m m e n h a n g U S -a m e ri k a n is c h e n A e ro s o le term 0 5 10 15 20 Amount of irrelevant texts containing a word (Top 10) (Topic = Climate) amount Figure 5.8: Amount of irrelevant texts containing a specific word (Top 10) (Topic = Climate Change) council, such as "Frau", which appears in sentences like "Wie Frau Muster bereits erwähnt hat" that may appear in any context. These words are excluded from the technical terms as they are above the cut-off frequency threshold. Still, they could be included in highly-relevant speeches as well, if they contain passages like "Rechte der Frau". The general result is not as bad as for the topic "European migrant crisis" but there are still close to twice as many speeches that are irrelevant. The following plots show which technical terms were used to retrieve the speeches. Figure 5.10 shows the amount of speeches containing a specific technical term, Figure 5.11 shows the amount of relevant speeches containing a specific technical term and Figure 5.12 shows the amount of irrelevant speeches containing a specific technical term. The results are always ordered descending by amount and the first ten entries are displayed in each plot. 67 5. Evaluation R e le v a n t Ir re le v a n t 0 50 100 150 200 Amount of relevant and irrelevant speeches found (Topic = Feminism) Figure 5.9: Value counts of relevant and irrelevant speeches (Topic = Feminism) U m w e lt - L e b e n s re a li tä t F ra u e n - To m a s e ll i K u lt u r- In it ia to ri n n e n F ra u e n g e s u n d h e it N u s s b a u m A k ti v is ti n n e n M e n s c h e n - term 0 5 10 15 20 25 30 35 40 Amount of texts containing a certain word (Top 10) (Topic = Feminism) amount Figure 5.10: Amount of texts containing a specific word (Top 10) (Topic = Feminism) 68 5.1. Extracting technical terms from Wikipedia F ra u e n - F ra u e n g e s u n d h e it L e b e n s re a li tä t B a c k la s h A k ti v is ti n n e n N u s s b a u m F ra u e n p e rs p e k ti v e H o m o p h o b ie G e s c h le c h ts m e rk m a le In it ia to ri n n e n term 0 5 10 15 20 25 30 Amount of relevant texts containing a word (Top 10) (Topic = Feminism) amount Figure 5.11: Amount of relevant texts containing a specific word (Top 10) (Topic = Feminism) U m w e lt - To m a s e ll i K u lt u r- L e b e n s re a li tä t In it ia to ri n n e n N u s s b a u m A k ti v is ti n n e n M e n s c h e n - S k a n d a li s ie ru n g Tr a g is c h e term 0 5 10 15 20 25 30 35 40 Amount of irrelevant texts containing a word (Top 10) (Topic = Feminism) amount Figure 5.12: Amount of irrelevant texts containing a specific word (Top 10) (Topic = Feminism) 69 5. Evaluation 5.2 Opinion Type Classification N o n -O p in io n a te d O p in io n a te d C o m p a ra ti v e O p in io n a te d S u p e rl a ti v e O p in io n a te d Categories 0 10000 20000 30000 40000 A m o u n t Comparision of the two POS-Tag based approaches POS-Tags POS-Tags + Filter Figure 5.13: Comparison of the results of the two POS-Tag based approaches This section presents the results that were gathered while implementing the approaches for opinion type classification, which are described in the experiment section above. Two approaches were implemented for detecting opinionated texts and three different approaches were implemented for the detection of comparative and superlative opinionated texts. To be able to compare the results of the different approaches a manual evaluation was conducted. The goal was to classify the statements that were tagged as opinionated, as either opinionated or not. As there were way to many statements, which were classified as opinionated, only a subset of 200 statements per method was evaluated. The 200 statements were selected randomly. It needs to be said that the manual evaluation introduces a bias as it was only conducted by one person. Table 5.1 gives a short overview of the precision of the different methods, the results are explained in more detail in the following subsections. 5.2.1 Approach 1: POS-Tagging using the RFTagger The POS-Tagging approach, which was implemented following the approach by Othman et al. [OHMI15], was implemented for all three opinion types. Figure 5.13 shows the results (category = "POS-Tags") that were achieved by naively searching for the respective tags. After analyzing the words that were tagged the most often, some words showed up, that would give the impression, that they are only used in greetings. As all of the texts are extracted from parliament protocols, this often is the 70 5.2. Opinion Type Classification Approach Precision POS-Tagging Opinionated 71% Data-Driven Opinionated 67% POS-Tagging Comparative Opinionated 44% Data-Driven Comparative Opinionated 36% Rule-Based Comparative Opinionated 33.5% POS-Tagging Superlative Opinionated 44% Data-Driven Superlative Opinionated 65% Rule-Based Superlative Opinionated 78% Table 5.1: Results of the approaches that were manually evaluated case, as many of the texts start with a greeting. To handle this case, another set of words was created. This time, the words were the ones that showed up most often and gave the impression that they are part of a greeting. After that, a filter was implemented with the task of filtering out statements were each tagged word belongs to the aforementioned set. This step was performed for each tag and led to a significant reduction of superlative opinionated statements, which is shown in Figure 5.13 (category = "POS-Tags + Filter"). POS-Tagging: Manual Data Evaluation Figure 5.14 shows the amount of opin- ionated and non-opinionated statements in the sample containing 200 statements that were classified as opinionated based on the POS-Tagging approach. Around 140 out of 200 statements are truly opinionated, which leads to a precision of around 0.7 - One has to consider that 44578 out of 63909 statements were classified as opinionated by this approach, so 200 statements is a tiny sample in this case. Opinionated Non Opinionated Categories 0 20 40 60 80 100 120 140 A m o u n t Opinionated vs Non Opinionated Samples (Positive - POS) Figure 5.14: Opinionated and Non-Opinionated Statements (POS-Tagging) Figure 5.15 shows the amount of comparative opinionated statements and non-opinionated 71 5. Evaluation statements based on the manual evaluation. Here one can see that there are around 55% non-opinionated statements. Opinionated Non Opinionated Categories 0 20 40 60 80 100 A m o u n t Opinionated vs Non Opinionated Samples (Comparative - POS) Figure 5.15: Comparative Opinionated and Non-Opinionated Statements (POS-Tagging) Figure 5.16 which shows the results of the evaluation of the POS-Tagging approach when applied to finding superlative opinionated statements, surprisingly shows exactly the same result as Figure 5.15, where around 55% of the manually evaluated statements are non-opinionated. Opinionated Non Opinionated Categories 0 20 40 60 80 100 A m o u n t Opinionated vs Non Opinionated Samples (Superlative - POS) Figure 5.16: Superlative Opinionated and Non-Opinionated Statements (POS-Tagging) 72 5.2. Opinion Type Classification 5.2.2 Approach 2: Data-Driven Approach For this approach, a dataset containing adjectives in three forms (positives, comparatives, superlatives) ,if available, was created using the data from Wiktionary [Wikf]. After that, five sets were created. Set P contains all the positive forms from the dataset, Set PD contains all the positive forms and their declensions. Set C contains all comparative forms from the dataset and Set CD contains the comparative forms and their declensions. Set S contains all superlative forms from the dataset. All the sets and their respective amount of words are shown in Table 5.2. The sets of words to ignore, which were defined in approach 1, were considered and the words were removed from the respective sets. Figure 5.17 shows the amount of statements that were marked as Set Name Amount of Words Set P (Positives) 13702 Set S (Superlatives) 5717 Set C (Comparatives) 5705 Set PD (Positives and Declensions) 82170 Set CD (Comparatives and Declensions) 34229 Table 5.2: Sizes of the sets which are used in the data-driven approach opinionated statements using different setups. All of the setups return a list of booleans, therefore it is possible to combine them using logical operators. If only one setup is mentioned (for example Set P) only the amount of statements which were marked as opinionated using Set P are considered. The logical operators used are the AND operator ∧ and the OR operator ∨. The first observation one can make is that the set-based methods find way more sentences containing an adjective in positive form than the tag-based methods. When combining one of the sets with the results from the tag-based method (without filter) using the logical AND operator the amount is only slightly less than when using the tag-based method (without filter) alone. On the other hand, when combined using a logical OR operator, the amount of statement increases significantly. Figure 5.18 shows the amount of statements that were marked as comparative opinionated statements using the same methods as described above but with the tags ADJD.Comp and ADJA.Comp and using the two sets Set C and Set CD which were generated using the Wiktionary dataset as well. In this case the set-based methods find way more statements than the tag-based methods, but once they are combined with an AND operator they return less statements although both sets contain a significant amount of comparative forms. This is a similar result to the one that was observed in Figure 5.17. 73 5. Evaluation P O S -T a g s S e t P P O S -T a g s S e t P D P O S -T a g s + F il te r P O S -T a g s S e t P S e t P D P O S -T a g s S e t P P O S -T a g s S e t P D Approaches 0 10000 20000 30000 40000 50000 60000 A m o u n t Data-driven approaches vs POS-Tag approach (Positive) Figure 5.17: Results of the data-driven approach (positives) Figure 5.19 shows the results that were achieved when applying the same approach to find superlative words. Here, the set-based method finds way less statements than the tag-based method. This may be the case because the tag-based method marked a lot of words as superlatives where the context reveals that it is not used to express an opinion. Examples are ’nächster’, ’nächstes’ and ’Hochgeschätzter’. All of these words are excluded from Set S. Data-Driven: Manual Data Evaluation The results shown in Figure 5.20 are very similar to the ones shown in Figure 5.14, so in this regard the POS-Tagging approach and the data-driven approach seem to perform quite similar. Figure 5.21 shows similar results to Figure 5.15 with a slight increment in non-opinionated statements. One can see in Figure 5.22 that this approach performs better than the POS-Tagging approach (Figure 5.16). This might be related to the pre-filtering that has already been done, where some obvious words like "nächsten" were removed from the set before classifying the statements. 74 5.2. Opinion Type Classification P O S -T a g s S e t C P O S -T a g s + F il te r P O S -T a g s S e t C D P O S -T a g s S e t C S e t C D P O S -T a g s S e t C P O S -T a g s S e t C D Approaches 0 5000 10000 15000 20000 A m o u n t Data-driven approaches vs POS-Tag approach (Comparative) Figure 5.18: Results of the data-driven approach (comparative) P O S -T a g s S e t S S e t S P O S -T a g s + F il te r P O S -T a g s P O S -T a g s S e t S Approaches 0 1000 2000 3000 4000 5000 A m o u n t Data-driven approaches vs POS-Tag approach (Superlative) Figure 5.19: Results of the data-driven approach (superlative) 75 5. Evaluation Opinionated Non Opinionated Categories 0 20 40 60 80 100 120 140 A m o u n t Opinionated vs Non Opinionated Samples (Positive - Data) Figure 5.20: Opinionated and Non-Opinionated Statements (Data-Driven) Opinionated Non Opinionated Categories 0 20 40 60 80 100 120 A m o u n t Opinionated vs Non Opinionated Samples (Comparative - Data) Figure 5.21: Comparative Opinionated and Non-Opinionated Statements (Data-Driven) 76 5.2. Opinion Type Classification Opinionated Non Opinionated Categories 0 20 40 60 80 100 120 A m o u n t Opinionated vs Non Opinionated Samples (Superlative - Data) Figure 5.22: Superlative Opinionated and Non-Opinionated Statements (Data-Driven) 5.2.3 Approach 3: Rule-Based Approach with regular expressions The rule-based approach was only applied for finding comparative or superlative state- ments. All the rules, which can be found in Table 4.5, were implemented using regular expressions. One important constraint was added while experimenting with the rule-based approach. There was a problem with the rules considering "immer ...er" and "...er als", as this rule also gets triggered in cases like "immer wieder" and "wieder als" where it would not count as comparative. To prevent this from happening, the procedure was extended so that one of the sets containing comparatives (Set C or Set CD) can be added. The String which is found using the regular expression is then separated and a lookup is performed to find out whether the word is part of the set. If that is the case it is marked as a comparative, else it is ignored. They are again compared to the tag-based method. Figure 5.23 shows the results of the different rule-based approaches that were executed to find comparative statements. The discrepancy between the different approaches is noteworthy; the tag-based approach finds around 8600 statements, while the rule-based approach with the set-based con- straints finds below 2000 statements in both cases. When both approaches are combined with an AND operator, below 800 statements are found in both cases. Figure 5.24 shows the results of the rule-based approach applied to finding superlative statements. The tag-based approach again finds significantly more superlative statements 77 5. Evaluation P O S -T a g s ( R u le s e t C S e t C ) P O S -T a g s ( R u le s e t C S e t C D ) P O S -T a g s R u le s e t C R u le s e t C S e t C R u le s e t C S e t C D R u le s e t C P O S -T a g s + F il te r P O S -T a g s P O S -T a g s ( R u le s e t C S e t C ) P O S -T a g s ( R u le s e t C S e t C D ) P O S -T a g s R u le s e t C Approaches 0 2000 4000 6000 8000 10000 A m o u n t Rule-based approaches vs POS-Tag approach (Comparative) Figure 5.23: Results of the rule-based approach (comparative) than the rule-based approach. Curiously, the rule-based approach finds superlative statements that were not found using the tag-based approach, as there are 477 in total but only 383 when they are combined with the results of the tag-based approach using the AND operator. P O S -T a g s R u le s e t S R u le s e t S P O S -T a g s + F il te r P O S -T a g s P O S -T a g s R u le s e t S Approaches 0 1000 2000 3000 4000 5000 A m o u n t Rule-based approaches vs POS-Tag approach (Superlative) Figure 5.24: Results of the rule-based approach (superlative) Rule-Based: Manual Data Evaluation Figure 5.25 shows that this rule-based approach performed slightly worse than the other two approaches. Here one needs to 78 5.2. Opinion Type Classification consider that this is the approach without additional filters. Opinionated Non Opinionated Categories 0 20 40 60 80 100 120 A m o u n t Opinionated vs Non Opinionated Samples (Comparative - Rule) Figure 5.25: Comparative Opinionated and Non-Opinionated Statements (Rule-Based) Figure 5.26 shows that the performance using this approach is better than the perfor- mance achieved by the POS-Tagging approach and the data-driven approach as shown in Figure 5.16 and Figure 5.22. Opinionated Non Opinionated Categories 0 20 40 60 80 100 120 140 160 A m o u n t Opinionated vs Non Opinionated Samples (Superlative - Rule) Figure 5.26: Superlative Opinionated and Non-Opinionated Statements (Rule-Based) 79 5. Evaluation Correctly detected Not detected 0 100 200 300 400 500 600 A m o u n t Amount of detected and undetected alliterations in alliteration dataset Figure 5.27: Amount of detected and undetected alliterations in the alliteration dataset 5.3 Alliteration Detection This section presents the results that were gathered while conducting the experiments regarding the detection of alliterations, which are described in the experiments section above. Two different datasets were used to evaluate the algorithm, one of them consists of alliterations and one consists of political statements. The first one was used to evaluate whether the algorithm is able to detect correct alliterations. The second dataset was used to see whether the algorithm can detect alliterations in plain text. A subset of the results were manually annotated as there is no information about whether there are alliterations in the dataset or not. It was not possible to manually annotate all the results as the algorithm detected way to many cases, although the configuration Table 4.8 was strict compared to the configuration defined for the alliteration dataset Table 4.7 5.3.1 Alliteration Dataset Figure 5.27 shows the results that were achieved when using the detection algorithm on the alliterations dataset. In this case the performance was very good as 601 out of 605 alliterations were detected. The only alliterations that were not detected are four examples of alliteration by word combination, as the algorithm does not consider the possibility that there is an alliteration in a single word. The four examples are "habhaft", "Linkliste", "Wendewinkel" and "Diridari". 80 5.3. Alliteration Detection 5.3.2 Political Statement Dataset The following plots show the results of the experiments using the dataset consisting of political statements. The results are separated into four different sets: • standard • cologne phonetics • both • none The algorithm was configured so that it would return one of four results per statement. The four different options are standard, cologne phonetics, both or none. If the algorithm returns "standard" it means that the alliteration was detected based on the first letter only. The option "cologne phonetics" stands for alliterations that were only detected because of the numeric representation of the letters. The other two options are either both of the previous ones or none of them. 200 statements were manually evaluated for the sets "standard", "cologne phonetics" and "both". Figure 5.28 shows the overall results. With the hard restriction that an alliteration must contain at least four words without any steps in between the amount of detected alliterations went down drastically, without the restriction regarding the steps in between around 22.000 statements were marked as containing an alliteration. During the manual evaluation of the first of the three sets a few edge cases showed up, they were invalid alliterations but the algorithm would detect them as alliterations. After that few rules were defined, which were considered when deciding whether an alliteration is valid or not, they are listed below. The algorithm could be improved by implementing logic which could deal with these cases. The examples are all taken from the political statements. • An alliteration should only be considered if it is inside of a single sentence – Example: einen Entschließungsantrag ein. Er lautet • If there are words which are separated by hyphen they should be considered separately – Example: aller AMA-Gütesiegelprodukte ausschließlich auf • The initial sound of the words must not be different, even if the first letter is the same 81 5. Evaluation None Cologne Phonetics Both Standard 0 10000 20000 30000 40000 50000 60000 A m o u n t Amount of each different set detected by the algorithm Figure 5.28: Amount of each of the sets detected by the algorithm in the political statements dataset – Example: nicht nur sich selbst schützt, sondern – Example: es ein extrem erfreuliches • Repetitions should not be considered – Example: in Österreich zugelassen sind, sind sicher, sind – Example: vollkommen verständlich, vollkommen verständlich • Alliterations consisting of one unique word only should not be considered – Example: impfen, impfen, impfen, impfen – Example: Ja, ja, ja, ja 5.3.2.1 Standard Figure 5.29 shows the results of the manual evaluation of the statements where an alliteration was found based on the standard method. In this case around 130 out of 200 cases were valid alliterations. The rules defined above were considered, they were responsible for the invalid cases. The most common sets of letters that occurred in the standard set are shown in Figure 5.30. The set of letters for the detected alliteration was computed in the algorithm and appended to the dataset so that it can be used for the evaluation. Here it is interesting to see, that there is only a tiny amount of sets, containing the letters a, e, i, u, h and j. The reason for that is that code 0 and code - (from the cologne phonetics transformation table 82 5.3. Alliteration Detection Alliteration No Alliteration Categories 0 20 40 60 80 100 120 140 A m o u n t Amount of alliterations (standard) Figure 5.29: Amount of detected and undetected alliterations in the political statement dataset (standard) { 'a '} { 'u '} { 'e '} { 'i '} { 'h '} { 'j '} { ' '} Letter combination 0 10 20 30 40 50 60 70 80 A m o u n t Most common combinations (standard) Alliteration No Alliteration Figure 5.30: Most common sets of letters (standard) Table 4.6) were ignored as they cannot be used to detect alliterations. The alliterations based on words starting with one of these letters were the only ones that only showed up using the standard method, where only the initial letters were combined. All the others either showed up with cologne phonetics only or with both ways. 83 5. Evaluation Alliteration No Alliteration Categories 0 20 40 60 80 100 120 140 A m o u n t Amount of alliterations (cologne) Figure 5.31: Amount of detected and undetected alliterations in the political statement dataset (cologne) 5.3.2.2 Cologne Phonetics Figure 5.31 shows the results of the manual evaluation of the statements where an alliteration was found based on the cologne phonetics method. In this case more than 140 out of 200 cases were invalid alliterations, which is more than two thirds and a bad result compared to the results of the standard method. The following list presents negative and positive examples for a few of the transformation rules which are defined in Table 4.6, these examples might lead to the conclusion that a similar algorithm like cologne phonetics could be developed, which focuses on alliterations only, but it might turn out to be very difficult to find the full set of rules that needs to be considered. The examples are taken from the political statements. • Negative examples for rule number 3 (f, v, w and p if before h) – Example: fangen wir wieder von vorne an – Example: vor Wahlen wieder Wahlzuckerl verteilen wollen • Positive examples for rule number 3 (f, v, w and p if before h) – Example: für viele Frauen, für viele Familien – Example: Fremdübernahmen von Firmen vollzogen • Negative example for rule number 8 (Table 4.6) – sich zum Ziel setzt 84 5.3. Alliteration Detection { '3 '} { '2 '} { '8 '} { '6 '} { '4 '} { '3 ', ' 2 '} { '3 ', ' 8 '} { '8 ', ' 2 '} Letter combination 0 10 20 30 40 50 60 70 80 A m o u n t Most common combinations (cologne) Alliteration No Alliteration Figure 5.32: Most common sets of numbers (cologne) – dass der Thinktank Think Figure 5.32 shows the most common sets of numbers that showed up in the statements. It is interesting to see that only rule 2 led to valid alliterations in most of the cases, while all of the other significant ones led to invalid alliterations in most of the cases. 5.3.2.3 Both The results of the manual evaluation of the statements where an alliteration was found based on the cologne phonetics method and the standard method are shown in Figure 5.33. The performance in this case is comparable to the performance of the standard method. The plot in Figure 5.34 shows the most common letter + number sets and the amount of valid and invalid alliterations. Here it is interesting to see that there are so many invalid alliterations based on the letter "s", they are mostly related to the rule "The initial sound of the words must not be different, even if the first letter is the same", as the first number of the cologne phonetics encoding does not change if the letter "s" is in front of "p", "t" or "ch". One solution could be the consideration of multiple numbers of the cologne phonetics encoding, not just the first one. 85 5. Evaluation Alliteration No Alliteration Categories 0 20 40 60 80 100 120 A m o u n t Amount of alliterations (both) Figure 5.33: Amount of detected and undetected alliterations in the political statement dataset (both) { 'd ', ' 2 '} { '3 ', ' w '} { 's ', ' 8 '} { '6 ', ' m '} { '3 ', ' f' } { '3 ', ' v '} { 'l ', ' 5 '} { 's ', ' 6 ', ' 8 '} { 'a ', ' 2 '} { 'n ', ' 6 '} Letter combination 0 10 20 30 40 50 60 70 A m o u n t Most common combinations (both) Alliteration No Alliteration Figure 5.34: Most common sets of numbers (both) 86 5.4. Hyperbole Detection machine all machine 400 manually 400 LR 0.689092 0.6350 0.6700 KNN 0.644471 0.5875 0.6375 NB 0.667395 0.6425 0.6400 DT 0.600736 0.6300 0.5975 SVM 0.686295 0.6675 0.6875 LDA 0.689097 0.6375 0.6675 RF 0.668928 0.6675 0.6525 Table 5.3: Mean accuracy which was achieved with 10-fold cross validation 5.4 Hyperbole Detection This section presents the results that were gathered while conduction the experiments regarding hyperbole detection, which are described in the experiments section above. Two different kinds of datasets were used to evaluate the approach. The first kind of dataset was a translated version of the HYPO-L dataset. The HYPO-L dataset was created by Zhang et al. [ZW21]. The whole dataset was translated using the Google Translate function provided in Google Sheets. 400 sentences of the HYPO-L dataset were manually translated as well. The experiments were conducted using three different datasets: The whole machine translated HYPO-L dataset, the subset with the 400 manually translated sentences and the same subset but with the respective machine translation, so that the performance can be compared. The second kind of dataset contained political statements which were extracted from a protocol of the Austrian national council. 5.4.1 Translated HYPO-L dataset This subsection describes the results that were achieved when using the whole machine translated HYPO-L dataset and the two subsets (machine translated and manually translated). Table 5.3 shows the mean accuracy that was achieved by the different algorithms on the three different datasets when applying 10-fold cross validation. It is interesting to see that Support Vector Machine (SVM) had the best performance in two cases and was also quite close to the best performance in the third case. Generally all of the algorithms had a mean accuracy above 0.58 which is slightly better than random. The best mean accuracy that was achieved by Troiano et al. [TSÖT18] is 0.72, so the results are not as good but still in a 3% range in the best case (0.689 LDA). 87 5. Evaluation machine all machine 400 manually 400 LR 0.522323 0.632866 0.659049 KNN 0.395776 0.582831 0.626145 NB 0.464290 0.657486 0.621646 DT 0.366243 0.635871 0.603195 SVM 0.496337 0.692868 0.672423 LDA 0.520421 0.641219 0.651457 RF 0.443829 0.664046 0.657090 Table 5.4: Mean precision which was achieved with 10-fold cross validation machine all machine 400 manually 400 LR 0.099307 0.635 0.725 KNN 0.263139 0.630 0.670 NB 0.370327 0.595 0.715 DT 0.385228 0.610 0.610 SVM 0.087416 0.605 0.740 LDA 0.123168 0.615 0.740 RF 0.242198 0.685 0.655 Table 5.5: Mean recall which was achieved with 10-fold cross validation The data in Table 5.4 shows the mean precision that was achieved when applying 10-fold cross validation. One can see that the performance on the full machine translated HYPO-L dataset was rather low with 0.52 being the best value. The mean precision was way higher in the other two cases, with 0.69 and 0.67 as the highest values, but this could also be due to the smaller size of the dataset. It is interesting to see, that SVM, which had the highest mean precision, performs better on the machine translated subset then on the manually translated subset. The best mean precision that was achieved by Troiano et al. was 0.76, so in that case the results are not comparable. The next table, Table 5.5, shows the mean recall that was achieved when applying 10-fold cross validation. The performance on the full machine translated HYPO-L dataset was quite bad, with 0.385 being the highest mean recall. SVM and LDA both had a quite good mean recall on the manually translated subset though (0.74). When comparing the results to the best mean recall that was achieved by Troiano et al. (0.76) they are only comparable when considering the performance when using the manually translated dataset, in the two other cases, especially when using the full machine translated dataset, they are not comparable. Table 5.6 shows the mean f1-score that was achieved when applying 10-fold cross validation. The performance on the full machine translated dataset is low with 0.41 being the highest mean f1-score. When using the two subsets the highest mean f1-score was moderately high, with 0.67 and 0.70 being the best values. The best mean f1-score that was achieved by Troiano et al. was 0.76 - when looking at the performance on the full machine 88 5.4. Hyperbole Detection machine all machine 400 manually 400 LR 0.166267 0.632089 0.688706 KNN 0.315328 0.603822 0.645403 NB 0.411136 0.621979 0.661793 DT 0.375204 0.619045 0.604268 SVM 0.147001 0.643456 0.703524 LDA 0.198310 0.625849 0.689583 RF 0.312876 0.673094 0.653529 Table 5.6: Mean F1-score which was achieved with 10-fold cross validation translated dataset the performance is not comparable. 5.4.2 Political Statements Dataset This subsection describes the results that were achieved when using the political state- ments dataset. Random Forest (RF) was used as an algorithm. The model got trained on the whole machine translated HYPO-L dataset using the semantic features described above. The feature engineered version of the political statements dataset was then used as a test dataset. The approach was to manually annotate all the statements that were classified as being hyperbolic, so that the precision could be calculated. Figure 5.35 shows the amount of statements that were classified as either literal or hyperbolic when using the model that was trained on the full machine translated HYPO- L dataset. 98 Statements were classified as hyperbolic, 1229 were classified as literal. The hyperbolic statements were manually evaluated so that the precision of the algorithm could be calculated. The plot in Figure 5.36 shows the amount of hyperboles that were found based on the manual evaluation. As only one person was evaluating the statements the result contains a bias. 22 Statements were classified as being hyperbolic, the other 76 statements were classified as literal. This leads to a precision of 22%. The following paragraphs will show a few interesting examples for hyperboles and literal statements that were classified as being hyperbolic. The following list will show a few examples that are hyperboles according to the annotator, the reasoning behind the decisions will be explained below. 1. Da hat anscheinend alleine das Einbringen der Petition beim Gesundheitsminister Wunder bewirkt. 2. Wer will, dass die Welt so bleibt, wie sie jetzt ist, der will nicht, dass sie bleibt. 89 5. Evaluation literal hyperbolic Categories 0 200 400 600 800 1000 1200 A m o u n t Amount of hyperboles based on classification (political dataset) Figure 5.35: Amount of hyperboles based on the classification (political dataset) literal hyperbolic Categories 0 10 20 30 40 50 60 70 A m o u n t Amount of hyperboles after annotation (political dataset) Figure 5.36: Amount of hyperboles based on the manual evaluation (political dataset) 90 5.4. Hyperbole Detection 3. Zu Tode gefürchtet ist auch gestorben! 4. Den Menschen habt ihr die Zuversicht genommen, die sind alle depressiv, und schuld daran sind unter anderem diese Masken, eure Verordnungen, bei denen sich keiner auskennt. 5. Die Menschen laufen alle nur mehr apathisch herum. 6. Statt Mut impft ihr den Leuten Angst ein. 7. Wer heute einmal keine Maske trägt dafür kann er eine medizinische Begründung haben oder nicht , ist sowieso schon böse. 8. Das ist eine schreiende Ungerechtigkeit. Statement 1 claims that handing in a petition worked wonders, with is interpreted as being hyperbolic as it did not really cause any wonders. It is an idiom that might have been used to put more weight on the act of handing in a petition. Statement 2 seems to be a bit of an exaggeration as it expresses the thought that people who want the world to stay the way it is now would not mind if it would not exist at all anymore. Statement 3 is a quote by Johann Nestroy that was used in one of the speeches. The literal meaning is that being feared to death is the same as being dead, the interpretation in the context of this work is that having a lot of fear in the daily life could be seen as being dead as well. Using this interpretation the statements could be seen as hyperbolic. Statement 4 claims that the people which are being talked to by the speaker took all the hope from the people, potentially by implementing some new guidelines like the lockdown during the COVID-19 pandemic, and that all the people are depressive. This is interpreted as an exaggeration as there are no statistics or studies claiming that 100% of people in Austria were depressive. Statement 5 is similar to statement 4 as it claims that all people are running around apathetic, which is interpreted as an exaggeration as well. Statement 6 claims that someone is injecting fear into people’s minds. The word "impfen" is most commonly used as a medical term (vaccinate), therefore this is interpreted as being metaphorical and exaggerated. The speaker who included statement 7 in their speech claims that there are people who think that if someone who does not wear a surgical mask, most likely during the COVID-19 pandemic, is automatically evil, does not matter whether they have medical reasons or not. The part "ist sowieso schon böse" is interpreted as a slight exaggeration, 91 5. Evaluation as it might convey the idea that this is the general opinion, which is not only expressed by some people but actually the majority of people. Statement 8 is interpreted as an exaggeration as "schreiende Ungerechtigkeit" stands for "screaming injustice", which puts stronger emphasis on the injustice. 92 CHAPTER 6 Conclusion During the experiments that were conducted in the scope of this master thesis multiple approaches for analyzing political statements were explored, one being relevant to the task of topic classification, one that focuses on opinion type classification and two different types of figure of speech detection, the detection of alliterations and hyperboles. All of these approaches were implemented for the German language. The insights that were gathered as a result of the experiments allow the following research questions to be answered: RQ1: How high is the precision of a rule-based approach for topic classification in political statements using corpora extracted from Wikipedia? The results of the experiments using three different topics (climate change, feminism, European migrant crisis) show that the precision can vary in a strong way between the different topics. In the case of climate change the approach showed promising results with a precision of 89,03%, in the case of feminism the precision was 36,39% and in the case of the European migrant crisis the precision was 19,04%. Following the experiments a data analysis was conducted which showed some of the underlying problems. RQ2: How does an approach using a German tag set and a tagger for the German language, instead of an English tag set and a tagger for the English language, for opinion type classification in German sentences, hold up against the approach developed by Othman et al. regarding opinion type classification in English sentences? This question will be evaluated using precision This question was evaluated based on the achieved precision by Othman et al. [OHMI15] regarding the classification of the three opinion types they used (opinion, comparative opinion, superlative opinion). In the case of the standard opinion the precision was comparable, 76,6% precision were achieved by Othman et al. and 93 6. Conclusion 71% precision were achieved by this implementation. In the case of the comparative and superlative opinion the results were not comparable. Othman et al. achieved 78,3% precision regarding comparative opinions, this implementation achieved only 44% precision. Superlative opinions were classified with 82,1% precision by Othman et al. while this implementation only achieved a precision of 44%. The two additional approaches that were implemented and the data analysis provided interesting insights regarding the improvement of this implementation. RQ3: The Cologne phonetics algorithm can be used to transform a word into a numerical representation depicting the underlying phonetics. Which additional constraints need to be added so that the numerical representation resulting from the Cologne phonetics algorithm is feasible for alliteration detection with 95 percent precision? The algorithm which was implemented for detecting alliterations achieved a very high precision (99,33%) on the alliteration dataset by Ulrich Mehner [meh] which contains 605 alliterations. The successful setup combined a phonetic approach using the Cologne Phonetics algorithm with a simple approach which is based on the words starting with the same letter. In addition the algorithm was written in a way so that it is flexible regarding gaps between words and the length of an alliteration. These constraints allow the detection of alliterations like "Brot und Butter" and "Der frühe Vogel gibt frohe Töne von sich". Additional experiments were performed on the political statements dataset, the three different setups that were tested and manually evaluated achieved lower precision (65%, 30% and 66,5%) and provided interesting insights into additional constraints that can be added and tested in future work. RQ4: Troiano et al. developed an approach for classifying English sentences as either hyperbolic or not. Is this approach applicable to the German language as well? This question will be evaluated using accuracy, precision, recall and the F1-Score. To answer this question, the same approach that Troiano et al. [TSÖT18] im- plemented for the English language was implemented for the German language. Precision, recall, accuracy and F1-Score were calculated based on the achieved results and compared to the best results achieved by Troiano et al. [TSÖT18] across their experiments. The accuracy is comparable, the best results are between 66,75% and 68,90%, while the best result by Troiano et al. is 72%. In the three other cases the best results are not comparable, especially when using the large machine translated version of the HYPO-L dataset (the original dataset was published by Zhang et al. [ZW21]), in this case the best results for precision (52,23%), recall (38,52%) and F1-Score (41,11%) were far below the best results that were achieved by Troiano et al. [TSÖT18] (precision = 76%, recall = 76%, F1-Score = 76%). Future Work As a result of the data analysis that was performed during the evaluation of the experiments multiple interesting factors were found that could be made use of in 94 future work. Regarding topic classification it would be very interesting to see the effect that a large language model like GPT-4 [AAA+23] would have when being used instead of the method that was implemented for extracting topic-related words from a Wikipedia article. The general approach with the inverted index could still be used but the topic-related words could be generated using a tool like the ChatGPT API. A tiny example which is displayed in Figure 6.1 shows the potential of this approach, this could lead to a lot of time being saved when searching for political statements regarding a certain topic. Figure 6.1: A simple example for a prompt that could be used to generate topic-related words using ChatGPT The evaluation of the experiments that were conducted regarding opinion type classi- fication showed that it might make sense to create a separate list of comparative and superlative forms that are not used for expressing an opinion in many cases, for example the superlative form of "nah" (am nächsten), which was used to speak about a future sitting in most of the political statements that were analyzed. This dataset in combina- tion with a rule-based approach which covers the grammatical rules for comparatives and superlatives and the part-of-speech based approach could lead to a better performance. 95 6. Conclusion In the case of alliteration detection a few insights were gained during the data analysis of the results that were achieved on free text. One important improvement could be the implementation of a few additional rules like a restriction for the amount of stop words per alliteration and the exclusion of alliterations that contain the same word multiple times. These simple rules could already increase the precision on free text but it would still not be enough to solve the problem related to the phonetic aspect of an alliteration. In this regard it would be interesting to implement a simpler form of the Cologne phonetics algorithm by Postel [Pos69] which focuses on alliterations in the German language. The results of the hyperbole detection approach could be improved by creating a specific hyperbole dataset for the German language. In addition it might be interesting to distinguish between hyperboles that were spoken, for example in public debates, and hyperboles that were extracted from literature, as they might have different characteris- tics. ChatGPT could be useful in the regard of the dataset creation as well, as it could be used to generate hyperboles which could later be crowd evaluated as either being a realistic hyperbole example or not. 96 List of Figures 2.1 Transformer architecture by Vaswani et al. [VSP+17] . . . . . . . . . . . . 15 2.2 Scaled Dot-Product Attention and Multi-Head Attention by Vaswani et al. [VSP+17] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 A simple POS-tagging example in German using the STTS tag set [WSJB17] 23 2.4 Different approaches for sentiment analysis, taken from Wankhade et al. [WRK22] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 Overall pre-training and fine-tuning procedures for BERT by Devlin et al. [DCLT18] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Example of an aspect-sentiment hierarchy as described by Afzaal et al. [AUFF19] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Activity diagram visualizing the process that was implemented to extract the technical terms from Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . 42 5.1 Value counts of relevant and irrelevant speeches (Topic = European Migrant Crisis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Amount of texts containing a specific word (Top 10) (Topic = European Migrant Crisis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 Amount of relevant texts containing a specific word (Top 10) (Topic = Euro- pean Migrant Crisis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4 Amount of irrelevant texts containing a specific word (Top 10) (Topic = European Migrant Crisis) . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.5 Value counts of relevant and irrelevant speeches (Topic = Climate Change) 65 5.6 Amount of texts containing a specific word (Top 10) (Topic = Climate Change) 65 5.7 Amount of relevant texts containing a specific word (Top 10) (Topic = Climate Change) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.8 Amount of irrelevant texts containing a specific word (Top 10) (Topic = Climate Change) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.9 Value counts of relevant and irrelevant speeches (Topic = Feminism) . . . 68 5.10 Amount of texts containing a specific word (Top 10) (Topic = Feminism) 68 5.11 Amount of relevant texts containing a specific word (Top 10) (Topic = Femi- nism) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.12 Amount of irrelevant texts containing a specific word (Top 10) (Topic = Feminism) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 97 5.13 Comparison of the results of the two POS-Tag based approaches . . . . . 70 5.14 Opinionated and Non-Opinionated Statements (POS-Tagging) . . . . . . 71 5.15 Comparative Opinionated and Non-Opinionated Statements (POS-Tagging) 72 5.16 Superlative Opinionated and Non-Opinionated Statements (POS-Tagging) 72 5.17 Results of the data-driven approach (positives) . . . . . . . . . . . . . . . 74 5.18 Results of the data-driven approach (comparative) . . . . . . . . . . . . . 75 5.19 Results of the data-driven approach (superlative) . . . . . . . . . . . . . . 75 5.20 Opinionated and Non-Opinionated Statements (Data-Driven) . . . . . . . 76 5.21 Comparative Opinionated and Non-Opinionated Statements (Data-Driven) 76 5.22 Superlative Opinionated and Non-Opinionated Statements (Data-Driven) 77 5.23 Results of the rule-based approach (comparative) . . . . . . . . . . . . . . 78 5.24 Results of the rule-based approach (superlative) . . . . . . . . . . . . . . . 78 5.25 Comparative Opinionated and Non-Opinionated Statements (Rule-Based) 79 5.26 Superlative Opinionated and Non-Opinionated Statements (Rule-Based) . 79 5.27 Amount of detected and undetected alliterations in the alliteration dataset 80 5.28 Amount of each of the sets detected by the algorithm in the political statements dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.29 Amount of detected and undetected alliterations in the political statement dataset (standard) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.30 Most common sets of letters (standard) . . . . . . . . . . . . . . . . . . . 83 5.31 Amount of detected and undetected alliterations in the political statement dataset (cologne) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.32 Most common sets of numbers (cologne) . . . . . . . . . . . . . . . . . . . 85 5.33 Amount of detected and undetected alliterations in the political statement dataset (both) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.34 Most common sets of numbers (both) . . . . . . . . . . . . . . . . . . . . 86 5.35 Amount of hyperboles based on the classification (political dataset) . . . . 90 5.36 Amount of hyperboles based on the manual evaluation (political dataset) 90 6.1 A simple example for a prompt that could be used to generate topic-related words using ChatGPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 98 List of Tables 2.1 Conversion Table from the Soundex algorithm by Russell and Odell [Rob18] 20 2.2 The transformation table of the cologne phonetics algorithm by Postel [Pos69] 21 2.3 POS-Tags from the STTS tagset [WSJB17] with explanation, related to the example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Two examples from the German adjectives dataset . . . . . . . . . . . . . 37 4.1 The four different sentimental categories defined by Othman et al. [OHMI15] 44 4.2 Used POS-Tags by Othman et al. [OHMI15] with description . . . . . . . 44 4.3 Used POS-Tags in this experiment . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Description of the used POS-Tags in this experiment . . . . . . . . . . . . 45 4.5 Rules which are used in the rule-based approach . . . . . . . . . . . . . . 46 4.6 The transformation table of the cologne phonetics algorithm by Postel [Pos69] 53 4.7 Different setups of the algorithm which were combined in the best attempt on the alliteration dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.8 Different setups of the algorithm which were combined to search for allitera- tions in the dataset of political statements . . . . . . . . . . . . . . . . . 55 5.1 Results of the approaches that were manually evaluated . . . . . . . . . . 71 5.2 Sizes of the sets which are used in the data-driven approach . . . . . . . . 73 5.3 Mean accuracy which was achieved with 10-fold cross validation . . . . . 87 5.4 Mean precision which was achieved with 10-fold cross validation . . . . . 88 5.5 Mean recall which was achieved with 10-fold cross validation . . . . . . . 88 5.6 Mean F1-score which was achieved with 10-fold cross validation . . . . . . 89 99 List of Algorithms 4.1 Creation of a simple inverted index R . . . . . . . . . . . . . . . . . . . 40 4.2 Returns a list of sublists, each sublist has size s . . . . . . . . . . . . . . 49 4.3 Returns preprocessed words from sentence . . . . . . . . . . . . . . . . 50 4.4 Returns true if a sentence contains an alliteration, false otherwise . . . 51 4.5 Returns filtered list of sublists . . . . . . . . . . . . . . . . . . . . . . . 52 4.6 Returns true if one sublist has an alliteration with step st, false otherwise 54 101 Bibliography [A+18] Charu C Aggarwal et al. Neural networks and deep learning. Springer, 10(978):3, 2018. [AAA+23] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Alt- man, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. [AUFF19] Muhammad Afzaal, Muhammad Usman, Alvis CM Fong, and Simon Fong. Multiaspect-based opinion classification model for tourist reviews. Expert Systems, 36(2):e12371, 2019. [aus] Austria national council: Protocols. https://www.parlament.gv.at/ recherchieren/protokolle/. Accessed: 2023-10-05. [BBFZ09] Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 43:209–226, 2009. [BDH+02] Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith. The tiger treebank. In Proceedings of the workshop on treebanks and linguistic theories, volume 168, pages 24–41, 2002. [ber] Jacob devlin: Bert github repository. https://github.com/ google-research/bert. Accessed: 2024-02-21. [Ber03] Robert Berwick. An idiot’s guide to support vector machines (svms). Retrieved on October, 21:2011, 2003. [blo] Google blog article: Understanding searches better than ever before. https://blog.google/products/search/ search-language-understanding-bert/. Accessed: 2024-02- 21. [Bre96] Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996. [Bre01] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001. 103 [cle] Checkthat! lab at clef 2023. https://checkthat.gitlab.io/ clef2023/task2/. Accessed: 2023-10-30. [CSQH17] Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. Adversar- ial multi-criteria learning for chinese word segmentation. arXiv preprint arXiv:1704.07556, 2017. [CV95] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273–297, 1995. [CWB+11] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE):2493–2537, 2011. [DBB16] Derick F Davis, Rajesh Bagchi, and Lauren G Block. Alliteration alters: Phonetic overlap in promotional messages influences evaluations and choice. Journal of Retailing, 92:1–12, 2016. [DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [deWa] Website that describes the dewac corpus. https://www.sketchengine. eu/dewac-german-corpus/. Accessed: 2023-03-23. [deWb] Website that provides the frequency list. https://wacky.sslmit. unibo.it/doku.php?id=frequency_lists. Accessed: 2023-03-23. [emb] Embeddings described on deepset.ai. https://www.deepset.ai/ german-word-embeddings. Accessed: 2023-10-27. [FE13] Gertrud Faaß and Kerstin Eckart. Sdewac–a corpus of parsable sentences from the web. In Language Processing and Knowledge in the Web: 25th International Conference, GSCL 2013, Darmstadt, Germany, September 25-27, 2013. Proceedings, pages 61–68. Springer, 2013. [Fil20] Peter Filzmoser. Advanced methods for regression and classification - lecture notes, 2020. [FND19] Alexa K Fox, Chinintorn Nakhata, and George D Deitz. Eat, drink, and create content: a multi-method exploration of visual social media marketing content. International Journal of Advertising, 38:450–470, 2019. [GGT15] Lorenzo Gatti, Marco Guerini, and Marco Turchi. Sentiwords: Deriving a high precision and high coverage lexicon for sentiment analysis. IEEE Transactions on Affective Computing, 7(4):409–421, 2015. 104 [gita] Gitlab: Glove embeddings. https://gitlab.com/deepset-ai/ open-source/glove-embeddings-de. Accessed: 2023-10-27. [gitb] Gitlab: Word2vec embeddings. https://gitlab.com/deepset-ai/ open-source/word2vec-embeddings-de. Accessed: 2023-10-27. [HDY+12] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mo- hamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal process- ing magazine, 29(6):82–97, 2012. [HG14] Clayton Hutto and Eric Gilbert. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pages 216–225, 2014. [HGC21] Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving de- berta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543, 2021. [HTFF09] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009. [hug] Huggingface: mdebertav3-subjectivity-german model. https: //huggingface.co/GroNLP/mdebertav3-subjectivity-german. Accessed: 2023-10-30. [Hun07] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90–95, 2007. [IL+65] Aleksĕı Grigorevich Ivakhnenko, Valentin Grigorévich Lapa, et al. Cybernetic predicting devices. (No Title), 1965. [JM] Daniel Jurafsky and James H Martin. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. [JR99] G. Booch J. Rumbaugh, I. Jacobson. The Unified Modeling Language Reference Manual. Addison-Wesley, 1999. [kil] Markus killer: textblob-de. https://github.com/markuskiller/ textblob-de. Accessed: 2023-10-27. [KIW16] Maximilian Köper and Sabine Schulte Im Walde. Automatically generated affective norms of abstractness, arousal, imageability and valence for 350 000 german lemmas. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 2595–2598, 2016. 105 [KRKP+16] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, Carol Willing, and Jupyter development team. Jupyter notebooks - a publishing format for reproducible computational workflows. In Fernando Loizides and Birgit Scmidt, editors, Positioning and Power in Academic Publishing: Players, Agents and Agendas, pages 87–90, Netherlands, 2016. IOS Press. [KRPA10] Adam Kilgarriff, Siva Reddy, Jan Pomikálek, and PVS Avinesh. A corpus factory for many languages. In LREC, 2010. [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet clas- sification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. [KT04] Christopher D. Manning Kristina Toutanova. Stanford pos-tagger descrip- tion. https://nlp.stanford.edu/software/tagger.html, 2004. Accessed: 2023-01-07. [L+11] Bing Liu et al. Web data mining: exploring hyperlinks, contents, and usage data, volume 1. Springer, 2011. [LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015. [LJ98] Yong H Li and Anil K Jain. Classification of text documents. The Computer Journal, 41(8):537–546, 1998. [LKH+14] Steven Loria, Pete Keen, Matthew Honnibal, Roman Yankovsky, David Karesh, Evan Dempsey, et al. Textblob: simplified text processing. Secondary TextBlob: simplified text processing, 3:2014, 2014. [LLG+19] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrah- man Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019. [LMS+19] Xiaoya Li, Yuxian Meng, Xiaofei Sun, Qinghong Han, Arianna Yuan, and Jiwei Li. Is word segmentation necessary for deep learning of Chinese representations? In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3242–3252, Florence, Italy, July 2019. Association for Computational Linguistics. [MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estima- tion of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 106 [meh] Ulrich mehner: Collection of alliterations. https://www.mehner.info/ html/alliteration.html. Accessed: 2023-10-01. [Mer14] Dirk Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux journal, 2014(239):2, 2014. [Mic88] Jörg Michael. Nicht wörtlich genommen–schreibweisentolerante suchroutinen in dbase implementiert. c’t Magazin für Computer und Technik, 10:126–131, 1988. [MP43] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5:115–133, 1943. [MP98] Elaine Marsh and Dennis Perzanowski. Muc-7 evaluation of ie technology: Overview of results. In Seventh Message Understanding Conference (MUC- 7): Proceedings of a Conference Held in Fairfax, Virginia, April 29-May 1, 1998, 1998. [MSM93] Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. 1993. Computational linguistics, 1993. [nat] 58. protocol of the austrian national council. https://www.parlament. gv.at/dokument/XXVII/NRSITZ/58/fnameorig_878722.html. Accessed: 2024-01-24. [NKK+18] Hiroki Nakayama, Takahiro Kubo, Junya Kamura, Yasufumi Taniguchi, and Xu Liang. doccano: Text annotation tool for human, 2018. Software available from https://github.com/doccano/doccano. [nou] Nouvertne: Cologne phonetics implementation. https://pypi.org/ project/cologne-phonetics/. Accessed: 2023-09-21. [num] pypi: num2words library. https://pypi.org/project/num2words/. Accessed: 2023-09-26. [OEC] OECD trust survey. https://www.oecd.org/governance/ trust-in-government/. Accessed: 2023-12-08. [OHMI15] Mahmoud Othman, Hesham Hassan, Ramadan Moawad, and Amira M Idrees. Using nlp approach for opinion types classifier. 2015. [Pen] Penn treebank tagset. https://www.ling.upenn.edu/courses/ Fall_2003/ling001/penn_treebank_pos.html. Accessed: 2023-01- 07. 107 [Pos69] Hans Joachim Postel. Die kölner phonetik. ein verfahren zur identifizierung von personennamen auf der grundlage der gestaltanalyse. IBM-Nachrichten, 19:925–931, 1969. [PSM14] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532– 1543, 2014. [PVG+11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825– 2830, 2011. [RHW86] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986. [Ris95] Eric Sven Ristad. A natural law of succession. arXiv preprint cmp- lg/9508012, 1995. [Rob18] C Russell Robert. The soundex coding system. Patent No. US1261167, 1918. [Ros58] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. [RQH10] Robert Remus, Uwe Quasthoff, and Gerhard Heyer. Sentiws-a publicly available german-language resource for sentiment analysis. In LREC, 2010. [SHP23] Nina Schneidermann, Daniel Hershcovich, and Bolette Pedersen. Probing for hyperbole in pre-trained language models. In Vishakh Padmakumar, Gisela Vallejo, and Yao Fu, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 200–211, Toronto, Canada, July 2023. Association for Computational Linguistics. [SL08] Helmut Schmid and Florian Laws. Estimation of conditional probabilities with decision trees and an application to fine-grained pos tagging. In Pro- ceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 777–784, 2008. [SMT09] Carolin Strobl, James Malley, and Gerhard Tutz. An introduction to recur- sive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological methods, 14(4):323, 2009. 108 [Soua] Oracle: Soundex algorithm. https://docs.oracle.com/cd/B19306_ 01/server.102/b14200/functions148.htm. Accessed: 2023-09-21. [Soub] Postgres: Soundex algorithm. https://www.postgresql.org/docs/ 9.1/fuzzystrmatch.html. Accessed: 2023-09-21. [STT95] Anne Schiller, Simone Teufel, and Christine Thielen. Guidelines f ur das tagging deutscher textcorpora mit stts. Universität Stuttgart, Universität Tübingen, Germany, 1995. [Stu17] Mary E Stuckey. American elections and the rhetoric of political change: Hyperbole, anger, and hope in us politics, 2017. [Swa76] Marc J Swartz. Hyperbole, politics, and potent specification: the political uses of a figure of speech. Language and Politics, pages 100–116, 1976. [TBG+14] Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman, Eric Nyberg, and Chris Dyer. Metaphor detection with cross-lingual model transfer. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 248–258, 2014. [TKMS03] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 252–259, 2003. [TkSP21] Yufei Tian, Arvind krishna Sridhar, and Nanyun Peng. Hypogen: Hyperbole generation with commonsense and counterfactual knowledge, 2021. [TLPG19] Karsten Tymann, Matthias Lutz, Patrick Palsbröker, and Carsten Gips. Gervader-a german adaptation of the vader sentiment analysis tool for social media texts. In LWDA, pages 178–189, 2019. [TM00] Kristina Toutanvoa and Christopher D Manning. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora, pages 63–70, 2000. [TSÖT18] Enrica Troiano, Carlo Strapparava, Gözde Özbal, and Serra Sinem Tekiroğlu. A computational exploration of exaggeration. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3296–3304, 2018. [TS18] Enrica Troiano, Carlo Strapparava, Gözde Özbal, and Serra Sinem Tekiroğlu. A computational exploration of exaggeration. pages 3296–3304, 2018. [Vap82] V Vapnik. Estimation of dependences based on empirical data berlin, 1982. 109 [Vor21] Vorakit Vorakitphan. Fine grained classification of polarized and propagan- dist text in news articles and political debates. PhD thesis, Université Côte d’Azur, 2021. [VRD09] Guido Van Rossum and Fred L. Drake. Python 3 Reference Manual. Cre- ateSpace, Scotts Valley, CA, 2009. [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [Wic19] Hadley Wickham. Dataset: Diamonds from hadley’s ggplot2, October 2019. [wika] German wikipedia article about climate change. https://de.wikipedia. org/wiki/Klimawandel. Accessed: 2023-02-05. [wikb] German wikipedia article about feminism. https://de.wikipedia. org/wiki/Feminismus. Accessed: 2023-02-05. [wikc] German wikipedia article about the european migrant crisis 2015/2016. https://de.wikipedia.org/wiki/FlÃijchtlingskrise_in_ Europa_2015/2016. Accessed: 2023-02-05. [wikd] German wikipedia article about the wikipedia api. https://de. wikipedia.org/wiki/Wikipedia:Technik/Datenbank/API. Ac- cessed: 2023-04-14. [wike] Wikipedia: Cologne phonetics. https://de.wikipedia.org/wiki/ KÃűlner_Phonetik. Accessed: 2023-09-21. [Wikf] Wiktionary: Collection of german adjectives. https://de.wiktionary. org/wiki/Kategorie:Adjektiv_(Deutsch). Accessed: 2023-01-07. [Wil05] Martin Wilz. Aspekte der kodierung phonetischer ähnlichkeiten in deutschen eigennamen. Master’s thesis, Universität zu Köln, Köln, 2005. [WRK22] Mayur Wankhade, Annavarapu Chandra Sekhara Rao, and Chaitanya Kulkarni. A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review, 55(7):5731–5780, 2022. [WSJB17] Swantje Westpfahl, Thomas Schmidt, Jasmin Jonietz, and Anton Borling- haus. Stts 2.0. guidelines für die annotation von pos-tags für transkripte gesprochener sprache in anlehnung an das stuttgart tübingen tagset (stts). 2017. [WW16] Hadley Wickham and Hadley Wickham. Data analysis. Springer, 2016. [YKD08] Bei Yu, Stefan Kaufmann, and Daniel Diermeier. Exploring the characteris- tics of opinion expressions for political opinion classification. 2008. 110 [Zar21] Stefan Zaruba. Using Natural Language Processing to Measure the Consis- tency of Opinions Expressed by Politicians. PhD thesis, Wien, 2021. [ZM06] Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM computing surveys (CSUR), 38(2):6–es, 2006. [ZW21] Yunxiang Zhang and Xiaojun Wan. Mover: Mask, over-generate and rank for hyperbole generation. arXiv preprint arXiv:2109.07726, 2021. 111