Using Natural Language
Processing to Measure the
Consistency of Opinions
Expressed by Politicians
DIPLOMARBEIT
zur Erlangung des akademischen Grades
Diplom-Ingenieur
im Rahmen des Studiums
Logic and Computation
eingereicht von
Stefan Zaruba, BSc
Matrikelnummer 01325853
an der Fakultät für Informatik
der Technischen Universität Wien
Betreuung: Ao.Univ.Prof. Mag. Dr. Horst Eidenberger
Wien, 14. Oktober 2021
Stefan Zaruba Horst Eidenberger
Technische Universität Wien
A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at

Using Natural Language
Processing to Measure the
Consistency of Opinions
Expressed by Politicians
DIPLOMA THESIS
submitted in partial fulfillment of the requirements for the degree of
Diplom-Ingenieur
in
Logic and Computation
by
Stefan Zaruba, BSc
Registration Number 01325853
to the Faculty of Informatics
at the TU Wien
Advisor: Ao.Univ.Prof. Mag. Dr. Horst Eidenberger
Vienna, 14th October, 2021
Stefan Zaruba Horst Eidenberger
Technische Universität Wien
A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at

Erklärung zur Verfassung der
Arbeit
Stefan Zaruba, BSc
Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen-
deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der
Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder
dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter
Angabe der Quelle als Entlehnung kenntlich gemacht habe.
Wien, 14. Oktober 2021
Stefan Zaruba
v

Acknowledgements
I want to thank my advisor, Professor Eidenberger, for being reliable, professional, and
always quick to respond. I also want to thank my parents for supporting me during my
studies.
vii

Kurzfassung
In dieser experimentellen Studie wird eine Lösung implementiert, um Meinungen, mithilfe
von Techniken aus dem maschinellen überwachten Lernens, aus geschriebenem Text zu
extrahieren, um in weiterer Folge deren Konsistenz über die Zeit zu visualisieren. Wir
prüfen sowohl die praktische Umsetzbarkeit als auch die Nützlichkeit des implementierten
Ansatzes.
Wir haben die vom österreichischen Parlament zur Verfügung gestellten Redeprotokolle
gesammelt, um zwei Datensätze zu Themen bezüglich Maßnahmen gegen die Verbreitung
des Coronavirus zu erzeugen. Um die Einträge für die Datensätze zu gewinnen, haben
wir den Rohtext anhand der Satzgrenzen aufgeteilt und relevante Sätze mithilfe einer
Schlüsselwortsuche identifiziert. Danach haben wir den Einträgen Meinungslabels per
Hand zugewiesen. Anschließend haben wir zwei statistischen Ansätze und drei tiefe
Lernnetzwerke verwendet, um die zuvor zugewiesenen Labels mithilfe von maschinellem
Lernen zu bestimmen. Wir haben den Vorgang mehrmals wiederholt, um mithilfe einer
Monte Carlo Kreuzvalidierung die erzielten Leistungen zu bewerten. Dann haben wir
die vorhergesagten Labels des leistungsstärksten Modells verwendet, um die allgemeine
Meinung, sowie die Konsistenz von Meinungen über die Zeit, grafisch darzustellen.
Am größeren Datensatz (etwa 5000 Einträge) erzielte ein BERT-Netzwerk die beste
Genauigkeit (70%), gefolgt von einem LSTM-Netzwerk (68%), einem MNB-Klassifikator
(67%), einem Bag-of-Words-Netzwerk (62%), und einem BM25-Algorithmus aus dem
Information Retrieval. Auf einem kleineren Datensatz (etwa 500 Einträge) gewann
auch BERT (56%), gefolgt vom MNB (53%), dem LSTM (51%), dem BM25-Ansatz
(47%), und dem Bag-of-Words-Netzwerk (42%). Die größten Hürden hinsichtlich der
praktischen Umsetzbarkeit waren der manuelle Label-Aufwand, sowie die Herausforderung
ein Thema mit einer ausreichenden Anzahl an Meinungsäußerungen zu finden. Daraus
schließen wir, dass der umgesetzte Ansatz am besten geeignet ist, wenn geplant ist, ihn
über einen längeren Zeitraum und für eine beschränkte Anzahl an Themen einzusetzen.
Die Nützlichkeit der vorhergesagten Meinungskonsistenz ist von der Genauigkeit des
zugrundeliegenden maschinellen Modells abhängig. Durch den Vergleich der tatsächlichen
Graphen mit den vorhergesagten, befanden wir eine Modellgenauigkeit von 70% als
ausreichend, um die allgemeinen Meinung zu einem Thema repräsentativ darzustellen.
Andererseits erfordert eine nützliche Darstellung der Meinungskonsistenz eine höhere
Modellgenauigkeit.
ix

Abstract
This experimental study implements a solution for extracting opinions from written
text with the help of supervised machine learning methods to visualize their consistency
over time. We examine the practical feasibility and the usefulness of the implemented
approach.
We gathered speech transcripts of the Austrian Parliament to create two datasets on
topics concerning measures against the spread of the Coronavirus. We split the raw text
around sentence boundaries into dataset records and used a keyword search to select
relevant sentences. Then, we manually assigned opinion labels and used two statistical
machine learning algorithms and three deep learning models to predict the labels. We
used Monte Carlo cross-validation to evaluate classification performance. Subsequently,
we used the predictions of the best-performing algorithm to plot the general sentiments
toward the topic and the consistencies of expressed opinions over time.
On the larger dataset (around 5000 records), a BERT network achieved the best accuracy
(70%), followed by an LSTM network (68%), an MNB classifier (67%), a Bag-of-Words
network (62%), and a BM25 document ranking classifier (42%). On the smaller dataset
(around 500 records), BERT also performed best (56%), followed by the MNB (53%), the
LSTM (51%), the BM25 approach (47%), and the Bag-of-Words network (42%). The
biggest challenge to practical feasibility was the manual annotation effort and choosing a
topic for which enough training samples are available. Thus, the approach is best suited
if the intention is to monitor a small selection of topics over a long period. We showed
that the usefulness of the predicted opinion consistency values depends on the accuracy
of the underlying opinion predictions. By comparing the graphs from actual opinion data
to graphs of predicted data, we gathered that a model with 70% accuracy is sufficient
to produce a representative impression of the overall sentiment towards a topic. On
the other hand, visualizing the consistency of opinions requires a higher classification
accuracy to be useful.
xi

Contents
Kurzfassung ix
Abstract xi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement and Research Questions . . . . . . . . . . . . . . . 2
1.3 Methodological Approach . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature 7
2.1 Machine Learning Architectures in NLP . . . . . . . . . . . . . . . . . 7
2.2 Linguistic Processing and NLP Tasks . . . . . . . . . . . . . . . . . . . 17
2.3 Supervised Machine Learning in NLP . . . . . . . . . . . . . . . . . . 27
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Design 41
3.1 Research Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Requirements: A Definition of Opinion Consistency . . . . . . . . . . . 42
3.3 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Datasets: Dataset Creation and Analysis . . . . . . . . . . . . . . . . . 47
4 Experiments 53
4.1 Opinion Classification: First Experiment . . . . . . . . . . . . . . . . . 53
4.2 Opinion Classification: Second Experiment . . . . . . . . . . . . . . . 59
4.3 Visualizations of Opinion Data . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Opinion Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Evaluation 81
5.1 Opinion Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Opinion Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4 What is possible and future directions . . . . . . . . . . . . . . . . . . 89
xiii
6 Conclusion 91
List of Figures 95
List of Tables 97
Bibliography 99
CHAPTER 1
Introduction
1.1 Motivation
The trust of citizens in their governments is generally low. A study from the OECD
shows that, in 2015, only an average of 43% of people from OECD countries trusted
their national government. [OEC] Trust is the necessary foundation for every strong and
healthy relationship. A lack of trust severely jeopardizes the stability and effectiveness of
any relation. Therefore, finding ways of restoring trust between governments and their
citizens is important.
One of the factors influencing the trustworthiness of politicians is the consistency of
their expressed opinions. A politician that frequently expresses contrary views on the
same topic cannot be trusted because a voter will not have security in how they will be
represented by this politician in the future. Therefore, it is important for politicians to
express congruent opinions, and for voters to be able to easily and objectively quantify
this consistency.
Currently, it would be rather difficult and time-consuming for the voter, to objectively
quantify the consistency of speech expressed by politicians. They would have to invest a
lot of time in gathering data, reading and analyzing, to make an informed decision. This
time commitment, most people simply cannot make. Therefore, they are mostly left with
the option of trusting their feelings, which can be easily wrong or considering experts’
opinions, which can be severely biased. Both options cannot be considered advantageous
positions from which to make important decisions about one’s future.
With the help of AI technologies, specifically natural language processing, the consistency
of opinions expressed by politicians can potentially be made quantifiable and easily
accessible for voters to make informed decisions. Speeches and written statements can be
automatically analyzed by NLP algorithms to extract opinions, politicians and parties
have, about various topics. The results can be made publicly available and visualized
1
1. Introduction
in an accessible manner for everyone to grasp easily. As a result, it will be easier for
politicians to convey their trustworthiness to potential voters and also for voters to choose
a party on objective criteria.
The potential effectiveness of such a platform depends not only on the performance of a
machine learning algorithm but also on the definition of opinion consistency, i.e., how it
is calculated from a set of labeled data records. Therefore, one motivation for this work
is to find such a definition, which can be used as a basis for further reasoning. Finally,
since there does not exist much work on that topic, another motivation is to advance
the progress towards being able to effectively reason about the consistency of opinions,
through means of NLP.
1.2 Problem Statement and Research Questions
This experimental study aims to implement a solution for extracting opinions from written
text with the help of supervised ML methods in order to visualize their consistency over
time. In this study, opinions on a topic can be either positive (for), neutral (indifferent),
or negative (against). The consistency of opinions should be high when the number of
contradictions is low and vice-versa. Two opinions expressed on the same topic are said
to be contradictory if one of them is positive and the other is negative. A more precise
definition will be given later. Once an approach is implemented, the lessons learned are
used to draw conclusions about the practical feasibility of the approach (Q1a) and the
usefulness of produced results (Q1b).
The effectiveness depends not only on the definition of opinion consistency but also
on the quality of predicted opinions. Since there are not many studies on the model
performance of ML algorithms predicting opinions on German texts in the political
domain specifically, a performance comparison of different ML models is planned (Q2).
In general, it would be helpful to get an idea of the model performances required to
predict an opinion consistency value to some desired accuracy. (Q3)
Besides opinion consistency, visualizing the opinions that an individual speaker holds
on a topic is also valuable in understanding the speaker’s stance on the topic. Those
visualizations can also be drawn for political parties by aggregating over opinions of their
members. Again, the usefulness of visualizations created from predicted data will depend
on the quality of predictions (Q4).
In summary, we have the following five research questions:
Q1a What is the practical feasibility of monitoring opinion consistency, a value repre-
senting the consistency of opinions on a topic, through the means of supervised
ML methods?
Q1b What is the usefulness of measuring and visualizing the consistency of opinions
based on opinion data predicted by supervised ML methods?
2
1.3. Methodological Approach
Q2 What performance do various ML architectures achieve in predicting opinions in
the domain of Austrian political speeches in the German language?
Q3 What could be minimum performance thresholds for ML algorithms to predict the
consistency of opinions to a desirable precision?
Q4 How useful are visualizations of opinion data of speakers and parties that are based
on predictions made by various supervised ML algorithms?
Now, that the expected outcomes of this work—in the form of answers to the above
questions—are established, the next section details a plan of how they are achieved.
1.3 Methodological Approach
The methodological approach for achieving the expected results consists of the following
steps.
1. Performing literature research. At least the following keywords will be included:
natural language processing, natural language understanding, sentiment analysis,
topic analysis, opinion mining, German natural language processing. The preferred
search engine will be Scopus. Recent review papers will serve as a first entry point
to assess the state-of-the-art. When the intended NLP algorithm architecture
becomes clearer, more specific search terms will be included.
2. Defining discussion topics/aspects. In order to extract opinions from the speech
protocols, discussion topics or aspects will be defined, which can be answered by
for or against. They could be broader (e.g., TAXES, HEALTH CARE) or more
specific (e.g., concrete discussion points). Based on the current understanding,
broader ones should be preferred because they will be applicable over longer time
periods and also make it easier to train and test since more data is available per
class.
3. Gathering and pre-processing of data. The publicly available speech transcripts of
the Austrian Parliament will be downloaded and brought from the HTML form
into a structured (e.g., CSV ) form. One record will at least consist of the following
information: Date of the speech, name of the speaker, party affiliation, type of
speech, governing party (yes/no). Regarding the actual speech, it can be stored in
different granularities. A decision will be made, whether it is going to be phrase-,
sentence- or an even higher level.
4. Researching NLP methods and machine learning models. Since there are many
possible approaches to solve this problem, a choice needs to be made, which machine
learning models or which NLP methods should be applied. This step goes together
with steps 5 and 6 and will be iterated several times, depending on performance
results.
3
1. Introduction
5. Using NLP methods and ML models to overcome the gap between the raw speech
data and having it labeled. Two types of labels need to be assigned: Topic labels
and opinion labels (for/against/indifferent). Existing frameworks, models, and
algorithms shall be used. The primary implementation language will be Python.
6. Evaluating classification performance. The implementation’s viability will be
evaluated on its produced results, on common performance metrics like accuracy
and F1-score. Additionally, depending on requirements and time available, the
opinion classification algorithm can be benchmarked on already annotated German
corpora, e.g., SB-10k [CDEU17], to verify the implementation.
7. Visualizing Opinion Data The predicted opinion labels will be used to plot the
number of positive/negative/neutral opinions on a topic per speaker or per political
party. The same graphs will be created from the actual opinion labels in order to
compare them to the predicted ones.
8. Defining and Visualizing Opinion Consistency A formula for computing a value
that represents the consistency of opinions will be defined. The computed value
will be plotted over time to observe changes in opinion consistency. It will also
be plotted for multiple speakers or parties in the same graph to compare their
consistency values. Again, the graphs based on predicted opinion labels will be
compared to those based on the actual opinion labels.
9. Determining minimum performance thresholds for the used ML algorithms. Finally,
minimum performance thresholds for predicting the opinion consistency to some
desired accuracy will be determined, either through calculating them or through a
sufficient number of simulation runs.
The next section will give the reader an overview of the things to come, such that they
can decide which parts are relevant to them and in which order they want to continue
reading.
1.4 Outline
A brief overview of the remaining chapters is given. Chapter 2 covers NLP architecture,
NLP tasks, supervised learning, and related work. 2.1 examines machine learning
architecture in NLP with a focus on supervised methods for the application in opinion
mining. It covers statistical methods 2.1.1, word embeddings 2.1.2, convolutional networks
2.1.3, recurrent networks 2.1.4, the attention mechanism and transformer models 2.1.5.
2.2 goes through linguistic processing techniques and NLP tasks relevant for this study.
Text segmentation 2.2.1, morphological analysis 2.2.2, POS-tagging and dependency
parsing 2.2.3, named entity recognition and coreference resolution 2.2.4, and sentiment
analysis and opinion mining 2.2.5 are covered. We introduce the tasks by giving a
motivation for their relevance, and we talk about where and how they can be applied,
4
1.4. Outline
about different implementations, development history, and state-of-the-art. 2.3 focuses
on the learning aspect of machine learning. Particularly on supervised text classification,
which is based on training a model on a labeled dataset. An end-to-end perspective of
the learning process—from dataset creation 2.3.1, over pre-processing 2.3.2, to training
and model selection based on evaluation criteria 2.3.4—is given. 2.3.5 covers the tooling
and infrastructure required for learning. 2.3.3 examines the difference between validation,
verification, and evaluation in the context of machine learning. 2.4 examines work similar
or related to opinion mining on political speeches in the German language.
Chapter 3 defines the framework in which the experiments are performed. 3.1 describes
the research method which is applied. 3.2 works out a definition of opinion consistency
by which the consistency of opinions can be measured based on speech transcripts. 3.3
defines a plan for the experiments to be performed in order to answer the research
questions 1.2. 3.4 details how the data was gathered and used to create the datasets.
Furthermore, an analysis of the datasets is performed.
Chapter 4 provides detailed documentation of performed steps and achieved outcomes
of the conducted experiments. 4.1 documents the opinion mining process on the first
dataset, including a comparison of model performances. 4.2 documents the improved
opinion mining process on the second dataset. 4.3 visualizes opinion data aggregated per
speaker and party. The chapter is concluded in 4.4 where the opinion consistency, as it
was defined earlier, is visualized over time 4.4.1 and the impact of a model’s capability
to classify opinions on the accuracy of calculated opinion consistency values is examined
4.4.2.
Chapter 5 evaluates the results which are observed during the experiments. After an
overall summary of the vision and early steps in the project, 5.1 evaluates the results
of classifying opinions during the experiments. 5.2 evaluates the usefulness of opinion
consistency charts that were plotted during experiments. 5.3 outlines the challenges
related to a general topic-independent approach to monitoring opinions. 5.4 discusses
where we are on the road to achieving a generic approach of monitoring the consistency
of opinions.
Finally, 6 provides an overall conclusion by summarizing the findings in regards to the
research questions and makes suggestions about future work towards robust systems for
monitoring the consistency of opinions over time.
5

CHAPTER 2
Literature
2.1 Machine Learning Architectures in NLP
This section examines machine learning architecture in NLP, focusing on supervised
methods for the application in opinion mining.
2.1.1 Statistical Models
Statistical machine learning models became popular for NLP tasks in the 1990s, with
the advent and popularization of the world wide web. The rapidly growing amount of
textual data shared over the internet enabled the effective learning of such models. Before
that, starting in the 1950s, mainly rule-based approaches were used in NLP research in
the areas of word/sentence analysis, question answering (QA), and machine translation
(MT). Statistical machine learning stayed the preferred option roughly until 2012, when
deep learning models were introduced to NLP tasks, quickly becoming state-of-the-art.
[ZDLS20] We will introduce the two statistical models that are used in the experiments
section—The Naive Bayes classifier and the BM25 document ranking algorithm.
The Naive Bayes classifier is a probabilistic method that works under the assumption
that input variables (of the classifier) are independent; thus, the observed outcome of one
variable does not influence the likelihood of results of another variable. For example, in
document classification, the Naive Bayes assumption implies that the presence of words is
independent of their context. Clearly, this is not the case, as some combinations of words
appear more frequently together than others, but despite that, the classifier can still be
effective in many cases. [MS99] We explain how Naive Bayes can be applied to the task
of document classification, according to Li and Jain [LJ98]). Let C = (c1, . . . , cm) be
m document classes and D = (w1, w2, . . . ) a document as a set of its words. The most
7
2. Literature
likely class ĉ for the document D can be estimated by
ĉ = argmaxcj∈C P (cj)
wi∈D
P (wi|cj) (2.1)
taking the argmax of classes cj ∈ C as the product of priori probability of class cj and
product of the posteriori probabilities of words in the document given that it is of class cj .
The probabilities can be calculated from the observed data (labeled corpus documents).
The priori probability of class cj can be calculated by
P (cj) = Nj
N
(2.2)
with Nj denoting the number of documents of class j and the total number of documents
N . The conditional probability of a word wi occurring in a document of class cj is
calculated as
P (wi|cj) = nij + 1
nj + kj
(2.3)
with nj total number of words in class cj , nij number of occurrences of word wi in
class cj and kj unique words in class cj (vocabulary size of cj). Adding the +1 in the
numerator and the +kj in the denominator is a Laplace smoothing technique used to
prevent zero-probability factors.
wRSJ
i = log (ri + 0.5)(N − R − ni + ri + 0.5)
(ni − ri + 0.5)(R − ri + 0.5) (2.4)
wIDF
i = log N − ni + 0.5
ni + 0.5 (2.5)
wBM25
i (tf ) = tf
k1 ((1 − b) + b dl
avdl ) + tf
wRSJ
i (2.6)
P (rel|d, q) =
q,tf i>0
wBM25
i (2.7)
According to Robertson and Zaragoza [RZ09] BM25 is one of the most successful text
retrieval algorithms, which comes from the field of information retrieval. It is based
on the Probabilistic Relevance Framework (PRF), which originated in the 1970–1980s
through the works of Robertson and Jones and was further developed for the following
30 years. The PRF’s basic idea is to build document-query pairs and ordering them by
decreasing the probability of relevance.
We walk through the calculation of document scores. First, some notation occurring in
the formulas is explained. The probabilistic notion of the framework is expressed through
the random variable Rel which can take the values rel (document is relevant) and rel
(document is not relevant). The authors go into more detail about the probabilistic
considerations behind the framework, but we will focus on the final formulas. All possible
terms (which can occur in documents and queries) are indexed into the vocabulary set
V. A document d := (tf 1, ..., tf |V|) is a vector of term frequencies tfi that count how
8
2.1. Machine Learning Architectures in NLP
often the i-th term of the vocabulary is present in the document. A query can either
be represented as a vector of query term frequencies q := (qtf 1, ..., qtf |V |) or as an index
set of terms occurring in the query q := {i|qtf i > 0}. As shown in equation 2.7, the
probability of relevance for a given document d and query q is calculated by summing
up over the weights wBM25
i for query terms (q) present in the document (tf i > 0). The
wBM25
i calculation is the product of two components—a term-frequency calculation and
a document-frequency calculation. Equations 2.4 and 2.5 show two ways to calculate
the document-frequency component. If information about the relevance of documents
is present, thus documents were judged as relevant or not relevant beforehand, the
Robertson/Sprck Jones weight wRSJ
i can be applied, in which R denotes the number of
documents that are judged as relevant and ri denotes the number of relevant documents
containing term ti (index i comes from the index set of query terms q). N denotes the
total number of (judged) documents, while ni denotes the number of (judged relevant)
documents containing ti. If no relevancy information is present, then R and ri can be set
to zero, and the formula becomes wIDF
i , a classical inverse document frequency (IDF)
calculation, in which a term ti becomes more relevant, the less the number of documents
where ti appears. The term weights wRSJ
i or wIDF
i are multiplied with the term frequency
component to get the BM25 term weight wBM25
i (2.6). The term frequency function tf
normally denotes the number of times a term appears in the document. The expression
((1 − b) + b dl
avdl ) represents a document length normalization, with 0 ≤ b ≤ 1 regulating
its impact. The larger the length of a document (dl) in relation to the average document
length (avdl), the less important ti becomes. The authors suggest 0.5 ≤ b ≤ 0.8 and
1.2 ≤ k1 ≤ 2 for the internal parameters. The BM25 method is typically used for
document ranking in search engines. MG4J, Xapian, and Zettair are some examples of
search engines implementing BM25. [RZ09]
2.1.2 Word Vectors and Word Embeddings
We now examine different ways of representing the textual input of NLP machine learning
models. Statistical models are based on counting the occurrence of words and word
probabilities. Deep neural networks consist of multiple layers of neurons. The input
layer takes a vector of numbers as its input, with the vector’s dimensionality equal to
the number of input neurons. A simple idea would be to pass the word counts to the
network’s input layer in the form of a one-hot-encoded input vector. That would result in
input vectors with a dimension equal to the total number of unique words. The problem
with such sparse vectors is that they make it more challenging to train the network,
as the higher the dimensionality, the more data are required. [WLS+15] Distributed
representation solves this problem by projecting the high-dimensional word vectors into
a relatively low-dimensional space by putting semantically more similar words in closer
proximity to each other. [YHPC18] Now, the distance between pairs of word vectors has
a meaning, compared to the other approach, in which words are arbitrarily numbered.
Learning word representations dates back already to 1986, where Rumelhart, Hinton,
and Williams [RHW86] trained neural networks to show that the internal units represent
9
2. Literature
features of the task domain. Word embeddings were revolutionized in 2013 by the works of
Mikolov et al. [MSC+13], who introduced continuous bag of words (CBOW) models and
Skip-Gram models for training. They were significantly more computationally efficient
than previous models, which greatly improved the quality of trained word vectors. They
also found that the produced word vectors follow the rules of compositionality to some
extent. For example, they found that vec("Madrid") - vec("Spain") + vec("France") is
closest to the vector vec("Paris"). [MSC+13] CBOW models are trained to predict a
word based on its surrounding context words (in a certain window size), while Skip-Gram
models are trained to predict the context words, given a center word. N-grams are word
sequences composed of n words. Training a distributed word representation from scratch
is resource-intensive. [MCCD13] For that reason, often pre-trained word embeddings,
which are a list of word vectors, are used in the first layer of a neural network architecture.
On the other hand, training domain-specific word embeddings can improve performance
on NLP tasks. [LL13] A compromise between the quality of word embeddings and
training effort is an approach proposed by Labutov and Lipson. [LL13] They take a
pre-trained general word embedding and tweak it for a specific sentiment classification
task to achieve improved performance compared to the baseline.
Although word embeddings have improved results in many NLP tasks, they come with
shortcomings and challenges, as outlined in [YHPC18]. Traditional word embeddings
work well for words with only one meaning, but they struggle with words that have
different meanings in different contexts (known as polysemy). Research is investigating
the effectiveness of multi-sense word embeddings, in which different word vectors can
be inferred based on the word’s context. Interestingly, multi-sense embeddings may not
give improved results in all NLP tasks. Li and Jurafsky [LJ15] show that multi-sense
embeddings improved performance on some tasks (e.g., POS-tagging) but not in others
(e.g., sentiment analysis and NER). Another problem is that phrases can have a different
meaning than the sum of their constituting words. For example, "hot dog" or idioms,
e.g., "beat around the bush." Some methods have approached this problem by learning
embeddings for n-grams, e.g., Johnson and Zhang [JZ15]. Another challenge that is
particularly relevant for sentiment analysis is that semantically similar words can have
negative polarities. For example, the words good and bad are considered semantically
similar since they are likely to occur in similar contexts but have opposite polarities.
[SPH+11] The authors of [TWY+14] approach this problem with sentiment-specific word
embeddings (SSWE), by which they encode sentiment information in the continuous
representation of words.
2.1.3 Convolutional Neural Networks
We now examine different neural network architectures, starting with convolutional
feed-forward NN. Distributed representation made it possible to extract features from
individual words, but the next step was to extract features of parts of a sentence and an
entire sentence. Convolutional neural networks, already successful in image recognition
tasks, were found to be useful also in NLP tasks. The idea of CNNs in NLP is to run
10
2.1. Machine Learning Architectures in NLP
filters of many differently sized windows over the sentence, each one extracting a new
feature. The network learns to extract relevant features automatically by continually
updating the filters’ weights to minimize the loss of an objective function—a process that
previously required manual feature-engineering work. The downside of this automatic
feature extraction is that the network is a black box, and it is more difficult to explain
the extracted features. Of the features extracted by convolution, a max-pooling layer
then selects the most relevant ones. Those features could then, for example, be used to
perform classification tasks by running them through a dense layer with output neurons
equivalent to the number of target classes.
Figure 2.1: CNN architecture for sentence classification [ZW15]
11
2. Literature
We describe the functioning of a CNN architecture for sentiment analysis in more detail,
according to [YHPC18]. Figure 2.1 shows a sample architecture with one convolutional
layer. In the first step, the embedding of a sentence is performed. Each word of the
sentence is mapped to a vector the size of the embedding dimension d. The result is a
sentence matrix W ∈ Rn×d with n being the number of words in the sentence. The i-th
word of a sentence is denoted by wi ∈ Rd. Convolution is performed between regions of
input vectors and filters (also called kernels) of different region sizes. The purpose of a
filter is to extract features of a sentence for a specific n-gram size. The n-gram size is
according to the region size, e.g., a filter with a region size of two extracts features of
bi-grams in the sentence, while a filter with region size three extracts features of tri-grams.
In this example, there are six filters in total, two for each of the three chosen region sizes.
Let wi:i+j be the concatenation of vectors wi, wi+1, . . . , wj and k ∈ Rh×d be a filter of
region size h, then new features ci are extracted via
ci = Φ(wi:i+h−1 · kT + b) (2.8)
In the above equation, Φ denotes an activation function and b ∈ R a constant bias
term. We can observe that convolution is performed between sliding windows of the
input vectors w and the filters k. Convolution is performed with all possible window
positions, resulting in feature maps c = [c1, c2, . . . , cn−h+1] of size n−h+1. Subsequently,
a max-pooling operation ĉ = maxc is performed on each feature map in order to produce
a fixed-length output and to reduce dimensionality while still keeping the most important
n-gram features for each filter. In this example, a 1-max pooling is performed, keeping
the largest value of each feature map, resulting in 6 values, which are then concatenated
to a single feature vector. Finally, a dense layer with softmax regularization can be
connected to the desired number of output classes, e.g., two output neurons, to represent
a positive or negative sentiment in the input sentence. Weights of word embeddings can
be initialized randomly and trained ad-hoc, or pre-trained word embeddings can be used.
This sample architecture features only one convolutional layer, but it is possible to chain
multiple layers of convolution and max-pooling to achieve improved feature abstraction
capabilities. [YHPC18]
2.1.4 Recurrent Neural Networks
Compared to CNNs, in which more and more abstract features are extracted hierarchically,
RNNs build an understanding of the sentence by processing it in sequential order, similar
to what a human reader would do. Different RNN architectures use memory components
to keep track of information across long distances and gates to filter out unimportant
information and keep important information. The authors of [YHPC18] outline several
motivating factors to use RNNs over CNNs for language processing. RNNs can deal much
easier with variable-length input and very long input (e.g., long sentences, paragraphs,
or documents). They are better suited for machine translation due to their ability to
handle long-term dependencies and to summarize a sentence to a single vector that can
be mapped back to a variable-length target sequence. In contrast, CNNs struggle with
12
2.1. Machine Learning Architectures in NLP
modeling long-distance contextual information and preserving sequential order in their
feature representation. Additionally, CNNs require more trainable parameters, which
requires more training data. One could think that the RNN architecture is naturally more
suited for processing language, which is sequential by nature and thus should achieve
better results than CNNs, but this is not the case as [YKYS17] suggests. They found
that performance depends on the task and dataset and that there is no clear winner.
Figure 2.2: Basic RNN architecture [LBH15]
There are different implementations of RNNs, e.g., long-short term memory (LSTM) and
gated recurrent units (GRU). In the following, the functioning of RNNs is explained in
more detail, according to [CGCB14]. Figure 2.2 shows how a traditional RNN works.
We will explain it based on the example of a POS-tagging task. The network uses a
hidden state h to keep and update information over time. The sentence is fed into the
network word by word, denoted by word vector xt. The hidden state ht is computed by
ht = Φ1(Uxt + Wht−1) (2.9)
with Φ1 being an activation function and U , V , W being the networks weight matrices.
We see, that the state of the current time depends on the state of the previous time
plus the current input. Thus, information is propagated over time and influences the
output based on previous inputs. The current output ot (POS-tag of the word xt) can be
calculated by
ot = Φ2(V ht) (2.10)
with Φ2 being another activation function.
Such a simple implementation of RNNs struggles with keeping long-term information
because they are affected by the vanishing or exploding gradients problem, causing
gradients to go towards zero or infinity, respectively. This effect becomes more pronounced
the more timesteps are involved. [BSF94] The LSTM and GRU architectures overcome
these problems by using gates, which control the flow of information over time. [YHPC18]
Figure 2.3 shows a schematic overview of the three RNN architectures. An empirical
evaluation of the three RNN architectures by the authors of [CGCB14] showed clear
superiority of LSTMs and GRUs over simple RNNs. However, they could not determine
a clear winner between LSTMs and GRUs.
13
2. Literature
Figure 2.3: Schematic overview of different RNN architectures. [KKDC19] Input vectors
are denoted by x, output vectors by y and hidden state vectors by c. Merging arrows
indicate a concatenation of vectors and a splitting arrow indicates a copy operation.
2.1.5 Attention and Transformer Models
The next milestones in the development of NLP architecture were the introduction of
encoder-decoder architecture and, based on that, the transformer architecture, which
is based on the attention mechanism. The idea of encoder-decoder first emerged in
sequence-to-sequence (seq2seq) tasks like machine translation but is now also used in
other tasks since most of the NLP tasks can be cast to sequence-to-sequence. The
transformer architecture formed the basis for current state-of-the-art models, e.g., BERT
and GPT-3.
The encoder-decoder architecture uses two RNNs, one to encode a sequence of input
tokens (e.g., a sentence in one language) and another to decode a sequence of output
tokens (e.g., a translated version of the input sentence). The hidden state generated
by the first RNN is used to initialize the hidden state of the second one. [Hu19] There
are three major drawbacks with this architecture. First, it does not work well for long
sequences because information tends to be forgotten, the more timesteps are involved.
Second, because the entire sequence is encoded before it is decoded, there is no alignment
between input and output tokens. [Hu19] Thirdly, the sequential nature of RNNs does
not allow for parallel processing. [VSP+17] Intuitively, it would be easier to translate a
text part by part instead of memorizing it in its entirety and then translating it from
memory. That is the idea of the attention mechanism for NLP tasks. The intuition
behind attention in transformers is that it allows the decoder to reference the most
relevant parts of the input sequence "by focusing its attention" on those parts during the
decoding process to improve decoding performance.
We now explain the functioning of transformers in more detail, according to the famous
paper "Attention is All You Need." [VSP+17] and to the explanations of Raschka [Ras21].
The basic ingredients form attention blocks. An attention block aims to enhance each
14
2.1. Machine Learning Architectures in NLP
(a) Attention blocks (b) Transformer architecture
Figure 2.4: Overview of the transformer architecture [VSP+17]
embedded token of an input sequence with context information to all other tokens. The
network then learns which relationships between pairs of tokens are more relevant than
others. Thus the network learns "to pay more attention" to the relevant context for a
specific task. Figure 2.4a depicts a scaled dot-product attention block. It consists of six
steps:
1. Given an embedded input sequence (x1, . . . , xn) (e.g., a sentence) which is repre-
sented by the matrix X ∈ Rn×de , with de being the embedding dimension, the
inputs to the attention block are constructed by: Q = X × W q, K = X × W k, and
V = X × W v. W q, W k ∈ Rde×dk and W v ∈ Rde×dv are the embedding weights to
create queries, keys and values, with embedding dimension dk for keys and queries
15
2. Literature
and embedding dimension dv for values.
2. The matrix multiplication QKT ∈ Rn×n is performed to determine the relationships
between pairs of tokens. For example, the first row of QKT contains the relationships
between the first token x1 ∈ X to all tokens x1, . . . , xn of the input sequence.
3. The scaling factor 1/
√
dk is applied, to counteract small gradients in the softmax
function, which can occur for large values of dk.
4. The next step is an optional masking operation, which is used only in the decoder
block to allow the network to "focus attention" only on the current and previous
positions in the sequence. Limiting the potential area of attention is achieved by
adding a mask value M, which can be either zero for no masking or negative infinity
to apply the mask.
5. A softmax function is applied for normalization.
6. The final attention matrix A ∈ Rn×dv is calculated by equation 2.11, which contains
a different embedding of the input sequence. For example, the first row of A contains
an embedding for the first token of the input sequence, enhanced with attention
information to the other tokens of the same input sequence.
Attention(Q, K, V ) = softmax(QKT + M√
dk
)V (2.11)
The scaled dot-product attention block is applied h times in a multi-head attention block
(Figure 2.4a). Each scaled dot-product attention block can focus on a different task by
training the weight matrices W q, W k, W v differently. The embeddings of those blocks
are then concatenated, and convolution is applied in a fully-connected linear layer with
weights W O:
MultiHeadAttention(Q, K, V ) = Concat(head1, . . . , headh)W O (2.12)
headi = Attention(QW q
i , KW K
i , V W V
i ) (2.13)
with W O ∈ Rhdv×de
Figure 2.4b depicts the transformer architecture, which is based on the previously
discussed attention blocks. It consists of an encoder on the left side and a decoder on the
right side. The network processes a sequence of input tokens in sequential order, where
the encoder has access to the entire sequence from the beginning, and the decoder (up to
after the masked multi-head attention block) has only access to the current token and
already processed tokens because the rest is masked. The encoder and decoder blocks
are also stacked multiple times to extract increasingly abstract features. The encoder
creates a new embedding of the input sequence with additional attention information
(which can be seen as context information for every token in the sequence), and the
decoder uses this new embedding to create an output sequence token by token. The
16
2.2. Linguistic Processing and NLP Tasks
positional encoding step applies position information to each token, making it possible
for the network to distinguish between tokens that occur multiple times (e.g., the same
word occurs twice) in the sequence. The decoder’s final layers look differently based on
the NLP task. For example, in machine translation, the final hidden vector embeddings
would be fed into a softmax-regulated linear layer with the number of output nodes
corresponding to the vocabulary size of the target language. [DCLT19] Transformer
models can also be adapted to single token outputs, as described by [SQXH19], who
performed text classification with BERT. The authors prepended every input sequence
with a placeholder token that contains the classification embedding of the entire sequence.
This embedding is used in a softmax classifier to predict the output class.
2.2 Linguistic Processing and NLP Tasks
In this section, we go through linguistic processing techniques and NLP tasks relevant to
this study. We introduce the tasks by giving a motivation for their relevance, and we talk
about where and how they can be applied, about different implementations, development
history, and state-of-the-art.
2.2.1 Text Segmentation
Splitting a text into segments such that it can be further processed is a prerequisite for
many natural language processing tasks. Depending on the task, different granularities are
necessary or better suited. Possible segmentation levels range from the sub-word and word
levels to the topic or document levels and everything in between. The desired granularity
can also depend on the processed language. For example, sub-word tokenization on
Chinese text may yield better results than word tokenization on subsequent language
tasks, which was demonstrated by Peng et al. [PCZ17]. They achieved better results on
a sentiment analysis task by using radical-based embeddings over word embeddings. If
not otherwise stated, statements of the remaining chapter will per default concern the
German or the English language.
A popular way of implementing text segmentation algorithms is by using finite-state
sequence-tagging models, like hidden Markov models, discriminative tagging models
based on maximum entropy classification, conditional random fields, and large-margin
techniques. [MCP05a]
Segmenting the text on the word or sub-word level is known as tokenization. At first, the
tokenization of words looks like an easy task—split the text around the white spaces and
remove punctuation—but in fact there are many considerations to make. One of them is
the treatment of compound words. Should time zone be considered as one or two words?
What if there are different ways of writing the same word, e.g., timezone, time zone and
time-zone. There are valid arguments both for treating them as the same word or as
different words. In German, many compound words are naturally joined together. For
example Lebensversicherung (life insurance). Even in that case, it might be desirable to
17
2. Literature
split them up into separate words, depending on the use case. A possible implementation
for splitting compound words is demonstrated by CharSplit [Tug16], which identifies
the most likely position to split the word based on probabilities of n-grams occurring in
that word. There are many more considerations to make, e.g., the treatment of commas
and dots in dates, emoticons, web addresses, brand names, hyphens and apostrophes,
differentiation between punctuation belonging to a word and punctuation not belonging
to a word, to name a few. [MS99] These examples show, there does not exist only one
correct definition of a word, but one has to implement a tokenization algorithm based
on specific requirements. Since tokenization is a fundamental task, it is implemented by
many NLP frameworks. [SLC17]
Segmenting the text into sentences is also not a trivial task. Some of the challenges are
easier to answer than others. For example, punctuation marks that appear mid-sentence,
e.g., dots in dates and numbers, should probably not split the sentence. On the other
hand, there is no clear answer to whether a semicolon or em dash should start a new
sentence, given no additional information. Again, the implementation of sentence splitting
will depend on the specific use case.
There are also efforts to detect topic changes in text in order to split the text around
those. Results are partly dependent on the definition of a topic change. It is a difficult
task for humans because a precise definition of what a topic is can hardly be given,
leading to bad inter-annotator agreement. As a result, according to Stede [Ste12], there
exists a wide range of segmentation techniques that perform vastly differently on different
datasets. He provides a comprehensive overview of approaches to tackle the problem of
topic segmentation, divided into four categories: 1) exploiting surface cues, 2) lexical
chains, 3) word distributions, and 4) probabilistic models. Beeferman et al. [BBL99]
learn the change of a topic based on the boundaries of news articles. Yamron et al.
[YCG+98] used hidden Markov models and classical language modeling techniques, to
automatically detect boundaries of stories and achieved promising results.
2.2.2 Morphological Analysis, Stemming, Lemmatization and
Normalization
While many NLP tasks are concerned with the inter-word analysis, there are several
motivations for analyzing the morphological information of individual words. One goal
is to reduce vocabulary size by reducing related words to a common base form. For
example, in a language like Finnish, in which a verb can have more than 10,000 forms, it
is impractical to enumerate them all in the vocabulary. [MS99] There, the processes of
stemming (truncating a word to its stem) or lemmatization (identifying a base form for
the word depending on context) can be essential pre-processing steps. Another motivation
for identifying related words comes from Information Retrieval (IR) systems, where it
is used to improve indexing and search results. Singh and Gupta [SG16] evaluated
the impact of different stemming algorithms across different languages on IR results.
Their comparison clearly shows how morphological analysis is language-dependent. In
English, the improvements in retrieval scores gained by applying stemming algorithms
18
2.2. Linguistic Processing and NLP Tasks
over not applying them are relatively small compared to other languages, like Hungarian.
Also, machine learning algorithms can benefit from morphological pre-processing steps.
Singh and Gupta [SG16] show that an SVM algorithm for text classification benefits
from stemming (all six evaluated stemmers show increased F-scores over a non-stemmed
approach).
Stemming and Lemmatization are also subsumed under the term word normalization.
[TTJ06] In contrast, text normalization is concerned with transforming expressions to a
canonical form, which is especially important in text-to-speech applications. For example,
the written expression "€25" and the spoken expression "twenty-five euros" should be
treated as the same entity. Another important application of text normalization is finding
a canonical form of different archaic expressions of a word (e.g., normalizing the archaic
expressions theire, theiare and thayr to the modern version their). [Bol19]
Both stemming and lemmatization share the same goal of reducing different variants of a
word to a common base form, but the approach and outcome are different. Stemming is
the simpler and more syntactical approach, which usually works by removing affixes from
a word. Lemmatization considers semantic information of a word by analyzing its context
and applying a POS-tag in order to find a lemma that represents the underlying lexeme
(a set of words with a similar meaning). As a result, lemmatizers are more difficult to
implement. [Jiv11] The following examples (taken from [Jiv11]) illustrate the different
outcomes a stemmer and a lemmatizer could have on the same words.
• Stemming: introduction, introducing, introduces—introduc
• Lemmatizing: introduction, introducing, introduces—introduce
• Stemming: gone, going, goes—go
• Lemmatizing: gone, going, goes, went—go
A few observations can be made:
1. In contrast to the lexemes produced by the Lemmatizer, the stems do not have to
be actual words found in a dictionary (also called bound stems) but can be (called
free stems). [SG16]
2. A lemmatizer reduces the word went to the same root as the words gone, going,
and goes, while the stemmer reduces went to a different root.
3. A stemmer and a lemmatizer can reduce the same word to the same root but do
not have to.
A brief overview of different implementations of stemmers and lemmatizers is given. Appl
[Jiv11] divides stemming algorithms into truncating, statistical and mixed approaches.
Truncating approaches work by (iteratively) applying a set of transformation rules for
19
2. Literature
removing affixes. A popular example is the Porter Stemmer [Por80], in which different
rules are applied over five steps to find a stem by removing suffixes. In statistical methods,
word commonalities are identified by applying unsupervised learning to large corpora.
For example, n-gram stemmers cluster words that share a high proportion of character
n-grams. Of course, also neural models are applied to the tasks. For example, Lematus
[BG18] is a sequence-to-sequence neural model for lemmatization, performing as well
or better than the previous models, evaluated on 20 different languages. It is based on
the neural machine translation framework Nematus of Sennrich et al. [SFC+17] Instead
of taking a sequence of words in one language and outputting a translated sequence of
words in the target language, in this case, the input is a space-separated sequence of
characters of a word and its context, and the output is the lemma of the word, in the
form of a space-separated sequence of characters.
2.2.3 POS-Tagging and Dependency Parsing
After a sentence was split into tokens (during the process of tokenization) it is of interest to
identify its parts of speech (POS-tagging) and to map its syntactical structure (constituent
parsing and dependency tree parsing) or even its deep semantic structure (dependency
graph parsing). While a syntactic parse, in the form of a dependency tree, also provides
shallow semantic information, recently, the demand for other representations, allowing
to carry deeper semantic information, grew, leading to the exploration of dependency
graphs. [Zha20]
POS-tagging is another stepstone towards natural language comprehension. The goal is
to assign one tag (e.g., verb, noun, adjective) per word, designating its role in a sentence.
A popular tagset is the Penn Treebank POS tagset [TMS03], which contains 48 different
tags. A particular challenge is that the same word can have different tags in different
environments. For example, the word play can be a noun or a verb. How to deal with this
issue falls under the research area of word-sense disambiguation. A popular probabilistic
algorithm that is used to solve POS-tagging is the Viterbi algorithm [Vit67], applied to
Hidden Markov Models. The basic idea is to calculate probabilities with which a word
has a certain tag based on the tags of surrounding words. Other categories of approaches
include rule-based and transformation. Of course, also deep learning models were applied
to POS-tagging, achieving state-of-the-art results.
In syntactic parsing, the goal is to uncover the relationship of words in a sentence based
on grammatical rules. There are two common ways of syntactic parsing—constituent
parsing and dependency parsing. As the names suggest, in the former, the sentence is
recursively split into constituents, starting from the entire sentence, arriving at individual
words, while in the latter, the dependencies between words are uncovered. Figure 2.5
shows an example of the two parsing methods on the same sentence. On the left (Figure
2.5a), we see how the entire clause (S) is split repeatedly into noun phrases (NP) and
verb phrases (VP) until we arrive at individual words. Note that one level above the
individual word appears its POS-tag indicating which kind of constituent phrase will
be built. On the right (Figure 2.5b), the parse shows the relationship between the root
20
2.2. Linguistic Processing and NLP Tasks
S
VP
ADVP
RB
underground
NP
PP
NP
NN
life
PRP$
its
IN
with
NP
NN
desert
DT
a
VBZ
is
NP
NN
ocean
DT
The
(a) A constituency parse
The ocean is a desert with its life underground
ROOT
det nsubj
attr
det prep
pobj
poss advmod
(b) A dependency parse
Figure 2.5: Different syntactic parses of the same sentence as produced by spaCy
[HMVB20]
word (ROOT) and its dependent words, recursively. The annotations are from the Penn
Treebank syntactic tagset and the Penn Treebank POS tagset. [TMS03] contains the full
list of tags.
According to Zhang [Zha20], the majority of approaches for the dependency parsing
problem can be divided along two axes:
1. The first axis describes the framework with which dependency trees can be con-
structed. Frameworks can be transition-based or graph-based.
2. The second axis describes the learning approach in training a classifier to predict
the correct tree within the framework. Classifiers can be divided into statistical
classifiers or neural model approaches.
Transition-based frameworks use a set of transition rules to parse a sentence in sequential
order. For example, Nivre [Niv03] is a stack-based algorithm, with parser transitions
for creating left and right arcs, pushing and popping tokens to and from the stack. In
the training process, a classifier is trained on a treebank (e.g., Penn Treebank), learning
to predict the correct parser transitions based on a given input sequence of tokens.
Classifiers can be of statistical nature, e.g., Nivre [Niv08] uses a state vector machine to
evaluate four different transition-based algorithms or neural models.
The basic idea of graph-based dependency parsing is to reformulate the problem as a
maximum spanning tree problem, e.g., as described in [MCP05b]. A graph is constructed
by taking the words as nodes and arcs between them, representing the dependencies
between words. Weights are assigned to the arcs according to the likelihood that there
is a relationship of dependence between the connected words. By maximizing the sum
of weights of a valid (some properties have to be satisfied) spanning sub-tree, the most
likely correct dependency parse will be found.
Zhang’s comparison of 28 approaches [Zha20] shows that graph-based models perform
better than transition-based and neural models perform better than statistical models.
21
2. Literature
2.2.4 Named Entity Recognition and Coreference Resolution
The term named entity (NE) was coined in 1996, during the Sixth Message Understanding
Conference [GS96], when research was focused on information extraction (IE). The
demand for identifying text passages that refer to real-world entities, e.g., persons,
companies, or locations but also for identifying numbers, dates, and percentages increased.
Besides marking the text sections which refer to a NE, the task of named entity recognition
(NER) also involves assigning an appropriate category label (company, person, location,
date, etc.). But what exactly is considered a NE, and which references should be
considered? Some refer to NEs as a proper nouns [PCV+00], while others refer to them as
rigid designators [NS07]. According to the Stanford Encyclopedia of Philosophy "a rigid
designator designates the same object in all possible worlds in which that object exists
and never designates anything else." [LaP18, para. 1] Today, research has come to the
consensus of dividing the NEs into two categories: generic NEs (e.g., person and location)
and domain-specific NEs (e.g., stock ticker symbols, proteins, or genes) [LSHL20]
Over the period of 1991 to 2006, the implementations of NER shifted from rule-based
approaches to machine learning approaches. [NS07] Today, rule-based approaches are
still used sometimes today, but machine learning and especially deep learning approaches
are clearly more prevalent. Li et al. [LSHL20] divide the approaches into the following
categories:
1. Rule-based systems work well when limited data are available. They are often
applied to specialized domain-specific use-cases, where they achieve high precision
but low recall and cannot be transferred to other domains.
2. Unsupervised learning approaches are often based on a clustering approach, in
which NEs are extracted from the clusters based on semantic similarity.
3. In feature-based supervised learning approaches, machine learning algorithms are
applied to carefully designed features, such as word-level information (e.g., case,
morphology, POS-tag), lookup information in digital gazetteers, or document and
corpus information (e.g., occurrence counts).
4. Finally, in deep learning NER—the dominant approach producing state-of-the-
art results—a neural network learns (through training) to automatically extract
features. The key element of learning is the combination of forward pass (calculating
the weighted sum of inputs) and backward pass (calculating a gradient that is based
on an objective function with respect to the model’s weights) through multiple
processing layers.
While NER identifies real-world entities directly, often, once introduced by name, they
are subsequently referred to by a descriptive phrase (noun phrase) or a pronoun. For
example, we might introduce Michael Jackson by his name, but later in the text refer to
him as "the king of pop," "the famous musician," or simply "he." For the sake of language
22
2.2. Linguistic Processing and NLP Tasks
understanding, it is important to comprehend which of the different referring expressions
concern the same underlying real-world entity—a field of study known as coreference
resolution. There can also be made a distinction between descriptors that can identify
a real-world entity uniquely without additional context (a rigid designator) and those
that require contextual information to be understood (a property which in linguistics is
referred to as deictic). [Lan20] In this example, "the king of pop" is probably sufficient
on its own in identifying the real-world entity Michael Jackson, while the noun phrase
"the famous musician" needs context before it can be resolved.
Two terms often coming up in relation to coreference resolution are discourse or discourse
processing. According to Stede [Ste12] discourse processing refers to language processing
beyond the sentence boundary. After processing information of individual sentences,
discourse processing augments the information, e.g., by looking at relationships between
words originating from different sentences or by examining causal relationships between
sentences. Underlying this approach is the assumption of coherence in a text, by which
its constituting sentences do not exist in isolation but form meaningful relationships
(causal or coreferential in nature) around a common topic.
In their review on neural Entity Coreference Resolution, Stylianou and Vlahavas [SV21]
provide an overview of the development of CR approaches. They start with pre deep
learning approaches in the following categories: Mention-Pair models, Mention-Ranking
models, Entity-Based models, and Latent Structured models. Similar to other NLP tasks,
deep learning models started to dominate also in CR, with the introduction of word
embeddings by Mikolov et al. [MSC+13]. The DL methods evolved in an incremental
way and in the same categories as the non-DL models by building on top of each other.
The early Mention-Pair models quickly evolved into Mention-Ranking models, which are
at the core of Entity-Based models. Latent Structured models and language models build
on either Mention-Ranking models or Entity-Based models. A comparison of different
implementations shows that the best results are currently achieved by latent structure
approaches.
2.2.5 Sentiment Analysis and Opinion Mining
The aim of this work is to extract opinions from written text. There exist many terms
related to the process of extracting opinions, e.g., text classification, sentiment analysis,
opinion mining, and many more. We start by mapping out the field and determining
what is relevant for this work. The terms opinion mining and sentiment analysis are
generally regarded as synonymous. In a comprehensive (over 400 references) survey book
covering all important topics and latest developments in the field up to 2012, Liu [Liu12]
provides the following definition:
Sentiment analysis, also called opinion mining, is the field of study that
analyzes people’s opinions, sentiments, evaluations, appraisals, attitudes, and
emotions towards entities such as products, services, organizations, individuals,
issues, events, topics, and their attributes. [Liu12, p.1]
23
2. Literature
He describes further that many different names exist under the umbrella terms sentiment
analysis and opinion mining with slightly different meaning, e.g., opinion extraction,
sentiment mining, subjectivity analysis, affect analysis, emotion analysis and review
mining. He further states, that sentiment classification is a text classification problem.
Zhang et al. [ZZL15] describe text classification as the problem of assigning predefined
categories to free-text documents. Kowsari et al. [KMH+19] describe text classification
as labeling a set of data points (i.e., documents, text segments) with a class value from a
set of k different discrete value indices.
Based on the above analysis, the following conclusion is drawn. The most important
terms for this work are sentiment analysis and opinion mining, which refer to the same
area of research. The terms exist under many different names with slightly different
scopes, which need to be considered. Information retrieval and text mining are areas
where the techniques of opinion mining and sentiment analysis are applied to. The
remainder of this section starts with a brief history of sentiment analysis and finishes
with an exploration of the problem definition for various sub-problems.
History and Motivation Although NLP research dates back to the 1950s [ZDLS20],
sentiment analysis research only began in the early 2000s. [Liu12] The explosion of
opinion data on social media and the potential advantage that can be gained from
analyzing and understanding it led to strong interest from politics, industry, and science
in the field. Additionally, the amount of opinionated data that could easily be harvested
from social media platforms enabled effective research in the first place. [SLC17] Research
on text classification dates back further than sentiment analysis, to the 1960s, and gained
a major popularity boost in the early 1990s due to the availability of more powerful
hardware. [Seb02]
Sentiment analysis is a challenging task for many reasons. Detecting sarcasm can be
difficult or even impossible without additional information, e.g., tone of voice and body
language (in voiced opinions), the setting in which opinion is expressed, the history of
the opinion holder in relation to the opinion target or the intent of the opinion holder.
Another challenge is the potential for multiple (overlapping) sentiments in a single
sentence. For example, the sentence "I am so glad I did not take the offer" expresses a
positive sentiment of relief but also contains a negative sentiment towards "the offer."
Problem Definition The definition by [Liu12] allows for an exhaustive capturing of
all sentiments in a single sentence. According to the definition an opinion is a quadruple
(g, s, h, t) with g as the opinion (or sentiment) target, s the sentiment about the target,
h the opinion holder and t the time when the opinion was expressed. The earlier used
example sentence "I am so glad I did not take the offer" would contain two opinion
quadruples. In the first one, I (opinion holder) have a positive sentiment (s) towards my
action of not taking the offer (g). In the second one, I (h) have a negative sentiment (s)
towards the offer (g). The time when the statement was uttered (t) is the same for both
opinions.
24
2.2. Linguistic Processing and NLP Tasks
The problem definition of the sentiment analysis task as a simplified version of Liu
[Liu12] can be defined as: Given an opinion document d, discover all opinion quadruples
(g, s, h, t) in d. A document d can be any text of arbitrary length, e.g., a single word, a
sentence, or also an entire book. In this work we will use the same definition of opinion
as a quadruple but will use an adapted problem definition (refer to section 3.2 for more
details).
Sub-Disciplines We examine sub-problems of opinion mining as they are outlined by
Liu [Liu12]. When the term sentiment analysis is used, often what is understood is the
labeling of a text as positive or negative without considering a specific sentiment target.
Liu refers to this as document sentiment classification. Hence, the opinion quadruple
would take the form (_, s, _, _) since we are only concerned with the sentiment s.
While in document sentiment classification the scope of a document is not explicitly
specified, when the scope is fixed to a single sentence, this special case is referred to as
sentence-level sentiment classification. [TQW+15] In this sub-problem, it is assumed
that one sentence contains at most one sentiment. It can be solved as a three-class
classification problem (positive, negative, neutral or no sentiment), or in a two-step
classification process, first filtering out sentences containing no opinion and subsequently
performing sentiment classification. Filtering out sentences that contain no sentiment
can be done by performing a subjectivity classification [HW00], in which sentences are
labeled as either subjective or objective. [WBO99]
While document and sentence-level sentiment classification work on simplifying assump-
tions to reduce problem complexity, aspect-based sentiment analysis is a more exhaustive
approach. Here the task is to extract all opinion quintuples (target entity, target aspect
of entity, sentiment, sentiment holder, time when sentiment was expressed) from a text.
Opinion summarization is a field of study that deals with aggregating information
from many opinions. One goal is to condense multiple different opinions into a single
summarizing text. Another one is aspect-based opinion summarization, in which a
summary text is created per entity and aspect, together with counts of positive and
negative opinions. Contrastive view summarization deals with matching a positive and a
negative opinion about the same aspect.
Other sub-disciplines of sentiment analysis include generalization across language (cross-
language sentiment analysis) or domain (cross-domain sentiment analysis). Multimodal
sentiment analysis is the discipline of combining multiple input types to improve the
performance of classification algorithms. [JH18] For example, a video file could provide
three input types—spoken text of actors, background music, and the visual layer—all of
which can be used for determining a sentiment.
Implementations As sentiment analysis is such a broad field, the used approaches and
algorithms depend on the specific problem. To get an idea of what is used and performs
well, we look at SemEval, the international workshop on semantic evaluation. [PSS+21]
25
2. Literature
It provides a yearly set of around 12 challenges concerning language understanding
of computers. The most recent and relevant (towards sentiment classification) one is
SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media.
The following result summary is taken from [ZMN+19]. The challenge consisted of three
sub-tasks:
1. Classification of the tweets in offensive and not offensive
2. Classification of the offensive tweets in targeted and untargeted
3. Classification of the targeted tweets into one of the target types Individual, Group,
or Other
Hence, based on the earlier problem definitions, (1) can be seen as a document-level
sentiment classification task, and (2) and (3) can be approached as a general text
classification task, or also as an aspect-based sentiment classification task. Nearly 800
teams participated in the challenge, of which 115 submitted their results.
Figure 2.6: Distribution of submitted model types for the SemEval-2019 Task 6 (sub-task
A) [ZMN+19]
Figure 2.6 shows the distribution of participating models. To no surprise, the majority
consisted of deep learning approaches, followed by traditional machine learning approaches
(e.g., SVM, logistic regression). The models were evaluated on the F1-Score. Overall,
the best results were achieved by ensemble methods and state-of-the-art deep learning
models such as BERT. In sub-task A, seven out of the top ten used BERT (the best
had an F1-Score of 0.829), and the best non-BERT was an ensemble method (CNN with
BLSTM+BGRU) ranked at place six (F1: 0.806). Interestingly, a rule-based approach
took the top spot (0.755) in sub-task B. Ensemble methods outperformed pure BERT
approaches in the second sub-task by taking second (0.739) and third (0.719) place,
followed by a pure BERT at place four (0.716). On task C, the best model was again
a pure BERT (0.660) but followed second by an ensemble of OpenAI Finetune, LSTM,
Transformer, SVM, and Random Forest (0.628).
26
2.3. Supervised Machine Learning in NLP
2.3 Supervised Machine Learning in NLP
So far, we have covered architectures and tasks or machine learning. In this section,
we focus on the learning aspect of machine learning, particularly on supervised text
classification, which is based on learning a model on a labeled dataset.
2.3.1 Dataset Creation
Having access to a sufficient quantity of high-quality data is an essential part of every
successful machine learning project. In this section, we will focus on a particular issue
that can arise in manually annotated datasets for classification tasks. Manual annotation
involves subjective judgment and is prone to human error, both of which can introduce
label noise into the dataset. Frénay and Verleysen [FV14] distinguish between the true
label of a sample and its observed label. The observed label is subjected to a noise
process which is referred to as label noise. Thus, label noise refers to noise in the labeling
process. Not in the scope of label noise is feature noise, which refers to noise in the
measurement process. For example, the inaccuracy of a thermometer would introduce
feature noise into measured temperature samples. Being aware of the issue of label noise
and knowing how to mitigate it is important because it can negatively affect the accuracy
of predictions, make a trained model more complex than necessary and increase the
required number of training samples. [FV14, GdCL15]
While label noise in image classification has received extensive research attention, label
noise in text classification has received less attention. [JPLN19] Still, there are several
studies on approaches to mitigating the impact of label noise in the latter. One approach
is to perform outlier detection to discard noisy labels. Garg, Ramakrishnan, and Thumbe
[GRT21] train a noise model, in conjunction with the main classifier, to predict the
likelihood of the presence of label noise. Samples with a higher likelihood of label noise
get assigned a lower weight having less impact on the network’s training process. Ardehaly
and Culotta [AC17] apply an enhanced label regularization technique to make their model
more robust against noise. Malik and Bhardwaj [MB11] investigated an automatic label
correction approach, in which samples with high-quality class labels are used to validate
and correct the other samples. Overall, the methods to deal with label noise can be
divided into three categories, as suggested by [FV14]:
1. Label noise-robust methods: Using models that are naturally more robust to label
noise. Such models remain effective when there is only a small amount of label
noise.
2. Data cleansing methods: Cleansing the dataset by correcting wrong labels or
removing samples with wrong labels.
3. Label noise-tolerant methods (probabilistic or model-based): When information
about label noise or its consequences is available, then the models can be designed
in a way that considers label noise. One method is to train a label noise model
27
2. Literature
simultaneously to the classifier. These combined classifiers learn to predict the
true label. Label noise-tolerant learning algorithms can be further divided into
probabilistic methods and model-based methods.
2.3.2 Pre-processing
Often, it is beneficial or even necessary to perform pre-processing operations on a dataset
before the use in machine learning models. Conforming to the model’s input format,
reducing processing time, or improving classification performance are possible motivations.
Some examples of common pre-processing techniques specific to text processing are word
stemming, lower-casing, and removal of unwanted tokens (e.g., URLs, HTML tags, stop
words). Examples of pre-processing operations in a broader problem domain include
reducing label noise and dealing with missing values.
Pre-processing operations can significantly improve the classification performance of
machine learning models. [NL18, SRS14, HLS13] Therefore, it is important to apply the
right pre-processing steps. Which pre-processing steps should be applied depends on the
dataset, the algorithm, the machine learning model, and the task, as [JX17] suggests.
They performed a comparison of the impact of six different pre-processing methods
on the sentiment classification performance of four different classification algorithms
on five Twitter datasets. The results showed that some techniques had a significant
impact on classification accuracy while others barely affected it. Another study on
the impact of pre-processing techniques for Twitter sentiment analysis [SEA18] shows
interesting results specifically for pre-processing in neural networks. Using two datasets,
they evaluated 16 techniques on four model types (CNN, linear regression, Bernoulli
Naive Bayes, and linear SVC). For the CNN, only 2 to 3 (depending on the dataset) of
the 16 techniques improved classification accuracy, while the other techniques worsened
it. For the non-neural-network approaches, significantly more (5 to 11) pre-processing
techniques improved performance. The results suggest that deep learning models benefit
less from pre-processing than statistical models. In the context of sentiment analysis,
the techniques performing best were lemmatization, replacing repeating punctuations,
replacing contractions, and removing numbers.
Pre-processing can also reduce the complexity of machine learning models by removing
information that is not relevant for the task (noise) because that information does not
have to be modeled. Thus, pre-processing can lead to decreased model sizes, which result
in faster training and classification times. [NL18] Additionally, cleansing the data from
noise will probably reduce the amount of training data required to achieve the same level
of classification performance.
In conclusion, the proper selection of pre-processing techniques is essential, as it can
improve classification performance, lower model complexity, as well as training and
classification times. What constitutes the proper selection of techniques depends on the
task, the dataset characteristics (e.g., language, text format, used vocabulary), and the
model architecture. Since the impact of pre-processing techniques depends on many
28
2.3. Supervised Machine Learning in NLP
factors and comprehensive studies exist mainly on specific domains (e.g., sentiment
analysis of Twitter posts in English), experimentation with different pre-processing
techniques should be performed in other domains.
2.3.3 Validation, Verification and Evaluation
Validation, verification, and evaluation are three terms related to testing machine learning
models, sometimes used synonymously. This section aims at exploring the differences, if
there are any, and overlaps between the terms.
The Encyclopedia of Machine Learning and Data Mining (by Sammut [SW17]) describes
model evaluation as an assessment of the efficacy of a learned model. Most of the
time, the primary consideration is the predictive efficacy, that is, how useful are the
model’s predictions for the use case for which it was deployed. Some well-known criteria
for measuring such usefulness are accuracy, precision, recall, and mean squared error.
Other evaluation criteria include the model’s size or its execution time. Regarding the
other terms, verification means to build the system right, while validation means to
build the right system. [SW17] We try to interpret this definition in the context of
machine learning models: Building the model (system) right implies adhering to some
specifications. Building the right model (system) means building a model that is useful
in solving a task. The lines between verification and validation are blurry because
arguably, one needs a specification to determine what useful behavior is. In the other
direction, the specification is usually written to ensure the useful behavior of the model.
The difference seems to be in the approach towards determining a model’s properties.
Verification appears to be more formal, while validation appears to be more empirical.
More concretely, in terms of determining the predictive efficacy of the model, it would
mean that verification aims at giving guarantees on all possible (unseen) inputs. At the
same time, validation is testing the model with some unseen input and extrapolating
expected behavior based thereon.
In the textbooks covering the fundamentals of machine learning, we found only little
mention of the terms model verification and model validation. Furthermore, the term
verification appears hardly at all, and the term validation occurs only as part of other terms
(e.g., cross-validation and validation set). [MRT18, WBK20, Lan95] Many papers use the
term V&V (verification and validation), which also appeared in Boehm’s description of the
"V-model" in 1984 [Boe84]. It describes the verification and validation (V&V) of software
requirements and design specifications during the software lifecycle. In review papers
about V&V of neural networks [BEW+18, TDM03], we did not find explicit definitions
of the terms in the context of machine learning models. Therefore, we are left with the
general interpretations in the context of software development and Boehm’s definitions
from 1984: "Verification. The process of determining whether or not the products of a
given phase of the software development cycle fulfill the requirements established during the
previous phase." [Boe84, p.1] and "Validation. The process of evaluating software at the
end of the software development process to ensure compliance with software requirements."
29
2. Literature
[Boe84, p.1] Those definitions are different only in that verification checks requirements
during development, whereas validation checks requirements at the end of development.
In summary, the evaluation of machine learning models (or algorithms) is the process
of evaluating their efficacy based on some property. Model evaluation subsumes the
terms model verification and model validation. The difference between validation and
verification in the context of machine learning seems not clearly defined. The general
notion we have observed in literature is that if evaluation takes a formal approach (giving
a guarantee on correctness), it can be called verification, and if it takes an empirical
approach, it can be called validation. Additionally, the term V&V can be observed, but
we did not find explicit definitions in the context of machine learning models.
2.3.4 Model Training and Evaluation
A machine learning model has to be trained on available data in order to efficaciously
make predictions on unseen data. As [MRT18] put it, machine learning refers to compu-
tational methods that use experience to make accurate predictions. To determine the
efficaciousness of such predictions, they also have to be evaluated. This section covers
the basics of training and evaluation of models.
Figure 2.7: Showing the concept of under- and overfitting in a binary classification task
in two feature dimensions. The blue line represents a classifier splitting the feature space
into two regions. The classifiers, from left to right, are likely to generalize too much,
appropriately, and too little.
Generalization A machine learning model should generalize from the data it was
trained on to unseen data. Two dangers arise in the training phase. If the model is
trained too specifically on the training data, it will likely not generalize well, which is
called overfitting. In contrast, underfitting is if the model was not trained to be specific
enough, meaning it generalizes too much and important details get lost. In both cases,
classification performance on unseen data will drop. [MRT18] Figure 2.7 depicts the
concepts of over- and underfitting. Various techniques can minimize the risks of over- or
30
2.3. Supervised Machine Learning in NLP
underfitting. Another factor that has a major impact on classification performance is
how well the samples used for training a model represent the total bandwidth of samples
that can occur. For example, if the model is trained on outliers only, it has no chance to
generalize to the real data.
Data Splitting A machine learning model should not be evaluated on the same data
it was trained on. One benefit of evaluating a model on "unseen" data is that it can show
if a model was overfitted to the training data. [Ber19] If the performance is much better
on the training data than it is on the test data, it could indicate overfitting. Different
techniques exist for splitting a dataset into training and testing sets. We will discuss
holdout, cross-validation, and stratification.
The holdout method splits the data into two disjoint sets by "holding out" some data for
evaluation. The method does not specify on which criterion the samples are selected into
the holdout set. A common approach is to choose them randomly. A shortcoming of the
holdout approach is that the model is evaluated only on a single subsample of the data.
Hence it could coincidentally get good results on the subsample, even though it would
perform worse overall. Cross-validation approaches tackle that problem by training and
evaluating the model multiple times on different dataset parts. The evaluation results
are then averaged over all runs to better estimate the model’s performance. [Ber19]
In exhaustive cross-validation (CV) techniques, the model is validated on all possible (as
defined by the method) test sets. [AC10] For example, in leave-p-out CV in each run, p
samples are used for the test set and n − p for the training set. Since it is an exhaustive
method, the total number of runs is n
p . In the non-exhaustive method k-fold CV, the
dataset is split into k equally-sized sets. Each run, one set is used for validation, and the
remaining k − 1 sets are used for training. In Monte-Carlo CV, in each run, a random
subset of fixed size is selected as the test set, while the complementary set forms the
training set. The process is repeated an arbitrary number of times.
A problem with randomly selecting samples into training and testing sets arises in
imbalanced datasets. Those are datasets that contain significantly more samples of one
class compared to samples of another class. By random selection, certain classes may
become severely under or overrepresented. A solution to this problem is stratification by
class. The idea is to preserve the ratio between classes in all subsets. For example, if
we were to choose randomly in a 20% holdout approach from a dataset containing 100
observations of class 1 and 10 observations of class 2, it can easily happen that we will
not choose any observations of class 2. Stratification on classes would ensure that 20
samples of class 1 and two samples of class 2 are chosen. If it is not possible to preserve
ratios exactly, then they should be approximated. [AC10]
Learning Now that we have discussed different methods to choose a training set, we
describe how a machine learning model learns from that data. Here we will look at
learning in neural network models. To understand the training process, we need to
understand the basic architecture of neural networks. The following explanations and
31
2. Literature
(a) Perceptron architecture, apply-
ing an activation function Φ to the
sum of five weighted input values
to produce an output value y.
(b) DNN architecture, with five inputs, two hidden
layers and a softmax layer which is used to perform
multi-class classification.
Figure 2.8: Basic DNN architecture [Agg18]
formulas are based on [Agg18]. Figure 2.8a shows a perceptron consisting of only one
neuron. The perceptron takes n input values x1, . . . , xn and outputs a value
y = Φ(
n
i=1
wi · xi) (2.14)
with Φ being an activation function (e.g., sigmoid) and w1, . . . , wn being the weights of
the input connections. Those weights are adapted during the training to approximate a
function that maps the input values of samples from the training set to their actual classes.
The perceptron is the simplest neural network architecture. In order to approximate
more complex functions, multiple layers of neurons (as in DNNs) are required. If the
network should learn to predict more than two classes, the softmax layer architecture
can be used: Let c1, . . . , ck be the possible classes and v̄ = (v1, . . . , vk) the outputs of the
softmax layer, then the i-th output is calculated by:
Φ(v̄)i = exp(vi)
k
j=1 exp(vj)
(2.15)
Φ(v̄)i corresponds to the probability that the input given to the model is mapped to
class ci. While running an input through a neural network to calculate an output value
is called forward pass, taking an outputted value and update the network’s weights by
calculating gradients of a loss function in relation to the network’s weights is called
backward pass. The loss function L tells the network how close its predictions were to
the ground truth. Let ŷ1, . . . , ŷk be probabilities outputted by the network. ŷi is the
probability that the input belongs to class ci. Then, the cross-entropy loss is calculated
as
L = − log(ŷr) (2.16)
with cr being the correct (ground-truth) class. ŷr assumes values between 0 and 1,
corresponding to the probability of the network predicting the correct class. Therefore
the loss is greatest (approaching infinity) when the network is furthest from the truth
(ŷr = 0), and the loss is 0 when the network is sure to predict the correct class (ŷr = 1).
32
2.3. Supervised Machine Learning in NLP
Finally, we describe how each weight in the network is updated by calculating the
gradient of the loss function with respect to that weight. We describe the calculation on a
simplified architecture with only a single sequence of hidden units h1, h2, . . . , hk followed
by a single output unit o. A more complex calculation based on dynamic programming
and the multivariable chain rule of derivatives must be performed in a network where
multiple paths exist from input to output. In the simple example, the gradient of the
loss function L with respect to the weight of the connection between hr−1 and hr is
calculated by
∂L
∂whr−1,hr
= ∂L
∂o
· ∂o
∂hk
k−1
i=r
∂hi+1
∂hi
∂hr
∂whr−1,hr
∀r ∈ 1 . . . k (2.17)
We now describe the backpropagation process in more detail. The goal of backpropagation
is to change the weights W of the network in such a way as to minimize the total
classification error of all samples that were fed into the network. Its notion comes from
the fact that incoming weights to a neuron are updated proportionally to the error
produced by the neuron’s activation value, starting from the last layer, proceeding layer
by layer in the direction of the first layer. In other words, the error is propagated back
through the layers. The calculation for a single weight in a simplified network is shown
in equation 2.17. After the calculation was carried out for all weights ( ∂L
∂W ), we have the
gradient. It tells us which combination of relative changes to the weights will result in
the maximum change of the loss. In other words, it tells us in which direction we have
to move (from the current configuration of weights) to reduce the loss the most. The
weights are updated via:
W ← W − α
∂L
∂W
(2.18)
with α being the learning rate or step size. Moving repeatedly in the direction of the
negative gradient is known as gradient descent. It is common practice to perform the
update on a batch B = {j1, . . . , jm} of randomly selected samples from the training
set—referred to as mini-batch stochastic gradient descent—via
W ← W − α
i∈B
∂Li
∂W
(2.19)
The learning process is repeated on the training data until convergence or until another
stopping criterion is reached. [Agg18]
Evaluation Metrics After training a model, it has to be evaluated. There are different
performance metrics for evaluating a machine learning model on its efficaciousness to
make predictions. We will explain the metrics used for multi-class (more than two
classes) classification in this study, based on [GBV20]. The calculations are based on the
confusion matrix, which shows per actual class how often each class was predicted.
Table 2.1 shows a sample confusion matrix with three classes c1, c2 and c3. The rows
show actual classes and the columns show predicted classes. For example, the third row
33
2. Literature
Pred. c1 Pred. c2 Pred. c3 Total Actual
Actual c1 9 2 0 11
Actual c2 3 7 1 11
Actual c3 4 11 2 17
Total Pred. 16 20 3 39
Table 2.1: Example of a confusion matrix
shows, that for the 17 samples belonging to c3, four of them were wrongly classified as
c1, 11 were wrongly classified as c2 and two were correctly classified as c3. The total
number and predicted number of samples with ck are denoted by Actualk and Predictedk,
respectively:
Actualk = TPk + FN k (2.20)
Predictedk = TPk + FPk (2.21)
For calculating the following performance metrics, we introduce four additional counts:
The true positives TPk are the number of times the predicted class was k when the actual
class was k. The true negatives TNk are the number of times the predicted class was
different from k when the actual class was also different from k. The false positives FPk
are the number of times the predicted class was k when the actual class was different
from k. Finally, the false negatives FNk are the number of times the predicted class was
different from k when the actual class was k.
Precisionk = TPk
TPk + FPk
(2.22)
Recallk = TPk
TPk + FN k
(2.23)
The metric Precisionk indicates how often the model is correct where it predicts class
k. A high precision value is required in applications when it is costly to make a wrong
prediction, for example, in spam filtering. The metric Recallk indicates how many of the
samples with actual class k the model predicts correctly. A high recall is important when
it is more costly to miss a prediction than making a wrong prediction. An example could
be a cyber security system guarding sensitive information against possible intrusion.
Accuracy is a popular metric that indicates how many samples are predicted correctly
out of the total number of samples:
Accuracy =
N
k=1 TPk
N
k=1 TPk + FN k
=
N
k=1 TPk
Total (2.24)
It is further possible to calculate the averages of precision and recall across all classes to
get a better idea of the overall performance. There are two common ways—macro, and
34
2.3. Supervised Machine Learning in NLP
weighted average:
MacroAveragePrecision =
N
k=1 Precisionk
N
(2.25)
MacroAverageRecall =
N
k=1 Recallk
N
(2.26)
WeightedAveragePrecision =
N
k=1
Precisionk · Actualk
Total (2.27)
WeightedAverageRecall =
N
k=1
Recallk · Actualk
Total (2.28)
For the macro-average values, the sum over the individual class values is taken and
divided by the number of classes N . The macro-average treats each class with equal
importance, regardless of how many samples belong to each class. The weighted averages
calculate averages per class weighed by the number of samples in each class.
The F1-Score provides an aggregated metric of a model’s precision and recall. It is the
harmonic mean of precision and recall:
F1Scorek = 2
Precision−1
k + Recall−1
k
= 2 Precisionk · Recallk
Precisionk + Recallk
(2.29)
When calculating the averaged F1-score over all classes, two different approaches can be
found in the literature. In the first approach [GBV20], the averaged F1-Scores are the
harmonic mean of their respective averaged precision and recall values:
MacroAverageF1 = 2 · MacroAveragePrecision · MacroAverageRecall
MacroAveragePrecision + MacroAverageRecall
(2.30)
WeightedAverageF1 = 2 · WeightedAveragePrecision · WeightedAverageRecall
WeightedAveragePrecision + WeightedAverageRecall
(2.31)
Another way of calculating the averaged F1-scores (found in [NPK+16]) is to first
determine the F1-Score per class and then take the averages.
MacroAverageF1 =
N
k=1 F1Scorek
N
(2.32)
WeightedAverageF1 =
N
k=1 F1Scorek · Actualk
Total (2.33)
The framework scikit-learn has implemented the latter way of calculating average F1-
Scores.
35
2. Literature
2.3.5 Tooling and Infrastructure
Tools and infrastructure play an essential role in the effective training of machine learning
models. Depending on the infrastructure, training times can vary significantly. Models
can be trained on multi-purpose processors, e.g., conventional CPUs, GPUs, or specialized
hardware, e.g., Tensor Processing Units (TPUs). In order to make use of the many
cores that GPUs and TPUs offer, the network architecture must allow for parallelized
computations. For example, attention models allow for better parallelization than
simple RNNs, as discussed earlier. Wang, Wei, and Brooks [WWB19] have performed
a detailed study on benchmarking CPU, GPU, and TPU (v2/v3) platforms for deep
learning. According to them, domain-specific hardware becomes more and more relevant
since improvements in computing power for general-purpose processors have become
increasingly difficult to achieve.
Additionally, they found that the TPU architecture makes good use of parallelism due
to batch size but does not exploit parallelism due to model depth to the same degree.
Furthermore, GPUs show better flexibility for small batch sizes and computations other
than matrix multiplication. CPUs achieve the highest FLOPS utilization on RNNs
and support the largest models because of their large memory capacity. They found
that the speedup of TPU over GPU depends heavily on the nature of the workload.
They measured speedups of real workloads between 3 times and 6.8 times. In summary,
TPUs are not always superior, although they are most of the time, and optimizing a
model’s architectural details is essential to gain the maximum benefit on the respective
infrastructure. As specialized and also powerful hardware is expensive, renting cloud-
based infrastructure can be a viable alternative.
Many tasks in machine learning and language processing have to be performed over
and over again. This led to the creation of machine learning frameworks, libraries, and
tools. There are general-purpose machine learning frameworks, e.g., Tensorflow1 and
Pytorch2, and there are specialized libraries specifically for natural language processing,
e.g., Spacy3 and NLTK4. Due to the steadily increasing demand for language processing,
companies started to offer it as a paid service (e.g., Google’s OpenAI5). The choice of a
machine learning framework not only affects obvious aspects, e.g., ease of use and general
capabilities to solve certain tasks, but also impacts performance. [WWB19]
Many users share annotated and unannotated datasets on the web. If sufficiently close
to the task domain of a project, such datasets can provide additional data on which
machine learning models can be trained and benchmarked. To this end, we examine
corpora (datasets) that could be helpful for this work. As is expected, finding un-
annotated corpora is easier than finding annotated ones, and finding annotated corpora
containing political speeches is even more challenging. Annotation types are manifold,
1https://www.tensorflow.org/, accessed: 2021-09-20
2https://pytorch.org/, accessed: 2021-09-20
3https://spacy.io/, accessed: 2021-09-20
4https://www.nltk.org/, accessed: 2021-09-20
5https://openai.com/, accessed: 2021-09-20
36
2.4. Related Work
but the only ones which are for sure relevant (both during training and validation) to
this work are sentiment annotations. The problem with topic annotations is that it is
implausible to find an annotated corpus containing topics applicable to our task. Other
common annotation types (e.g., part-of-speech, named entity recognition) could improve
classification performance by providing additional information to the classification model.
Various annotated corpora for German sentiment analysis exist on the web. However,
most of them seem to focus on shorter messages and simple language, e.g., Twitter posts,
comments on news sites, movie/product reviews, and news headlines. It is unclear how
well a model, trained on these kinds of texts, will perform on political speech, which
is very different in many aspects (e.g., longer sentences, more accurate grammar, more
vocabulary, no emoticons, fewer slang words, less offensive, fewer made-up words, and
fewer spelling errors).
“One Million Posts” [SST17] is a dataset containing posts from an Austrian newspaper
website in the German language. From those, 3599 are annotated by a sentiment label.
Unfortunately, only 1% of those labels are positive, while 52% are neutral, and 47%
are negative, which will make it challenging to train for positive sentiment. SB-10k
[CDEU17] is a sentiment corpus of 9738 Twitter posts featuring the following labels (with
frequency): positive (1682), negative (1077), neutral (5266), mixed (330), and unknown
(1428). Barbaresi [Bar18] published a searchable text archive containing German political
speeches from 1990 onwards. They contain no annotated sentiment labels or topic labels
but are searchable by a query language. GermaParl [Bla] contains unannotated plenary
speech protocols from the German Bundestag. There also exist sentiment word lists (e.g.
SentiWS [RQH10]), which assign to each word a value between −1 and 1, indicating its
connotation (negative/positive). Such lists can be used to calculate a cumulative polarity
value for a text to determine the overall sentiment.
In summary, the supply of applicable annotated corpora that are useful for this task is
low. The most promising corpus found was SB-10k [CDEU17], which is of the German
language and has a well-balanced distribution of relevant sentiment labels. The selection
of English corpora is significantly more extensive, and there even exists an annotated
corpus of political debates. [GBZ18]
2.4 Related Work
This section examines related work with the following four goals in mind.
1. Narrowing down the field of NLP, identifying sub-fields and related areas, to get a
better understanding of terms to search for.
2. Finding sources on sentiment and topic analysis, focusing, as much as possible, on
German corpora in the political domain.
3. Finding corpora in the German language, annotated with sentiment and topic
labels, since they are helpful for training and validation.
37
2. Literature
4. Searching for papers that use NLP methods to quantify the consistency of opinions
over time. This implies looking for sources that combine sentiment analysis and
topic analysis.
This project deals with the NLP sub-field of natural language understanding (NLU)—the
discipline of machine reading comprehension. For consideration, the analyzed text will be
written in German, and, compared to English, the state-of-the-art reading comprehension
will lack behind. There are different goals in NLU. Question answering, sentiment analysis,
determining the topic of a text, and machine translation are some of them. Relevant for
this work will be a combination of topic classification and sentiment analysis.
2.4.1 NLP in General and Related Fields
In their survey paper, the authors of [ZDLS20] view NLP from three perspectives:
modeling, learning, and reasoning. We will describe each of them briefly and relate them
to this project. Modeling describes the task of creating a neural network structure that
can take an encoded natural language sentence and turn it into a sequence of labels or
another natural language sentence. In our case, we want a sequence of labels, i.e., the
discussed topics and the politician’s sentiment. The main modeling techniques used are
word and sentence embedding and sequence-to-sequence modeling. Learning deals with
the training of the network parameters. In NLP, a multitude of learning algorithms is
used. Supervised methods perform very well when enough labeled data are available.
If that is not the case, unsupervised methods can be applied. In this project, we use
supervised methods based on a manually annotated dataset. Finally, they describe
reasoning as the process of generating answers to unseen questions (i.e., questions for
which an NLP algorithm did not produce the answers right away) by inferring from
available information. In our case, the reasoning part would entail making statements
about politicians’ consistency on their opinions by analyzing their expressed sentiments
for specific topics (determined by the NLP algorithm) over time.
Keyphrase extraction is the technique of automatically reducing a text to some key
phrases, containing a summary of the original text that preserves essential information
only. Among other tasks, it is helpful for document clustering and classification. [PT20]
Keyphrase extraction could be used as a pre-processing step to improve classification
performance.
Transfer learning is a subfield of machine learning that studies how the knowledge gained
in one domain can apply to other domains. The work of Ruder [Rud19] deals with
transfer learning in natural language processing. While transfer learning concepts might
come up indirectly during this work, they will not be of primary concern.
2.4.2 Opinion Consistency in Politics
This section checks if specific research exists on using NLP to measure the consistency
of expressed opinions over time. The aim is to find those that satisfy as many of the
38
2.4. Related Work
following aspects as possible:
• Combining sentiment analysis and topic classification
• The evaluation of change in opinions over time
• The domain of political speeches
• A German-language corpus
The technical term for combining sentiment analysis with topic analysis is called aspect-
based sentiment analysis (ABSA). This sub-discipline of NLP considers the aspects of a
text as targets for sentiments perceived in the same text. [NGK20]
ABSA in the Political Domain The authors of [GBZ18] performed ABSA on presi-
dential debates between Hillary Clinton and Donald Trump. Their work provides two
main contributions. Firstly, they provide an annotated corpus with sentiments and the
aspects agenda, united states, group, opposition, self, women, and other in two different
annotation schemata. Secondly, they show that the chosen schema has a substantial
impact on result performance.
The authors of [AMPZ17] performed ABSA on political news articles. Their work is
especially relevant for several reasons. First, because it is one of the few that apply ABSA
on larger documents (compared to most, which work with shorter social media posts) in
the political domain, secondly, they share the annotated corpus, which contains both
sentiment and aspect annotations. Unfortunately, the language is not German, and it
remains an open question if corpora in other languages are useful for this project. Thirdly,
they develop a classification algorithm and share performance evaluations. Finally, they
interpret the results by fitting them into the political and social context.
ABSA on German Language Corpora Only one relevant paper was found. Kerst-
ing and Geierhos [KG20] implemented a neural network algorithm to perform ABSA. In
order to evaluate their algorithm, they collected German physician reviews and manually
annotated them. The dataset contained 11,237 sentences annotated on the aspects
"friendliness", "competence", "time taken", and "explanation". The authors tested different
opinion extraction methods, e.g., using frequent nouns, making use of opinion and target
relations, supervised learning, and topic modeling. They concluded that only supervised
approaches were promising. Their algorithm, mainly based on a bidirectional LSTM,
achieved an average F1-Score of 0.8 over all four aspects. Their contribution is especially
relevant because it was the only one that performed ABSA on a German language corpus.
Evaluating Consistency of Opinions Chandio and Sah [CS19] analyzed changing
opinions on four topics relating to UK’s decision to leave the EU—Brexit, EU, Theresa
May, and Jeremy Corby. For each topic, they collected Twitter messages from four
39
2. Literature
periods. They used a keyword search, with a representative keyword for each topic,
against Twitter’s API to collect the messages. Then, they used NLTK6 and TextBlob7 to
calculate a polarity value for each tweet and plotted the proportions of positive, negative,
and neutral posts as a pie-chart per topic and time period. The results show that the
proportion of positive tweets for Brexit was larger in 2017 (around 32%) than in January
of 2019 (29%). After parliament voting in February of 2019, the proportion of positive
tweets dropped to around 27%. However, the number of negative tweets on Brexit
decreased as well—from 23% to 16.3%. For the keyword EU, the results show a shrinking
amount of positive tweets (from 38% to 30%), as well as negative tweets (24% to 18%).
According to the authors, the data further shows that people are more supportive of
Jeremy Corbyn than Theresa May.
The authors of [CGG+07] present a framework for letting users express their opinions and
visualizing individual and collective sentiments over time. Although the paper focuses
more on design considerations of a front-end application and less on an actual algorithm,
it still provides inputs for designing a platform visualizing opinions.
Evaluating Trustworthiness of Statements Although there is a decent number of
articles dealing with assessing the trustworthiness of statements (e.g., fake news detection),
none could be found on assessing the trustworthiness of people (or politicians).
In summary, we found no studies combining all of the aspects defined above. The highest
number of satisfied aspects in a single paper was two, which shows that this work is a
novel contribution.
6https://www.nltk.org/, accessed: 2021-09-20
7https://textblob.readthedocs.io/en/dev/, accessed: 2021-09-20
40
CHAPTER 3
Design
This chapter describes the research method, derives requirements for the experiments
from the research questions, defines opinion consistency, and documents the dataset
creation process.
3.1 Research Method
In this work, we performed experimental research, i.e., we started with a vision in mind
but did not know at the time how to get there or how far we could reach with the
available resources. That is why the project followed an iterative approach of multiple
phases of exploration, design, and implementation. The project specifications were kept
loose and open initially and were narrowed down and concretized with increasing project
duration, based on gathered insights along the way. The research questions (Q1–Q4) were
designed accordingly: Open enough to allow for different implementations but concrete
in answering how useful and practically feasible the chosen implementation will be.
The vision was to develop a system that is capable of monitoring the consistency of
opinions over time. In order to do that and to answer the research questions, we had to
define a formula that can make the consistency of opinions quantifiable. Coming up with
such a formula was relatively easy and is described in Section 3.2. The hard part was to
define how exactly an opinion can be extracted from a piece of text. It was not easy to
put the process that a human performs for identifying opinions into an exact definition.
As such, this was an exploratory process (described in Section 3.3) of different ideas,
trading off the subtlety of captured opinions with the feasibility of implementation.
After the initial exploration phase, we decided to progress iteratively by first performing
supervised opinion classification on a single topic, with the option to expand to more
topics or different methods in subsequent iterations. Also, the dataset creation (described
in Section 3.4) involved an exploratory process because the manual labeling of opinions
41
3. Design
involves many uncertainties. The first iteration of opinion classification (Section 4.1)
yielded a low accuracy. With the suspected reason being the small dataset size, we
performed a second iteration (Section 4.2) on a larger dataset and were able to improve
classification performance significantly. The results of those classification experiments
were used to answer research question Q2.
After the second iteration of classification experiments, we moved on to utilizing opinion
data. In Section 4.3 we explored how opinion data can be visualized in a useful way
(Q4) and in 4.4.1 we investigated the usefulness of visualizing opinion consistency (Q1b).
To answer research question Q3, we analysed the impact of model performance on the
resulting visualizations in Section 4.4.2. The answer to Q3 also helped in the better
answering of Q1a—the practical feasibility of monitoring opinion consistency through
the means of supervised ML methods.
3.2 Requirements: A Definition of Opinion Consistency
We start by deriving requirements from the research questions outlined in Section 1.2:
R1 Precise Definitions: The terms opinion consistency and opinion have to be well-
defined, because all other results (Q1–Q4) depend on those definitions. Therefore,
those definitions must be easily comprehensible in order to put the results into
context.
R2 Extracting Opinions: A method for extracting opinions from text has to be estab-
lished and documented. The extracted opinions are subsequently also referred to
as opinion data, and are required for calculating the opinion consistency. (Q1, Q2,
Q4)
R3 Measurability and Comparability: The classification results of the used ML algo-
rithms to predict opinions have to be measurable in order to make them comparable.
(Q2)
R4 Transparency and Reproducability: The experiment conditions have to be well
documented in order to make the experiments reproducible. (Q1–Q4)
R5 Visualizations: To help in determining the feasibility and usefulness of monitoring
and visualizing opinion consistency and opinion data in general (Q1, Q4), at least
the following graphs should be created:
– A graph that compares the opinion consistencies of multiple speakers over
time.
– A comparison of the actual vs. predicted opinion consistencies.
– A comparison of the actual vs. predicted opinion data of extracted opinions.
42
3.2. Requirements: A Definition of Opinion Consistency
R6 Estimation of Accuracy: In addition to visualizing opinion consistency based on
experimental data, an effort should be made to directly determine the theoreti-
cal accuracy of predicted opinion consistency values based on the classification
capabilities of underlying machine learning algorithms. (Q3)
In fulfillment of R1, the remainder of this section establishes the definitions of the terms
opinion and opinion consistency. To answer the research questions, we need a way of
quantifying the consistency of opinions. We build on Liu’s [Liu12] definition of opinions
as quadruples, as described in Section 2.2.5. An opinion (g, s, h, t) has a target g ∈ G,
expresses a sentiment s ∈ S, and is held by the opinion holder h ∈ H at time t ∈ N . O
denotes the set of all extracted opinions. In our case, the following interpretations apply:
1. The set of all opinion targets G contains ideas discussed in the Austrian parliament.
2. The possible sentiments S := {POSITIVE, NEGATIVE, NEUTRAL} represent the
speaker’s stance towards the opinion target. A POSITIVE/NEGATIVE sentiment
means that the opinion supports/resists the idea. A NEUTRAL sentiment means
that the opinion neither clearly supports nor resists the idea.
3. The set of opinion holders H contains all speakers who expressed an opinion in
parliament. A speaker h can belong to a political party P , denoted by h ∈ P . The
set of all parties is denoted by P.
4. The time t is a timestamp of the date when the speaker expressed the opinion.
Opinion consistency should be high when the number of contradicting opinions is low and
vice-versa. Two opinions (g1, s1, h1, t1) and (g2, s2, h2, t2) are contradicting each other if
g1 = g2 (they refer to the same topic), s1 = s2 and s1 = NEUTRAL and s1 = NEUTRAL
(one has a positive sentiment and the other one has a negative sentiment). In this study,
we focus on the opinion consistency of a single speaker (h1 = h2) or the speakers of a
political party P (h1, h2 ∈ P ).
Next, we define three variables to count the number of positive, negative, and neutral
opinions. We count opinions for a subset of topics G ⊆ G, a subset of speakers H ⊆ H,
up to point t in time:
Positive(G , H , t ) = | { (g, POSITIVE, h, t) ∈ O | g ∈ G ∧ h ∈ H ∧ t ≤ t } | (3.1)
Negative(G , H , t ) = | { (g, NEGATIVE, h, t) ∈ O | g ∈ G ∧ h ∈ H ∧ t ≤ t } | (3.2)
Neutral(G , H , t ) = | { (g, NEUTRAL, h, t) ∈ O | g ∈ G ∧ h ∈ H ∧ t ≤ t } | (3.3)
We have not yet defined how neutral opinions should affect opinion consistency. Neutral
opinions should never affect opinion consistency negatively, but they could increase
43
3. Design
opinion consistency. We propose two formulas:
OpCons1(G , H , t ) = max{Positive, Negative} + Neutral
Positive + Negative + Neutral (3.4)
OpCons2(G , H , t ) = max{Positive, Negative}
Positive + Negative (3.5)
In equation 3.4 neutral opinions increase opinion consistency, while in equation 3.5 they
have no effect on it.
The proposed definitions fulfill our requirements of making the consistency of opinions
quantifiable and comparable. The value correlates positively with the proportion of
opinions expressing a non-contradicting sentiment. Furthermore, these definitions are
flexible. For example, to calculate a value for a single speaker h on a single topic g, we
set H := {h} and G := {g}. Or, if we want to calculate the opinion consistency of all
speakers in a party P we can set H := P .
These definitions imply that the values will become more insensitive to changing opinions
the more opinions are collected over time. Other, more complex calculations could
counter that problem. Improved methods could use a rolling window in which opinions
are considered or weigh recent opinions more strongly. In this work, we will use the simple
definitions from equation 3.4 and equation 3.5 and leave the study of more complex ones
to future work.
3.3 Experiment Design
In fulfillment of R2 we had to establish a method for extracting opinions from text.
This was an exploratory process, that started in the design phase (this section) and
continued through the dataset creation phase (Section 3.4) and to the early stages of
opinion classification (Section 4.1). This section documents the process up to the point
were we had enough information to start creating a dataset. After we had a definition
of opinion consistency, before we could proceed with the experiments, the following
questions required an answer:
1. How to identify an opinion in a text document?
2. On which topics should we extract opinions?
3. How should an opinion be extracted on the technical level?
4. How should the data be represented?
We started with the first two questions. To that end, we downloaded a number of speech
protocols from the Austrian parliament website1 and tried to extract opinions manually
1https://www.parlament.gv.at/, accessed: 2021-09-24
44
3.3. Experiment Design
by reading through them and highlighting text passages from which opinions could be
derived. It quickly became apparent that this was a considerably complex task for the
following reasons: Without narrowing down the scope of what to look for, each statement
can potentially have multiple layers of opinions. Some opinions only become apparent
with more context information. Furthermore, some opinions are more apparent than
others. Unless an opinion is stated directly and without room for interpretation, which is
rarely the case, their identification involves a subjective judgment.
We concluded that it is best to start with a simple method and a narrow scope. For
that reason, we decided to extract opinions on the sentence level. In order to determine
whether a sentence is of relevance to the topic of interest, we chose to use a keyword
search. Regarding the second question, we decided to focus on a single topic, at least
in the first experiment. The topic should be polarizing so that diverse opinions exist
and it should be relevant so that enough opinions are expressed. The chosen topic that
fulfilled both requirements at the time was the discussion about lockdowns as a measure
to prevent the spread of the coronavirus.
Now that we narrowed the scope to a specific topic, it was unclear how to extract opinions
on that topic on the technical level. First, we had to decide between supervised and
unsupervised methods. The advantage of the latter would be that a manual annotation
process would not be necessary. Ultimately, we decided to use supervised methods
because, in opinion mining, they usually outperform unsupervised methods. [SLC17] We
dedicated some time to explore rule-based approaches but realized after a short time
that statistical and neural network approaches were more promising.
Intermediate Data Formats At that point, we had an answer for the second question
and narrowed down the answers to the other questions. Before we could progress further,
we required more insight that could be gathered only through experimentation on the
data. To support experimentation and analysis on the data, we had to bring the raw
data of speech protocols from the HTML format to a format suited for the processing by
machine learning algorithms. Since we did not yet know how we would identify opinions,
we designed the intermediate data formats with maximum flexibility in mind. We came
up with three file formats, called primary, secondary, and tertiary.
The primary format is a comma-separated values (CSV) file that contains the following
fields:
• speaker: Contains the speaker’s title(s), name, and party affiliation.
• speech: Contains the speaker’s transcribed speech from the moment they begin to
speak up to the moment they are interrupted by the president or are done speaking.
In the secondary format (also CSV), the speeches of the primary format are split along
sentence boundaries and enhanced with additional information, resulting in the following
fields:
45
3. Design
• sent_id: A unique identifier of the sentence.
• date: The date when the sentence was uttered by the speaker.
• protocol_id: The id of the protocol in which the sentence is contained.
• party: The party affiliation of the speaker that uttered the sentence.
• speaker: The speaker that uttered the sentence.
• governing: A truth value that indicates whether the party of the speaker was
governing at the time the sentence was uttered.
• text: The transcribed sentence.
The tertiary format is in the CONLL-X format and contains for each sentence all fields
of the secondary format as a comment and additionally the sentence analysis in the
CONLL-X format with the following columns (descriptions from [BM06]):
• ID: Token counter, starting at 1 for each new sentence.
• FORM: Word form or punctuation symbol.
• LEMMA: Lemma or stem of word form.
• CPOSTAG: Coarse-grained part-of-speech tag.
• POSTAG: Fine-grained part-of-speech tag.
• FEATS: Unordered set of syntactic and/or morphological features.
• HEAD: Head of the current token, which is either a value of ID or zero (’0’).
• DEPREL: Dependency relation to the HEAD.
• PHEAD: Projective head of current token, which is either a value of ID or zero (’0’),
or an underscore if not available.
• PDEPREL: Dependency relation to the PHEAD, or an underscore if not available.
After the three intermediate formats were defined, we started with downloading session
protocols in the HTML format from the government website. It has to be noted that it
takes the transcribers a considerable amount of time before they make the final protocols
available. At the time, the finalized protocols were lacking behind approximately five
months. To bring the speeches from HTML format to the primary format, we implemented
a document parser in Python. We chose to directly apply the pre-processing step of
removing HTML tags from the speeches since the goal was to extract opinions only from
words uttered by the speaker, without access to additional meta-information.
46
3.4. Datasets: Dataset Creation and Analysis
After transforming all available protocols from the HTML format to the primary format,
we implemented two additional parsers, one to convert the primary format into the
secondary format and one to convert the secondary format into the tertiary format. To
generate the dependency parse for the tertiary format, we used the ParZu library [SVS13].
Finally, we had the data of all available protocols, available in the intermediate formats,
for further processing.
3.4 Datasets: Dataset Creation and Analysis
In Section 3.3 we defined some parameters of the first experiment. We chose the
topic LOCKDOWN and decided that we would gather opinions on the sentence level.
Furthermore, we decided to use a keyword search to identify sentences of relevance, i.e.,
those that are concerning the chosen topic.
We used the regular expression [lL]ock.?[dD]own to filter for sentences that concern
the topic. This gave us a selection of 492 sentences. As a consequence of the decision
to use supervised machine learning approaches to extract opinions, we had to manually
annotate the 492 sentences. In the first approach, we tried to assign one of three labels
representing the speaker’s opinion on the question of lockdowns to each sentence. This
proved to be difficult because we constantly doubted whether our definition of opinion
was still the same as in the beginning. Therefore, we wrote a definition down, but that
did not solve the problem because there constantly appeared border cases that were not
covered by the definition.
(a) Opinion distribution per sentiment (b) Overall opinion distribution
Figure 3.1: Relationship between sentiment and opinion categories in the first dataset
Additionally, in the first classification attempt, we were conservative with giving subjective
opinion labels (+ or -), and as a result, only 49 out of 492 samples were subjective. We
anticipated that this could be a problem for the machine learning algorithm to learn
from such a small sample size. Therefore, we performed a second classification attempt
with a bias towards assigning subjective labels. In the second attempt, the number of
47
3. Design
subjective labels increased to 366. Figure 3.1b shows the distribution of labels of the
second attempt.
In an attempt to approach the annotation process in a more objective way, we assigned
labels for multiple categories per sample, each with a more specific definition. We
annotated the data on the following seven categories:
1. General sentiment of the sentence (+/-/o)
2. Speaks about somebody else’s opinion (x = no, # = speaks about somebody else’s
opinion, ## = speaks about somebody speaking about somebody else’s opinion)
3. Explicit support (+ = expresses explicit support for lockdowns, - = expresses explicit
resistance against lockdowns, x = expresses neither support for nor resistance against
lockdowns)
4. Impact (+ = mentions explicitly that lockdowns have a positive impact, - =
mentions explicitly that lockdowns have a negative impact, o = talks neutrally
about the impact of lockdowns, x = does not mention the impact of lockdowns)
5. Organisation (+ = expresses positive sentiment towards the organisational aspects
surrounding the implementation of lockdowns, - = expresses negative sentiment
towards the organisational aspects surrounding the implementation of lockdowns, o
= talks neutrally about the organisational aspects surrounding the implementation
of lockdowns, x = does not mention the organisational aspects surrounding the
implementation of lockdowns)
6. Overall opinion, less subjective: The overall opinion on lockdowns with a bias
towards neutral opinions (+/-/o)
7. Overall opinion, more subjective: The overall opinion on lockdowns with a bias
towards non-neutral opinions (+/-/o)
The labels of categories 3–5 can be prepended by # or ## to denote the opinions of other
speakers. To illustrate the nuanced and subjective nature of the labeling process, we go
through two example sentences from the LOCKDOWN set. Sentence 1:
Egal wann es einen solchen Lockdown gibt, für die Wirtschaft gibt es keinen
guten Zeitpunkt für eine solche Maßnahme, und gerade vor dem anlaufenden
Weihnachtsgeschäft ist dieser Schritt natürlich besonders schmerzhaft.
and Sentence 2:
Wir wissen, dass wir keine tatsächliche Berichtigung von einer tatsächlichen
Berichtigung machen können, aber, Herr Loacker, ich möchte das hier schon
richtigstellen: Sie behaupten, Kollegin Hebein hätte gesagt, sie kann sich einen
zweiten Lockdown vorstellen.
48
3.4. Datasets: Dataset Creation and Analysis
Table 3.1 shows the labels we have assigned in the seven categories. For the first sentence,
we have assigned an overall negative sentiment (C1 = -). The speaker mentioned that a
lockdown is bad for business and especially bad for the Christmas business (C4 = -).
The speaker did not explicitly express whether they are for or against a lockdown (C3 =
x). They did not talk positively or negatively about the organizational aspect, but they
talked neutrally about the timing of a lockdown (C5 = o). We found that even though
they mentioned the negative impact of a lockdown, the way in which they have formulated
the sentence implies that they think there is no alternative to a lockdown, which means
they are ultimately for a lockdown. Since this is a rather subjective interpretation, we
have assigned a neutral opinion in the conservative category (C6 = o) and a positive
opinion in the interpretative category (C7 = +).
The second sentence shows an example of a sentence in which the speaker addressed a
statement of colleague Loacker, in which he addressed colleague Hebein, who presumably
expressed a positive opinion on a lockdown. Accordingly, we have assigned C2 = ## and
C3 = ##+. We considered an overall sentiment of neutral and negative but ultimately
went with negative since the sentence is confrontational (C1 = -). Furthermore, the
speaker did not talk about the impact (C4 = x) or organizational aspects (C5 = x) and
we cannot deduce an opinion for or against lockdowns (C6 = C7 = o).
C1 C2 C3 C4 C5 C6 C7
Sentence 1 - x x - o o +
Sentence 2 - ## ##+ x x o o
Table 3.1: The annotations on the two example sentences, according to the seven
categories.
The first idea was to rely on explicit support (category 3) only since this would be the
most objective way of determining the opinion. An analysis of these data revealed that
politicians rarely expressed explicit support or resistance (only 107 out of 492 times). The
politicians more frequently expressed a subjective (non-neutral) opinion on the effects
of a lockdown (172 times, category 4) and on the implementation details of a lockdown
(138 times, category 5). Considering the low amount of explicitly expressed opinions, we
concluded that we had to include more subjective opinions as well.
Another idea was to determine the opinion directly from the sentence’s overall sentiment
(category 1) if the correlation between the overall sentiment and the opinion (category 6
and 7) would be high enough. We plotted the sentence’s sentiment against the overall
opinion of category 7. Figure 3.1a shows the relative frequency of opinions per sentiment.
For example, in the third column, we see that when a sentence has a negative sentiment,
it is labeled as a supporting opinion in 8%, as a neutral opinion in 13%, and as a resisting
opinion in 79%. Although the correlation is relatively high for negative sentiments, it is
not high enough for positive and neutral ones. Ultimately, we came to the conclusion
that it was easiest to predict the opinions directly from category 7 and proceeded with
the classification on the first dataset (see 4.1).
49
3. Design
Second Dataset The classification results of the first experiment were relatively poor.
One explanation for the poor results was that the dataset was too small compared to
the complexity of the task. Under that assumption, the classification algorithm would
not have access to enough training samples to learn all the features that can be used
to extract opinions. To examine the impact that dataset size has on the classification
performance, we planned to collect a second dataset that was significantly larger than
the first one. The first dataset contained approximately 500 entries. The second one
should contain at least ten times as many records.
With a defined target for the dataset size, we had to choose a topic that could produce
around 5000 records. We tried various terms that were related to the coronavirus
pandemic, but it turned out that none of the topics produced nearly enough samples.
We looked at possibilities to gather more data from the government website and found
a section with tentative speech protocols that are in a preliminary state. Initially, we
avoided those protocols because their format is more difficult to parse than that of
the finalized ones. Since we required additional data, we had no other choice than to
implement another parser that could transform the preliminary protocols to the primary
format.
After parsing the preliminary protocols into the intermediate file formats, the number
of available sentences increased significantly. We experimented with different regular
expressions and grouped similar ones to form topics. The finalized topics, together with
their respective regular expressions, are:
• MASKS: mask|ffp2|mund.?nasen
• VACCINES: impf
• TESTING: testet|testung|tests|testen|pcr
• DISTANCING: distanc|abstand|social.d
• LOCKDOWN: lock.?down
We filtered for sentences that contained at least one of the patterns. The resulting number
of sentences per topic is shown in the following table:
Topic Sentences
MASKS 799
VACCINES 2298
TESTING 1641
DISTANCING 410
LOCKDOWN 855
Table 3.2: Number of sentences per topic
50
3.4. Datasets: Dataset Creation and Analysis
Since none of the topics came close to the target of 5000 samples, we decided to combine
all five topics under the topic MEASURES. The topic MEASURES contains opinions on
measures against the spread of the coronavirus. In total, we could gather 5573 sentences
on this topic. The total does not equal the sum of individual topics because one sentence
can belong to multiple topics.
One of the reasons for gathering the second dataset was to examine if a larger dataset
improved classification performance. We argue that the fact that the second dataset is a
superset of the first one could help to a minor degree in making the results comparable, but
more importantly, it does not hurt the results. Let us say the classification performance
is a function of dataset size and the complexity of samples. If we had two mutually
exclusive datasets, with the samples in one of the datasets being significantly easier to
predict than those of the other, the impact of dataset size would become less important.
We argue that the impact of this relation is low in this case because the first dataset has
only one-tenth of the samples of the second one. To better study the impact of dataset
size, we could have used cross-validation on multiple subsets of the second dataset, but
we leave that to future study.
Figure 3.2: Screenshot of the annotation software that aided in the annotation process of
the second dataset
After we had gathered the samples for the MEASURES dataset, we had to label them
51
3. Design
with the opinion labels. Since this required a considerable effort, we decided to implement
an annotation software (Figure 3.2) that aided in the process. The key features of the
software are:
• Displaying of unlabelled records in random order and without the speaker’s name
or their party affiliation to ensure that the decision is not influenced by meta-
information that would not be available to the machine learning algorithm.
• Loading of unlabeled records from one file and storage of the labeled record in
another file. The software remembers which records were labeled already.
• Convenience features to speed up the process. They include assigning a label via
hotkey and highlighting keywords.
• Displaying of context information in the form of the previous and the subsequent
sentence. If context information is used in the labeling process, then the dataset
will also contain it, i.e., the machine learning algorithm has the same information
as the human annotator has.
In the next chapter, we document how we used different classification algorithms to
predict the opinions of records from the two datasets.
52
CHAPTER 4
Experiments
This chapter documents the opinion mining process, with the help of various machine
learning algorithms, on the first (Section 4.1) and second (Section 4.2) dataset. In Section
4.3, we visualize opinion data from the second dataset aggregated per speaker and party.
The chapter is concluded in Section 4.4, where we calculate the opinion consistency values
for speakers and political parties. In Section 4.4.1, we visualize actual and predicted
opinion consistency values. Finally, in Section 4.4.2, we explore the impact of a model’s
capability to classify opinions on the accuracy of opinion consistency values.
4.1 Opinion Classification: First Experiment
After we labeled the first dataset, we examined the capability of various machine learning
models to predict the speakers’ opinions on lockdowns.
4.1.1 Classification
For the first run, we started with a simple deep learning network. The network’s
architecture consisted of a bag-of-words embedding layer with 64 dimensions, followed
by a fully connected linear output layer with three output neurons. We initialized the
weights randomly and started training on the test set. We used a stratified holdout
approach to split the data 80-20. From the training set, we split off another 20% to be
used for validation during training. We used a stochastic gradient descent optimizer with
a batch size of 32 and a cross-entropy loss function. We trained for 20 epochs with no
early stopping criteria. Furthermore, no pre-processing was applied.
We manually ran the first setup a few times and achieved classification accuracies between
64% and 80% on the test split. We did not expect the results to be representative of the
true performance because we did not use weighted class labels for training, which should
be done on imbalanced datasets. We suspected that the network learned to predict the
53
4. Experiments
majority class, therefore statistically achieving good performance. A confusion matrix
could have been used to verify that hypothesis. At the time, we decided to approach the
problem differently by creating a training set with the same number of samples in each
of the three classes.
In the second data splitting approach—with an even distribution of class labels in the
training set—we achieved classification accuracies between 20% and 70%. We explained
the high variance by the small sample size. Depending on the quality of the randomly
chosen training samples, the results on the test set could be significantly better or
significantly worse.
Initially, the minority class of our dataset was the positive opinion with only 50 samples.
Therefore, the training set consisted of only 150 samples, with the approach of evenly
distributing the classes. To verify if that is indeed a cause for the high variance in
classification performance, we labeled the dataset a second time with a bias towards
non-neutral opinions (referred to as category 7 in Section 3.4). This time, the minority
class was still the positive opinion, but with 90 samples instead of 50. We split the data
into an evenly distributed training set (90 samples of each class) and put the rest into
the test set. With a classification accuracy between 10% and 50%, the results were not
better.
At this point, we decided that due to the high variance in the results, we had to perform
multiple runs of training evaluation on different splits of the data and average the results
to get a better understanding of true performance. Another explanation for the high
variance could be the small validation set, which we have used to adjust the learning rate
during training dynamically. Since the validation set samples consisted of only 20% of
randomly selected samples from the training set, the variance in classification accuracy
will be high for such a small dataset.
Due to the overall dataset being small, we performed some runs without a validation set,
i.e., we used the training set in place of a separate validation set. This approach increases
the risk of overfitting, but it might be the better compromise on our dataset. Again,
we used the same holdout approach as before and performed 50 runs of training and
evaluation. The results showed us that most of the runs achieved a classification accuracy
between 30% and 50% on the test set, with some outliers performing significantly better
or worse. Of course, the resulting average accuracy of close to 40% was not satisfying, as
a random guessing approach would not be significantly worse with an average accuracy
of 33%. To test the impact of pre-processing, we performed another run after applying
stop-word removal. The results showed practically no difference in performance.
A more complex model At that point, we had explored various data-splitting
methods and one pre-processing method. Next, we wanted to test a more complex model
architecture. It consisted of an embedding layer with 300 dimensions, followed by a
bi-directional LSTM with two layers and a linear layer with three output neurons. For
54
4.1. Opinion Classification: First Experiment
the embedding layer we used pre-trained GloVe word embeddings1. We had to apply the
pre-processing step of lower-casing all words because the pre-trained word vectors were
all in lower case. Interestingly, with the data splitting method of even distribution, the
results were worse than random guessing.
Next, we tried a different approach to counter the class imbalance. We used a stratified
train-test split but used weighted training samples. When calculating the loss, we
made the training samples of underrepresented classes more important and those of
overrepresented classes less important. The results showed the superiority of this approach
compared to the one we had used previously.
An attention model Since we had achieved promising results by using a recurrent
network, we were interested in examining the capabilities of the attention mechanism in
transformer architectures. For the subsequent runs, we used a BERT architecture with
around 109M trainable parameters. It consists of an input layer that takes a sequence of
word ids, followed by a pre-trained German BERT model, completed by a fully connected
layer.
Due to the large size of this BERT model, it was not feasible to train it on a personal CPU.
Instead, we performed all experiments related to BERT on tensor processing units (TPUs)
at Google Colab servers, thus managing to reduce training times to reasonable durations.
Since the Colab servers disconnect users from time to time, we had to implement a
mechanism to store and resume the training progress after each epoch.
To test the capabilities of this pre-trained BERT model, we first used it on the 10kGNAD2
dataset. We achieved an overall accuracy of 89%, which gave us confidence in the model’s
architecture.
After verifying the model on the 10kGNAD dataset, we moved on to the LOCKDOWN
dataset. We used an 85-15 stratified split, with no validation set still. We applied the ad-
ditional pre-processing techniques of lower-casing, stopword removal, and stemming. For
now, we used the same pre-processing steps in all BERT runs. In the second experiment
(Section 4.2), we also compared the impact of different pre-processing techniques. The
model architecture quickly brought the machine’s hardware capabilities to their limits.
Therefore, we had to truncate the sentences to a maximum length of 128 tokens and use
a batch size of 64 samples. In the case of the LOCKDOWN dataset, the truncation did
not lose any data, as Figure 4.1 shows. We trained the network for a maximum of 20
epochs but used an early stopping policy if there was no improvement in the loss for
more than three epochs. Usually, the network converged in epoch 10–12.
The classification report, showing the performance per class, revealed a weakness in
the training method. The model predicted zero positive opinions. We explain that
circumstance by the low amount of positive samples in the training set. After using a
1https://www.deepset.ai/german-word-embeddings, accessed: 2021-09-27
2https://github.com/tblock/10kGNAD, accessed: 2021-09-30
55
4. Experiments
Figure 4.1: Speech lengths in the LOCKDOWN dataset
weighted loss function to give the positive samples more importance during training, the
model also learned to predict positive opinions and achieved an impressive performance
of 59% accuracy.
Out of curiosity, we also used a second data-splitting method, creating a test set consisting
of precisely 20 samples from each class, for a total of 60 samples. We trained the model
on the training set consisting of the remaining samples with a weighted loss function.
In this case, the evaluation performance dropped to 49% accuracy on the test set. We
conclude that it is still more difficult for the model to predict positive samples simply
because there are fewer examples to train on, even though they have more impact due to
the weighted loss function.
Statistical Models At this point, we had a good overview of the capabilities of deep
learning models. Additionally, we wanted to test the capabilities of some statistical
models.
We started with a multinomial Bayes (MNB) model, as described in Section 2.1.1. Since
in MNB, we did not have to train a network with many parameters but count frequencies
of words per class, fitting the model to the training data took considerably less time than
for the deep learning models. Therefore, the experiments could be run on an average
personal computer with acceptable time investment. As before, we weighted the samples
proportionally to the class sizes to account for the class imbalance. We applied the same
pre-processing methods as we did for the BERT runs and performed 1000 runs on two
data-splitting methods. First, we utilized a random 85-15 stratified split and achieved an
accuracy of 53% averaged over all runs. On the second split—by randomly selecting 20
samples from each class into the test set and the remaining samples into the training set
56
4.1. Opinion Classification: First Experiment
—the model achieved an averaged accuracy of 51%. Interestingly, the difference between
results on the two splits is only 2% compared to the 9% observed for the BERT model.
Finally, we applied the BM25 document ranking algorithm (described in Section 2.1.1)
to classify opinions. Based on a text query, the document ranking function calculates
a relevance score for each document in a set of documents. To make it work for text
classification, we used the speech to be classified as the query and used BM25 to calculate
relevance scores for all speeches in the training set. Then, we used the samples with the
n highest scores to determine the label of the query sample.
We applied the same pre-processing steps as in the BERT run. In each run, we randomly
selected 20 samples from each class into the test set and used the remaining samples
for calculating the document scores. We used two methods for determining the class
label from the ordered list of samples of the lookup set of size N—one weighted by class
frequency, the other not. Let (s1, l1), . . . , (sN , lN ) with si ≥ sj ∀i < j be the ordered list
of tuples with the relevance score si describing how relevant the i–th document is in
relation to the query, and li the class label of the i–th document. Then, we assign class c
to the query by calculating:
Scorec =
i≤n∧li=c
si · wc (4.1)
c =argmax{Score1, Score2, Score3} (4.2)
In the non-weighted case we set wc = 1 and in the weighted case we calculate wc = N/Nc,
with Nc being the number of samples belonging to class c. We experimented with different
parameters of n, both with and without class-weighing. We attained the best results
with n = 7 and class-weighing. The average accuracy over 500 runs was about 45%.
In the next section, we plot a table of the overall results and provide an interpretation.
4.1.2 Summary of Results
With the five models that we used to predict opinions on the lookup (training) set, we
had a good overview of our expected performance. The results, averaged over all runs, are
shown in Table 4.1. The accuracy values are slightly different from those of the previous
section. The reason is that after we had performed the second experiment, we reran all
algorithms on the first dataset to record also the macro-average F1-Scores, which we did
not record initially.
It was expected that BERT outperforms the LSTM, which in turn outperforms the
Embedding Bag, but it comes as a surprise that the MNB outperforms the BERT slightly
in terms of mean F1, even though it is slightly behind in accuracy. The BM25 approach
had the biggest gap between accuracy and F1-score, meaning it was influenced the
strongest by the dataset imbalance.
The classification results on the LOCKDOWN set can be considered mediocre, and there
are possible explanations.
57
4. Experiments
Approach Mean Acc Mean F1 Std Dev Min. F1. Max. F1. Runs
MNB 0.53 0.48 0.055 0.31 0.66 1000
BERT 0.56 0.47 0.088 0.23 0.66 100
LSTM 0.51 0.42 0.051 0.30 0.55 100
Embedding Bag 0.42 0.38 0.066 0.23 0.58 100
BM25 0.47 0.28 0.057 0.10 0.42 100
Table 4.1: Performance comparison of various machine learning approaches on the
LOCKDOWN set, sorted by F1-Score.
1. The dataset size is small, which makes it difficult to train a generalized model.
2. Due to the class imbalance, there are even fewer samples to train from in the
minority class, making it more challenging to achieve good macro average F1-Scores.
Another reason could be that the domain is complex, making it more difficult for
the model to learn.
3. The subjectivity of opinions is high, making it difficult to label the samples consis-
tently, increasing the likelihood of introducing label noise, which will make it more
difficult for the model.
In the next section, we describe how we applied the same algorithms on the more extensive
MEASURES set to study the impact of dataset size on model performance.
58
4.2. Opinion Classification: Second Experiment
4.2 Opinion Classification: Second Experiment
In the second experiment, we performed opinion classification on the MEASURES dataset
(refer to Section 3.4 for the collection process), which is over ten times larger than the
LOCKDOWN dataset.
4.2.1 An Improved Test Pipeline
Before we ran the classification algorithms on the second dataset, we decided to define
an improved test pipeline that provides more detailed documentation of results. The
goal was to unify the test conditions as much as possible across the algorithms to allow
for a more meaningful comparison.
We decided to perform a random 85-15 train-test split with stratification by class for all
algorithms. Where applicable, a stratified validation set should be selected by randomly
choosing 15% of the training samples. We used the cross-entropy loss for models that use
a loss function commonly used in multiclass classification tasks. We weighed the training
samples proportional to their occurrence frequencies when calculating the loss, making
less frequent classes more important. Furthermore, we selected the model that achieved
the lowest loss on the validation set at the end of each training phase and evaluated it
on the test set.
To allow us the easy change of different combinations of experiment parameters, we
extracted the following parameters from the code as variables:
1. RUNS: How many times the model should be trained and evaluated. Not to be
confused with the number of epochs for which a network is trained in each run.
Default: 100
2. TEST_SPLIT: How many percent of the total data should be used for testing. The
remainder is used for training. Default: 15%
3. VALID_SPLIT: How many percent of the training data should be used for validation
during training. Default: 15%
4. SHUFFLE: If the data splits should be sampled randomly. Default: True
5. STRATIFY: If the data splits should be stratified by class. Default: True
6. CLASS_WEIGHTS: If the training samples should be weighed proportionally to
their class’ frequency. Default: True
7. REMOVE_STOP_WORDS: Whether the pre-processing step of stop word removal
should be performed. Default: False
8. STEMMING: Whether the pre-processing step of stemming should be performed.
Default: False
59
4. Experiments
9. LOWERING: Whether the pre-processing step of lower-casing all words should be
performed. Default: False
10. NO_PUNCTUATION: Whether the pre-processing step of removing punctuation
marks should be performed. Default: False
11. N_BEST: How many of the best-matching samples should be considered for deter-
mining a class label. (only BM25) Default: 7
If not otherwise stated, we used the specified default values. Additionally, we implemented
automatic storage of evaluation metrics after each run into a file that contains all of the
above parameter values in its file name. That way, we could easily plot graphs of the
results later and keep track of the experiment parameters.
4.2.2 Classification
After implementing the improved test pipeline changes, we began to run the models on
the MEASURES dataset. The bag-of-words model achieved significantly better accuracies
and F1-Scores than on the previous dataset. Figure 4.2 shows the performance metrics
of 100 individual train-evaluation runs. As expected, due to the larger dataset size, we
observe an overall reduction in the variance of results compared to the previous dataset.
The red horizontal line indicates the performance of a random-guessing approach that
would select each of the three classes with equal probability. Due to how the F1-Score is
calculated, a random-guess approach would achieve only 32% and not 33%, as is the case
with accuracy.
(a) Accuracies (b) F1-Scores
Figure 4.2: Results of the bag-of-words neural network on the MEASURES dataset
Table 4.2 shows the classification report for the bag-of-words (BOW) model on the
MEASURES dataset. All values are averaged over the 100 runs. We observe a direct
correlation between the ability to predict a class with the number of samples from that
class (Column Support). As was the case in the LOCKDOWN set, the network was able
60
4.2. Opinion Classification: Second Experiment
to predict more frequent classes better, even though we used a weighted loss function.
Considering that opinion classification is a complex task, the overall classification accuracy
of 62% was already a decent accomplishment.
Precision Recall F1-Score Support
- 0.55 0.57 0.56 196
o 0.39 0.53 0.45 155
+ 0.79 0.67 0.73 416
Accuracy 0.62 767
Macro Avg. 0.58 0.59 0.58 767
Weighted Avg. 0.65 0.62 0.63 767
Table 4.2: Classification report: BOW on the MEASURES set
Next, we ran the second deep network—the LSTM. With the same architecture as in the
first experiment and all of the test parameters set to default, we achieved an average
accuracy of 68%; six percent better than the bag-of-words model. The classification
performance on the test sets of 100 runs can be seen in Figure 4.3.
(a) Accuracies (b) F1-Scores
Figure 4.3: Results for the LSTM neural network on the MEASURES dataset
Table 4.3 shows the classification performance per class and the averaged scores over
all classes. We observe that for neutral opinions, the recall is higher than the precision,
which could result from giving neutral training examples more weight during training. It
is the other way around for the majority class (positive opinions) since we weigh each
training example less than samples of the other two classes. Moreover, with medium
support, the values of precision and recall are well balanced for negative opinions. The
same pattern can be observed in all other classification reports in this section (except
for OpenAI, but there the sample size is considerably lower). We assume that we could
achieve a better balance between precision and recall if we would increase the importance
of the majority class and decrease the importance of the minority class slightly. We leave
61
4. Experiments
that as a topic for future study.
Precision Recall F1-Score Support
- 0.63 0.63 0.63 196
o 0.47 0.57 0.51 155
+ 0.82 0.75 0.78 416
Accuracy 0.68 767
Macro Avg. 0.64 0.65 0.64 767
Weighted Avg. 0.70 0.68 0.69 767
Table 4.3: Classification report: LSTM on the MEASURES set
Since the BERT architecture had considerably more trainable parameters than the other
models, performance considerations played a more important role. The maximum speech
lengths in the MEASURES dataset were longer than those in the LOCKDOWN dataset.
This time, we had to trim some speeches to stay within the server’s memory capacity.
Figure 4.4 shows a plot of the speech lengths in the MEASURES set. The maximum
length that the servers could handle was around 128 tokens. Fortunately, 98% of the
speeches were shorter, requiring trimming on less than two percent.
In terms of pre-processing, we did not apply any. The idea was to adapt the complexity
of input to the complexity of the model. Since BERT is a complex model, we decided
to provide input with a high information density. For example, removing punctuation
marks or lower-casing all words would reduce the complexity of the input, but it could
lose vital information. A single comma can alter the meaning of a sentence, and also the
capitalization of words can carry semantic information.
Figure 4.4: Speech lengths in the MEASURES dataset
62
4.2. Opinion Classification: Second Experiment
Figure 4.5 shows the classification performances of the BERT model on the test sets. With
70% accuracy, it could achieve only a 2% improvement over the LSTM on the MEASURES
set compared to a 6% improvement that it had achieved on the LOCKDOWN set. The
results may suggest that the BERT can deal better with smaller datasets than the LSTM.
One explanation could be the pre-training on a large corpus that the BERT received. As
with other models, the accuracy is slightly above the macro-average F1-Score, because
the performance on minority classes (neutral and negative opinions) is below the average.
(a) Accuracies (b) F1-Scores
Figure 4.5: Results for the BERT neural network on the MEASURES dataset
The detailed class report (Table 4.4) follows the same patterns as for the other models,
but with overall higher numbers. We observe an increased precision in the majority class,
an increased recall in the minority class, and a balance between precision and recall in the
medium support class. As usual, the accuracy is slightly higher than the macro-average
F1-Score.
Precision Recall F1-Score Support
- 0.64 0.65 0.64 195
o 0.53 0.63 0.57 155
+ 0.82 0.75 0.78 416
Accuracy 0.70 766
Macro Avg. 0.66 0.67 0.66 766
Weighted Avg. 0.72 0.70 0.70 766
Table 4.4: Classification report: BERT on the MEASURES set
Next, we applied the BM25 for classification on the MEASURES dataset. We initially
applied all four pre-processing techniques: removing stop words, stemming, lower-casing,
and removing punctuation. We achieved slightly better results by limiting the choice to
removing stop words and punctuation, so we went with that for the BM25 runs.
To determine class labels with BM25, we used the same method as in the previous exper-
63
4. Experiments
iment. Because it took a considerable amount of time to run BM25 on the MEASURES
set, it was not feasible to perform 100 runs for all combinations of the parameters N_BEST
and CLASS_WEIGHTS. Therefore, we performed a preliminary study with ten runs each
on the impact of some combinations of the parameters on classification performance.
Figure 4.6 shows, that the best results were achieved with n = 3 and no class weights.
Looking only at the results without weights, the best accuracy values (blue bars) tie
between n = 3 and n = 20, but when considering the F1-Scores (green bars), the clear
winner is n = 3.
Figure 4.6: N-Best comparison for the BM25 model
After determining the best combination of N_BEST and CLASS_WEIGHTS, we performed
100 train-evaluation runs on the dataset. For the first time, the graphs of classification
performance (Figure 4.7) follow a different pattern. We observe most results between
36% and 42% macro-average F1-Score, but around 15% being significantly worse. We
explain that result by expecting that there are some especially important samples in the
dataset. If those samples end up in the test set and not in the training set (used for
lookup), performance drops significantly.
By looking at Table 4.5 we observe that the F1-Scores for negative and neutral opinions
are below 32%, the threshold of a random-guessing approach. Additionally, the metrics
are the lowest of all models. The results indicate that the way we used the BM25
algorithm is not well suited for opinion classification.
Next, we ran the Multinomial Bayes (MNB) with different combinations of pre-processing
and found that they had no significant impact on the outcome. Finally, we performed 100
train-evaluation runs on the MEASURES set, with no prior pre-processing. The results,
shown in Figure 4.8, are significant because MNB was only slightly worse than BERT,
was on par with the LSTM, and better than the simple bag-of-words neural network.
We were impressed because an MNB model, due to its simplicity, can be trained in a
64
4.2. Opinion Classification: Second Experiment
(a) Accuracies (b) F1-Scores
Figure 4.7: Results for the BM25 approach on the MEASURES dataset
Precision Recall F1-Score Support
- 0.31 0.31 0.31 195
o 0.25 0.38 0.30 154
+ 0.60 0.48 0.53 415
Accuracy 0.42 764
Macro Avg. 0.39 0.39 0.38 764
Weighted Avg. 0.46 0.42 0.42 764
Table 4.5: Classification report: BM25 on the MEASURES set
fraction of the time necessary to train an LSTM or BERT model. This result indicates
that MNB classification could be a viable alternative when training a complex neural
network on a large dataset would be too costly.
(a) Accuracies (b) F1-Scores
Figure 4.8: Results for the Multinomial Bayes approach on the MEASURES dataset
65
4. Experiments
The classification report (Table 4.6) reveals a gap of 25% between the class-specific F1-
Scores of the minority and majority class. This value is in the medium range compared
to the other models, with the highest (28%) found in the BOW model and the lowest
(21%) observed in the BERT model. Additionally, the MNB achieved a slightly higher
precision on positive opinions, a slightly higher recall on neutral opinions, and a slight
improvement in F1-Score on negative opinions, compared to the BERT model. All other
metrics were equal or slightly lower than those of BERT.
Precision Recall F1-Score Support
- 0.69 0.62 0.65 196
o 0.42 0.66 0.52 154
+ 0.84 0.71 0.77 416
Accuracy 0.67 766
Macro Avg. 0.65 0.66 0.65 766
Weighted Avg. 0.72 0.67 0.69 766
Table 4.6: Classification report: MNB on the MEASURES set
Additionally, on the MEASURES set, we examined the capabilities of a sixth model—the
Davinci model from Open AI, which is based on GPT-3. It can be accessed through their
web API3. Due to the API nature of the model, the test process was different than it
was for the other models. We did not have to perform model training since it Davinci
is an already trained model. Additionally, the API uses a single endpoint for all NLP
tasks. The API detects which task should be performed based on the input format.
Furthermore, it requires examples, either provided together with the classification sample
or by uploading a file. We chose the second approach since we provided the entire train
split, which is many examples. A single query can make use of at most 200 example
records. If there are more examples in the file, then the algorithm selects the most
relevant ones.
The API works with a credit system. Depending on the length of the query, the used
model (Davinci is the most expensive), and the number of provided examples, the API
charges a different amount. To our surprise, we managed to label only 262 samples before
we had spent our free budget of $18. Since we could only classify 262 out of 767 samples,
the results are likely not to be representative.
The results of the 262 samples we managed to classify are shown in Table 4.7. With
an F1-Score of 49% and an accuracy of 50%, the results are worse than expected. Our
first explanation for the bad results is that the algorithm derives meaning from the class
labels. Since we named them "0,1,2", we would expect to perform better by naming
them "positive, negative, neutral." Since we used up the budget so quickly, we could
not experiment with different input formats. Therefore, the results are likely not to be
representative of the potential performance we could have achieved.
3https://beta.openai.com/, accessed: 2021-10-01
66
4.2. Opinion Classification: Second Experiment
Precision Recall F1-Score Support
- 0.51 0.82 0.63 78
o 0.30 0.52 0.38 50
+ 0.84 0.31 0.46 134
Accuracy 0.50 262
Macro Avg. 0.55 0.55 0.49 262
Weighted Avg. 0.64 0.50 0.49 262
Table 4.7: Classification report: Open AI Davinci (GPT-3) on the MEASURES set
4.2.3 Summary of Results
For the second experiment, we had access to a significantly larger dataset (~10x). It
contains speeches that talk about measures for containing the spreading of the Coronavirus.
We ran the same algorithms as on the LOCKDOWN set, with the addition of a GPT-3
model over the Open AI API. Due to the increased dataset size, we could comfortably
split a validation set from the training set and use it for model selection. Additionally,
we experimented with different combinations of pre-processing steps but found that they
had no significant impact on performance.
Approach Mean Acc Mean F1 Std Dev Min. F1. Max. F1. Runs
BERT 0.70 0.66 0.022 0.59 0.71 100
MNB 0.67 0.65 0.017 0.60 0.70 1000
LSTM 0.68 0.64 0.021 0.58 0.69 100
Embedding Bag 0.62 0.58 0.022 0.52 0.63 100
Open AI 0.50 0.49 - 0.49 0.49 <1
BM25 0.42 0.38 0.042 0.25 0.43 100
Table 4.8: Performance comparison of various machine learning approaches on the
MEASURES set. The standard deviation refers to the F1-Score.
Table 4.8 displays the averaged results of algorithms on the MEASURES set, ordered by
F1-score. Overall, the results are significantly better than on the LOCKDOWN set, so
dataset size seems to have played a significant role. Like on the LOCKDOWN set, the
deep neural network models are ranked in order of sophistication, except for Open AI.
Open AI is displayed with less than one run because it was only evaluated on a subset
of the test data once. Therefore, the Open AI results are likely not representative of
its actual capabilities. The MNB approach again performed very well and is only one
percent behind BERT, but also the LSTM performs almost equally well. The BM25
approach has a significantly better F1-score than before, and the gap to the accuracy is
not as large as it was on the LOCKDOWN set, meaning it dealt better with the class
imbalance this time.
67
4. Experiments
4.3 Visualizations of Opinion Data
In the previous two sections, we explored the capabilities of different machine learning
models to predict opinions. In this section, we want to work on the partial fulfillment
of Requirement R5 and visualize the opinion data produced by the best-performing
algorithm. In order to examine the usefulness of such visualizations, we compare the
actual opinion data with the predicted opinion data.
Precision Recall F1-Score Support
- 0.63 0.67 0.65 1301
o 0.52 0.62 0.57 1032
+ 0.83 0.75 0.78 2773
Accuracy 0.70 5106
Macro Avg. 0.66 0.68 0.67 5106
Weighted Avg. 0.72 0.70 0.71 5106
Figure 4.9: Classification report of the labels, predicted with BERT, used in the opinion
data visualizations and opinion consistency comparisons
We used the BERT model to generate the predicted opinion data for the entire MEA-
SURES dataset. To this end, we used a stratified k-fold technique by splitting the dataset
into ten equally large parts. We predicted the labels of each part by the BERT model
trained on the other nine parts. All predictions were aggregated to cover the entire set.
These predictions were used to plot the predicted opinion distributions per party and
per speaker and also, in the next section, to plot the predicted opinion consistency over
time. Table 4.9 shows the classification report of the overall results on the MEASURES
dataset. The values are very close to those achieved in the previous section.
First, we plotted the opinion data per political party. Figure 4.10a shows the actual
values and Figure 4.10b shows the predicted values. There are five different parties
and the category "ohne Klubzugehörigkeit," which means without party affiliation. The
difference between the actual and the predicted values is small enough that the predicted
graph provides a representative picture of the actual graph, with the caveat that a certain
sample size needs to be exceeded. As we can observe, the difference between predicted
values and actual values is greatest for "ohne Klubzugehörigkeit" since the sample size is
small for that category. For the other parties, the differences are within an acceptable
level. We have to note that the predicted graph is based on values produced by an
algorithm that has only 70% accuracy, and with higher accuracy values, the predicted
graph would become even more accurate.
Multiple interpretations of the data are possible. In our view, we observe a pro-measures
attitude in the governing parties Grüne and ÖVP since they are the ones implementing
the measures. The strongest opposition comes from the FPÖ. The NEOS have the highest
number of neutral opinions since they appear to follow a problem-solving approach, i.e.,
objectively discussing facts. The SPÖ is somewhere in the middle, indicating that they
68
4.3. Visualizations of Opinion Data
(a) Actual opinion distribution per party
(b) Predicted opinion distribution per party
Figure 4.10: Opinions on MEASURES per party
want to "leave multiple doors open" by appealing to a broad audience. They lean towards
a pro-measures opinion but do not take extreme positions.
Then, we plotted the opinion data for the twenty speakers that expressed the most opinions
on the COVID-19 measures. As we did for the parties, we compare the representative
quality of the predicted graph (Figure 4.11b), to the actual graph (Figure 4.11a). In our
opinion, the predicted graph provides an adequate representation of the actual graph.
69
4. Experiments
(a) Actual opinions of the top 20 speakers
(b) Predicted opinions of the top 20 speakers
Figure 4.11: Opinions of the top 20 speakers on MEASURES
The general sentiment towards the measures of all the speakers is preserved. For example,
it does not happen that a speaker that is clearly for the measures in the actual graph
is suddenly against the measures in the predicted graph. Generally, the sentiments of
individual speakers are aligned with those of their affiliated parties, with a few exceptions,
e.g., Gerhard Kaniak from the FPÖ.
In conclusion, the visualizations of opinions of parties and speakers, based on a prediction
70
4.4. Opinion Consistency
accuracy of 70%, are considered sufficiently representative to be used for reading the
overall sentiment towards a topic.
4.4 Opinion Consistency
In the previous section, we examined the sentiment of speakers and political parties on
a specific topic. In this section, we want to visualize the consistency of speakers’ and
political parties’ opinions over time.
4.4.1 Visualizing Opinion Consistency
To calculate the opinion consistency values, we used the opinion data from the MEA-
SURES dataset, in addition to the labels we have predicted with the BERT model (refer
to Table 4.9). As was the case for the previous section, the actual opinion data is collected
from the MEASURES dataset, and the predicted opinion data is constructed from the
predicted labels. We construct the opinion tuples (g, s, h, t), by setting g to the topic
MEASURES, h to the speaker and t to the timestamp. The speaker and timestamp
are taken from the MEASURES dataset. For the actual opinion tuples, we set s to the
actual label (found in the dataset), and for the predicted ones, we set s to the predicted
label (provided by the BERT model). In this manner, we construct one actual and one
predicted opinion tuple per record in the MEASURES dataset.
Based on the opinion data, we calculated the opinion consistency scores OpCons1(G, H, t)
and OpCons2(G, H, t)—as defined in Section 3.2—for every speaker and every party. In
the set of opinion targets G, we included the topics of the MEASURES set. To calculate
the scores of a speaker, we set the set of opinion holders H to include only the speaker.
In order to calculate the aggregated score of a party, we set H to include all speakers of
that party. We calculated all scores for each day by moving the timestamp t in 24-hour
increments.
First, we plot the actual and predicted opinion consistencies for the parties over time.
The actual graph (Figure 4.12a) is reasonably helpful. We can see which parties are
more consistent, which ones express a more diversified opinion, and how the values
change over time. We observe a strong fluctuation of values in the early period, as the
consistency value gets more stable with increased sample size. A rolling window could
also be interesting to get a more time-local view of the change in opinion consistency.
On the other hand, the predicted graph (4.12b) is of limited usefulness. Parties with a
lower consistency score (SPÖ and NEOS) get approximated better, but those with a
higher score, especially the ÖVP, are far off the actual values. In the middle periods,
the predicted FPÖ value is higher than those of Grüne and ÖVP, which is misleading
considering the actual values. Towards the end, the predicted ÖVP value is significantly
below the actual value, which also delivers a wrong impression. For some parties, the
predicted consistencies are very accurate, but there is a big difference for others. It seems
there is a high element of chance involved, based on a model with 70% accuracy.
71
4. Experiments
(a) Actual opinion consistency per party
(b) Predicted opinion consistency per party
Figure 4.12: Opinion consistency over time per party, based on OpCons2(G, H, t)
72
4.4. Opinion Consistency
Figure 4.13: The predicted vs. actual opinion consistencies per party, including neutral
opinions
Figure 4.13 shows the prediction errors of the opinion consistencies per party, based
on the values of OpCons1(G, H, t). We observe the largest error for values of "ohne
Klubzugehörigkeit" since the sample size is very small for this category, which also
becomes apparent in the big jumps the graph performs. Additionally, we observe larger
errors in the parties ÖVP and Grüne, which could be the result of the fact that especially
high or low values are harder to predict than those in the medium range (as will be
proved in Section 4.4.2). The predictions of the other parties are relatively accurate.
Since the graphs for OpCons2(G, H, t) look similar, but with overall lower values and
overall higher prediction error, we do not show them here.
73
4. Experiments
(a) Actual opinion consistencies of one selected speaker from each party
(b) Predicted opinion consistencies of one selected speaker from each party
Figure 4.14: Opinion consistency over time per speaker, based on OpCons2(G, H, t)
In Figure 4.14, we look at the opinion consistencies of five representative speakers, one
from each party. The graph drawn from actual values (Figure 4.14a) is of high interest.
It provides a diverse set of scenarios. We can observe that some speakers stay extremely
consistent over the observed period, e.g., Sebastian Kurz and Rudolf Anschober, and
some stay rather inconsistent (Gerald Loacker). We can identify when a speaker started
inconsistently and became more consistent over time (Herbert Kickl), and we can also
identify the opposite (Pamela Rendi-Wagner).
74
4.4. Opinion Consistency
The predicted graph (Figure 4.14b) provides a reasonable approximation for speakers of
lower consistency but struggles to capture the ones with high consistency. For example, we
can observe a divergence of almost 20% between Herbert Kickl’s and Rudolf Anschober’s
consistency values at the end of the predicted graph, whereas the difference is less than
five percent in the actual graph.
Figure 4.15: The predicted vs. actual opinion consistencies for selected speakers, including
neutral opinions
In Figure 4.15, we visualize the classification error between the predicted and the
actual opinion consistency values per speaker, based on OpCons1(G, H, t). Contrary to
expectations, we can observe a high opinion consistency value that is accurately predicted
75
4. Experiments
(Rudolf Anschober). An explanation is that the sample size for individual speakers is
smaller than for parties, leading to an amplified impact of the element of chance. With
Sebastian Kurz, we observe a high consistency value, for which the predictions struggle,
as is expected. When looking at Herbert Kickl, we also observe the factor of chance,
having close to perfect predictions in the middle part but strongly diverging toward the
end. Overall, the opinion consistencies of speakers are more difficult to predict than
those of parties due to the lower amount of available data.
Figure 4.16: The predicted vs. actual opinion consistencies for selected speakers, excluding
neutral opinions
In Figure 4.16 we also show the error, but this time excluding neutral opinions. Following
76
4.4. Opinion Consistency
the expectations, the values are overall lower, especially for speakers expressing more
neutral opinions (e.g., Gerald Loacker). Contrary to expectations, the classification errors
are not significantly higher, which we also explain by the higher element of chance due
to the smaller sample size.
Overall, we suggest using OpCons1(G, H, t) if the goal is to produce more accurate
predictions. If a high accuracy model is available, we suggest using OpCons2(G, H, t)
since it will better visualize contradicting opinions.
4.4.2 Impact of Model Performance on Opinion Consistency
In the previous section, we have analyzed the impact of model performance empirically.
In this section, we aim to get a general understanding of the impact of model performance
on the accuracy of opinion consistency. To this end, we want to determine the minimum
required model accuracies for predicting opinion consistency within a certain margin of
error. This effort is in fulfillment of Requirement R6.
We want to determine the required minimum accuracies for predicting the opinion
consistency within some select confidence intervals for different margins of error and
numbers of samples. We chose a 95% confidence interval, with seven margins of error
between 0.5% and 10%, and eight sample sizes between ten and 3000.
The margin of error extends in both directions around the actual opinion consistency
value. For example, a 5% margin around an actual opinion consistency of 70% would
mean the acceptable predictions have to fall within 65% to 75% of predicted opinion
consistency.
In addition to error margin and sample size, the required minimum accuracies also change
based on the ratios between opinion classes. It is easiest to predict the opinion consistency
resulting from the ratios 1:1:1 between negative, neutral, and positive opinions. That is
because a uniform distribution of opinions will get approximated by a random guess, for
which our model would need no accuracy. The opinion consistency gets more difficult
to predict accurately, the more it deviates from the value of 0.67—the value resulting
from a uniform opinion distribution. The most challenging opinion consistency value to
predict is a value of 1.0—the value resulting from a 1:0:0 or 0:0:1 distribution of opinions.
The lowest possible opinion consistency of 0.5 (1:0:1) is easier to predict than 1.0 but
harder than 0.67.
To determine the minimum accuracies, we ran a simulation. We created a sequence of
actual opinion labels and then used the Algorithm 4.1 to create the predicted labels with
the help of a virtual machine learning model with accuracy acc. In the algorithm, the
virtual machine learning model iterates through the list of actual labels and has a chance
of acc to output the correct label and otherwise randomly outputs one of the other two
labels. The symbol denotes the initialization of an empty list. The operator :: denotes
the concatenation of two lists. The symbol ⊕ denotes the exclusive OR operation. The
function random() generates a random real number in the interval [0, 1). If a list is given
77
4. Experiments
to the function random(), it returns a random element from the list. The expression
{0, 1, 2} denotes the set containing a negative, neutral, and positive opinion label.
Algorithm 4.1: Simulate predictions of virtual machine learning model
Data: A sequence of actual opinion labels: actual. An assumed model accuracy:
acc
Result: A sequence of predicted opinion labels: ops_pred
1 ops_pred ←
2 for op_act ∈ actual do
3 if random() < acc then
4 op_pred ← op_act
5 else
6 op_pred ← random({0, 1, 2} ⊕ {op_act})
7 end
8 ops_pred ← ops_pred :: op_pred
9 end
For each combination of error margin and sample size, we predicted the actual opinions
1000 times with each accuracy between 0.0 and 1.0 in increments of 0.01. For each of
the 1000 prediction sequences, the opinion consistency value was calculated. For each
combination of error margin and sample size, the lowest accuracy was chosen, for which
95% of the predicted opinion consistencies still lie inside the margin of error.
We ran the simulations on the opinions of four imaginary individuals. The first one
expresses opinions in a uniform distribution (Table 4.17), the second one is absolutely
consistent so he only states negative opinions (Table 4.18), the third one is as inconsistent
as possible, so his opinions go back and forth between positive and negative (Table 4.19)
and the last one is somewhere in-between. He always expresses three negative opinions
followed by one neutral and one positive opinion (Table 4.20).
From the tables 4.17–4.20 we can read the minimum accuracies that are required to
predict opinion consistency within the desired precision. For example, if we want to
predict the opinion consistency of a perfectly consistent individual within a maximum
error margin of 10% within a 95% confidence interval (meaning we want to be right 19
out of 20 cases), then we read from Table 4.18 that we need a minimum model accuracy
of 89% after 100 samples and 85% after 500 samples.
By looking at the tables, we can make various observations. For low sample sizes and
low error margins, almost perfect model accuracy is required. The predictions become
more feasible with increased sample sizes or if higher error margins are tolerated. The
minimum required accuracies depend on the ratio of the underlying opinions used to
calculate opinion consistency. As previously mentioned, a ratio of 1:1:1 is easiest to
predict, 1:0:0 is the most difficult, and 1:0:1 is in-between. If all ratios should be predicted
within the desired error margin, then the minimum model accuracy must be taken from
the 1:0:0 table; otherwise, a value between this table and the 1:1:1 table can be assumed.
78
4.4. Opinion Consistency
Error
Samples 10 30 50 100 300 500 1500 3000
0.5% 0.99 1.00 1.00 1.00 1.00 1.00 0.99 0.97
1.5% 0.99 1.00 1.00 0.99 0.98 0.95 0.86 0.67
2.5% 0.99 1.00 0.98 0.97 0.93 0.85 0.37 0.00
3.5% 0.99 0.98 0.98 0.95 0.83 0.68 0.00 0.00
5% 0.99 0.98 0.94 0.85 0.60 0.00 0.00 0.00
7.5% 0.99 0.92 0.88 0.67 0.00 0.00 0.00 0.00
10% 0.99 0.81 0.77 0.00 0.00 0.00 0.00 0.00
Figure 4.17: Minimum required model accuracies for predicting the opinion consistency
inside a 0.95 confidence interval within a certain margin of error after a certain amount
of samples with opinion ratios of 1:1:1 (opinion consistency 0.67)
Error
Samples 10 30 50 100 300 500 1500 3000
0.5% 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
1.5% 1.00 1.00 1.00 1.00 0.99 0.99 0.98 0.98
2.5% 1.00 1.00 0.99 0.99 0.98 0.98 0.97 0.96
3.5% 1.00 0.98 0.99 0.98 0.96 0.96 0.95 0.95
5% 1.00 0.98 0.97 0.96 0.94 0.93 0.92 0.92
7.5% 1.00 0.95 0.94 0.92 0.90 0.89 0.87 0.87
10% 0.93 0.91 0.91 0.89 0.86 0.85 0.83 0.82
Figure 4.18: Minimum required model accuracies for predicting the opinion consistency
inside a 0.95 confidence interval within a certain margin of error after a certain amount
of samples with opinion ratios of 1:0:0 (opinion consistency 1.0)
Error
Samples 10 30 50 100 300 500 1500 3000
0.5% 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
1.5% 1.00 1.00 1.00 1.00 0.99 0.99 0.98 0.97
2.5% 1.00 1.00 0.99 0.99 0.97 0.96 0.95 0.94
3.5% 1.00 0.98 0.99 0.98 0.95 0.94 0.92 0.90
5% 1.00 0.98 0.98 0.96 0.92 0.90 0.87 0.85
7.5% 1.00 0.95 0.95 0.91 0.85 0.82 0.78 0.76
10% 0.96 0.93 0.90 0.86 0.77 0.75 0.69 0.67
Figure 4.19: Minimum required model accuracies for predicting the opinion consistency
inside a 0.95 confidence interval within a certain margin of error after a certain amount
of samples with opinion ratios of 1:0:1 (opinion consistency 0.5)
79
4. Experiments
Error
Samples 10 30 50 100 300 500 1500 3000
0.5% 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99
1.5% 1.00 1.00 1.00 1.00 0.99 0.98 0.96 0.96
2.5% 1.00 1.00 0.99 0.99 0.96 0.95 0.93 0.92
3.5% 1.00 0.98 0.99 0.96 0.93 0.91 0.88 0.87
5% 1.00 0.98 0.96 0.94 0.88 0.87 0.82 0.80
7.5% 1.00 0.92 0.91 0.87 0.79 0.76 0.72 0.70
10% 0.99 0.92 0.87 0.80 0.70 0.66 0.60 0.57
Figure 4.20: Minimum required model accuracies for predicting the opinion consistency
inside a 0.95 confidence interval within a certain margin of error after a certain amount
of samples with opinion ratios of 3:1:1 (opinion consistency 0.8)
Depending on the application, different minimum accuracies would be necessary. For
example, should the algorithm spot contradicting opinions confidently, or should it only
give a broad idea of the general consistency of opinions? To answer the question of
whether or not the consistency of opinions can be measured through the help of NLP
methods over time, it can be said: Yes, under certain circumstances, it is feasible. Enough
data needs to be available, and the use-case needs to be able to tolerate an error of at
least 5% to reach achievable accuracies. In the future, with the advancement of machine
learning models and language understanding, it will become more feasible.
With the simulations performed in this section—providing an understanding of the
model accuracies required to predict opinion consistency values within a reasonable error
margin—we conclude the experiments. All requirements (R1–R6), as defined in Section
3.2, should have been fulfilled. In the next chapter, we summarize and discuss the results
of the experiments.
80
CHAPTER 5
Evaluation
The overall vision for this project was to automatically extract opinions from text and
monitor their change over time. Since this is novel and difficult, the goal of this project
was to examine what is possible with current technology, conclude what can be reasonably
achieved and which areas require more attention to achieve the vision.
It became quickly evident that it was too ambitious to capture opinions in general, so
the scope was restricted to the opinion on a specific question: "Should a Lockdown be
implemented?" The algorithm should output one of three answers to this question based
on a given text: (1) Yes, (2) No, or (3) Neither, which should be given in the case of a
non-subjective opinion.
In the beginning, we examined how much the sentiment of a sentence correlates with
its derived opinion (Figure 5.1). If the correlation had been reasonably high, we could
(a) Opinion distribution per sentiment label (b) Sentiment distribution per opinion label
Figure 5.1: Relationship between sentiment and opinion categories in the LOCKDOWN
dataset
81
5. Evaluation
have applied regular sentiment analysis to predict the opinion. After examination, the
correlation was not considered high enough; thus, an alternative approach was necessary.
The graphic shows the correlation between the sentiment of a sentence and its derived
opinion. Sub-figure 5.1a displays the distribution of opinions per sentiment. We can see,
for example, in the case of the document having a positive (+) sentiment, the document’s
derived opinion is positive in 65%, neutral in 29%, and negative (-) in 6% of the cases.
Sub-figure 5.1b shows the other direction.
Since we did not consider the sentiment sufficient to predict the opinions, we decided
to predict the opinion directly. At first, we explored hard-coded rule-based classifiers
and considered unsupervised methods, but then moved to supervised machine learning
approaches because, in opinion mining, they usually outperform unsupervised methods.
[SLC17]
It became apparent that an opinion can be highly subjective, making it difficult to
consistently assign labels during the manual labeling process of the dataset. Therefore,
we expect the presence of label noise, which is interpreted as assigning wrong labels to
some samples. Due to this label noise, the performance of machine learning algorithms can
suffer, as was discussed in Section 2.3.1. As a result, the highest achievable performance
by the algorithm might be lower than 100%.
5.1 Opinion Classification
We tested different algorithms on the dataset. Since it was comparatively small, the
resulting performance depended heavily on the distribution of samples between the
training and test set. As a countermeasure, we performed Monte Carlo cross-validation,
i.e., ran the experiments multiple times, each time with random train-test splits, and
finally, took the averages. The resulting averages provide us with a good idea of the
actual performance of the algorithms.
The LOCKDOWN dataset used in the experiments was relatively small because the
targeted opinion was a specific question. Additionally, due to the chosen topic, there is a
considerable class imbalance. There are many more negative samples than positive (see
Figure 5.2a). It makes sense since a Lockdown is generally perceived as a restriction to
freedom by the Austrians, and thus politicians are wary of openly speaking in favor of it.
When somebody did speak in favor of the Lockdown, they generally indicated that there
would be no other choice.
The methods of splitting the LOCKDOWN dataset into training and test partitions
evolved during the experiments. We started with a stratified split of 85-15 but encountered
a problem with that approach. The accuracy of a random guessing approach should be
33% since there are three possible classes. However, due to the imbalance of the dataset,
it was possible to achieve a higher accuracy simply by predicting the most frequent class
all the time. With the stratified dataset, this would lead to an accuracy of 56%. Thus,
an algorithm like BM25 could achieve a high result simply because it is more likely to
82
5.1. Opinion Classification
(a) Class distribution in the LOCKDOWN set (b) Class distribution in the MEASURES set
Figure 5.2: Absolute and relative class frequencies of the two datasets
choose the majority class. To make it impossible for such a primitive approach to achieve
an accuracy above 33%, we utilized a holdout approach. We chose the same number of
samples from each class to form the test set and assigned the rest to the training set,
ensuring a uniform class distribution in the test set.
It could be desirable in some applications to have a model learn to predict a class based
on frequency, but in this case, every class has the same importance; hence, frequency
should not impact the model’s decision. Thus, we first used a holdout approach, which
provides a test set with the same number of samples in each class. Later, we switched to
a stratified split but used the macro-average F1-score to evaluate performance, which we
considered a more effective measure to deal with the class imbalance.
We tested three deep learning models: A simple deep learning network based on a
bag-of-words embedding, a more advanced LSTM network, and a pre-trained BERT
model that we trained for the downstream task.
Besides the deep learning models, we used a Multinomial Bayes (MNB) classifier and
adapted the BM25 document ranking algorithm to perform classification. The BM25
algorithm calculates document scores based on how well they match a query. In order to
apply it to text classification, the samples of the training set served as a lookup table.
Given a sample of the test set, we used the algorithm to rank the samples in the lookup
table and return the average class of the top n samples.
Table 5.1 shows a performance comparison between the best versions of each algorithm
on the LOCKDOWN set. Each approach was trained and evaluated for the displayed
number of times. The MNB was run 1000 times, instead of 100 like the others, because
it executed quickly. We evaluated the performance first on the accuracy, which is a solid
general-purpose metric. Additionally, we collected the macro-average F1-score since it
provides a better evaluation of model performance on imbalanced datasets. We chose it
over the micro-average F1-score, which would treat each sample with equal importance,
thus coming closer to the accuracy metric. To capture the spread of the F1-Scores across
83
5. Evaluation
Approach Mean Acc Mean F1 Std Dev Min. F1. Max. F1. Runs
MNB 0.53 0.48 0.055 0.31 0.66 1000
BERT 0.56 0.47 0.088 0.23 0.66 100
LSTM 0.51 0.42 0.051 0.30 0.55 100
Embedding Bag 0.42 0.38 0.066 0.23 0.58 100
BM25 0.47 0.28 0.057 0.10 0.42 100
Table 5.1: Performance comparison of various machine learning approaches on the
LOCKDOWN set, sorted by F1-Score
the runs, we also display the standard deviation and minimum and maximum values. As
expected, BERT outperforms the LSTM, which in turn outperforms the Bag-of-Words
model (Embedding Bag), but it is a surprise that the MNB outperforms the BERT
slightly in terms of mean F1, even though it is slightly behind in terms of accuracy. The
BM25 approach had the most significant gap between accuracy and F1-Score, indicating
a strong influence of the dataset imbalance.
The classification results on the LOCKDOWN set can be considered mediocre, and there
are possible explanations: First, the dataset size is small, making it difficult to train
a generalized model. Additionally, there are even fewer samples to train from in the
minority class due to the class imbalance, making it more challenging to achieve good
macro average F1-Scores. Another reason could be that the domain is complex, making
it more difficult for the model to learn. Furthermore, the subjectivity of opinions is
high, making it difficult to label the samples consistently, increasing the likelihood of
introducing label noise, which will make it more difficult for the model.
It is quite possible that with a more extensive dataset, the results would improve. However,
it is not always possible to increase the number of samples. That is especially true when
tracking the consistency of opinions on a particular topic of interest. Still, it would be
beneficial to have a comparison with a larger dataset.
In order to assess the impact of dataset size on classification performance, a second,
much more extensive (1̃0x) dataset was collected, referred to as the MEASURES set (see
Section 3.4). It contains speeches that are about measures for containing the spreading
of the Coronavirus. The possible opinions are 0 (against measures), 1 (neither for nor
against measures), and 2 (for measures). The MEASURES set is also imbalanced (Figure
5.2b), roughly to the same degree as the LOCKDOWN set, but due to the increased
overall size, it should be less of a concern. Due to the increased labeling effort of the
MEASURES set, we used a custom annotation tool, which sped up the process and
helped increase the quality of labels.
We reran the same algorithms on the MEASURES set, with an 85-15 stratified train-test
split. Due to the increased dataset size, a separate stratified validation set of 15% was
split randomly from the training set during training of the deep neural network models.
Additionally, we tested Google’s Open AI API on the second dataset. Contrary to high
84
5.1. Opinion Classification
Approach Mean Acc Mean F1 Std Dev Min. F1. Max. F1. Runs
BERT 0.70 0.66 0.022 0.59 0.71 100
MNB 0.67 0.65 0.017 0.60 0.70 1000
LSTM 0.68 0.64 0.021 0.58 0.69 100
Embedding Bag 0.62 0.58 0.022 0.52 0.63 100
Open AI 0.50 0.49 - 0.49 0.49 <1
BM25 0.42 0.38 0.042 0.25 0.43 100
Table 5.2: Performance comparison of various machine learning approaches on the
MEASURES set. The Standard Deviation refers to the F1-Score.
expectations, the actual results were much lower. The exact reasons were not clear. One
issue was the high cost per query, which caused us to use up the available budget quickly,
i.e., we only managed to classify 262 samples with an accuracy of 50%. Since the budget
was used up in the first run already, there was no possibility of querying the API with
different input to investigate the reasons behind the low accuracy. One explanation could
be that the API is deriving meaning from the class labels. In our API calls, we defined
them as the numbers 0, 1, and 2. Possibly labels like "Negative Opinion" would have
achieved better results.
Table 5.2 displays the averaged results of algorithms on the MEASURES set, ordered by
F1-Score. Overall, the results are significantly better than on the LOCKDOWN set, so
dataset size appears to have played a significant role. Like on the LOCKDOWN set, the
deep neural network models are ranked in order of sophistication, except for Open AI.
Open AI is displayed with less than one run because it was only evaluated on a subset of
the test data once. Therefore, the Open AI results are likely not to be representative of
its actual capabilities. The MNB approach again performed very well and is only one
percent behind BERT, but also the LSTM performed almost equally well. The BM25
approach has a significantly better F1-score than before, and the gap to the accuracy is
not as large as it was on the LOCKDOWN set, meaning it dealt better with the class
imbalance this time.
After concluding the performance comparison, we facilitated the BERT model to predict
the opinions on the entire MEASURES set as a preliminary step to compare actual to
predicted graphs on opinion data. The overall accuracy of those predictions was 70%.
Before we visualized opinion consistencies, we plotted opinion distributions per party
and speaker. The graphs between actual and predicted opinions were viewed side-by-side
(Figure 5.3). We considered the differences small enough for the predicted graphs to
convey a representative impression of the actual sentiments. We concluded that those
graphs require a considerably lower model accuracy to remain useful, as compared to the
ones for opinion consistency.
85
5. Evaluation
(a) Actual opinion distribution per party (b) Predicted opinion distribution per party
Figure 5.3: Opinions of the parties on the measures against the Coronavirus
5.2 Opinion Consistency
Ultimately, the idea would be to monitor the consistency of opinions based only on
written text to compare the consistency of different groups or individuals in general and
over time. We used two formulas (Section 3.2) to calculate the consistency values based
on opinion data. The first is defined by the relative amount of samples belonging to
the majority class, between positive and negative opinions. So, for example, a group or
individual whose statements about a particular topic make up 70% positive and 30%
negative statements would be considered 70% consistent. The other definition would also
include neutral opinions in the calculation.
The validity of conclusions drawn from opinion consistency graphs depends on the
accuracy of the machine learning models with which they are predicted. To get an
idea about the usefulness, we look at the predicted opinion consistency vs. the actual
opinion consistency of the different political parties (see Figure 5.4) in the case of our
MEASURES dataset.
We can observe that the predictions for the SPÖ, FPÖ, and NEOS are better than those
for ÖVP and Grüne, which is reasonable since it is more challenging to predict opinion
consistencies the more they deviate from average values. As we have determined before,
the more the opinion ratios deviate from an even distribution (1:1:1), the higher the
required model accuracies to predict the opinion consistencies with the same level of
accuracy. We can also observe that the predicted consistency deviates a lot in the case
of "ohne Klubzugehörigkeit" (no party affiliation), which is expected because the actual
consistency is high, and the sample size is small.
Regarding the usefulness of those specific plots, we consider them rough estimates of the
actual values. Unfortunately, the predictions are not resembling very high or very low
consistency values as well. Also, a significant number of samples is required before the
predicted values become reliable.
Additionally, we determined the necessary accuracy for a machine-learning algorithm to
predict the actual opinion consistency within a 95 confidence interval. The minimum
86
5.2. Opinion Consistency
Figure 5.4: The predicted vs. actual opinion consistencies per party, including neutral
opinions
results depend on the ratios between negative, neutral, and positive opinions. We
determined values for four different ratios (refer to Section 4.4.2 for all results). In Table
5.5 we see the minimum accuracies for a ratio of 3:1:1; thus, for three negative opinions,
there is one neutral and one positive each. For example, we read from the table that
after 100 expressed opinions, if we want to predict opinion consistency with a maximum
error of 5%, we need a model accuracy of at least 88%.
We made various observations by looking at the tables of minimum accuracies (Section
4.4.2). For low sample sizes and low error margins, almost perfect model accuracy is
87
5. Evaluation
Error
Samples 10 30 50 100 300 500 1500 3000
0.5% 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99
1.5% 1.00 1.00 1.00 1.00 0.99 0.98 0.96 0.96
2.5% 1.00 1.00 0.99 0.99 0.96 0.95 0.93 0.92
3.5% 1.00 0.98 0.99 0.96 0.93 0.91 0.88 0.87
5% 1.00 0.98 0.96 0.94 0.88 0.87 0.82 0.80
7.5% 1.00 0.92 0.91 0.87 0.79 0.76 0.72 0.70
10% 0.99 0.92 0.87 0.80 0.70 0.66 0.60 0.57
Figure 5.5: Minimum required model accuracies for predicting the opinion consistency
inside a 0.95 confidence interval within a certain margin of error after a certain amount
of samples with opinion ratios of 3:1:1 (opinion consistency 0.8)
required. The predictions become more feasible with increased sample sizes or if higher
error margins can be tolerated. The minimum required accuracies depend on the ratio
of the underlying opinions used to calculate opinion consistency. A ratio of 1:1:1 is
easiest to predict, 1:0:0 is the most difficult, and 1:0:1 is in-between. Therefore, if all
ratios are equally important, the minimum model accuracy must be read from the 1:0:0
table. Depending on the application domain, if we assume that extreme values of opinion
consistency will be rare, the lower values of the other tables can be used as a guideline.
Depending on the application, different minimum accuracies would be necessary. For
example, if the algorithm spots contradicting opinions confidently, a higher accuracy will
be required. A lower one will suffice if it should provide only a broad idea of the general
consistency of opinions. To the initial question of whether the consistency of opinions
can be monitored with the help of NLP methods over time, we answer: Yes, under certain
circumstances, it is feasible. Enough data need to be available, and the use-case needs
to tolerate an error of at least 5% to reach achievable accuracies. In the future, with
the advancement of machine learning models and language understanding, it will become
more feasible.
5.3 Challenges
We encountered several challenges during the implementation of our approach of moni-
toring opinion consistency.
• As seen in our case, sentiment does not always match a statement’s opinion, making
it more challenging to use a general sentiment classifier to predict an opinion.
• Sometimes the speaker does not talk about their own opinion but about someone
else’s, which is an additional challenge for a machine learning model.
88
5.4. What is possible and future directions
• Datasets can get very small, depending on the specificity of the topic. The smaller
a dataset gets, the more difficult it will be to train a reliable model. Additionally,
when a dataset becomes small enough, it could be more feasible to examine the few
samples by hand instead of training a machine learning model.
• In supervised learning approaches, it is necessary to label a training set manually,
which was the biggest challenge to the practical feasibility of the implemented
approach. Additionally, opinions are subjective, and thus, this approach is especially
prone to label noise.
• In supervised learning approaches, it is necessary to label a training set manually,
which was the biggest challenge to the practical feasibility of the implemented
approach. Additionally, opinions are subjective, and thus, this approach is especially
prone to label noise.
• Each topic could be drastically different regarding identifying a positive or negative
opinion, making it necessary to train a separate model for each topic. Even if
transfer-learning is used, the model probably still has to be fine-tuned on each
dataset.
• The understanding of what an opinion consists of is diverse. To predict an opinion
from a text could mean something different for different people. We have used
a simple definition, trading off the nuance of captured opinions for simplifying
implementation complexity. Finding the right balance is an additional challenge.
• When aiming to predict an opinion in the sense of "what the person really meant,"
it might be too subjective to predict in a meaningful way. When the person is
not explicitly stating an opinion, it might be impossible to know for sure what the
actual opinion is. Depending on the application domain, the issue might become
less pronounced, but there will always be some level of subjectivity.
5.4 What is possible and future directions
A generic approach of monitoring opinions in general is out of reach yet. However, it
is possible to use NLP methods to predict consistency values for opinions on a specific
topic. Although, to make meaningful predictions, the accuracy scores of predictions
would have to be significantly higher (at least for graphs of opinion consistency) than
what we have achieved in this project, as was shown in section 5.2. Whether these values
can be achieved depends on multiple factors, like the complexity of the domain, the
definition of an opinion, and the size and quality of the dataset.
Concretely, based on the performed experiments, the following approach is suggested.
A topic is defined, which can be identified easily by specific keywords. Manual labeling
of opinions is performed on sentences containing those keywords. The best available
pre-trained attention model is chosen and trained on the dataset. When the achieved
performance is satisfactory, it can be used to monitor opinions on the chosen topic.
89
5. Evaluation
Since the amount of data for such an approach needs to be large, it can only be used
on widely discussed topics. In the case of the project’s domain (parliamentary speech
protocols), it would require a long-term time horizon of an established topic for that to
be the case. Therefore, it cannot be used when insight needs to be gathered quickly in
response to a newly arising topic, but better is used for general topics that have been
discussed for several years already.
However, in other domains, such as social media (e.g., Twitter messages), this approach
could be easier to implement, as overall, a lot more data are produced, making it easier
to accumulate a larger dataset. Additionally, the data itself are less complex, which
would make it easier to predict the opinions.
Finally, humans use context information to interpret a text that is not coming from the
same text. Simpler models have no access to such context information, but advanced
models, based on the concept of transfer learning, like the BERT model, can be said to
make use of such information. They are pre-trained on a large set of text documents (e.g.,
Wikipedia articles) before they are trained on the target dataset. The performance of such
models could be further improved, by providing relevant context information, alongside
the expression, but this is only an idea. In theory, with enough context information, a
general model could be built, coming close or even surpassing the accuracy of humans
because they can store and process more data than a typical human could. The transfer
learning approach seems to be the most promising in achieving the vision of a general
opinion extraction system.
90
CHAPTER 6
Conclusion
In this experimental study, we explored the possibilities of measuring the consistency
of opinions with the help of NLP methods. We have defined the term opinion and
implemented a method to extract opinions from textual data. We provided two formulas
for calculating opinion consistency, a value that makes the consistency of opinions
quantifiable. We gathered two datasets, annotated them, and ran different machine
learning algorithms to extract opinions. We calculated the opinion consistency values
for speakers and parties and visualized them. In addition, we examined the impact of
a model’s accuracy on the accuracy of predicted opinion consistency values. We used
the insight gained by implementing such an approach to answer the following research
questions:
Q1a What is the practical feasibility of monitoring opinion consistency, a value repre-
senting the consistency of opinions on a topic, through the means of supervised
ML methods?
Using supervised machine learning methods to extract opinions requires an an-
notated dataset on which a model is trained. The biggest challenge to feasibility
in the proposed approach is that every topic requires creating and annotating a
different dataset. Thus, the approach is best suited if the intention is to monitor a
small selection of topics over a long period of time. The other factors, e.g., training
the algorithms and computing opinion consistency values, are minor considerations
since they can be automated.
Q1b What is the usefulness of measuring and visualizing the consistency of opinions
based on opinion data predicted by supervised ML methods?
As shown in Section 4.4.1, the usefulness depends on the accuracy of predicted
opinion consistency values, which in turn, depend on the number of opinions and
91
6. Conclusion
the accuracy of the chosen ML method. The precision of the predicted opinion
consistency increases with the number of opinions. Additionally, we observed an
improved prediction accuracy of opinion labels on the larger dataset compared to
the smaller one. We conclude, the proposed method is most useful for topics with
a high number of opinions.
Q2 What performance do various ML architectures achieve in predicting opinions in
the domain of Austrian political speeches in the German language?
In the Sections 4.1 and 4.2, we compared the classification performances of five
different machine learning models on two datasets constructed from speech tran-
scriptions of Austrian politicians. On the first dataset with around 500 records,
we found that according to accuracy, BERT performed best (56%), follwed by the
MNB (53%), the LSTM (51%), the BM25 (47%), and the Bag-of-Words model
(42%). On the second dataset, with around 5000 records, BERT also achieved
the highest accuracy (70%), followed by the LSTM (68%), the MNB (67%), the
Bag-of-Words (62%), and the BM25 (42%). Notable is the high performance of the
MNB, which was almost on par with BERT.
Q3 What could be minimum performance thresholds for ML algorithms to predict the
consistency of opinions to a desirable precision?
In Section 4.4.2, we determined the minimum model accuracies that are required
to predict opinion consistency within a certain margin of error. We found that
the minimum accuracy depends on the number of opinions on which the opinion
consistency value is calculated and the ratio between positive, negative, and neutral
opinions. Extreme values (exceptionally high or low opinion consistencies) are more
difficult to predict accurately than moderate ones. Additionally, we found that
almost perfect prediction accuracy is required to achieve low margins of error on
low numbers of opinions. Above a certain threshold for error margin and sample
size, the required model accuracies become achievable.
Q4 How useful are visualizations of opinion data of speakers and parties that are based
on predictions made by various supervised ML algorithms?
In Section 4.3 we visualized the actual and predicted opinion data. We saw that
to be useful those visualizations required a lower prediction performance than the
visualizations of opinion consistency. We found that the visualizations of opinions
of parties and speakers, based on prediction accuracy of 70%, are sufficiently
representative to be used for recognizing the overall sentiment towards a topic.
We take some space to reflect critically on the used methodology and achieved results.
Supervised machine learning methods are not the only way we could have achieved our
goal, but we have chosen them over unsupervised methods because of their better state-
of-the-art performance in opinion mining tasks. Future work can explore unsupervised
92
methods to eliminate the manual annotation process, which we have found to be the
greatest obstacle to the feasibility of the presented approach.
We think that we have covered the landscape of deep learning methods reasonably well.
We could have covered more statistical models, but based on the state-of-the-art results
on similar tasks (e.g., sentiment analysis), deep learning methods are likely to perform
better in any case. How we have applied BM25 is debatable, but we think we did a
reasonable job considering that it is a document ranking algorithm and not primarily a
text classification algorithm. We could have tested more combinations of pre-processing
steps to increase performance, but we do not expect significant gains beyond a few
percentage points, based on the observations on the combinations we have tested. We
provided detailed performance reports on the models we evaluated and used the same
metrics to make results comparable. We believe the used Monte Carlo cross-validation
approach made the results reliable.
We are satisfied with our definition of opinion as a quadruple, and we are partially
satisfied with the definition of opinion consistency. We believe the definition of opinion
consistency can be improved by weighing recent opinions more strongly than older ones.
To investigate the usefulness of opinion consistency visualizations, we utilized the graphs
of Section 4.4.1. We could have improved the comparisons on opinion data by calculating
the divergence between actual and predicted values in addition to the visual comparisons.
However, we believe the visual comparisons are sufficient because we additionally provided
the tables in Section 4.4.2, which tell precisely how accurate the predictions will be, based
on the model’s accuracy.
In summary, our main contributions are:
1. Proposing a definition of opinion consistency, a value that makes the consistency of
opinions quantifiable.
2. Designing, implementing, and testing a method for visualizing the consistency of
opinions over time of individuals and groups.
3. Creating various visualizations of opinion data and examining their usefulness.
4. Evaluating the performance of different supervised machine learning models on two
datasets from the domain of political speeches in the German language.
5. Determining a table of minimum accuracies that a machine learning model would
have to achieve for predicting opinion consistency with a certain accuracy.
In the next section, we propose ideas for future research on the topic that we could not
cover in this work.
93
6. Conclusion
Future Work In this work, we employed supervised machine learning methods to
extract opinions from textual data. The most significant disadvantage with this approach
is the manual annotation effort that is involved. In the future, it would be interesting to
examine unsupervised methods to eliminate the manual labeling process. One of several
ideas is to apply topic modeling to group text passages per topic and then use a sentiment
lexicon to determine the opinion’s sentiment.
Future work should explore how well a model can generalize across topics. In this work,
we have trained separate classifiers per dataset. It would be interesting to examine how
well a model trained on one topic will perform on another topic. If it would be possible
to train a model that performs well on many topics, it becomes more feasible to apply
the technique proposed in this work in a generalized way.
The formulas for calculating the opinion consistency values determine which type of
insights can be derived. In this work we have used a formula, that is good at showing
the overall consistency over long periods. Future work should investigate other formulas
that can answer different questions. For example, if the interest lies more in spotting
contradicting opinions in short intervals, a rolling window could be used, or opinions
could be weighed by how far they date back.
In this work, we have proposed a two-phase method of first extracting text passages
related to a chosen topic and then classifying the opinion on those passages. It would
be interesting to test the performance of classifiers that perform both tasks at the same
time. Such a classifier could output the additional label of unrelated if the text is not
about the desired topic. Alternatively, a classifier that outputs two labels—one for the
topic and another for the opinion’s sentiment—could be employed.
Finally, improving the classification performance is a reliable way to improve the accuracy
of predicted opinion consistency values and thus the conclusions formed thereon. In this
work, we have achieved a performance of 70% with a BERT model on a dataset of around
5000 records. Considering that the domain is complex, we consider it a fair achievement.
Future work could improve performance in that domain, e.g., by hyperparameter tuning,
applying different pre-processing techniques, utilizing transfer learning, or using other
models. It would also be interesting how such an approach performs in different domains,
e.g., on social media.
In this work, we have successfully implemented a method for visualizing the consistency
of opinions of individuals and groups. We gathered valuable insights in regards to the
practical feasibility and usefulness of the implemented approach. Based on our evaluation,
we predict that considerable efforts are required before such an approach becomes useful
for a broad range of application domains.
94
List of Figures
2.1 CNN architecture for sentence classification [ZW15] . . . . . . . . . . . . . 11
2.2 Basic RNN architecture [LBH15] . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Schematic overview of different RNN architectures. [KKDC19] Input vectors
are denoted by x, output vectors by y and hidden state vectors by c. Merging
arrows indicate a concatenation of vectors and a splitting arrow indicates a
copy operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Overview of the transformer architecture [VSP+17] . . . . . . . . . . . . . 15
2.5 Different syntactic parses of the same sentence as produced by spaCy [HMVB20] 21
2.6 Distribution of submitted model types for the SemEval-2019 Task 6 (sub-task
A) [ZMN+19] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Showing the concept of under- and overfitting in a binary classification task
in two feature dimensions. The blue line represents a classifier splitting the
feature space into two regions. The classifiers, from left to right, are likely to
generalize too much, appropriately, and too little. . . . . . . . . . . . . . . 30
2.8 Basic DNN architecture [Agg18] . . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Relationship between sentiment and opinion categories in the first dataset 47
3.2 Screenshot of the annotation software that aided in the annotation process of
the second dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Speech lengths in the LOCKDOWN dataset . . . . . . . . . . . . . . . . . 56
4.2 Results of the bag-of-words neural network on the MEASURES dataset . 60
4.3 Results for the LSTM neural network on the MEASURES dataset . . . . . 61
4.4 Speech lengths in the MEASURES dataset . . . . . . . . . . . . . . . . . 62
4.5 Results for the BERT neural network on the MEASURES dataset . . . . 63
4.6 N-Best comparison for the BM25 model . . . . . . . . . . . . . . . . . . . 64
4.7 Results for the BM25 approach on the MEASURES dataset . . . . . . . . 65
4.8 Results for the Multinomial Bayes approach on the MEASURES dataset . 65
4.9 Classification report of the labels, predicted with BERT, used in the opinion
data visualizations and opinion consistency comparisons . . . . . . . . . . 68
4.10 Opinions on MEASURES per party . . . . . . . . . . . . . . . . . . . . . 69
4.11 Opinions of the top 20 speakers on MEASURES . . . . . . . . . . . . . . 70
4.12 Opinion consistency over time per party, based on OpCons2(G, H, t) . . . 72
95
4.13 The predicted vs. actual opinion consistencies per party, including neutral
opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.14 Opinion consistency over time per speaker, based on OpCons2(G, H, t) . . 74
4.15 The predicted vs. actual opinion consistencies for selected speakers, including
neutral opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.16 The predicted vs. actual opinion consistencies for selected speakers, excluding
neutral opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.17 Minimum required model accuracies for predicting the opinion consistency
inside a 0.95 confidence interval within a certain margin of error after a certain
amount of samples with opinion ratios of 1:1:1 (opinion consistency 0.67) 79
4.18 Minimum required model accuracies for predicting the opinion consistency
inside a 0.95 confidence interval within a certain margin of error after a certain
amount of samples with opinion ratios of 1:0:0 (opinion consistency 1.0) . 79
4.19 Minimum required model accuracies for predicting the opinion consistency
inside a 0.95 confidence interval within a certain margin of error after a certain
amount of samples with opinion ratios of 1:0:1 (opinion consistency 0.5) . 79
4.20 Minimum required model accuracies for predicting the opinion consistency
inside a 0.95 confidence interval within a certain margin of error after a certain
amount of samples with opinion ratios of 3:1:1 (opinion consistency 0.8) . 80
5.1 Relationship between sentiment and opinion categories in the LOCKDOWN
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Absolute and relative class frequencies of the two datasets . . . . . . . . . 83
5.3 Opinions of the parties on the measures against the Coronavirus . . . . . 86
5.4 The predicted vs. actual opinion consistencies per party, including neutral
opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.5 Minimum required model accuracies for predicting the opinion consistency
inside a 0.95 confidence interval within a certain margin of error after a certain
amount of samples with opinion ratios of 3:1:1 (opinion consistency 0.8) . 88
96
List of Tables
2.1 Example of a confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 The annotations on the two example sentences, according to the seven cate-
gories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Number of sentences per topic . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 Performance comparison of various machine learning approaches on the LOCK-
DOWN set, sorted by F1-Score. . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Classification report: BOW on the MEASURES set . . . . . . . . . . . . . 61
4.3 Classification report: LSTM on the MEASURES set . . . . . . . . . . . . 62
4.4 Classification report: BERT on the MEASURES set . . . . . . . . . . . . 63
4.5 Classification report: BM25 on the MEASURES set . . . . . . . . . . . . 65
4.6 Classification report: MNB on the MEASURES set . . . . . . . . . . . . . 66
4.7 Classification report: Open AI Davinci (GPT-3) on the MEASURES set . 67
4.8 Performance comparison of various machine learning approaches on the MEA-
SURES set. The standard deviation refers to the F1-Score. . . . . . . . . 67
5.1 Performance comparison of various machine learning approaches on the LOCK-
DOWN set, sorted by F1-Score . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Performance comparison of various machine learning approaches on the MEA-
SURES set. The Standard Deviation refers to the F1-Score. . . . . . . . 85
97

Bibliography
[AC10] Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for
model selection. Statistics surveys, 4:40–79, 2010.
[AC17] Ehsan Mohammady Ardehaly and Aron Culotta. Learning from noisy label
proportions for classifying online social data. Social Network Analysis and
Mining 2017, 8(1):1–18, November 2017.
[Agg18] Charu C Aggarwal. Neural networks and deep learning. Springer, 10:973–978,
2018.
[AMPZ17] R. Ahmad, H. Mannan, A. Pervaiz, and F. Zaffar. Aspect based sentiment
analysis for large documents with applications to US presidential elections
2016. In AMCIS 2017 - America’s Conference on Information Systems: A
Tradition of Innovation, volume 2017-August, 2017.
[Bar18] Adrien Barbaresi. A corpus of German political speeches from the 21st
century. In Proceedings of the Eleventh International Conference on Lan-
guage Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018.
European Language Resources Association (ELRA).
[BBL99] Doug Beeferman, Adam Berger, and John Lafferty. Statistical models for
text segmentation. Machine Learning, 34(1):177–210, 1999.
[Ber19] Daniel Berrar. Cross-Validation. Encyclopedia of Bioinformatics and Com-
putational Biology: ABC of Bioinformatics, 1-3:542–545, January 2019.
[BEW+18] Markus Borg, Cristofer Englund, Krzysztof Wnuk, Boris Duran, Christoffer
Levandowski, Shenjian Gao, Yanwen Tan, Henrik Kaijser, Henrik Lönn,
and Jonas Törnqvist. Safely Entering the Deep: A Review of Verification
and Validation for Machine Learning and a Challenge Elicitation in the
Automotive Industry. Journal of Automotive Software Engineering, 1(1):1–
19, December 2018.
[BG18] Toms Bergmanis and Sharon Goldwater. Context Sensitive Neural Lemmati-
zation with Lematus. In Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
99
Language Technologies, Volume 1 (Long Papers), pages 1391–1400, New
Orleans, Louisiana, 2018. Association for Computational Linguistics.
[Bla] Andreas Blaette. GermaParl. Corpus of Plenary Protocols of the German
Bundestag. https://github.com/PolMine/GermaParlTEI. Accessed:
2020-05-21.
[BM06] Sabine Buchholz and Erwin Marsi. CoNLL-X Shared Task on Multilingual
Dependency Parsing. In Proceedings of the Tenth Conference on Compu-
tational Natural Language Learning (CoNLL-X), pages 149–164, New York
City, 2006. Association for Computational Linguistics.
[Boe84] Barry W Boehm. Verifying and validating software requirements and design
specifications. IEEE software, 1(1):75, 1984.
[Bol19] Marcel Bollmann. A Large-Scale Comparison of Historical Text Normalization
Systems. 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, 1:3885–3898,
April 2019.
[BSF94] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning Long-Term
Dependencies with Gradient Descent is Difficult. IEEE Transactions on
Neural Networks, 5(2):157–166, 1994.
[CDEU17] Mark Cieliebak, Jan Milan Deriu, Dominic Egger, and Fatih Uzdilli. A twitter
corpus and benchmark resources for german sentiment analysis. In Proceedings
of the Fifth International Workshop on Natural Language Processing for Social
Media, pages 45–51, 2017.
[CGCB14] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio.
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence
Modeling. NIPS 2014 Workshop on Deep Learning, December 2014.
[CGG+07] Ofelia Cervantes, Francisco Gutiérrez, Ernesto Gutiérrez, Esteban Castillo,
J. Alfredo Sánchez, and Wanggen Wan. Expression: Visualizing Affective
Content from Social Streams. In Proceedings of the Latin American Con-
ference on Human Computer Interaction - CLIHC ’15, pages 1–8, Córdoba,
Argentina, 2007. ACM Press.
[CS19] Muntazar Mahdi Chandio and Melike Sah. Brexit Twitter Sentiment Analysis:
Changing Opinions About Brexit and UK Politicians. In International
Conference on Information, Communication and Computing Technology,
pages 1–11. Springer, Cham, October 2019.
[DCLT19] Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. BERT:
Pre-training of deep bidirectional transformers for language understanding.
NAACL HLT 2019 - 2019 Conference of the North American Chapter of the
100
Association for Computational Linguistics: Human Language Technologies -
Proceedings of the Conference, 1(Mlm):4171–4186, 2019.
[FV14] Benoît Frénay and Michel Verleysen. Classification in the presence of label
noise: A survey. IEEE Transactions on Neural Networks and Learning
Systems, 25(5):845–869, 2014.
[GBV20] Margherita Grandini, Enrico Bagli, and Giorgio Visani. Metrics for Multi-
Class Classification: an Overview. arXiv preprint arXiv:2008.05756, August
2020.
[GBZ18] Darina Gold, Marie Bexte, and Torsten Zesch. Corpus of aspect-based senti-
ment in political debates. 14th Conference on Natural Language Processing -
KONVENS 2018, (Konvens):89–99, 2018.
[GdCL15] Luís P.F. Garcia, André C.P.L.F. de Carvalho, and Ana C. Lorena. Effect
of label noise in the complexity of classification problems. Neurocomputing,
160:108–119, July 2015.
[GRT21] Siddhant Garg, Goutham Ramakrishnan, and Varun Thumbe. Towards
Robustness to Label Noise in Text Classification via Noise Modeling. ICLR
2021 RobustML and S2D-OLAD Workshops, 2021.
[GS96] Ralph Grishman and Beth Sundheim. Message Understanding Conference-
6: A Brief History. In COLING 1996 Volume 1: The 16th International
Conference on Computational Linguistics, 1996.
[HLS13] Emma Haddi, Xiaohui Liu, and Yong Shi. The Role of Text Pre-processing
in Sentiment Analysis. Procedia Computer Science, 17:26–32, January 2013.
[HMVB20] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd.
spaCy: Industrial-strength Natural Language Processing in Python, 2020.
https://spacy.io/. Accessed: 2021-10-11.
[Hu19] Dichao Hu. An Introductory Survey on Attention Mechanisms in NLP
Problems. Advances in Intelligent Systems and Computing, 1038:432–448,
September 2019.
[HW00] Vasileios Hatzivassiloglou and Janyce M. Wiebe. Effects of adjective orienta-
tion and gradability on sentence subjectivity. COLING 2000 Volume 1: The
18th International Conference on Computational Linguistics, pages 299–305,
2000.
[JH18] Yahia Hasan Jazyah and Intisar O. Hussien. Multimodal Sentiment Analysis:
A Comparison Study. Journal of Computer Science, 14(6):804–818, June
2018.
101
[Jiv11] Anjali Jivani. A Comparative Study of Stemming Algorithms. Int. J. Comp.
Tech. Appl., 2:1930–1938, 2011.
[JPLN19] Ishan Jindal, Daniel Pressel, Brian Lester, and Matthew Nokleby. An
Effective Label Noise Model for DNN Text Classification. NAACL HLT
2019 - 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies - Proceedings
of the Conference, 1:3246–3256, March 2019.
[JX17] Zhao Jianqiang and Gui Xiaolin. Comparison research on text pre-processing
methods on twitter sentiment analysis. IEEE Access, 5:2870–2879, 2017.
[JZ15] Rie Johnson and Tong Zhang. Semi-supervised Convolutional Neural Net-
works for Text Categorization via Region Embedding. Advances in neural
information processing systems, 28:919, 2015.
[KG20] J. Kersting and M. Geierhos. Aspect phrase extraction in sentiment analysis
with deep learning. In ICAART 2020 - Proceedings of the 12th International
Conference on Agents and Artificial Intelligence, volume 1, pages 391–400,
2020.
[KKDC19] Zadid Khan, Sakib Mahmud Khan, Kakan Dey, and Mashrur Chowdhury.
Development and Evaluation of Recurrent Neural Network-Based Models
for Hourly Traffic Volume and Annual Average Daily Traffic Prediction.
Transportation Research Record, 2673(7):489–503, 2019.
[KMH+19] Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana
Mendu, Laura Barnes, and Donald Brown. Text Classification Algorithms:
A Survey. Information 2019, Vol. 10, Page 150, 10(4):150, 2019.
[Lan95] Pat Langley. Elements of Machine Learning. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA, 1995.
[Lan20] Oxford Languages. Oxford Languages Dictionary, 2020. https://
languages.oup.com/. Accessed: 2021-10-11.
[LaP18] Joseph LaPorte. Rigid Designators, 2018. https://plato.stanford.
edu/archives/spr2018/entries/rigid-designators/. Accessed:
2021-10-11.
[LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,
521(7553):436–444, May 2015.
[Liu12] Bing Liu. Sentiment Analysis and Opinion Mining. Synthesis lectures on
human language technologies, 5(1):1–184, May 2012.
[LJ98] Yong H. Li and Anil K. Jain. Classification of text documents. The Computer
Journal, 41(8):543–545, 1998.
102
[LJ15] Jiwei Li and Dan Jurafsky. Do Multi-Sense Embeddings Improve Natural
Language Understanding? Conference Proceedings - EMNLP 2015: Confer-
ence on Empirical Methods in Natural Language Processing, pages 1722–1732,
June 2015.
[LL13] Igor Labutov and Hod Lipson. Re-embedding words. In Proceedings of
the 51st Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), pages 489–493, 2013.
[LSHL20] Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. A Survey on Deep
Learning for Named Entity Recognition. IEEE Transactions on Knowledge
and Data Engineering, March 2020.
[MB11] Hassan H. Malik and Vikas S. Bhardwaj. Automatic training data cleaning
for text classification. Proceedings - IEEE International Conference on Data
Mining, ICDM, pages 442–449, 2011.
[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estima-
tion of Word Representations in Vector Space. 1st International Conference
on Learning Representations, ICLR 2013 - Workshop Track Proceedings,
January 2013.
[MCP05a] Ryan McDonald, Koby Crammer, and Fernando Pereira. Flexible text
segmentation with structured multilabel classification. In HLT/EMNLP 2005
- Human Language Technology Conference and Conference on Empirical
Methods in Natural Language Processing, Proceedings of the Conference,
pages 987–994. Association for Computational Linguistics (ACL), 2005.
[MCP05b] Ryan Mcdonald, Koby Crammer, and Fernando Pereira. Online Large-Margin
Training of Dependency Parsers. In Proceedings of the 43rd Annual Meeting
of the Association for Computational Linguistics (ACL’05), pages 91–98,
2005.
[MRT18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations
of machine learning. MIT Press, 2018.
[MS99] Christopher Manning and Hinrich Schutze. Foundations of statistical natural
language processing. MIT Press, 1999.
[MSC+13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean.
Distributed representations ofwords and phrases and their compositionality.
In Advances in Neural Information Processing Systems. Neural information
processing systems foundation, 2013.
[NGK20] Nikola Nikolić, Olivera Grljević, and Aleksandar Kovačević. Aspect-based sen-
timent analysis of reviews in the domain of higher education. The Electronic
Library, 38(1):44–64, January 2020.
103
[Niv03] Joakim Nivre. An Efficient Algorithm for Projective Dependency Parsing. In
Proceedings of the Eighth International Conference on Parsing Technologies,
pages 149–160, Nancy, France, April 2003.
[Niv08] Joakim Nivre. Algorithms for deterministic incremental dependency parsing.
Computational Linguistics, 34(4):513–553, December 2008.
[NL18] V. V. Nhlabano and P. E.N. Lutu. Impact of Text Pre-Processing on the
Performance of Sentiment Analysis Models for Social Media Data. 2018
International Conference on Advances in Big Data, Computing and Data
Communication Systems (icABCD), pages 1–6, September 2018.
[NPK+16] Harikrishna Narasimhan, Weiwei Pan, Purushottam Kar, Pavlos Protopa-
pas, and Harish G Ramaswamy. Optimizing the multiclass F-measure via
biconcave programming. In 2016 IEEE 16th international conference on data
mining (ICDM), pages 1101–1106. IEEE, 2016.
[NS07] David Nadeau and Satoshi Sekine. A survey of named entity recognition and
classification. Lingvisticae Investigationes, 30(1):3–26, August 2007.
[OEC] OECD. Trust in Government. https://www.oecd.org/gov/
trust-in-government.htm. Accessed: 2021-10-11.
[PCV+00] Georgios Petasis, Alessandro Cucchiarelli, Paola Velardi, Georgios Paliouras,
Vangelis Karkaletsis, and Constantine D Spyropoulos. Automatic adaptation
of Proper Noun Dictionaries through cooperation of machine learning and
probabilistic methods. In Proceedings of the 23rd annual international ACM
SIGIR conference on Research and development in information retrieval -
SIGIR ’00, New York, New York, USA, 2000. ACM Press.
[PCZ17] Haiyun Peng, Erik Cambria, and Xiaomei Zou. Radical-based hierarchical
embeddings for Chinese sentiment analysis at sentence level. In FLAIRS
2017 - Proceedings of the 30th International Florida Artificial Intelligence
Research Society Conference, pages 347–352. AAAI Press, 2017.
[Por80] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137,
March 1980.
[PSS+21] Alexis Palmer, Nathan Schneider, Natalie Schluter, Guy Emerson, Aurelie
Herbelot, and Xiaodan Zhu, editors. Proceedings of the 15th International
Workshop on Semantic Evaluation (SemEval-2021), Online, August 2021.
Association for Computational Linguistics.
[PT20] Eirini Papagiannopoulou and Grigorios Tsoumakas. A review of keyphrase
extraction. WIREs Data Mining and Knowledge Discovery, 10(2), March
2020.
104
[Ras21] Sebastian Raschka. L19.5.1 The Transformer Architecture - Online Lecture,
2021. https://www.youtube.com/watch?v=tstbZXNCfLY. Accessed:
2021-10-11.
[RHW86] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning
representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
[RQH10] Robert Remus, Uwe Quasthoff, and Gerhard Heyer. SentiWS-A Publicly
Available German-language Resource for Sentiment Analysis. In Proceed-
ings of the Seventh International Conference on Language Resources and
Evaluation (LREC’10), 2010.
[Rud19] Sebastian Ruder. Neural transfer learning for natural language processing.
PhD Thesis, NUI Galway, 2019.
[RZ09] Stephen Robertson and Hugo Zaragoza. The probabilistic relevance frame-
work: BM25 and beyond. Foundations and Trends in Information Retrieval,
3(4):333–389, 2009.
[SEA18] Symeon Symeonidis, Dimitrios Effrosynidis, and Avi Arampatzis. A com-
parative evaluation of pre-processing techniques and their interactions for
twitter sentiment analysis. Expert Systems with Applications, 110:298–310,
November 2018.
[Seb02] Fabrizio Sebastiani. Machine Learning in Automated Text Categorization.
ACM Computing Surveys, 34(1):1–47, 2002.
[SFC+17] Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Had-
dow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio
Valerio Miceli Barone, Jozef Mokry, and Maria Nădejde. Nematus: a Toolkit
for Neural Machine Translation. 15th Conference of the European Chapter
of the Association for Computational Linguistics, EACL 2017 - Proceedings
of the Software Demonstrations, pages 65–68, March 2017.
[SG16] Jasmeet Singh and Vishal Gupta. A systematic review of text stemming
techniques. Artificial Intelligence Review 2016 48:2, 48(2):157–217, August
2016.
[SLC17] Shiliang Sun, Chen Luo, and Junyu Chen. A review of natural language
processing techniques for opinion mining systems. Information Fusion, 36:10–
25, July 2017.
[SPH+11] Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and
Christopher D. Manning. Semi-Supervised Recursive Autoencoders for Pre-
dicting Sentiment Distributions. In Proceedings of the 2011 conference on
empirical methods in natural language processing, pages 151–161, 2011.
105
[SQXH19] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to fine-tune
bert for text classification? In China National Conference on Chinese
Computational Linguistics, pages 194–206. Springer, 2019.
[SRS14] T. Sree Sharmila, K. Ramar, and T. Sree Renga Raja. Impact of applying
pre-processing techniques for improving classification accuracy. Signal, Image
and Video Processing, 8(1):149–157, 2014.
[SST17] Dietmar Schabus, Marcin Skowron, and Martin Trapp. One Million Posts:
A Data Set of German Online Discussions. In Proceedings of the 40th
International ACM SIGIR Conference on Research and Development in
Information Retrieval, pages 1241–1244, Shinjuku Tokyo Japan, August 2017.
ACM.
[Ste12] Manfred Stede. Discourse Processing. Synthesis Lectures on Human Language
Technologies, 4(3):1–167, December 2012.
[SV21] Nikolaos Stylianou and Ioannis Vlahavas. A neural Entity Coreference
Resolution review. Expert Systems with Applications, 168, April 2021.
[SVS13] Rico Sennrich, Martin Volk, and Gerold Schneider. Exploiting Synergies
Between Open Resources for German Dependency Parsing, POS-tagging,
and Morphological Analysis. In Proceedings of the International Conference
Recent Advances in Natural Language Processing RANLP 2013, pages 601–
609. INCOMA Ltd. Shoumen, September 2013.
[SW17] Claude Sammut and Geoffrey I Webb. Encyclopedia of machine learning and
data mining. Springer Publishing Company, Incorporated, 2017.
[TDM03] Brian J Taylor, Marjorie A Darrah, and Christina D Moats. Verification and
validation of neural networks: a sampling of research in progress. In Kevin L
Priddy and Peter J Angeline, editors, Intelligent Computing: Theory and
Applications, volume 5103, pages 8–16. International Society for Optics and
Photonics, SPIE, 2003.
[TMS03] Ann Taylor, Mitchell Marcus, and Beatrice Santorini. The Penn Treebank:
An Overview. Treebanks, pages 5–22, 2003.
[TQW+15] Duyu Tang, Bing Qin, Furu Wei, Li Dong, Ting Liu, and Ming Zhou. A Joint
Segmentation and Classification Framework for Sentence Level Sentiment
Classification. IEEE/ACM Transactions on Audio, Speech, and Language
Processing, 23(11):1750–1761, 2015.
[TTJ06] Michal Toman, Roman Tesar, and Karel Jezek. Influence of word normaliza-
tion on text classification. Proceedings of InSciT, 4:354–358, 2006.
[Tug16] Don Tuggener. Incremental coreference resolution for German. PhD thesis,
University of Zurich, 2016.
106
[TWY+14] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learn-
ing Sentiment-Specific Word Embedding for Twitter Sentiment Classification.
52nd Annual Meeting of the Association for Computational Linguistics, ACL
2014 - Proceedings of the Conference, 1:1555–1565, 2014.
[Vit67] Andrew J. Viterbi. Error Bounds for Convolutional Codes and an Asymptot-
ically Optimum Decoding Algorithm. IEEE Transactions on Information
Theory, 13(2):260–269, 1967.
[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You
Need. Advances in Neural Information Processing Systems, pages 5999–6009,
June 2017.
[WBK20] Jeremy Watt, Reza Borhani, and Aggelos K Katsaggelos. Machine learning
refined: foundations, algorithms, and applications. Cambridge University
Press, 2020.
[WBO99] Janyce Wiebe, Rebecca Bruce, and Thomas P. O’Hara. Development and
Use of a Gold-Standard Data Set for Subjectivity Classifications. Proceedings
of the 37th Annual Meeting of the Association for Computational Linguistics,
pages 246–253, 1999.
[WLS+15] Xin Wang, Yuanchao Liu, Cheng-Jie Sun, Baoxun Wang, and Xiaolong
Wang. Predicting Polarities of Tweets by Composing Word Embeddings
with Long Short-Term Memory. ACL-IJCNLP 2015 - 53rd Annual Meeting
of the Association for Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing of the Asian Federation of
Natural Language Processing, Proceedings of the Conference, 1:1343–1353,
2015.
[WWB19] Yu Emma Wang, Gu-Yeon Wei, and David Brooks. Benchmarking TPU, GPU,
and CPU Platforms for Deep Learning. arXiv preprint arXiv:1907.10701,
July 2019.
[YCG+98] J. P. Yamron, I. Carp, L. Gillick, S. Lowe, and P. Van Mulbregt. A hidden
Markov model approach to text segmentation and event tracking. In ICASSP,
IEEE International Conference on Acoustics, Speech and Signal Processing -
Proceedings, volume 1, pages 333–336. Institute of Electrical and Electronics
Engineers Inc., 1998.
[YHPC18] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Re-
cent trends in deep learning based natural language processing. IEEE
Computational Intelligence Magazine, 13(3):55–75, 2018.
107
[YKYS17] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. Comparative
Study of CNN and RNN for Natural Language Processing. arXiv preprint
arXiv:1702.01923, 2017.
[ZDLS20] Ming Zhou, Nan Duan, Shujie Liu, and Heung Yeung Shum. Progress in
Neural NLP: Modeling, Learning, and Reasoning. Engineering, 6(3):275–290,
2020.
[Zha20] Mei Shan Zhang. A survey of syntactic-semantic parsing based on con-
stituent and dependency structures. Science China Technological Sciences,
63(10):1898–1920, October 2020.
[ZMN+19] Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura
Farra, and Ritesh Kumar. SemEval-2019 Task 6: Identifying and Cate-
gorizing Offensive Language in Social Media (OffensEval). arXiv preprint
arXiv:1903.08983, March 2019.
[ZW15] Ye Zhang and Byron Wallace. A Sensitivity Analysis of (and Practitioners’
Guide to) Convolutional Neural Networks for Sentence Classification. arXiv
preprint arXiv:1510.03820, October 2015.
[ZZL15] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional
networks for text classification. Advances in neural information processing
systems, 28:649–657, 2015.
108