Safer Internet Chatbot
Chatbot for the Safer Internet Program
DIPLOMARBEIT
zur Erlangung des akademischen Grades
Diplom-Ingenieur
im Rahmen des Studiums
Logic and Computation
eingereicht von
Dipl. Ing. Kresimir Kasal, BSc
Matrikelnummer 0026127
an der Fakultät für Informatik
der Technischen Universität Wien
Betreuung: Ao.Univ.Prof. Mag.rer.soc.oec. Dr.rer.soc.oec. Horst Eidenberger
Wien, 12. Mai 2023
Kresimir Kasal Horst Eidenberger
Technische Universität Wien
A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at

Safer Internet Chatbot
Chatbot for the Safer Internet Program
DIPLOMA THESIS
submitted in partial fulfillment of the requirements for the degree of
Diplom-Ingenieur
in
Logic and Computation
by
Dipl. Ing. Kresimir Kasal, BSc
Registration Number 0026127
to the Faculty of Informatics
at the TU Wien
Advisor: Ao.Univ.Prof. Mag.rer.soc.oec. Dr.rer.soc.oec. Horst Eidenberger
Vienna, 12th May, 2023
Kresimir Kasal Horst Eidenberger
Technische Universität Wien
A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at

Erklärung zur Verfassung der
Arbeit
Dipl. Ing. Kresimir Kasal, BSc
Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen-
deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der
Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder
dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter
Angabe der Quelle als Entlehnung kenntlich gemacht habe.
Wien, 12. Mai 2023
Kresimir Kasal
v

Danksagung
Ich möchte mich an dieser Stelle bei all jenen bedanken, die mich während des Studiums
und der Anfertigung meiner Diplomarbeit unterstützt haben. Ganz besonderer Dank
gilt hierbei meiner Famile, die immer für mich da war und mir den Rücken gestärkt hat.
Zudem möchte ich mich bei meinem Betreuer Prof. Horst Eidenberger für die hilfreichen
Anregungen und die vielen Verbesserungsvorschläge bedanken, ohne die diese Arbeit
nicht möglich gewesen wäre.
vii

Acknowledgements
I would like to take this opportunity to thank all those who supported me during my
studies and during the preparation of my thesis. Special thanks go to my family, who
were always there for me and supported me. I would also like to thank my supervisor
Prof. Horst Eidenberger for the helpful suggestions and the many ideas for improvement,
without which this thesis would not have been possible.
ix

Kurzfassung
Virtuelle Assistenten oder Konversationsagenten - allgemein als Chatbots bekannt -
werden zunehmend zu einem festen Bestandteil unserer modernen Gesellschaft. Häufig
werden diese für Aufgaben eingesetzt, bei welchen ständige Erreichbarkeit von Vorteil ist.
Dies ist beispielsweise dann der Fall, wenn die Beantwortung von Fragen zu Produkten
oder Dienstleistungen rund um die Uhr gewährleistet werden soll. In unserem konkreten
Anwendungsfall wird ein bereits betriebener deutschsprachiger Chatbot dafür eingesetzt,
Fragen von Kindern zu beantworten. Bei den gestellten Fragen geht es darum, ob die
von den Kindern über unterschiedlichste Kanäle erhaltenen Nachrichten Kettenbriefe
darstellen. In einem typischen Szenario stellt ein Kind eine Frage, und sendet die erhal-
tene Nachricht um zu erfahren ob diese ernst genommen oder ignoriert werden sollte.
Die Aufgabe des Chatbots besteht nun darin, die Absicht der Frage zu erkennen und
festzustellen ob es sich bei der erhaltenen Nachricht um einen Kettenbrief handelt, sowie
angemessen darauf zu antworten. Dabei sollte das Gespräch möglichst so gelenkt werden,
dass potentielle Ängste vorgebeugt sowie sinnvolle, konkrete Ratschläge für das weitere
Vorgehen mitgegeben werden. Im Rahmen dieser Arbeit tragen wir zur Verbesserung
des aktuell betriebenen deutschsprachigen Chatbots bei, indem wir (i) eine auf Open-
Source-Technologien basierende Implementierung bereitstellen und (ii) eine quantitative
Bewertung von 120 verschiedenen - auf maschinellem Lernen basierenden - Ansätzen
durchführen, bei welchen wir eine Vielzahl von Algorithmen, neuronalen Netzwerkarchitek-
turen und Word-Embeddings kombinieren. Durch die Anwendung von Transfer-Learning
auf der Grundlage des BERT-Sprachmodells konnten wir eine Klassifikationsleistung von
0.9 für die beiden Metriken F-Score und Genauigkeit erreichen. Ebenso haben wir (iii) den
Einfluss von Emojis untersucht, wobei wir zu unserer Überraschung keinen eindeutigen
Effekt auf die Klassifikationsleistung feststellen konnten. Darüber hinaus haben wir (iv)
eine qualitative Bewertung unserer Implementierung durchgeführt, und einen Fragebogen
zu den erwarteten und tatsächlichen Ergebnissen hinsichtlich des Systemverhaltens er-
stellt. Das Feedback, welches wir über den Fragebogen erhalten haben, war sehr positiv
und zeigte, dass wir die wahrgenommene Qualität der (v) Absichtserkennung, (vi) der
Erkennung von Kettenbriefen sowie der (vii) Antwortgenerierung steigern konnten.
xi

Abstract
Virtual assistants or conversational agents - widely known as chatbots - are becoming
an increasingly pervasive part of our modern society, and are already widely used to
take on tasks where permanent accessibility is beneficial. In our particular use case,
an already operational German language chatbot is used to answer children’s questions
regarding chain letters. It is not a conversational chatbot, but serves a particular goal.
In a typical scenario, a child asks a question and sends the received message, for which it
wants to know whether it should be taken seriously or whether the message can be safely
ignored. The chatbot’s task is to recognise the intent of the question, detect whether the
received message represents a chain letter, and to respond appropriately and steer the
conversation such that potential fears are alleviated and advice is given on how to proceed
further. Throughout this work, we have improved the current German language chatbot
by (i) providing an implementation based on open-source technologies, and conducted
a (ii) quantitative evaluation of 120 different approaches based on machine learning,
where we have combined a variety of algorithms, neural network architectures and text
embedding methods. By applying transfer learning based on the BERT language model,
we were able to achieve a classification performance of 0.9 for both metrics F-Score and
accuracy. We have also examined (iii) the influence of emojis on the overall classification
performance, where to our surprise we could not identify any clear effect. Furthermore,
we have conducted (iv) a qualitative evaluation of our implementation and have therefore
compiled a questionnaire regarding the expected and actual results concerning the system
behavior and performance. The feedback received through the questionnaire has been
very positive, and it showed that we were able to increase the perceived quality of (v)
intent recognition, (vi) chain letter detection and (vii) response generation.
xiii

Contents
Kurzfassung xi
Abstract xiii
1 Introduction 1
1.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . 1
1.2 Aim of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 5
2.1 Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Components and Architectures . . . . . . . . . . . . . . . . . . 5
2.1.2 Dialogue Management . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Natural Language Understanding . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . 11
2.2.2 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Classical Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Transfer learning with pretrained language models . . . . . . . . . . . 29
3 System Design 31
3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Text-to-Text Transformer . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3 Answer Generation . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.4 Conversation Memory . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.5 Attachements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Selected Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 38
xv
3.4.2 Emoji Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.3 Training, Validation and Test Datasets . . . . . . . . . . . . . . 39
3.4.4 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Implementation 41
4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Intent Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Answer Generation . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Evaluation 47
5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Evaluation of Algorithms for Intent Classification . . . . . . . . . . . . 50
5.3 Evaluation of Algorithms for Intent Labelling . . . . . . . . . . . . . . 52
5.4 User Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.2 Perceived Performance . . . . . . . . . . . . . . . . . . . . . . . 62
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6 Conclusion 79
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
List of Figures 83
List of Tables 89
Bibliography 91
CHAPTER 1
Introduction
1.1 Motivation and Problem Statement
Conversational agents - widely known as chatbots - are becoming an increasingly pervasive
part of our modern society, and are already widely used to take on tasks where permanent
accessibility is beneficial, for instance by answering questions regarding products or ser-
vices around the clock. During recent months, chatbots have gained great popularity with
the emergence of ChatGPT [Ope22]. In our particular use case, an already operational
German language chatbot is used to answer children’s questions regarding chain letters.
It is not a conversational chatbot, a so-called chatterbot, but serves a particular goal
and can thus be considered a task-oriented conversational system. Hence, in a typical
scenario, a child asks a question and sends the received message, for which it wants to
know whether it should be taken seriously or whether the message can be safely ignored.
The chatbot’s task now is to recognise the intent of the question, and to detect whether
the received message represents a chain letter. After the request has been processed and
analysed, an answer shall be generated and sent back to the conversation partner. The
chatbot should make it clear for the child that it is not participating in a conversation
with a human, it should reduce the amount of exchanged messages and keep the focus on
helping the child handle the situation when a chain letter has been received. The chatbot
is not required to keep track of the whole interaction with the conversation partner, but
to steer the conversation such that potential fears are alleviated and advice is given
on how to proceed further (e.g. delete the message, call a helpline, etc.). Moreover,
occasionally it happens that audio messages or documents are sent by children as well.
Thus, the chatbot shall be able to recognize documents and understand audio messages.
1
1. Introduction
1.2 Aim of the Work
The aim of the work is to improve the current German language chatbot, particularly
with regard to the quality of (i) intent recognition, the quality of (ii) chain letter detection,
as well as to (iii) increase the quality of generated responses. Furthermore, it shall also
be verified (iv) whether the improvement can be accomplished by using open source
technologies. In order to do so, classic machine learning algorithms, deep learning
neural network architectures as well as a pre-trained German language model, fine-tuned
with a problem specific dataset, shall be evaluated and compared with regard to their
classification performance. To tackle the requirement of improving the quality of intent
detection and to be able to generate better suitable answers, we intend to orthogonally
complement the classifier by an additional label detector. In our particular case, we
intend to utilize labels to signal a context which is not captured well by classes contained
in our problem specific dataset. Thus, by combining classes and labels, we intend
to refine the captured meaning of messages where no specific intent can be derived
from the assigned category. By doing so, we intend to improve the perceived quality
of conversation. Furthermore, as it occasionally happens that documents and audio
messages are sent by children as well, the chatbot shall also be able to retrieve text from
several document formats, and to detect and understand audio messages. We will use
the magic and filetype Python libraries to detect the type of the uploaded file, and
the SpeechRecognition Python library to transform audio data to text, since the
models expect text as their corresponding input data format.
1.3 Research Questions
Following research questions are addressed in this thesis:
1. How do classic machine learning algorithms and conventional deep neural networks
compare to approaches based on fine-tuned, pre-trained language models?
The reason for asking this question is to find the most promising approach. We
intend to explore possible combinations that arise from the variety of existing
algorithms and text representation methods in order to find the most suitable
combination, and thus to be best equipped when addressing the challenge of intent
recognition and solving the particular problem at hand.
2. To what extent do emojis contribute to the overall classification performance?
Emojis represent a significant part of communication between children. Hence,
this raises the question of how much they add to the intended meaning behind the
message. From a more practical point of view, this question intends to explore
whether emojis can be reasonably utilized to contribute to the overall classification
performance, and therefore to enhance and improve the chatbot’s capability to
recognize the intent behind received messages.
2
1.4. Outline
3. Is it possible to improve the existing chatbot with open-source technology?
The goal is to provide a cost effective solution which offers higher quality in intent
recognition and chain letter detection. Hence, in order to lower costs, it is crucial to
verify whether an implementation based on open source technology can be provided.
1.4 Outline
Chapter 2 presents the background knowledge and related work upon which the thesis is
based on. In particular, a short overview is given on chatbot components and architectures,
natural language processing and understanding, classical machine learning algorithms
and deep neural network architectures. In Chapter 3, system requirements are delineated
and the design of the system is illustrated. Furthermore, design decisions and selected
methods are explained and justified. Chapter 4 explains the operating principles of the
chatbot, as well as how training of language models used for classification and labelling
is performed. In Chapter 5, the evaluation of the implemented chatbot is described. On
one hand, classification and labelling models are quantitatively evaluated. On the other
hand, actual results from user’s point of view are presented. Chapter 6 concludes the
work, discusses its outcomes and opens questions and possibilities for further work.
3

CHAPTER 2
Related Work
This chapter provides an overview of related work and gives a background information
for the following presentation of the thesis. First, an overview is given regarding chatbots,
components they are comprised of and common architectures. Afterwards, natural lan-
guage processing and understanding are presented. Finally, machine learning algorithms
and architectures are examined.
2.1 Chatbots
2.1.1 Components and Architectures
In [Gal19], a chatbot is described as a program that serves as an interface between a
human and an application, utilizing natural language as primary means of communication.
Figure 2.1: High-level basic architecture of a chatbot, as depicted in [Gal19].
A chatbot architecture, represented in its simplest form in figure 2.1, comprises a
Natural Language Understanding (NLU) component, which is responsible for generating
a meaning or an interpretation of a user’s statement, such as the intent behind the
utterance or a logical representation of it. After the NLU component, the next module is
5
2. Related Work
the dialogue manager (DM), which plays a crucial role in managing the conversation flow
and communication between various sub-systems and components. It can be thought of
as a meta-component, that enables smooth interaction between the chatbot and the user.
A further crucial module is the Natural Language Generator (NLG), which in [Gal19]
is considered to be a part of the dialogue manager. The NLG module receives input
information, such as the retrieved intent behind the user’s statement, and produces a
corresponding textual representation afterwards. In the subsequent paragraphs, additional
components of a common chatbot, as described in [Gal19], are enumerated and further
details are provided. The list is not exhaustive, as we have focused on what we consider
to be essential.
Natural Language Understanding
One of the main functions of an NLU is parsing. It involves taking a sequence of words,
identifying keywords and entities, and creating a linguistic structure for the statement
that can be processed by other components in the architecture [Gal19]. How the parsing
process is accomplished, greatly varies and depends on the specific implementation. It
may involve rule-based approaches such as context-free grammars or pattern matching
techniques, machine learning algorithms, statistical models or data-driven approaches
such as large language models (LLM) which have become very popular recently.
Dialogue Manager
A dialogue manager shall enable a smooth interaction between the chatbot and the
user. To engage in flexible conversations, the DM needs to model a formalized dialogue
structure, perform contextual interpretation, manage domain knowledge and potentially
select appropriate chatbot actions. As is described in [Gal19], the process of contextual
interpretation often involves maintaining a certain level of dialogue context that can be
utilized to resolve anaphoric references, that is to resolve meaning of words which refer
to other ideas or words for their meaning. Additionally, a dialogue manager is expected
to have the ability to reason about the particular domain within which it operates.
Topic Detection
In order to facilitate a more enjoyable and interesting conversation, topic detection is
employed to monitor and maintain context and subject matter. In general, this is achieved
by utilizing a text classifier to categorize incoming messages into various topics. One
valuable source of training data for this purpose can be found within Reddit comments,
as they are often organized according to specific areas of interest [Gal19].
Named Entities and their Templates
This component is composed of two key elements: a Named Entity Recognition and
Disambiguation (NER) model and a template selection model. The NER model links
entities mentioned within a given text to an associated knowledge base, thereby enabling
the chatbot to comprehend the nature of these entities and engage with conversation
topics appropriately. Once a list of entities has been generated, pre-written templates are
employed to formulate responses. By taking into account the required information for
6
2.1. Chatbots
each template and the attributes of all available entities, a related template is selected at
random in order to promote conversational diversity [Gal19].
Information Retrieval
The objective of this module is to generate responses that are more natural and up-to-
date compared to those produced by the entity-based template and dialogue generation
modulesm, as described in [Gal19]. The module can obtain information from various
sources, such as tweets obtained through the Twitter search API.
As described in [Gal19], further possible modules of a chatbot architecture are the so-
called Personalization module, in which a model of user’s personality is built and utilized
throughout the conversation, the Multimodal Interaction module which incorporates
multiple modalities for communication, such as voice and hand gestures. The so-called
Context Tracking module is important for coreference resolution, which means that all
expressions which refer to the same entity in the text are identified. For instance, the
produced coreference chain is utilized to alter the original input message by substituting
pronouns with the corresponding entities to which they refer to.
In figure 2.2, a more complex architecture is given. It includes components for topic
detection, intent analysis and entity linking. The component responsible for topic
detection calculates the probability of each covered topic by analyzing the intents.
Depending on the NLU result, the chatbot will follow various paths as per its built-in
conversational strategies.
Figure 2.2: Architecture of a chatbot comprising components outlined in this section, as
proposed in [LLS+17] and described and summarized in [Gal19].
7
2. Related Work
2.1.2 Dialogue Management
As already described above, the Dialogue Management component is concerned with
describing the flow of conversation. In the following paragraphs, we will give an overview
of approaches suitable to tackle the dialogue management task, as described in [Gal19].
Deterministic
Some of the prominent representatives of a deterministic approach to dialogue management
are rule-based, state-machine based and template-based dialogue management approaches.
Rule-based approaches are often compared to production systems which involve logic
programming and rules that implement reasoning. The precondition of clauses in logic
programming which make up the rules, may be instantiated from the user’s input or
triggered by pattern matching. Such a component usually deterministically matches
the input and returns the corresponding, single output. Rule-based chatbots are more
flexible than most script-based dialogue systems that follow a fixed, pre-determined flow.
However, the expressiveness of rules is not sufficient to handle all types of variability and
dynamics in human conversations. Therefore, these are only applicable to domains where
users are restricted to a predetermined set of actions and phrasings. This is not the case
in undirected, free conversations.
Finite state machines (FSM) have also been utilized in chatbots, as they establish a
predetermined sequence of steps that depict the conversation’s stages at any given moment.
Each transition encodes a communicative act between the chatbot and the user. The
use of FSM-based dialogue management may seem unnatural, as it does not accurately
reflect the way in which conversations between humans typically unfold in the real world.
Furthermore, although incorporating additional states to an FSM is straightforward, it
becomes increasingly challenging when the chatbot domain is complex and extensive.
Hence, FSM-based chatbots are better suited for narrow vertical domains whose scope
can be adequately defined, as described in [Gal19]. One of the key advantages of such
chatbots is their predictability, as any utterance generated by the chatbot can be traced
back to the preceding state that produced it.
One of the standard ways to specify templates is to use Artificial Intelligence Markup
Language (AIML). In order to generate a response, a chatbot would typically use a
collection of AIML templates that take into account the conversation history and the
user’s input. These templates can be adapted to align with the chatbot’s objectives, such
as filtering out inappropriate language or steering the conversation towards certain topics.
However, AIML templates can sometimes result in incomplete or incorrect sentences
since they repeat the user’s input. To address this issue, it can be helpful to incorporate
web mining techniques to generate more coherent and accurate responses, as described
in [Gal19].
8
2.1. Chatbots
Statistical
This section outlines the ways in which statistical learning techniques are utilized, often
referred to as data-driven methods due to their reliance on large datasets for learning
dialogue strategies. In such cases, the system is provided with a corpus of input data,
which allows it to learn suitable responses. Although remarkable results have been
achieved by utilizing large amounts of data, at the same time the reliance upon data
to support the learning process is also the primary disadvantage of such approaches, as
described in [Gal19].
A classic representative of statistical methods are the so-called Bayesian networks,
which model a probabilistic distribution between events, such as dialogue utterances
for instance. They are based on Bayes’ theorem, which describes the probability of an
event based on prior knowledge of conditions related to the event. Bayesian networks
consist of a directed acyclic graph, and conditional probabilities for transitions between
the individual nodes. Similarly to rule-based chatbots, the network structure is designed
by domain experts and is thus associated with a significant degree of development effort.
Furthermore, initial conditional probabilities for transitions between the nodes need to be
computed during system inception as well. Some criticism of Bayesian networks is, that
they have limited ability to handle dynamic input and that their predefined methodology
is restrictive. Additionally, they are criticized for their inability to transition naturally
between topics in conversations, as described in [Gal19].
A further prominent statistical method are the so-called Markov models, also known as
Markov Decision Processes or Markov Chains. These are based on the idea of a Markov
property, which states that the future state of a system depends only on its current state,
and not on any previous states. In a general case, a Markov Decision Process consists of
• a discrete set of states H,
• a discrete set of actions A,
• the transition distribution or probability function Pa(h, h′), which describes how
likely it is that action a in state h at time t will lead to state h′ at time t + 1, and
• the reward distribution function Ra(h, h′), which describes the reward for an agent,
which is received after transitioning from state h to state h′ due to action a.
An agent aims to maximize its reward. For instance, at time step t, let the agent be in state
ht ∈ H , and take action at ∈ A. Then, after transitioning to a new state ht+1 due to action
at ∈ A with probability P (ht, ht+1), the agent receives a reward R(ht, ht+1). Higher-
order Markov models could also be used. With higher-order models, the probability of
transitioning to a new state is based on two or more previous states. Markov model-based
systems require training in order to learn dialogue strategies. Supervised learning has been
utilized for the initial training, but also reinforcement learning has been proposed to learn
9
2. Related Work
optimal strategies. In [Gal19], relying on automatically created - e.g. via reinforcement
learning - dialogue management strategies is described to be disadvantageous, as there is
no control to make sure that dialogue flow is adequate and relevant.
In the context of chatbots, neural network based techniques have been applied in speech
recognition, sequence matching, prediction, sequence-to-sequence learning and response
generation based on corpus training. A prominent example of the sequence-to-sequence
learning approach are so-called large language models (LLM), which have gained huge
popularity recently due to emergence of ChatGPT [Ope22]. LLMs are based on the
transformer neural network architecture, which is described in section 2.3.2, and are
moreover trained on vast amounts of data. To further improve the performance, the
so-called Reinfocement Learning with Human Feedback (RLHF) approach has been
applied to LLMs. In [OWJ+22], the authors report that simply increasing the size of
language models did not necessarily make them better at understanding and addressing
the user’s intent. In fact, larger models may produce harmful or unhelpful outputs,
since they are not aligned with their users. Hence, to address this issue, an additional
step involving fine tuning by applying reinforcement learning with human feedback has
been introduced after an initial fine-tuning step based on supervised learning. The
authors report that the approach has lead to improvements in model’s truthfulness and
reductions in toxic output generation, despite having significantly fewer parameters than
the comparative model, which has not been fine-tuned by applying RLHF.
Example-Based
Example-based dialogue management is a popular approach, which involves collecting
pairs of initial user utterances and corresponding chatbot responses in a database [Gal19].
These examples are then used to generate chatbot responses for new user inputs. Example-
based dialogue managers can be easily modified by updating the dialogue examples in the
database, making them flexible and effective for scenarios where the dialogue system’s
domain or the task frequently change. However, to ensure coverage of a variety of inputs
in the dialogue, a large number of dialogue examples are needed.
Transfer-Learning
Transfer-Learning, or reusing pre-trained models on a new problem, is a popular technique
in the field of Deep Learning (see section 2.4), since it enables developers to train the
chatbot with a relatively small dataset [Gal19]. This approach thus proves to be beneficial,
as complex models generally require vast amounts of labeled data samples, which are not
usually available for real-world problems.
2.2 Natural Language Understanding
In this chapter, methods are introduced which enable machines to interpret the intent
behind written text. First, Natural Language Processing (NLP) techniques such as
10
2.2. Natural Language Understanding
tokenization, stemming and lemmatization, are covered. These are necessary to prepare
the text such that it can be interpreted afterwards. For instance, in the tokenization
process, text is broken into smaller units, such as words or parts of words. Lemmatization,
on the other hand, is the process of reducing a word to its root form. By doing so,
different spellings of a word can be mapped to the intended meaning. Results achieved
by processing steps such as tokenization or lemmatization are then interpreted. Thus, in
the following section, methods to represent the intended meaning of text are illustrated.
These are called word embeddings, and allow words or text to be represented as numerical
vectors or matrices in a dense, high-dimensional vector space. By doing so, words and
text are represented in a way such that they can easily be processed by machine learning
algorithms.
2.2.1 Natural Language Processing
Pattern Matching
One possible approach to NLP is pattern-matching, which means that formal languages
are used to specify natural language structures which can occur in a conversation. One
prominent example of such a formalism are regular expressions. In [JM09], a regular
expression is defined as a formula in a special language that specifies simple classes of
strings, where a string is described as a sequence of symbols; for most text-based search
techniques, a string is any sequence of alphanumeric characters (letters, numbers, spaces,
tabs and punctuation). The authors further write that formally, a regular expression is an
algebraic notation for characterizing a set of strings, which can specify search strings as
well as define a language in a formal way. Thus, by using regular expressions, a special
kind of production system called regular grammars can be defined [HMU07]. These have
a predictable, deterministic and provable behaviour. Despite these very useful properties,
since one is greatly inclined to think that they are very limiting, regular grammars are
still flexible enough to power dialog engines used in popular products such as Amazon
Alexa or Google Now, as delineated in [LHH19].
Tokenization
In [MRS08], the authors define tokenization as the process of chopping the document
into pieces, called tokens, while at same time throwing away certain characters such
as punctuation. Tokens are often loosely referred to as terms or words, and are more
precisely defined by the authors as instances of a sequence of characters in some particular
document that are grouped together as a useful semantic unit for processing. Hence, with
tokenization unstructured data is broken into smaller units of information, which can
further be processed e.g. by applying stemming or lemmatization, or even counted as
discrete elements to represent the document as a vector, as described in [LHH19].
11
2. Related Work
Stemming
In order to accommodate grammatical variations, documents often feature different forms
of a word, such as organize, organizes, and organizing. Furthermore, there are groups
of words with related meanings, such as democracy, democratic, and democratization.
As Manning et al. describe in [MRS08], the aim of stemming is to simplify inflectional
forms to a shared base form. Hence, stemming often involves removing derivational
affixes and uses an unsophisticated heuristic method that truncates word endings with
the hope of correctly achieving this objective. The Porter algorithm is the most widely
used approach for stemming English, and has demonstrated to be remarkably effective.
This technique consists of five phases of word reduction, applied sequentially, with various
rules to be employed within each stage. Stemmers employ language-specific rules and
require less knowledge than a lemmatizer, which necessitates a complete vocabulary and
a morphological analysis to accurately determine lemmata of words, see [MRS08].
Lemmatization
In [MRS08], lemmatization is described as a process which targets the same goal as
stemming. However, it differs by approaching the problem in a more sophisticated manner.
It involves the use of a vocabulary and morphological analysis of words with the goal of
identifying and returning the root form of a word, known as the lemma. The process
typically involves removing only inflectional endings while retaining the root of the word.
Unlike stemming, which may produce crude results such as returning s for the word saw,
lemmatization attempts to return the appropriate base form, such as see or saw, based
on the context and part of speech of the input token, as described in [MRS08].
Part-of-Speech-Tagging
In [JM09], part-of-speech tagging is described as the process of assigning a part of speech
or other syntactic class marker to each word in a corpus. Since in general tags are also
applied to punctuations, tagging requires those to be separated from words. Hence,
tokenization (described in section 2.2.1) is usually applied before. To perform the task, a
tagging algorithm takes in a sequence of words and a designated set of tags, and outputs
the most appropriate tag for each word in the sequence. Tagging algorithms can be
divided into two classes: rule-based taggers and stochastic taggers. In rule-based taggers,
a substantial collection of disambiguation rules that are handwritten, is usually utilized.
For instance, the rules help in specifying whether a certain word should be tagged as
a noun or as a verb. On the other hand, stochastic taggers resolve tagging ambiguities
through the use of a training corpus that calculates the probability of a given word having
a particular tag, based on the context it is used in. A typical stochastic tagger would be
based on Markov models, as already described in 2.1.2.
12
2.2. Natural Language Understanding
Named Entity Recognition
In [JM09], Named Entity Recognition (NER) is described as a fundamental step in
information extraction, as it involves detecting and categorizing named entities in a
text. Here, named entity means anything that can be referred to with a proper name,
such as people, organizations, and locations. Named entity recognition involves two
steps: identifying the span of the text that constitutes a proper name, and classifying the
entity based on its type. While generic systems focus on identifying people, places, and
organizations, specialized systems can also identify commercial products or other entities.
Typically, named entity recognition is approached as a word-by-word labeling task, where
each assigned tag captures both the boundary and the type of a detected named entity.
Statistical sequence labeling techniques, such as Markov models or conditional random
fields, are common approaches to implement named entity recognition.
2.2.2 Text Representation
In this section, we will have a look at different approaches on how text can be represented
and its meaning captured. A common naming for such representations is the term
embeddings, as texts are represented or "embedded" as numerical vectors or matrices
in a high-dimensional vector space. The advantage of doing so is that words and text
can easily be processed by machine learning algorithms. Hence, in the following section,
commonly used approches will be examined.
TF-IDF
TF-IDF is an abbreviation and stands for Term Frequency - Inverse Document Frequency,
which is a short and concise description of measures that are used to calculate vectors in
the underlying embedding space. The basic idea here is that texts can be described as
so-called bags-of-words. In this model, a text is represented as a multiset of its words while
the underlying grammar and word ordering are ignored. What is considered important
are word frequencies and their relative occurrences - as a measure of their importance -
in the whole document corpus. The dimensionality of the vector representing the text is
equal to the size of the vocabulary of words which occurs across all documents in the
corpus. In order to calculate the "length" of each component (which corresponds to a
single word or term) in the vector representation of the document d, two measures are
required: (i) the term frequency which is equal to the number of occurences of term t in
the document d, and (ii) the inverse document frequency, which is equal to the logarithm
of the ratio of the number of all documents in the collection (N) relative to the number
of documents containing the term t, denoted as dft (as described in [MRS08]):
idft = log
N
dft
(2.1)
The value for each dimension (that is, the length of the vector for a specific term or word)
is then calculated as the product of the frequency of the term in the document tft,d and
13
2. Related Work
the inverse document frequency idft:
tf − idft,d = tft,d × idft (2.2)
Hence, the embedding for a single document d in a corpus is a result of calculations delin-
eated in equations 2.1 and 2.2, performed for each term t in the corpus, for the particular
document of interest d. For dictionary terms that do not occur in a document, this value
equals to zero. The resulting vector is a sparse and high-dimensional representation
of the document, which can be used for text classification or clustering. Although the
approach provides a simple and effective way to handle a wide range of text data, it has
some limitations. For instance, it suffers from high-dimensionality, since each term in the
corpus is represented as a separate feature. Furthermore, TF-IDF representations do not
capture the meaning of words in a comprehensive way, which is obvious particularly when
different words are used to describe same or similar meanings, such as the words boat or
ship. Hence, in this particular case, although both words are very similar in meaning,
they are represented by different dimensions and thus indicate no similarity at all.
Word2Vec
Word2Vec denotes a method proposed to overcome problems that come with techniques
which treat words as atomic units, such as TF-IDF, and has been proposed by Mikolov et
al. in [MCCD13a]. The core idea of the approach is to use weights in a neural network to
represent words. The reasoning is as follows - if the network has been trained to predict
words, then the weights which lead to the recognition of a certain word are in fact a
representation of that particular word, a representation in a high-dimensional vector
space. The dimensionality of the vector space corresponds to the number of weights
contained in the network layer, which contributes to predicting the word. To predict a
target word, its surrounding context, that is its surrounding words, are taken as input.
In [MCCD13a], two architectures have been proposed for this purpose: (i) the continuous
bag-of-words (CBOW) model, and the (ii) continuous skip-gram model, both depicted in
figure 2.3. As can be seen, both models differ in their corresponding input and output
structures. The CBOW model takes a context of words as input and predicts a single
word. The order of words does not matter here, since the words get projected onto the
same position, and the vectors get averaged, hence also in this case the term bag-of-words.
The continuous skip-gram model, on the other hand, takes a target word as input and
tries to predict the words that are likely to appear in its context. In this case, the
input is a single word, and the output is the context of the target word, which means its
surrounding words. Both models are effective at generating high-quality word embeddings.
However, the CBOW model is faster and tends to perform better on frequent words, while
the skip-gram model is better suited for infrequent words and capturing rare relationships
between words [MCCD13b].
14
2.2. Natural Language Understanding
Figure 2.3: CBOW and Skip-gram models, as proposed in [MCCD13a].
GloVe
Global Vectors for Word Representation (GloVe) is a further method to represent words
in dense vector spaces. Similar to word2vec, the underlying assumption is that words
occurring in the same context tend to have similar meanings. However, instead of using
neural networks to construct the embeddings, in GloVe a co-occurence matrix is used for
this purpose. The co-occurence matrix is constructed by counting the number of times
each word appears in the context of every other word that is contained in the corpus.
Hence, the advantage of GloVe is that global statistics (word co-occurrences) are used to
obtain the word vectors, and not only local information (that is, the context of the target
word). Afterwards, the matrix is factorized by applying matrix factorization techniques
to obtain low-dimensional vectors for each word, as described in [PSM14]. A drawback
of the approach is that in case of large corpora, large amounts of memory are required to
store the co-occurrence matrix.
BERT Embeddings
In [DCLT18], the authors propose a pre-trained language representation model called
BERT (Bidirectional Encoder Representations from Transformers), which was designed
to generate contextualized embeddings that capture the meaning of words and their
relationships within a sentence. The difference to previous approaches is, that here the
vectors are contextualized. With word2vec or GloVe, each word is represented by exactly
one embedding vector. For instance, the German word Bank - which can refer to a park
15
2. Related Work
bench or to a financial institution - has only one embedding vector. This one vector is
the result of all sentences that were included in the training corpus, the average value of
all contained contexts so to say. Contrary to that, in case of BERT embeddings, values
of vectors are dependent on their context. Hence, in different contexts, the embedding
vectors of a single word are different.
To build the model, a transformer-based architecture is used (see 2.3.2). This is a
type of neural network that processes entire input sequences at once, rather than word-
by-word. BERT is trained on large amounts of unlabeled text data using a process
called masked language modeling. During this process, part of words in the input text is
randomly masked out, and the model is trained to predict the masked words based on
the surrounding, non-masked words. By doing so, (i) relationships between words in a
sentence are learned, and (ii) embeddings generated that capture the context in which a
word appears.
2.3 Machine Learning
In this section, machine learning algorithms utilized throughout this work are presented.
First, classic machine learning algorithms such as decision trees or K nearest neighbours
are examined. Afterwards, deep learning architectures, such as convolutional and recurrent
neural networks, are introduced. Then, transformers are presented, the latest state-of-
the-art neural network architecture. And finally, transfer learning based on transformers,
are covered.
2.3.1 Classical Algorithms
Decision Trees and Random Forests
In [Mit97], decision tree learning is described as as method for approximating discrete-
valued target functions, where the function to be learned is expressed through a decision
tree. The learned decision tree can also be transformed into a set of if-then rules, making
it easier for humans to understand how results are obtained. The author describes the
functioning principle behind predictions based on a decision tree as follows:
Each node in the tree specifies a test of some attribute of the instance, and each branch
descending from that node corresponds to one of the possible values for this attribute. An
instance is classified by starting at the root node of the tree, testing the attribute specified
by this node, then moving down the tree branch corresponding to the value of the attribute.
This process is then repeated for the subtree rooted in the new node.
The main idea behind the algorithm to construct such a decision tree can be expressed
recursively. In [WFH11], the algorithm is described to consist of the following steps:
• select an attribute to place the root node and make a branch for each possible value,
since this splits up the example set into subsets, one for every value of the attribute
16
2.3. Machine Learning
• repeat the process recursively for each branch, using only those instances that
actually reach the branch
In order to select the attribute which split up data partitions such that the simplest
possible decision tree is created, selection measures such as entropy, Gini index or the gain
ratio are used, as described in [WFH11]. After a tree-like model of decisions and their
possible consequences has been constructed, each decision node represents a condition on
an input variable, and each leaf node represents a class label or a numeric value, since
the algorithm can be used for both classification and regression tasks. For classification,
the leaf nodes represent class labels. For regression, they represent numeric values.
Overfitting describes the situation when the model fits the training data too closely, but
exhibits poor generalization performance on unseen data and thus fails to predict future
data reliably. This effect usually occurs when decision tree learning is applied, because
decision trees can become very complex, with many branches and leaves. Overfitting can
be addresses by pruning the tree, which means removing nodes or branches to simplify
the tree, limiting the depth of the tree or using techniques such as regularization (that is,
adding penalty to the loss function that the model is optimizing during training).
A further approach has been proposed to mitigate overfitting - random forests. This is
an ensemble learning method that combines multiple decision trees. The basic principle
is that multiple decision trees are created on different subsets of the training data, such
that each tree is different. The output of these decision trees is combined to make the
final prediction. Random forests exhibit an improved accuracy, less sensitivity to outliers
and can easly be parallelized.
K Nearest Neighbours
In [Mit97], the K Nearest Neighbours (KNN) algorithm is described as a conceptually
straightforward approach to approximate real-valued or discrete-valued target functions,
as it assumes that all instances correspond to points in an n-dimensional vector space.
The nearest neighbors of an instance are thus defined in terms of Euclidean distance,
where an arbitrary instance x is described by the feature vector
< a1(x), a2(x), ..., an(x) > (2.3)
with ar(x) denoting the value of the rth attribute of x, and the distance between two
points xi and xj being defined as the square root of the sum of all component differences:
d(xi, xj) =
n
r=1
(ar(xi) − ar(xj))2 (2.4)
The target value is then predicted based on the majority vote or the average of the K
nearest neighbours. The algorithm can be computationally expensive for large datasets,
17
2. Related Work
since significant computation can be required to process each new query. Furthermore,
the algorithm is considered to be resistant to overfitting due to its lazy learning approach.
However, choosing the value of K and the distance metric can very much affect the
algorithm’s performance.
Naive Bayes
The Naive Bayes algorithm is a probabilistic method that relies on the Bayes theorem to
calculate the probability of an instance that it belongs to a certain class. The algorithm
is considered naive, since it makes the strong assumption of independence between the
individual features. While this assumption in general is not true, it very much simplifies
the estimation. The Bayes theorem is defined as
P (h|D) = P (D|h) × P (h)
P (D) (2.5)
where P (h) is called the prior probability of hypothesis h, and can be thought of the
knowledge we have that h is a correct hypothesis. P (D) denotes the prior probability
that training data D will be observed, and P (D|h) denotes the probability that data D
will be observed in case the hypothesis h holds true, as described in [Mit97]. Hence, we
want to know P (h|D), the probability that hypothesis h is true when data D is observed.
For instance, we want to know how likely it is that a document belongs to a particular
class (hypothesis h) when certain words and their corresponding frequencies are observed
(data D). Using the Bayes theorem, this probability can be calculated, as exhibited in
examples in [WFH11]. Thus, by simply counting words and applying the Bayes theorem,
we can make predictions on how likely it is that a certain instance, e.g. a document,
belongs to a particular class, e.g. that the document covers a subject such as medicine.
Multilayer Perceptron
The multilayer perceptron (MLP) is a type of fully connected feedforward artificial neural
network, and represents the basis for more complex deep learning architectures. A fully
connected network means here that each unit from layer n is connected to all units from
subsequent layer n+1, each unit from layer n+1 is connected to all units from subsequent
layer n + 2, and so forth. Feedforward, on the other hand, refers to the fact that no loops
are contained in the network. All connections between the individual units or neurons
have weights assigned to them, which increase or decrease the signal while it is being
forwarded through the network, as depicted in figure 2.4 (e.g. weight w1 influences signal
x1, since it is the product w1x1 which is contributing to the activation of the neuron).
MLPs consist of three or more layers of neurons, where each neuron receives input from
the previous layer and passes its output to the next layer. The input layer is the first
layer in a multilayer perceptron, and receives the input data. It is followed by one or
more hidden layers, which perform nonlinear transformations on the input data. For an
example, see figure 2.5. The ability to approximate nonlinear relationships between the
18
2.3. Machine Learning
Figure 2.4: The sigmoid threshold unit applied to the sum of weighted inputs [Mit97].
input and output data comes from nonlinear activation functions in individual neurons,
as the output of each neuron is computed by the activation function applied to the
sum of its inputs. Typical activation functions are the sigmoid function (see figure 2.4),
rectified linear unit (ReLU) or the hyperbolic tangent function (tanh). Moreover, it has
been shown that multilayered feedforward neural networks are universal approximators,
and thus can approximate any function if sufficiently many hidden units are available
[HSW89]. After forwarding the signal through all the hidden layers, the output layer
produces the final result.
Figure 2.5: Multilayer perceptron with a single hidden layer composed of three units or
neurons. The input layer receives the data and forwards it to the hidden layer, from
where it is then finally passed to the output layer. The example was taken from [Mit97].
The weights of connections between the neurons are learned through backpropagation,
an algorithm to train parameterized networks with differentiable nodes, in which the
error is backpropagated through the network to update the weights. These are updated
in such a way that the difference between the predicted (that is, calculated) and the
true output is minimized. The chain rule is utilized in the backpropagation algorithm
to calculate the gradient of the cost (or error) function. It necessitates the calculation
of the derivative, which requires computing the partial derivative of each weight. As
a result, the gradient is obtained, which enables the adjustment of weights through a
technique known as gradient descent - an optimization algorithm that is used to find the
19
2. Related Work
weights that minimize the error function. This is done by altering the weight vector in
the direction that produces the steepest descent along the error surface, as depicted in
figure 2.6. The process continues until the optimum is reached.
Figure 2.6: The error function for a unit with weights w0 and w1. The depicted arrow
indicates the steepest descent along the error surface towards the minimum error, as
described and depicted in [Mit97].
MLPs are prone to overfitting when the network is too large relative to the size of the
training data. In order to prevent overfitting, regularization techniques such as weight
decay (that is, penalizing large weights and encouraging the model to learn simpler
functions) or dropout (modifying the network by dropping random neurons) can be
utilized.
2.3.2 Deep Learning
Convolutional Neural Networks
Convolutional neural networks (CNNs) are a class of deep neural networks for processing
data that has a known grid-like topology [GBC16]. They are commonly used in computer
vision, but have also been successfully applied to natural language processing. A crucial
difference between densely or fully connected layers and convolution layers is that the
former learn global patterns, while the latter learn local patterns such as edges, textures
and other features, as described in [Cho17]. For instance, for a MNIST image representing
a digit - where MNIST is a handwritten digit image dataset that is widely used as a
benchmark in computer vision and machine learning research, see [LC10] - fully connected
layers learn the whole image and its pixels, but do not inspect parts of it or search for
patterns that comprise parts of the image, but only take the whole picture and its pixels
20
2.3. Machine Learning
into account. Convolution layers, on the other hand, learn local patterns. In case of
images, patterns such as edges an textures are found in small two-dimensional windows
which are part of the input data, as illustrated in figure 2.7. The image is broken into
smaller subparts or modules, which are then - if present - detected as straight or curved
lines, horizontal or vertical lines, edges, textures, etc. In a following layer, those subparts
are then recognized as objects such as ears, eyes or a nose. In a final layer, the objects
are combined into a high-level concept, such as cat, as illustrated in figure 2.8.
Figure 2.7: Local features, such as edges and textures, can be extracted from images.
These are contained in a small window of the original image, as illustrated in [Cho17].
Figure 2.8: In the world of visual perception, a spatial hierarchy of modules exists, which
includes elementary lines or textures that combine into basic objects like eyes or ears. It
eventually culminates in higher-level concepts such as cat, as illustrated in [Cho17].
Hence, in order to be able to detect objects at different granularity levels, a CNN
consists of multiple layers, where each layer is dedicated to objects at a particular level
21
2. Related Work
of granularity (e.g., lines and edges at the lowest granularity level, eyes and ears at
the middle granularity level, and finally concepts such as cat on the highest granularity
level). Basic layers of a CNN are the convolutional, pooling, and the fully connected layer.
Convolutional layers apply filters to the input image in order to extract relevant features.
The filter slides over the image, performing a dot product at each location and producing
an output value that represents the degree of similarity between the filter and the input
image. More formally, it is a mathematical function - called convolution - where an input
and a filter are taken as inputs. The output of the function is a new image, in which
patterns from the input are highlighted (or filtered) in the output, as depicted in figure
2.9 and described in [Raf22]. Hence, in a CNN with two convolutional layers, the first
convolutional layer would extract patterns such as edges and textures, while the second
layer would learn larger patterns made of features of the first layer.
Figure 2.9: Convolution operation, visualised on a simple example from [Raf22].
Pooling layers, on the other hand, are used to (i) reduce the spatial dimensionality of
feature maps (these refer to the output of a convolutional filter applied to an input
image), and to (ii) achieve the so-called translation invariance, which means to allow the
CNN to recognize patterns in images regardless of their location (e.g. to tolerate shifting
up or down, left or right). This is done by aggregating or downsampling the information
in local neighborhoods of feature maps, for instance by using the so-called max pooling,
which is - similar to convolution - a mathematical operation. In max pooling, a window
is moved over the input, and the maximum value within each window is selected to
represent that region, instead of a dot product as is the case with convolution. Hence,
after each convolutional layer, a pooling layer is applied in order to reduce dimensions
and to achieve translation-invariance. Therefore, as previosuly indicated, convolutional
and pooling layers are stacked, such that spatial hierarchies of patterns can be learned.
For instance, corners and edges are learned in the first stack, objects such as eyes or
ears in the second, and so on (see figure 2.8). Finally, the fully connected layer uses
the output of the previous layers to classify the input image into one of the pre-defined
classes, which define the concepts. CNNs have been widely used in various applications,
such as object recognition, image segmentation, and natural language processing. In
natural language processing, they have exhibited competitive performance to Recurrent
Neutral Networks (RNNs), usually at a cheaper computational cost, as illustrated by
Chollet in [Cho17].
22
2.3. Machine Learning
Recurrent Neural Networks
Recurrent Neural Networks, or RNNs, are a class of neural networks designed to process
sequential data, such as time-series or natural language text [GBC16]. They handle
sequences by consecutively processing the contained elements, and retain a state which
includes information related to the input that has been observed so far [Cho17]. Thus,
RNNs are a type of network with an internal loop, as depicted in figure 2.10. When
independent sequences are processed by a recurrent neural network, the state of the
network is reset between each sequence. Hence, a single sequence is still considered a
single data point which is fed into the network. However, the data point is no longer
processed in a single step. Instead, the network iterates over the elements contained in
the sequence, as described in [Cho17].
Figure 2.10: A recurrent neural network - a network with a loop. Taken from [Cho17].
Hence, the activation function is applied to the sum of (i) the input data, and (ii) the
internal state. The transformation is parameterized by the matrices W and U , and an
additional bias vector b, as described in [Cho17]. The function is very similar to the
activation operated by a densely connected layer in a feedforward network, such as the
multilayer perceptron. In the following listing, a simple pseudocode for a basic RNN is
given. The output of the activation function is used as the internal state for the next
iteration, as described in [Cho17]. In figure 2.11, this principle is depicted in an unrolled
RNN iteration over time.
Listing 2.1: Pseudocode for a basic RNN [Cho17]
s tate_t = 0
for input_t in input_sequence :
output_t = a c t i v a t i o n ( dot (W, input_t ) + dot (U, state_t ) + b)
state_t = output_t
# end f o r
In summary, each neuron has a memory so-to-say, which enables it to remember previous
inputs and carry this information forward in time. The output of each neuron is fed back
into the network as input to the next time step, creating a feedback loop that enables
the network to retain information about previous inputs. This feedback loop allows
the RNN to operate on input sequences of varying lengths and to generate predictions
23
2. Related Work
Figure 2.11: A simple RNN, unrolled over time. Taken from [Cho17].
that are conditioned on the entire input sequence, rather than just the most recent
input. In general, RNNs can suffer from vanishing gradients, which means that gradients
can become extremely small during backpropagation, making it difficult to learn long-
term dependencies, e.g. in long text. Furthermore, RNNs can also be computationally
expensive when dealing with large datasets, due to all the necessary iterations.
Long Short Term Memory
Long Short-Term Memory (LSTM) is a type of RNN, which has been proposed by
Schmidhuber and Hochreiter in 1997 to address the vanishing gradient problem commonly
encountered in conventional RNNs, see [HS97]. In RNNs, the gradient of the loss function
can almost vanish with respect to parameters in the earlier layers of the network, as it is
backpropagated through time steps which can make it difficult for the network to learn
long-term dependencies. Hence, in order to prevent vanishing of gradients in earlier layers,
an additional data flow that carries information across timesteps has been introduced.
In [Cho17], Chollet writes that LSTMs are intendend to allow past information to be
reinjected at a later time, thus fighting the vanishing-gradient problem. Hence, LSTMs are
able to selectively store and retrieve information over extended periods of time, and are
thus particularly well suited for modeling long-term dependencies in sequential data. The
key innovation of the LSTM architecture is the introduction of memory cells and gating
mechanisms, which allows the network to selectively read, write, and forget information
over time. LSTMs have been shown to outperform standard RNNs in sequence modeling
tasks such as speech recognition, machine translation, and handwriting recognition.
Transformers
Transformers are the most recent, state-of-the-art neural network architecture which has
been introduced in 2017 by Vaswani et al. in [VSP+17], and have since then become
the dominant architecture for natural language processing tasks. They are based on the
24
2.3. Machine Learning
Figure 2.12: Internal structure of an LSTM, as depicted in [Cho17].
so-called self-attention mechanism, which allows the network to weigh different words
according to their relative importance. In natural language, not all information or all
the communicated words are equally important. Hence, it is natural to require that the
model prioritizes or pays more attention to certain features, and less to others [Cho21].
Figure 2.13: Input features (pixels) in the original representation and the corresponding
attention scores. The higher (brighter) the attention score, the more important the
corresponding pixel in the image. Example was taken from [Cho21].
25
2. Related Work
Max pooling in convolutional neural networks serves a similar purpose, since it examines
a set of features and chooses only one feature to retain (the maximum). In transformers,
this principle is used to make features context-aware, which means to provide a vector
representation for a word depending on the other words surrounding it [Cho21]. In
figure 2.14, relevancy scores between the vector for the word "station" and all the other
word vectors are computed. For this purpose, the dot product is used. The result of
this calculation are so-called attention scores. The softmax function is applied to the
attention-scores to obtain a probability distribution. The reason for this is, that this
tells us the importance of each particular item. Then, the sum of all word vectors in
the sentence is computed. The resulting vector is the new representation for the word
"station".
Figure 2.14: Attention scores are computed between the word “station” and every other
word in the sequence. These are then used to weight a sum of word vectors that becomes
the new “station” vector. Example was taken from [Cho21].
In transformers, Multi-Head Attention is an addition to the self-attention mechanism. It
allows the model to focus on different aspects of the input sequence, thus capturing more
diverse patterns. Since the softmax function of one head tends to focus on one aspect of
similarity - and thus learns one linguistic phenomenon only - having multiple heads allows
to focus on several similarity aspects at once. This is very similar to having multiple filters
in convolutional neural networks. For instance, in CNNs one filter can be responsible for
detecting faces, while another one might be in charge of finding wheels of cars in images
[TvWW22]. In Multi-Head Attention, the input sequence is first transformed into three
different vectors, namely Query, Key, and Value. These vectors are then used to compute
26
2.3. Machine Learning
a set of attention scores, which measure the relevance between each Query and Key pair.
The attention scores are used to weight the corresponding Value vectors and produce a
weighted sum of them, which represents the output of the Multi-Head Attention layer.
When a fully-connected feed-forward layer is added to the Multi-Head Attention Layer,
this enables the attention layer to learn something, hence those two components together
compose the so-called transformer encoder - one of two critical parts that make up the
transformer architecture [Cho21]. The encoder is a very generic module which can be
used for text classification (since it contains a dense layer). In [DCLT18], the authors
have proposed BERT (Bidirectional Encoder Representations from Transformers), a
language model based on the transformer architecture, containing only the encoder. In
this work, we have based our chainletter classifier on a fine-tuned BERT language model,
thus utilizing the encoder component of the transformer architecture.
However, the original transformer architecture consists of two parts - an encoder which
processes the input sequence, and the decoder which generates a transformed version of
the input sequence, see figure 2.15. The decoder consists - in addition to the attention
and feedforward neural networks which are also contained in the encoder - a third
component called the encoder-decoder attention. This attention mechanism allows the
decoder to attend to the encoder’s representation of the input sequence, it draws relevant
information from the encodings and helps to generate the output tokens for the input
sequence. Furthermore, the decoder is trained in a teacher-forcing manner, where at each
time step the ground truth from a previous time step is used as input [Raf22].
Figure 2.15: The transformer model architecture [VSP+17]
27
2. Related Work
Unlike recurrent neural networks, transformers can process entire sequences in parallel,
which is making them faster and more efficient then RNNs. Furthermore, transformers
currently represent the state-of-the-art method for natural language processing tasks.
Sequence-to-sequence models
Sequence-to-sequence models accept a sequence as input, and convert or transform it into
a different sequence. This is the core task of many natural language processing problems,
such as machine translation, text summarization, question answering or text generation
[Cho21]. The principle behind such models is depicted in figure 2.16. An encoder model
turns the source or input sequence into an intermediate representation. Afterwards, the
decoder predicts the next token in the target sequence by looking at (i) previous tokens
and (ii) the encoded input sequence.
Figure 2.16: During training, the source sequence is processed by the encoder and then
sent to the decoder. The decoder looks at the target sequence so far, and predicts the
offset by one step in the future. During inference, one target token is generated at a time
and fed back into the decoder. Taken from [Cho21]
Possible approaches to tackle the sequence-to-sequence problem are recurrent neural
networks and transformers. With transformers, both decoder only (e.g. the GPT model
family, see [RNSS18]), and encoder-decoder model approaches (such as the T5 model
family, described in [RSR+19]), have been proposed and are prominent representatives
of so-called large language models (LLMs), transformer based neural networks trained on
vast amounts of data, e.g. a dataset such as The Pile [GBB+21].
28
2.4. Transfer learning with pretrained language models
2.4 Transfer learning with pretrained language models
Transfer learning is a method where a model is used that has already been trained on
a related task. The underlying idea is that a model which has been pre-trained on a
large and diverse dataset, can capture a lot of general knowledge about the domain.
This knowledge can then be leveraged for downstream tasks, even if those particular
tasks have different characteristics and requirements. Hence, the success of transfer
learning through pre-trained models largely depends on the ability to learn robust, widely
applicable features. In order to use a pre-trained model for a specific NLP task, the
model needs to be fine-tuned with labelled data. In such a scenario, the pre-trained
weights of the model are frozen and only the final layers of the model are trained on
the new task. This approach, also called fine-tuning, allows the model to adapt to the
new task with much less labelled data. However, this approach did not yield significant
success for natural language processing tasks only until recently. With the introduction
of transformer-based algorithms, the situation has changed drastically. A family of
transformer-based architectures - particularly the encoder-only models - has significantly
improved the quality of results that have been achieved on text problems [Raf22]. Models
such as BERT [DCLT18] or DistilBERT [SDCW19] have been pre-trained on large
datasets, made public and are available for downstream tasks, e.g. from platforms like
HuggingFace, see [TvWW22]. Furthermore, also decoder only (e.g. the GPT or GPT-2
models, for details see [RNSS18] and [RWC+19]) as well as encoder-decoder models (e.g.
the T5 family, described in [RSR+19]) have been proposed as a further variant of transfer
learning. For instance, the T5 model (T5 is a short form for Text-to-Text Transfer
Transformer) has been trained on a variety of natural language processing tasks, such
as translation, summarization and question answering. In figure 2.17, this principle is
illustrated. The utilized "text-to-text" treatment of every problem, allows for directly
applying the same model, objective, training procedure, and decoding process to every
task which has been considered [RSR+19].
Figure 2.17: The basic idea is to treat every text processing problem as a “text-to-text”
problem. This allows for reusing the model, loss function and hyperparameters across a
diverse set of tasks. Taken from [RSR+19]
29
2. Related Work
Due to huge popularity of GPT-3 [BMR+20] and ChatGPT [Ope22], further development
of large language models (LLMs) has even more accelerated during last months. LLMs
seem to have emerged as the most recent approach to transfer learning, and models such
as FLAN-T5 (which is in fact a further fine-tuned T5 model, see [CHL+22]) or LLaMA
[TLI+23] are some of the most recent developments in the field. Unfortunately, further
fine tuning of such models demands high computational power and is, as of this writing,
not suitable for consumer or low-end hardware.
30
CHAPTER 3
System Design
In this section, first we will have a look at the functionality required to replace the
currently used and operated chatbot. Afterwards, we will analyze the dataset which will
be utilized for intent recognition and chain letter detection. As well, we will present the
design of the system and justify why certain decisions have been made. Finally, we will
demonstrate and explain the methods which we have selected to implement the system.
3.1 Requirements
In our particular use case, an already operational German language chatbot which is
used to answer children’s questions regarding chain letters, shall be replaced by an
implementation based on open source technology. In a typical scenario, a child would
ask a question and send the message for which it would want to know whether the letter
needs to be taken seriously, or whether it can be safely ignored. The chatbot’s task here
is to recognise the intent of the question, and to detect whether the received message
represents a chain letter. Depending on the detected intent and the type of the message,
an appropriate answer shall be generated and sent back to the conversation partner.
Furthermore, as it occasionally happens that audio messages and text documents are sent
by children as well, the chatbot shall also be able to recognize the file type, understand
several audio and document formats, as well as to be able to extract text from the
received audio files. It is not necessary to keep track of the conversation, but to increase
the quality of intent recognition and classification accuracy, as well as to reduce - when
appropriate - the number of sent responses in order to lower the costs incurred. Common
errors and problems with the chatbot currently in operation are misclassifications, often
caused by typing and spelling errors, too long response times, as well as multiple - instead
of single - responses. The requirements stated by the customer are listed in table 3.1.
31
3. System Design
Nr. Requirement
R1 Improve the quality of chain letter detection
R2 Improve the quality of intent recognition
R3 Generate an appropriate answer for the received message
R4 Recognize the format of a received file
R5 Extract text from a received file
R6 Lower the costs by reducing the number of sent responses
Table 3.1: Requirements.
3.2 Dataset
The dataset consists of 27 classes which can be partitioned into groups, whereas each
group serves a specific purpose within a dialogue, such as
• a group of 13 categories representing chain letters, that is messages in which a
sender requests the receiver to forward the received message to several other persons,
either just for fun or because otherwise supposedly something bad would happen,
• a group of six categories which represent questions that are frequently asked by
children, for instance children would ask what a chain letter is, what the project
Saferinternet.at is all about, what a chatbot is, and similar,
• four categories representing messages containing statements about something, for
instance stating that a URL should be clicked, a message where the chatbot is
insulted, a hint that the chatbot makes mistakes,
• three categories represent parts of a dialogue, such as greeting the chatbot, saying
thanks and saying goodbye to the chatbot,
• and finally the category None, for which a specific intent is unknown and probably
not relevant in general.
As can be seen in table 3.2, the classes are very unequally distributed, which thus might
lead to very unequal classification performance between the classes. Furthermore, three
classes contain only a single instance. Thus, it is obvious that additional data instances
must be generated. Furthermore, emojis are a significant part of the messages contained
in the dataset, as can be seen in figure 3.1. Therefore, it is important to handle them
appropriately, since otherwise the information they contain might be lost and the message
interpreted inadequately.
3.3 Design Decisions
In order to improve the classification performance, and thus to fulfill the requirements R1
and R2 stated in table 3.1, we have chosen to evaluate several classification approaches.
32
3.3. Design Decisions
Instances
Category Original
Dataset
Augmented
Dataset
Experimental
Dataset
none 3487 3487 3487
chainletter-general 762 762 4572
chainletter-spiel 514 514 3084
chainletter-socialbarometer 268 268 268
greeting 247 247 247
express-thanks 213 213 213
statement-openurl 212 212 212
chainletter-scary 179 179 179
question-conversation 140 840 840
chainletter-whatsapp 139 834 834
chainletter-love 132 792 792
question-advicenecessary 89 534 534
chainletter-fakewarnung 79 474 474
chainletter-event 59 354 354
chainletter-prank 50 300 300
chainletter-poesiealbum 40 240 240
chainletter-ageunsuitable 35 210 210
question-bot 32 192 192
question-wasisteinkettenbrief 27 162 162
bye 17 102 102
statement-insult 10 60 60
question-saferinternet 6 36 36
chainletter-wiederbetaetigung 5 30 30
statement-dumachstfehler 2 12 12
chainletter-hatespeech 1 6 6
Delete-Request 1 6 6
question-wasistrataufdraht 1 6 6
Total 6747 11072 17452
Table 3.2: The original, augmented and experimental datasets, with the enlisted counts
for each category.
Initially, we have considered utilizing approaches which would support and complement
even a relatively weak classifier - e.g. using a majority voting ensemble or applying
an embedding based semantic search, in addition to predictions made by the classifier.
However, as is documented in sections 5.2 and 5.3, the transfer learning approach based
on the transformer architecture has delivered very good results - with an F-Score around
90%, averaged over all 27 classes. Therefore, utilizing complementary approaches was not
necessary. However, with messages belonging to classes none, question-conversation and
question-advicenecessary, the children sometimes seem to demand an open conversation.
33
3. System Design
3.3.1 Text-to-Text Transformer
In order to handle requests which seem to require the ability to have an open talk, we
have experimented with multilingual mT5 [XCR+21] and German variants of GPT-2
[RWC+19] models. Unfortunately, due to the size of our dataset, we have not been able
to train those models to meaningfully steer a conversation towards the subject of chain
letters. Very often, seemingly random words and word sequences were generated by
those models, which were not related to the subject of chain letters at all. Furthermore,
generating huge amounts of additional synthetic data to train those models did not seem
very promising either, since the model would then learn structures imposed by the data
generator, which would not represent a widely varied language spectrum contained in
data gathered from real conversations. Hence, we have chosen to label messages belonging
to certain categories and thus to refine overall intent understanding.
Label Description Labelled
Rows
kettenbrief-zugeschickt The message indicates that a chain letter has
been received.
67
hilfe-was-tun The child does not know what to do with the
chain letter.
106
nervig The chain letter is annoying. 6
angst The message indicates that the child is afraid. 71
wahr-oder-nicht The child does not know whether the chain
letter is actually true or not.
21
bitte-um-anwtort The bot did not respond, thus the child asks
for an answer.
7
bin-roboter It makes sense to tell the child that it does not
communicate with a human.
106
link-nicht-oeffnen The message indicates that a link has been
received, which should not be clicked on.
4
rat-auf-draht The child either asks about the Rat auf Draht
telephone helpline, or it would make sense to
tell the child that it should ask for help there.
12
warum-kettenbriefe The child wonders and asks why chain letters
are sent.
2
wer-hat-erstellt The message indicates that the child wants to
know who has created the chain letter.
4
erkenne-kettenbriefe It makes sense to tell the child that the con-
versation should focus on chain letters, since
the chatbot is not knowledgeable in any other
subject area.
146
Table 3.3: Labels, the corresponding contexts when they are set, and the counts of rows
labelled with that specific label.
34
3.3. Design Decisions
3.3.2 Labels
We have labelled adequate messages belonging to the the none category, as well as all
messages belonging to the question-conversation and question-advicenecessary categories.
In table 3.3, the utilized labels are depicted. Each label signals a specific context, refines
the meaning of messages where no specific intent can be derived from the assigned
category, and thus should help to generate an adequate answer. For instance, in case
the message "Soll ich das wirklich tun?" is received (the message belongs to the category
question-conversation), the label hilfe-was-tun is recognized. By doing so, the refined
intent of the question can be taken into account, and a better suitable answer can
be generated. Another example would be the message "Wie geht es Ihnen?", with its
corresponding recognized label bin-roboter. The label signals that the child should be
told that it is not communicating with a human being. Similar to the previous case, this
message also belongs to the category question-conversation.
3.3.3 Answer Generation
For all categories except the already mentioned categories none, question-conversation
and question-advicenecessary, a specific list of answers is determined. The list of answers
consists of equally well suited responses which are adequate for that specific category.
Here, only the actual message needs to be considered, any previously received message can
safely be ignored. Thus, a mapping from the category to a list of possible, predetermined
answers, is perfectly adequate. Hence, in case a message is received, a random answer
is selected from the corresponding list. Merely for the three remaining categories, a
different approach is taken. Here, instead of the category, each label is mapped to a list
of corresponding, predetermined answers. Thus, in case a message triggers several labels,
the generated answer consists of a concatenation of responses for each label, whereas for
each label the corresponding response is chosen randomly from the list of responses for
that particular label.
3.3.4 Conversation Memory
Although initially assumed otherwise, it has turned out that keeping track of conversations,
e.g. by maintaining a conversation state in order to improve the quality of the dialogue, is
not required. As can be seen in chapter 5.4, it would be preferable to reduce the amount
of outgoing responses to lower the costs, since these are proportional to the number of
sent messages. Hence, this can easily be achieved by simply ignoring irrelevant messages.
3.3.5 Attachements
Messages containing chain letters sometimes come in audio or document form, thus our
chatbot needs to be able to handle that form of communication as well. In order to do
so, we extract text from the attached file - be it a word document or an audio file. After
extraction, the retrieved text is processed in the same way as messages received via a
web browser or via Whatsapp.
35
3. System Design
3.4 Selected Methods
In order to achieve best possible intent recognition results, we have decided to test a
combination of a variety of algorithms and popular text embedding techniques. Hence,
we have chosen following machine learning algorithms for the classification task:
• The Naive Bayes algorithm has been selected due to its simplicity and speed to
gather quick results, and to serve as a baseline while performing experiments with
different embeddings, emoji handling methods and different dataset sizes.
• The K Nearest Neighbors algorithm has been selected - similar to reasons given
for the Naive Bayes algorithm - mainly because of its suitability to perform
quick explorative testing, as well as to have a second algorithm to form a more
comprehensive baseline.
• The Decision Tree algorithm has delivered pretty good results in our initial tests
already. For this reason, we have ’kept’ it and included it in our further evaluations.
• The Multilayer Perceptron can be seen as a "bridge" between the classic machine
learning and the deep learning worlds. For this reason we have included this
algorithm in our evaluation as well.
• Since the Random Forest algorithm is based on the Decision Tree algorithm, but
prevents overfitting since it is an ensemble learning method, we have decided to
evaluate the algorithm.
• Convolutional Neural Networks are a type of deep-learning architecture mainly
used in computer vision, which however also can be applied to process text. Thus,
out of curiosity, we have chosen to evaluate the approach.
• Long Short Term Memory is a popular deep-learning architecture, aimed at process-
ing sequential data such as natural language. Since it has been the most popular
approach before the Transformer architecture has emerged, we have included it in
our evaluation.
• In recent years, the Transformer architecture has gained in popularity due to its
superior performance in natural language processing, and has thus become the
defacto standard. Therefore, we have included it in our evaluation.
Several methods can be used to represent text, and in combination with different
algorithms these can exhibit various behaviours which can lead to varying results. Since
there is no best algorithm and no best text representation method suited for all the
possible problems, we have chosen following embedding methods to be combined and
evaluated with the algorithms and architectures enumerated above.
36
3.4. Selected Methods
• With the TF-IDF embedding, a document in the corpus is represented as a single,
high-dimensional sparse vector. This text representation method is very much
used in combination with classical machine learning algorithms, since it does not
represent texts as sequences. Each document is represented as a single sparse
vector, and therefore the embedding is not suitable for algorithms or architectures
which require vector sequences, such as the Long Short Term Memory architecture.
Furthermore, the TF-IDF embedding is also not suitable - at least not without a
modification or some additional processing - to be combined with the Convolutional
Neural Network architecture. Networks of this type rely on dense representations,
and the filters - which are a characteristic basic building block of this architectural
type - are used to identify combinations of words and the proximities between
those words. All this information can not be captured or represented in a single
sparse vector that is representing the whole document. Hence, in convolutional
neural networks local patterns are learned, but the TF-IDF embedding does not
capture local characteristics such as the order or proximity of words, since it only
considers frequencies. For this reason, we have evaluated only classic machine
learning algorithms in combination with the TF-IDF embedding.
• The Global Vectors for Word Representation (GloVe) algorithm is used to create
dense representations of words, based on the aggregated global word-word co-
occurence statistics from a corpus. We have used a freely available, 300-dimensional
embedding for the German language, which we have evaluated in combination with
all the classical machine learning algorithms, as well as with the Convolutional
Neural Network and Long Short Term Memory deep-learning architectures. For
the classical machine learning algorithms, we have used the mean average value
of all the words contained in the message in order to calculate the embedding
vector for the message. Averaging word vectors was not necessary for deep learning
architectures - vector sequences have been used instead.
• The Word2Vec algorithm is, analogous to GloVe, used to create dense vector
representations of words, that is embeddings, based on training a neural network to
predict the context of each word in a corpus. The embeddings are extracted from
the weight matrix of the hidden layer in the network. As with GloVe, we have used
a freely available 300-dimensional embedding for the German language, which we
have evaluated in combination with the classical machine learning algorithms as
well as with deep-learning architectures. Similar to GloVe, also for Word2Vec, in
the case of classical machine learning algorithms we have used the mean average
value of all the words contained in the message to represent the message vector.
For deep learning architectures, averaging word vectors was not necessary since
vector sequences have been used instead.
• BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained
transformer-based neural network architecture, trained on large amounts of unan-
notated text data, which generates contextualized word embeddings. This means
37
3. System Design
that the generated embeddings are based on the context in which the word appears,
and not only on the word itself. This is in contrast to traditional word embedding
methods such as Word2Vec and GloVe, where a fixed embedding is generated for
each word, thus combining all the different senses for that word in one single vector.
For instance, this means that the BERT embedding for the word pool differs for
the sentences "There is a pool table in the room" and "They are swimming in the
pool". In both Word2Vec and GloVe embeddings, words vectors for the word pool
would be the same in both contexts. In BERT, those two word vectors differ from
each other since their both embedding contexts are different. We have used the
BERT embedding in combination with the transformer architecture only, and did
not make any experiments where BERT embeddings would be combined with other
types of machine learning algorithms or deep-learning architectures.
3.4.1 Data preprocessing
The first step in our preprocessing procedure is the correction of encoding errors contained
in the original dataset (German umlauts were encoded incorrectly due to copy and paste
from different sources, e.g. web browsers). Afterwards, the exact procedure depends
on the utilized word embedding, as for each word embedding a different preprocessing
technique is applied. For the TF-IDF embedding, we have removed punctuations, numbers,
stopwords, and have transformed the remaining words into their corresponding word
stems. For the Word2Vec and GloVe embeddings, we have removed punctuations and
numbers, we did not remove the stopwords as these are part of the word context, and
we have transformed the remaining words into their corresponding word lemmata. The
reason for utilizing lemmatization is that word lemmata do form real words which are
likely to be contained in the embedding dictionary, whereas word stems do not necessarily
form real words, and are thus much less likely to be part of the embedding dictionary.
Hence, by utilizing lemmatization, more information remains preserved during data
preprocessing. For the BERT embedding, we did not do any preprocessing steps at all.
Embedding Preprocessing Steps
TF-IDF remove punctuations
remove numbers
remove stopwords
stemming
GloVe remove punctuations
remove numbers
lemmatization
Word2Vec remove punctuations
remove numbers
lemmatization
BERT -
Table 3.4: Preprocessing steps for each embedding method.
38
3.4. Selected Methods
3.4.2 Emoji Handling
We have experimented with three methods for handling emojis, that is (i) removing
them, (ii) treating them as symbols, and (iii) converting them to text. We have combined
each of the methods with the algorithms and embeddings described in previous sections.
Removing emojis means removing information in all combinations. When treating emojis
as symbols, there are different effects with each of the delineated word embeddings. For
the TF-IDF embedding, treating emojis as symbols means that these are considered as
additional words of the vocabulary, and weighed according to the TF-IDF schema as
all the other words. For deep-learning architectures, we need to treat emojis differently
since these are not part of the embedding vocabularies. Therefore, we have created a
so-called emoji vector, where for each emoji we count the number of its occurrances,
normalize the vector and concatenate it with the word embedding vector. Together, the
two vectors comprise the overall input which is then fed to the neural net. In the case of
BERT embeddings - as is the case with GloVe and Word2Vec embeddings - emojis are
not contained in the vocabulary, they are treated as unknown symbols and thus largely
ignored. In the third variant, we convert emojis to text. As an example, a smiley is
converted to the German expression "lächelndes Gesicht". In that particukar case, for all
embeddings emojis are treated as normal text, and no additional vectors are needed.
Figure 3.1: An example of a message contained in the dataset, belonging to the category
chainletter-spiel. As can be seen, a variety of emojis is used to express the meaning of
the message.
3.4.3 Training, Validation and Test Datasets
For training and evaluating the model, we use different chunks of data such that no
instance which is used for evaluation has been seen by the classifier previously during
training. Thus, we have split the dataset into a training dataset and a test datasets, with
the training dataset accounting for 80 percent of the total data, and the test dataset
accounting for the remaining 20 percent. For model training, we use a 90 percent split
of the training dataset. For model validation, we use the remaining 10 percent of the
training dataset, which we call the validation dataset. Thus, the weights of the neural
network are calculated and determined with 90 percent of the training dataset (in other
words, with 72 percent of the complete dataset). After training, the fitted model - which
is the model with the adjusted weights - is validated on the validation dataset. That
is, the model is used to predict responses for observations contained in the validation
dataset, which contains 8 percent of total data. The purpose for doing so is to fine tune
the model’s hyperparameters (e.g. the learning rate). Finally, the test set is used to
provide an objective evaluation of the model’s performance. All results given below in
chapter 5 have been obtained during an evaluation conducted on the test dataset.
39
3. System Design
3.4.4 Data Augmentation
As is depicted in table 3.2, the dataset instances are distributed very unevenly over the
27 classes. Thus, this can have an overall bad influence on the learning and subsequent
classification performance, since for certain classes there are simply too less instances to
learn from. For this reason we have generated additional, synthetic data instances which
we have derived from the original ones. Data generation is done by performing following
operations on the dataset:
• substitute suitable words by words with similar meaning by using the already
described BERT embedding,
• insert additional, suitable words into the message using the aforementioned BERT
embedding,
• swap words randomly,
• delete words randomly, and
• split one word to two words randomly.
The described operations have been applied to all classes with a population size less
than 150 instances, thus adding additional five instances (where one additional instance
is created by one described operation) for each data point contained in the sparsely
populated class. Hence, for each class with a reasonably small population size, in the
augmented dataset the amount of instances increases sixfold, as can be seen in table 3.2, in
the column "Augmented Dataset". In the experimental dataset, classes chainletter-general
and chainletter-spiel are augmented as well.
40
CHAPTER 4
Implementation
In this section, we describe and explain the operating principles of our chatbot, as well
as how the models required for intent recognition - that is, the models which we use for
classification and labelling - have been trained and evaluated.
4.1 System Architecture
The chatbot has been implemented as a web application using the Flask Python framework.
In figure 4.1, components comprising the chatbot are depicted. In a first step, the user
provides a message, either in text form or via an uploaded file. In case a file is provided,
the media type is recognized first to determine how to proceed - e.g. either to extract
text from a document, or to convert an audio file into text form. This is done in the
so-called media recognition module. For recognizing the file type, we use magic and
filetype Python libraries. To retrieve text from various document formats, we use Tika.
To convert speech to text, we use the SpeechRecognition Python library, from which we
use the integrated Google-Speech recognizer.
After the user input is available in pure text form, in the preprocessing module emojis
are converted to text. For this purpose, we use the emoji Python package. By applying
its demojize function, each emoji is converted into its textual representation, which is
then used as part of text since this very much helps to better recognize the intended
meaning of a message.
Subsequently, in the intent recognition module, the message is categorized and labelled.
In case of none, question-conversation and question-advice-necessary categories, labels are
considered since the intent of the message can not be identified clearly. The corresponding
answer is thus either determined by the category of the message, or constructed from
sentences that correspond to each label in the answer generation module.
41
4. Implementation
Figure 4.1: Modules comprising the chatbot.
4.2 Components
4.2.1 Intent Recognition
For intent recognition, we use two classifiers. One is used to recognize the class of the
message, while the other is used to label the message. Labelling makes sense when the
class is indicating a demand for an open conversation, such as when the recognized class
is none, question-conversation or question-advicenecessary. The recognized class and
labels serve as input parameters to generate the response, which is then sent back to the
user. This principle is visualised in figure 4.2.
4.2.2 Answer Generation
In case the class is sufficient to recognize the intent, a simple table is enough to hold
the information needed to generate an adequate answer. In that particular case, the
class can be mapped to a possible answer, which is suitable and sufficient for that
corresponding class. This is true for all the chainletter classes, since an adequate an-
swer very clearly addresses the content contained in the message, and instructs the
child not to take the letter seriously as well as not to send it further. Hence, for that
particular case we have used a simple Python dictionary which holds the class name
as a key, and a list of suitable answers as the corresponding value. Thus, when gen-
erating a response, a random answer is retrieved from the list of possible answers for
42
4.2. Components
Figure 4.2: Intent recognition.
that particular class. By doing so, we ensure that a certain variability of answers is
guaranteed. In case labels are needed, a similar principle is used. However, the answer
here is generated by concatenating random answers for all recognized labels for that
particular message. Hence, in case two labels are recognized for a message, for each label
a random answer is retrieved from the dictionary and concatenated to form the final
response, which is then sent back to the user. The principle is depicted below in figure 4.3.
Figure 4.3: Answer generation.
43
4. Implementation
4.3 Training
In order to find suitable classifier models needed for intent recognition, we have decided
to utilize and compare several different machine learning approaches in combination with
different text representation methods, different emoji handling methods and different
dataset sizes, as already described in section 3.4. In table 4.1, combinations of algorithms
and text embeddings that we have used are summarized. For each given algorithm-
embedding combination, we perform the training for each emoji handling method as well,
that is for each given combination we (i) remove emojis, (ii) treat them as symbols and
(iii) convert them to text, as already set out in section 3.4.2. Furthermore, we train our
models on the original and augmented datasets, as described in chapters 3.2 and 3.4.4.
Thus, we obtain 120 combinations in total which result in 120 models that need to be
evaluated and compared, in order to find the combination which is best suited for intent
recognition, applied to our particular problem.
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes x x x -
K Nearest Neighbors x x x -
Decision Tree x x x -
Multilayer Perceptron x x x -
Random Forests x x x -
Deep Learning Convolutional Neural Nets - x x -
Long Short Term Memory - x x -
Transformers - - - x
Table 4.1: Feasible combinations of algorithms and embeddings.
In figure 4.4, training flows are depicted for (i) classic machine learning algorithms, (ii)
deep neural network architectures such as Convolutional Neural Networks (CNN) and
Long Short Term Memory (LSTM), as well as for the recent (iii) transformer architecture.
In all three variants, text preprocessing is a necessary step since stemming is required for
classic algorithms, lemmatization is needed for deep learning techniques, and for all three
method types emojis either need to be removed or converted to text, as described in
section 3.4.1. Hence, the preprocessing step is contained in all variants. In the subsequent
step retreieve emoji embeddings - which is not used for the transformer architecture -
a binary vector is retrieved. The number of dimensions here is equal to the number of
all possible emojis, 3521 to be precise. In case an emoji is contained in the message,
its corresponding value in the vector is set to one. Otherwise, it is set to zero. Since
such a binary vector is large and sparse, we furthermore apply principal components
analysis (PCA) to (a) reduce the dimensionality of the vector to 20, and (b) to make the
vector "more dense". Thus, the emoji contained in a message are finally represented by a
20-dimensional, normalized and dense embedding.
44
4.3. Training
The resulting emoji vector is then used differently for each individual algorithm-embedding
combination. In case TF-IDF embedding is used together with classical machine
learning algorithms, the emoji vector is not considered at all since all emojis are
already contained in the TF-IDF embedding. If emojis need to be removed or represented
as text, this has been done in the preprocessing step and is thus already implicitly
considered. In case word2vec or GloVe embeddings are utilized together with the classic
algorithms, in a dedicated step - after retrieving the embedding matrix by iterating
over the lemmas of each individual word contained in the message - the mean vector
representing the message embedding is calculated and concatenated with the emoji
embedding. The dataset is then split, and the training starts. In a final step, the created
machine learning model is evaluated on the test dataset, and the results written on disk.
For deep learning architectures such as CNNs and LSTMs, only word2vec and GloVe
embeddings are utilized. Here, as is the case with classic algorithms, the embedding
matrix is constructed by iterating over the lemmas of each individual word contained
in the message. However, no average vector embedding of the message is calculated for
the CNN or LSTM architectures, since in a sequence manner each word is fed into the
network one by one, and not the whole message at once. The network is created layer by
layer in the create neural network architecture step, and both word and emoji embeddings
serve as input for the network. The data is split into training and test sets, and the
training starts. Finally, the created model is evaluated on the test set.
For transformers, the preprocessing step contains removing or converting emojis only
(hence, there is no stemming or lemmatization). Afterwards, data is split into train and
test sets, and tokenized using a dedicated BertTokenizer, sine this is how data has been
tokenized during pre-training and thus needs to be tokenized in the same manner as well.
This is done in the tokenize dataset step. After that, the pre-trained model is loaded and
additional data structures - called dataloaders - are created, since these are required for
the utilized PyTorch Python library. Then the training starts. After the training has
been finished, in a final step the resulting language model is evaluated on the test set,
and corresponding classification reports and confusion matrices are written to disk.
45
4. Implementation
Figure 4.4: Training workflows for the classic machine learning algorithms, for older deep
neural network architectures such as Convolutional Neural Networks (CNN) and Long
Short Term Memory (LSTM), and for the recent transformer architecture.
46
CHAPTER 5
Evaluation
In this section, the evaluation of the implemented chatbot is described. First, the quanti-
tative methodology used to evaluate the trained language models is explained. In order
to evaluate the results, confusion matrices as well as common classification performance
measures such as precision, recall and F-Score, are utilized. Then the intent classifica-
tion and labelling models are evaluated. Here, several machine learning algorithms in
combination with different text representation methods (that is, word embeddings) are
compared. Afterwards, a qualitative user survey in form of a questionnaire, together
with the resulting user feedback, is given. In it, on the one hand, expectations from the
system, and on the other hand, actual results from the user’s point of view are presented.
Finally, in a conclusive section, the evaluation results are summarized.
5.1 Methodology
A very important question which arises when predictive classification is used, is how
good is the model? In order to answer this question, we need to distinguish between four
possible outcomes, which our classification model can produce as a result, see [Ski17]:
• True Positives (TP): an item has been correctly classified as belonging to a particular
class.
• True Negatives (TN): an item has been correctly classified as not belonging to a
particular class.
• False Positives (FP): an item has been falsely classified as belonging to a particular
class.
• False Negatives (FN): an item has been falsely classified as not belonging to a
particular class.
47
5. Evaluation
Resulting from the counts detailed above, following useful evaluation statistics can be
computed, as described in [Ski17]:
• Accuracy is the ratio of correct predictions over the total amount of predictions,
and thus indicates how accurate the classifier is:
accuracy = TP + TN
TP + TN + FN + FP
• Precision is the ratio of correct positive predictions over the total amount of
positive predictions, thus indicating how often the classifier is correct when a
certain class is predicted:
precision = TP
TP + FP
• Recall is the ratio of correct positive predictions over the total amount of positive
instances, thus indicating how often a certain class is correctly predicted amongst
all the positive instances:
recall = TP
TP + FN
• F-Score is a combination of precision and recall, thus representing the harmonic
mean of those two measures:
F = 2 · precision · recall
precision + recall
All given measures are well suited for both, binary and multi-class classification problems.
Furthermore, the F-Score is a commonly used measure to assess the quality of classifica-
tion results since it equally considers and weighs both quality indicators, precision and
recall. A further useful evaluation tool is the confusion matrix M, where M[X,Y] reports
the amount of instances of class X which have been labelled as class Y, as delineated in
[Ski17]. Furthermore, normalized confusion matrices can be useful as well, where ratios
are used instead of absolute numbers. An example is depicted in figure 5.1.
As a further means of evaluation, we have compiled a questionnaire that we made available
to the client, which has been filled out after an evaluation procedure. We have decided to
do so in order to obtain a direct and valuable user feedback, as questionnaires have proven
to be useful since they offer an efficient and inexpensive means of gathering qualitative
information and insight on how the software is perceived by the user, in addition to the
quantitative, statistical methods described above.
48
5.1. Methodology
(a) Total counts of the
entire population
(b) Counts normalized over
the entire matrix
(c) Counts normalized over
each column
(d) Counts normalized over
each row
Figure 5.1: An example of a confusion matrix, depicted in four different forms: (a) counts
of each category are reported, (b) ratios of counts divided by the entire population are
reported, (c) ratios of counts divided by the sum of each column are reported (precision),
and finally (d) ratios of counts divided by the sum of each row are reported (recall).
49
5. Evaluation
5.2 Evaluation of Algorithms for Intent Classification
In this section, we present classification results which we have obtained by running several
machine learning algorithms in combination with different text representations (TF-IDF,
GloVe, Word2Vec, BERT), different methods of handling emojis (that is, (i) removing
emojis, (ii) representing them as symbols and (iii) converting them to text), as well
as evaluating the performance on the original and on an augmented (in other words,
extended) dataset. For the quantitative evaluation, we have used 20 percent of the data
which we have not used during the training process. As can be seen throughout tables 5.1
to 5.6 - where performance results for several machine learning algorithms applied to the
original dataset are depicted - champions of each algorithm class deliver approximately
equally good results. However, with more data - in our particular case we have generated
additional instances of sparsely populated classes and added them to the dataset - the
transfer learning approach performed best, as can be seen throughout tables 5.7 and 5.12.
Furthermore, this combination of a state-of-the-art approach with an enhanced dataset has
helped to reduce class intermingling, as the effect can be observed when confusion matrices
for the random forest algorithm - applied on the original dataset - and for the transformer
architecture - applied on the augmented dataset - are compared. Normalized matrices
(which are normalized over the columns, thus showing precision) for both algorithms are
given in figures 5.3, 5.5, 5.9 and 5.11. One can see that, in the case of random forest
classifier, all depicted classes have been confused with the none class. Further confusions
have often occurred with the chainletter-general class. By combining transfer leaning
with the augmented dataset, confusions for both classes have been reduced. Further
"sources of confusion" are messages with relatively "openly" formulated content - such
as those belonging to the question-conversation and question-advicenecessary categories.
By comparing figures 5.3 and 5.9, it can be seen that these confusions have been reduced
as well. Furthermore, for a more complete comparison, it makes sense to consider the
non-weighted F-Score as well (that is, the macro average F1-Score). Tables 5.13 and
5.16 indicate that less populated classes are recognized better when the transformer
architecture is combined with the augmented dataset, since the approach convinces with
a significantly higher macro F-Score (e.g., 0.59 vs. 0.82).
We have assumed that further generation of additional synthetic data would improve
the results even more. Therefore, we have generated additional instances for classes
chainletter-general and chainletter-spiel, based on the ones already available in the
augmented dataset, as described in section 3.4.4 and shown in table 3.2, in the column
Experimental Dataset. The reason for picking these two chain letter classes is that
those are the ones with a lot of "mixes". Thus, we hoped that adding more data
would sharpen the classifier’s discriminatory capacity, and thus help to reduce the
intermingling. However, confusion matrices of results obtained by applying the transfer
learning approach on the experimental dataset, as depicted in figures 5.12 and 5.13
and in the corresponding classification report shown in table 5.18, indicate that it is
not the number of instances that matters anymore at this stage of model development,
but the variety and structure of available data as well as the class distribution balance.
50
5.2. Evaluation of Algorithms for Intent Classification
We have made some more experiments with additionally generated instances for the
chainletter-poesiealbum, question-conversation and question-advicenecessary categories,
which also "mix" a lot. However, the obtained results have been very similar without any
noteworthy improvement, thus indicating that with any further synthetic data generation,
we are just overfitting. Hence, we still believe that additional data would help to improve
the results, but it needs to vary more in order to be genuinely representative for the
particular categories, and not just - as is the case when data is generated from a limited
amount of original instances - a huge number of similar texts with a small variability.
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes 0.49 0.42 0.41 -
K Nearest Neighbors 0.75 0.62 0.62 -
Decision Tree 0.83 0.72 0.72 -
Multilayer Perceptron 0.83 0.78 0.70 -
Random Forests 0.85 0.82 0.82 -
Deep Learning Convolutional Neural Nets - 0.85 0.84 -
Long Short Term Memory - 0.85 0.80 -
Transformers - - - 0.87
Table 5.1: Accuracy results when emojis are removed from the original dataset.
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes 0.52 0.46 0.45 -
K Nearest Neighbors 0.74 0.63 0.64 -
Decision Tree 0.82 0.72 0.72 -
Multilayer Perceptron 0.83 0.78 0.69 -
Random Forests 0.84 0.81 0.81 -
Deep Learning Convolutional Neural Nets - 0.85 0.84 -
Long Short Term Memory - 0.84 0.79 -
Transformers - - - 0.87
Table 5.2: F-Score results when emojis are removed from the original dataset.
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes 0.49 0.31 0.32 -
K Nearest Neighbors 0.70 0.59 0.57 -
Decision Tree 0.78 0.66 0.67 -
Multilayer Perceptron 0.81 0.78 0.72 -
Random Forests 0.82 0.79 0.80 -
Deep Learning Convolutional Neural Nets - 0.83 0.85 -
Long Short Term Memory - 0.81 0.78 -
Transformers - - - 0.87
Table 5.3: Accuracy results when emojis are represented as symbols in the original
dataset.
51
5. Evaluation
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes 0.53 0.35 0.37 -
K Nearest Neighbors 0.69 0.62 0.59 -
Decision Tree 0.78 0.66 0.67 -
Multilayer Perceptron 0.80 0.78 0.72 -
Random Forests 0.80 0.77 0.78 -
Deep Learning Convolutional Neural Nets - 0.83 0.84 -
Long Short Term Memory - 0.81 0.78 -
Transformers - - - 0.86
Table 5.4: F-Score results when emojis are represented as symbols in the original
dataset.
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes 0.48 0.38 0.37 -
K Nearest Neighbors 0.72 0.60 0.61 -
Decision Tree 0.81 0.63 0.66 -
Multilayer Perceptron 0.81 0.78 0.72 -
Random Forests 0.83 0.78 0.78 -
Deep Learning Convolutional Neural Nets - 0.85 0.84 -
Long Short Term Memory - 0.79 0.78 -
Transformers - - - 0.87
Table 5.5: Accuracy results when emojis are represented as text in the original
dataset.
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes 0.51 0.42 0.41 -
K Nearest Neighbors 0.70 0.61 0.62 -
Decision Tree 0.80 0.64 0.66 -
Multilayer Perceptron 0.81 0.78 0.70 -
Random Forests 0.82 0.76 0.77 -
Deep Learning Convolutional Neural Nets - 0.85 0.83 -
Long Short Term Memory - 0.78 0.77 -
Transformers - - - 0.87
Table 5.6: F-Score results when emojis are represented as text in the original dataset.
5.3 Evaluation of Algorithms for Intent Labelling
In this section, we present our results for the multi labelling problem, which we have
introduced in order to be able to respond better to messages belonging to so-called
"open" categories, such as question-conversation and question-advicenecessary, as justified
in sections 3.3 and 3.3.2. The results were obtained by applying the transfer learning
approach, since it has turned out to be the best performing one, and are shown in figures
5.14, 5.15 and in table 5.19. Although the overall labelling performance is quite good -
52
5.3. Evaluation of Algorithms for Intent Labelling
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes 0.57 0.40 0.42 -
K Nearest Neighbors 0.77 0.68 0.68 -
Decision Tree 0.77 0.63 0.64 -
Multilayer Perceptron 0.81 0.77 0.67 -
Random Forests 0.83 0.78 0.78 -
Deep Learning Convolutional Neural Nets - 0.85 0.86 -
Long Short Term Memory - 0.84 0.84 -
Transformers - - - 0.90
Table 5.7: Accuracy results when emojis are removed from the augmented dataset.
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes 0.58 0.40 0.42 -
K Nearest Neighbors 0.76 0.68 0.67 -
Decision Tree 0.76 0.63 0.64 -
Multilayer Perceptron 0.81 0.77 0.66 -
Random Forests 0.82 0.77 0.77 -
Deep Learning Convolutional Neural Nets - 0.84 0.85 -
Long Short Term Memory - 0.84 0.83 -
Transformers - - - 0.90
Table 5.8: F-Score results when emojis are removed from the augmented dataset.
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes 0.60 0.29 0.32 -
K Nearest Neighbors 0.75 0.68 0.64 -
Decision Tree 0.76 0.61 0.60 -
Multilayer Perceptron 0.80 0.78 0.69 -
Random Forests 0.82 0.77 0.77 -
Deep Learning Convolutional Neural Nets - 0.86 0.86 -
Long Short Term Memory - 0.84 0.83 -
Transformers - - - 0.88
Table 5.9: Accuracy results when emojis are represented as symbols in the augmented
dataset.
the average weighted F-Score value is above 0.95 - the performance for the particular
label link-nicht-oeffnen is nevertheless quite remarkable. It is a notable outlier for which
the individual F-Score is around 0.67. Unfortunately, only four rows of training data have
been labelled with that particular label, as is delineated in table 3.3. Hence, additional
data would help to increase the performance for that particular label.
53
5. Evaluation
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes 0.61 0.30 0.32 -
K Nearest Neighbors 0.74 0.67 0.63 -
Decision Tree 0.76 0.61 0.60 -
Multilayer Perceptron 0.80 0.78 0.68 -
Random Forests 0.81 0.76 0.76 -
Deep Learning Convolutional Neural Nets - 0.85 0.86 -
Long Short Term Memory - 0.84 0.82 -
Transformers - - - 0.88
Table 5.10: F-Score results when emojis are represented as symbols in the augmented
dataset.
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes 0.59 0.39 0.40 -
K Nearest Neighbors 0.75 0.66 0.67 -
Decision Tree 0.77 0.60 0.61 -
Multilayer Perceptron 0.81 0.82 0.74 -
Random Forests 0.83 0.77 0.76 -
Deep Learning Convolutional Neural Nets - 0.85 0.85 -
Long Short Term Memory - 0.85 0.83 -
Transformers - - - 0.89
Table 5.11: Accuracy results when emojis are represented as text in the augmented
dataset.
TF-IDF GloVe Word2Vec BERT
Classic ML Naive Bayes 0.60 0.38 0.40 -
K Nearest Neighbors 0.74 0.65 0.66 -
Decision Tree 0.77 0.60 0.60 -
Multilayer Perceptron 0.81 0.82 0.74 -
Random Forests 0.83 0.76 0.75 -
Deep Learning Convolutional Neural Nets - 0.85 0.85 -
Long Short Term Memory - 0.85 0.83 -
Transformers - - - 0.89
Table 5.12: F-Score results when emojis are represented as text in the augmented
dataset.
5.4 User Evaluation
In order to figure out how users perceive the overall system performance and the
classification performance in particular, we have compiled a questionnaire and sent
it to three experts in the organization which has commissioned the chatbot, and
who know (i) how children communicate with the system, (ii) what kind of response is
suitable, (iii) the weaknesses of the current system, as well as (iv) where an improvement
54
5.4. User Evaluation
Precision Recall F-Score Instances
Per class bye 0.00 0.00 0.00 5
chainletter-ageunsuitable 1.00 0.44 0.62 9
chainletter-event 0.88 0.54 0.67 13
chainletter-fakewarnung 1.00 0.62 0.77 16
chainletter-general 0.64 0.74 0.69 155
chainletter-love 0.90 0.55 0.68 33
chainletter-poesiealbum 0.83 0.50 0.62 10
chainletter-prank 0.93 0.82 0.87 17
chainletter-scary 0.90 0.63 0.75 30
chainletter-socialbarometer 0.69 0.55 0.61 49
chainletter-spiel 0.74 0.60 0.67 86
chainletter-whatsapp 0.96 0.75 0.84 36
chainletter-wiederbetaetigung 0.00 0.00 0.00 1
delete-request 0.00 0.00 0.00 1
express-thanks 0.91 0.78 0.84 41
greeting 0.81 0.60 0.69 50
none 0.84 0.98 0.90 692
question-advicenecessary 0.89 0.47 0.62 17
question-bot 0.80 0.50 0.62 8
question-conversation 0.85 0.41 0.55 27
question-saferinternet 0.00 0.00 0.80 1
question-wasisteinkettenbrief 1.00 0.20 0.33 5
statement-dumachstfehler 1.00 1.00 1.00 1
statement-insult 1.00 0.33 0.50 3
statement-openurl 0.93 0.86 0.89 44
Per dataset Macro Avg 0.74 0.52 0.59 1350
Weighted Avg 0.82 0.82 0.80 1350
Accuracy - - 0.82 1350
Table 5.13: Classification report obtained for the random forest classifier, applied on
the original data set with the TF-IDF embedding as the utilized text representation
method, with emojis represented as symbols.
is desirable. Thus, the questionnaire on the one hand aims to figure out what the users
expect, and on the other hand it targets at the performance perceived during the test
run. In sections 5.4.1 and 5.4.2, the compiled questionnaire, together with the received
user feedback, is provided. We have received a common position, a common standpoint
of the expert group so to speak, which has been provided as a single document.
5.4.1 Expectations
In this section, we ask four questions to find out what the user expects from the system.
After each question, the user feedback is provided. After each user feedback, we delineate
our point of view on how the system does, or does not, meet the user expectations.
Question 1: Was wird von einem Kettenbrief-Chatbot für Kinder im Hinblick auf
55
5. Evaluation
Precision Recall F-Score Instances
Per class bye 1.00 0.40 0.57 5
chainletter-ageunsuitable 1.00 0.78 0.88 9
chainletter-event 0.91 0.77 0.83 13
chainletter-fakewarnung 1.00 0.75 0.86 16
chainletter-general 0.72 0.81 0.76 155
chainletter-love 0.91 0.64 0.75 33
chainletter-poesiealbum 0.86 0.60 0.71 10
chainletter-prank 0.93 0.82 0.87 17
chainletter-scary 0.96 0.80 0.87 30
chainletter-socialbarometer 0.67 0.61 0.64 49
chainletter-spiel 0.79 0.64 0.71 86
chainletter-whatsapp 0.96 0.75 0.84 36
chainletter-wiederbetaetigung 0.00 0.00 0.00 1
delete-request 0.00 0.00 0.00 1
express-thanks 0.93 0.66 0.77 41
greeting 0.80 0.64 0.71 50
none 0.86 0.98 0.92 692
question-advicenecessary 1.00 0.82 0.90 17
question-bot 0.80 0.50 0.62 8
question-conversation 0.82 0.52 0.64 27
question-saferinternet 0.00 0.00 0.80 1
question-wasisteinkettenbrief 1.00 0.40 0.57 5
statement-dumachstfehler 1.00 1.00 1.00 1
statement-insult 0.00 0.00 0.00 3
statement-openurl 0.93 0.91 0.92 44
Per dataset Macro Avg 0.75 0.59 0.65 1350
Weighted Avg 0.85 0.85 0.84 1350
Accuracy - - 0.85 1350
Table 5.14: Classification report obtained for the random forest classifier, applied on
the original data set with the TF-IDF embedding as the utilized text representation
method, with emojis being removed.
die Dynamik der Gesprächsführung erwartet, mit anderen Worten, wie spezifisch oder
generisch soll ein Chatbot im Gesprächsverlauf sein (z.B. bei der Frage “Was machst du
eigentlich” – wie würde hier ein erwarteter Gesprächsverlauf aussehen)?
Answer 1: Unsere Idee für den Chatbot rührt daher, dass wir gemerkt haben, wir sind
zu den Zeiten an denen Kinder schreiben nicht verfügbar, es geht um Zeiten außerhalb
der Schulzeit – ab 17 Uhr ungefähr und bis in die Abendstunden hinein. Auch haben wir
gemerkt, dass sie instantly eine Antwort erwarten, und auch das zu leisten, bräuchte einen
enormen personellen Aufwand. Die Antwort muss rasch kommen, weil sie in dem Moment
sonst die Weiterleitung an ihre Bekannten und Freundinnen machen und der Schaden
nicht minimiert wird ansonsten. Gleichzeitig sind die Ansprüche an die Gesprächsführung
möglichst transparent zu sein und kein falsches Gefühl zu vermitteln, dass das Gegenüber
56
5.4. User Evaluation
Precision Recall F-Score Instances
Per class greeting 0.86 0.76 0.81 50
statement-insult 0.85 0.79 0.81 14
chainletter-hatespeech 1.00 1.00 1.00 2
chainletter-scary 0.77 0.79 0.78 34
express-thanks 0.92 0.81 0.86 43
question-advicenecessary 0.87 0.84 0.85 122
chainletter-fakewarnung 0.97 0.99 0.98 84
chainletter-love 0.97 0.97 0.97 156
chainletter-general 0.82 0.67 0.74 153
chainletter-socialbarometer 0.57 0.53 0.55 57
chainletter-whatsapp 0.93 0.97 0.95 168
chainletter-wiederbetaetigung 1.00 1.00 1.00 8
chainletter-spiel 0.70 0.79 0.74 103
question-saferinternet 1.00 0.43 0.60 7
bye 0.57 0.57 0.57 14
chainletter-ageunsuitable 1.00 0.93 0.97 45
chainletter-event 0.89 0.94 0.91 78
question-wasisteinkettenbrief 0.75 0.77 0.76 31
chainletter-prank 0.85 0.93 0.89 55
question-bot 0.69 0.63 0.66 35
none 0.90 0.90 0.90 705
chainletter-poesiealbum 0.90 0.92 0.91 39
statement-openurl 0.85 0.90 0.87 50
delete-request 1.00 0.50 0.67 2
question-conversation 0.73 0.84 0.78 158
question-wasistrataufdraht 0.00 0.00 0.00 1
statement-dumachstfehler 0.67 1.00 0.80 2
Per dataset Macro Avg 0.82 0.78 0.79 2216
Weighted Avg 0.86 0.86 0.86 2216
Accuracy - - 0.86 2216
Table 5.15: Classification report obtained for the convolutional neural network
classifier applied on the augmented data set with the Word2Vec embedding as text
representation method, and emojis represented as symbols.
– selbst wenn menschlich – eine adäquate Beratung wäre. Das ist nicht im Rahmen
der Kompetenz von Saferinternet.at und sollte auch nicht automatisiert erfolgen, da
die Risiken sehr groß sein könnten bei Fehlern. Kettenbriefe als Thema eignen sich
aber gut für einen Chatbot aus unserer Sicht, weil wir eben über Automatisierung gern
sicherstellen würden, dass Kinder sehr schnell eine Entwarnung erhalten, gleichzeitig
auch die Chance sehen, dass die Gespräche kontrolliert kurz sind. Das Ziel ist, dass
Kettenbriefe ihrer Wirkmächtigkeit genommen werden, das sie also inhaltlich für nicht
bedeutsam erklärt werden, für nicht wahr bzw. andersweitig richtig eingeordnet werden
und, dass mögliche damit zusammenhängende Sorgen erkannt werden, damit auf andere
Stellen verwiesen wird. Klar herüber kommen soll, dass Kettenbriefe nicht verschickt
57
5. Evaluation
Precision Recall F-Score Instances
Per class chainletter-ageunsuitable 1.00 1.00 1.00 38
statement-openurl 0.95 0.88 0.91 41
chainletter-socialbarometer 0.68 0.67 0.67 48
chainletter-scary 0.87 0.87 0.87 45
chainletter-poesiealbum 0.96 0.96 0.96 51
chainletter-whatsapp 0.97 0.96 0.97 170
chainletter-spiel 0.74 0.68 0.71 100
chainletter-fakewarnung 0.96 1.00 0.98 91
none 0.88 0.94 0.91 695
chainletter-love 0.97 0.98 0.97 151
express-thanks 0.84 0.80 0.82 51
chainletter-prank 0.85 0.95 0.90 61
chainletter-general 0.70 0.71 0.71 157
chainletter-event 0.93 0.86 0.89 73
question-advicenecessary 0.92 0.87 0.89 106
question-bot 0.91 0.78 0.84 40
question-conversation 0.91 0.90 0.91 138
question-wasisteinkettenbrief 1.00 0.83 0.91 35
statement-insult 0.75 0.38 0.50 16
bye 0.72 0.72 0.72 25
chainletter-hatespeech 1.00 1.00 1.00 2
chainletter-wiederbetaetigung 1.00 1.00 1.00 6
greeting 0.91 0.71 0.80 59
question-saferinternet 0.60 0.86 0.71 7
delete-request 1.00 1.00 1.00 1
statement-dumachstfehler 0.50 0.75 0.60 4
question-wasistrataufdraht 0.00 0.00 0.00 4
Per dataset Macro Avg 0.83 0.82 0.82 2215
Weighted Avg 0.88 0.88 0.88 2215
Accuracy - - 0.88 2215
Table 5.16: Classification report obtained for the transformer based classifier, applied
on the augmented data set with the BERT embedding as text representation method,
and emojis represented as symbols.
werden sollten an andere und auch, dass keine zweifelhaften Links oder Programme
wie teilweise darin empfohlen heruntergeladen werden sollten. Es soll jedenfalls aber
nicht zu langen Gesprächen kommen, weil wir davon ausgehen müssen, dass Kinder das
Gegenüber sonst im Sinne eines Freundes, Freundin wahrnehmen und damit einhergehend
die Verantwortung groß wäre und der Erwartung nicht entsprochen werden könnte.
From our point of view, the implementation is very much in alignment with the stated user
expectations. This is on one hand due to the relatively high classification performance,
and on the other hand due to short and focused dialogues which ensure that the chatbot
is not considered a friend, but rather an auxiliary and useful tool.
58
5.4. User Evaluation
Precision Recall F-Score Instances
Per class chainletter-ageunsuitable 1.00 0.97 0.99 38
statement-openurl 0.95 0.95 0.95 41
chainletter-socialbarometer 0.74 0.67 0.70 48
chainletter-scary 0.84 0.84 0.84 45
chainletter-poesiealbum 0.93 0.98 0.95 51
chainletter-whatsapp 0.98 0.99 0.99 170
chainletter-spiel 0.70 0.74 0.72 100
chainletter-fakewarnung 0.94 1.00 0.97 91
none 0.92 0.94 0.93 695
chainletter-love 0.97 0.97 0.97 151
express-thanks 0.88 0.86 0.87 51
chainletter-prank 0.84 0.97 0.90 61
chainletter-general 0.76 0.72 0.74 157
chainletter-event 0.93 0.86 0.89 73
question-advicenecessary 0.90 0.87 0.88 106
question-bot 0.90 0.88 0.89 40
question-conversation 0.86 0.91 0.89 138
question-wasisteinkettenbrief 0.97 0.89 0.93 35
statement-insult 0.75 0.38 0.50 16
bye 0.95 0.80 0.87 25
chainletter-hatespeech 1.00 1.00 1.00 2
chainletter-wiederbetaetigung 1.00 1.00 1.00 6
greeting 0.93 0.92 0.92 59
question-saferinternet 0.71 0.71 0.71 7
delete-request 1.00 1.00 1.00 1
statement-dumachstfehler 1.00 0.75 0.86 4
question-wasistrataufdraht 0.00 0.00 0.00 4
Per dataset Macro Avg 0.87 0.84 0.85 2215
Weighted Avg 0.90 0.90 0.90 2215
Accuracy - - 0.90 2215
Table 5.17: Classification report obtained for the transformer based classifier, applied
on the augmented data set with the BERT embedding as text representation method,
and emojis being removed.
Question 2: Inwiefern sollte der Chatbot über ein Gedächntis verfügen, also über die
Inhalte vorheriger Nachrichten Bescheid wissen bzw. für die Generierung von Antworten
Teile des Gesprächsverlaufs mitberücksichtigen?
Answer 2: Oft leiten die Kinder einen Kettenbrief direkt weiter, und sie hängen daran
auch noch weitere Fragen oder Sätze an. Das wäre zum Beispiel die Angabe „Hier ist ein
Kettenbrief“ oder „Stimmt der“ oder auch eine Begrüßung. In diesem Zusammenhang
wäre es für uns wichtig, dass der Chatbot nicht jede eingehende Nachricht beantwortet,
sondern das erkennt – eventuell über die zeitliche Nähe? Ansonsten ist keine weitere
Sache notwendig, wenn jemand zwei Tage wieder etwas schickt ist kein inhaltlicher
59
5. Evaluation
Precision Recall F-Score Instances
Per class chainletter-ageunsuitable 1.00 0.98 0.99 42
statement-openurl 0.88 0.90 0.89 42
chainletter-socialbarometer 0.76 0.52 0.62 54
chainletter-scary 0.91 0.86 0.89 36
chainletter-poesiealbum 0.79 0.88 0.83 48
chainletter-whatsapp 0.98 0.99 0.99 167
chainletter-spiel 0.95 0.96 0.95 617
chainletter-fakewarnung 0.99 0.95 0.97 95
none 0.92 0.91 0.91 698
chainletter-love 0.97 0.97 0.97 159
express-thanks 0.88 0.98 0.92 43
chainletter-prank 0.98 0.92 0.95 60
chainletter-general 0.95 0.96 0.95 915
chainletter-event 0.92 0.99 0.95 71
question-advicenecessary 0.89 0.91 0.90 107
question-bot 0.80 0.87 0.84 38
question-conversation 0.85 0.88 0.87 168
question-wasisteinkettenbrief 0.83 0.75 0.79 32
statement-insult 0.91 0.83 0.87 12
bye 0.84 0.80 0.82 20
chainletter-hatespeech 1.00 1.00 1.00 1
chainletter-wiederbetaetigung 1.00 1.00 1.00 6
greeting 0.87 0.82 0.84 49
question-saferinternet 1.00 0.43 0.60 7
delete-request 0.00 0.00 0.00 1
statement-dumachstfehler 0.67 1.00 0.80 2
question-wasistrataufdraht 1.00 1.00 1.00 1
Per dataset Macro Avg 0.84 0.83 0.83 3491
Weighted Avg 0.93 0.93 0.93 3491
Accuracy - - 0.93 3491
Table 5.18: Classification report obtained for the transformer based classifier, ap-
plied on the experimental data set with the BERT embedding as the utilized text
representation method, and emojis represented as text.
Zusammenhang da, abseits von „ich schicke dir hier noch zwei Kettenbriefe“ und wir
brauchen insofern keine weiteren Verlaufsverfolgungen.
We did not implement any conversation memory or state machine, as the available data
did not indicate that this was necessary. For each message, its intent can very well be
recognized by classifying and labelling it. Thus, in case it is not considered relevant that
an answer is given - which means the message either belongs to the class none or no
significant label has been recognized - a reply does not necessarily have to be sent.
Question 3: Welche Teile der Konversation sollten besonders gut erkannt werden (z.B.
die Begrüßung, unterschiedliche Kategorien der Kettenbriefe, Angst, etc...)?
60
5.4. User Evaluation
Precision Recall F-Score Instances
Per class kettenbrief-zugeschickt 1.00 1.00 1.00 58
hilfe-was-tun 1.00 1.00 1.00 91
nervig 1.00 0.86 0.92 7
angst 0.97 1.00 0.99 71
wahr-oder-nicht 1.00 1.00 1.00 20
bitte-um-anwtort 1.00 1.00 1.00 7
bin-roboter 1.00 1.00 1.00 80
link-nicht-oeffnen 0.50 1.00 0.67 2
rat-auf-draht 1.00 1.00 1.00 12
warum-kettenbriefe 1.00 1.00 1.00 2
wer-hat-erstellt 1.00 1.00 1.00 4
erkenne-kettenbriefe 0.99 1.00 1.00 103
Per dataset Micro Avg 0.99 1.00 0.99 457
Macro Avg 0.96 0.99 0.96 457
Weighted Avg 0.99 1.00 0.99 457
Sample Avg 0.95 0.95 0.95 457
Table 5.19: Multilabel classification report obtained for the transformer based classi-
fier applied on the multilabel data set with the BERT embedding as text representation
method, and emojis represented as text.
Answer 3: Die Kettenbriefe sollten als Kettenbriefe erkannt werden – und sofern es sich
um heikle Themen handelt, also teilweise Kettenbriefe mit illegalen Inhalten oder auch
Angst, sollte dies erkannt werden, weil es aus unserer Sicht die größten Risiken sind.
Wenn eine Kategorie falsch erkannt wird, ist dies für uns zu bemerken, aber manchmal
für das Gegenüber nicht direkt.
In the classification report obtained for the transfer learning based classifier, the perfor-
mance results given in tables 5.16 and 5.17 show that relevant classes such as chainletter-
prank, chainletter-hatespeech, chainletter-scary and chainletter-wiederbetaetigung are
recognized very well, as on average the corresponding F-Scores rank around or well above
the 0.85 margin.
Question 4: Wie sollte mit großen Dateien umgegangen werden (z.B. hochgeladene
Dokumente und Audiodateien)
Answer 4: Zwar gibt es Audio-Kettenbriefe und auch Videokettenbriefe, aber gleichzeitig
ist es nicht prioritär für uns sie zu analysieren und wir erkennen an, dass es auch Risiken
bergen könnte wie hohe Kosten, insofern sofern es einfach möglich ist, könnte es Sinn
machen – aber eventuell wäre es auch hilfreich genug, wenn wir sehen würden, was an
Inhalten hineinkommt.
We do not analyse or check the size of transmitted files, as - from the beginning - this
functionality was not given high priority.
61
5. Evaluation
5.4.2 Perceived Performance
In this section, we ask seven questions in order to find out how the system interaction
has been perceived. After each question, the obtained user feedback is provided. After
the user feedback, we delineate our point of view.
Question 5: Wie gut ist die Qualität der Intent-Erkennung, d.h. wie gut wird die
Kategorie der Nachricht erkannt (bspw. ist es eine Begrüßung, welche Art des Kettenbriefs,
Frage um Rat, etc.)?
Answer 5: Wir finden beim Test, dass die Qualität sehr gut ist – vor allem wenn es um
Erkennen von Kettenbrief und ihrer Kategorien geht bzw. auch um Gefühle.
This confirms our results obtained during the quantitative evaluation, delineated in
sections 5.2 and 5.3.
Question 6: Gibt es spezifische Intent-Klassen, bei welchen die Qualität der Erkennung
nicht ausreichend ist bzw. wo diese erhöht werden sollte (da ja im erhaltenen Datensatz
die Kategorien nicht alle gleich vertreten waren und die korrekte Erkennung mancher
Klassen evtl. von höherer Bedeutung ist)?
Answer 6: Die Fehler, die wir sehen konnten, betrafen vor allem die None-Inhalte –
also Fragen, die hineinkommen und eigentlich keine Antwort von uns verlangen. Kinder
halten sich meist nicht an das Skript und das ist eine große Schwierigkeit, weil wir wohl
zum einem großen Teil Anfragen erhalten werden, die nicht beantwortet werden sollten.
Although intermingling of classes has been improved by using the transformer deep
learning architecture - as can be seen by comparing confusion matrices given throughout
figures 5.3 and 5.9 - confusing the none and chainletter-general, chainletter-poesiealbum
or chainletter-spiel categories still occurs. Furthermore, the none category is sometimes
confused with categories without a specific intent, such as the question-conversation
category. In order to reduce the amount of confusions, additional data for these classes is
necessary. As delineated in sections 3.3 and 3.3.2, we have introduced labels to overcome
the drawbacks which come with classes that do not serve a specific intent and thus are
not clearly distinguishable from each other, since they represent a wide area of possibly
overlapping messages. However, also for the multilabel classification problem, a larger
dataset would also be beneficial.
Question 7: Wie gut ist die Qualität generierter Antworten für die beiden Kategorien
“none” und “question-conversation” ?
Answer 7: Es wäre super, wenn wir schaffen könnten einen besseren Umgang mit der
None Kategorie zu finden, also dass das besser gelingt. Der Test der Konversationsanfra-
gen wirkte gut, da waren wir zufrieden und eher, dass man überlegen könnte wie quasi
die zusammengehörigen Fragen nicht jeweils beantwortet werden einzeln, das ist auch
finanziell für uns wichtig, weil wir pro ausgehende Nachricht zahlen.
As we understand it, the overall quality of conversation - also for the none and question-
conversation categories - is pretty good. However, in order to reduce costs, it would make
62
5.5. Summary
sense not to answer every message that is not classified as relevant. Thus, a possible
improvement would be not to send a reply in case no label or no relevant label has been
detected, e.g. when labels such as kettenbrief-zugeschickt or wer-hat-erstellt are detected.
Question 8: Wie gut sind die eingeführten Labels geeignet, um auf Nachrichten der
Kategorien “none” und “question-conversation” adäquat antworten zu können? Werden
womöglich zusätzliche Labels benötigt? Sämtliche Labels sind in der Datei “datensatz-
labels.xlsx” angeführt.
Answer 8: Dazu fehlt uns glaube ich die Kompetenz für die Bewertung.
Question 9: Wie gut bzw. akzeptabel ist die Response-Zeit auf Textnachrichten (d.h.
jene Zeit welche zwischen dem Absetzen der Anfrage bis zum Erhalt der Antwort vergeht)
Answer 9: Die Response Zeit ist sehr gut.
Question 10: Wie gut bzw. akzeptabel ist die Response-Zeit auf hochgeladene Dokumente
(d.h. jene Zeit welche zwischen dem Hochladen des Dokuments bis zum Erhalt der Antwort
vergeht) ?
Answer 10: Auch sehr gut.
Question 11: Wie gut bzw. akzeptabel ist die Response-Zeit auf hochgeladene Audio-
dateien (d.h. jene Zeit welche zwischen dem Hochladen der Audiodatei bis zum Erhalt
der Antwort vergeht) ?
Answer 11: Auch sehr gut!
5.5 Summary
In summary, it can be said that applying the transformer-based transfer learning approach
- in combination with the augmented dataset - has led to very good classification results.
This goes hand in hand with outcomes obtained from the user feedback, which was also
very positive. However, improving the recognition performance of messages with relatively
"openly" formulated content - such as those belonging to question-conversation, question-
advicenecessary and none categories - would further enhance the quality of the system. In
particular, reducing class confusion would be beneficial, as it happens that comprehensive
chain letter categories such as chainletter-general, as well as aforementioned messages with
"openly" formulated content, are sometimes confused with each other. Our experiments
furthermore show that additional genuine and representative data is necessary to solve
this problem, as just simply adding more and more synthetically generated data - after
an initial, meaningful and successful augmenting step which is necessary to balance the
underrepresented classes - does not lead to any further improvement in performance.
Moreover, it can also be said that older approaches - such as random forests or long short
term memory - while producing slightly worse results, have performed quite good as well.
More precisely, classic machine learning algorithms performed best in combination with
the TF-IDF embedding method. More recent embedding methods, such as Word2Vec
63
5. Evaluation
or GloVe, exhibited a poorer performance in combination with these algorithms, and
were best suited for deep learning approaches, such as CNN and LSTM neural network
architectures. A further interesting observation - indeed a surprising one - was that it did
not really matter how emojis were handled. What did matter was increasing the dataset
size, since generating additional instances for underrepresented classes clearly helped to
recognize and distinguish classes with higher precision. Hence, it was a combination of the
utilized learning approach and the dataset size which has exhibited a noticeable influence
on the classification performance, and has contributed around 0.08 to the overall F-Score,
as is illustrated in tables 5.20 and 5.21. This is clearly visible when the performance of
classic ML algorithms applied on the original dataset is compared with the performance
of the transfer learning approach, applied on the augmented dataset (e.g. 0.80 vs. 0.88).
Classic ML Deep Learning Transfer learning
Emojis removed from text 0.84 0.85 0.87
Emojis as symbols 0.80 0.84 0.86
Emojis as text 0.82 0.85 0.87
Table 5.20: Summarized F-Scores in comparison, applied on the original dataset.
Classic ML Deep Learning Transfer learning
Emojis removed from text 0.82 0.85 0.90
Emojis as symbols 0.81 0.86 0.88
Emojis as text 0.83 0.85 0.89
Table 5.21: Summarized F-Scores in comparison, applied on the augmented dataset.
Tables 5.20 and 5.21 summarize results given throughout tables 5.1 and 5.12. Classic
ML denotes the best performing classic machine learning algorithm combined with the
TF-IDF embedding, applied on the corresponding emoji handling method and dataset
size. For instance, the random forest algorithm has performed best on the original dataset
when emojis were removed from text, and has delivered an F-Score of 0.84. Analogous,
Deep Learning refers to the best preforming combination within the combination set
when CNN and LSTM architectures are combined with Word2Vec and GloVe embedding
techniques, applied on the corresponding emoji handling method and dataset size. For
instance, the CNN and Word2Vec combination has performed best when applied on the
augmented dataset with emojis represented as symbols, and has delivered an F-Score
of 0.86. Transfer learning denotes the transformer-based BERT model, applied on the
corresponding emoji handling method and dataset size. Both tables 5.20 and 5.21 indicate
that emojis do not exhibit a noticeable influence on the classification performance, but
that the learning approach and the dataset size do.
64
5.5. Summary
Figure 5.2: Confusion matrix representing the total counts over the entire population,
obtained for the random forest classifier applied on the original data set using the
TF-IDF embedding as text representation method, with emojis represented as symbols.
Less populated classes have been removed from the figure, and the remaining ones
renamed, such as e.g. chainletter-scary was renamed to c-scary, question-conversation to
q-conversation and statement-openurl to s-openurl, with the aim to reduce the size of the
matrix and the text length of each class label, in order to increase readability.
65
5. Evaluation
Figure 5.3: Confusion matrix normalized over the columns (precision), obtained for the
random forest classifier applied on the original data set using the TF-IDF embedding
as text representation method, with emojis represented as symbols. Less populated
classes have been removed from the figure, and the remaining ones renamed, such as e.g.
chainletter-scary was renamed to c-scary, question-conversation to q-conversation and
statement-openurl to s-openurl, with the aim to reduce the size of the matrix and the
text length of each class label, in order to increase readability.
66
5.5. Summary
Figure 5.4: Confusion matrix representing the total counts over the entire population,
obtained for the random forest classifier applied on the original data set using
the TF-IDF embedding as text representation method, with emojis being removed.
Less populated classes have been taken out from the figure, and the remaining ones
renamed, such as e.g. chainletter-scary was renamed to c-scary, question-conversation to
q-conversation and statement-openurl to s-openurl, with the aim to reduce the size of the
matrix and the text length of each class label, in order to increase readability.
67
5. Evaluation
Figure 5.5: Confusion matrix normalized over the columns (precision), obtained for the
random forest classifier applied on the original data set using the TF-IDF embedding
as text representation method, with emojis being removed. Less populated classes
have been taken out from the figure, and the remaining ones renamed, such as e.g.
chainletter-scary was renamed to c-scary, question-conversation to q-conversation and
statement-openurl to s-openurl, with the aim to reduce the size of the matrix and the
text length of each class label, in order to increase readability.
68
5.5. Summary
Figure 5.6: Confusion matrix representing total counts over the entire population,
obtained for the convolutional neural network classifier applied on the augmented
data set with the Word2Vec embedding as text representation method, and emojis
represented as symbols. Less populated classes have been removed from the figure,
and the remaining ones renamed, such as e.g. chainletter-scary was renamed to c-scary,
question-conversation to q-conversation and statement-openurl to s-openurl, with the
aim to reduce the size of the matrix and the text length of each class label, in order to
increase readability.
69
5. Evaluation
Figure 5.7: Confusion matrix normalized over the columns (precision), obtained for the
convolutional neural network classifier applied on the augmented data set with the
Word2Vec embedding as text representation method, and emojis represented as symbols.
Less populated classes have been removed from the figure, and the remaining ones
renamed, such as e.g. chainletter-scary was renamed to c-scary, question-conversation to
q-conversation and statement-openurl to s-openurl, with the aim to reduce the size of the
matrix and the text length of each class label, in order to increase readability.
70
5.5. Summary
Figure 5.8: Confusion matrix representing total counts over the entire population,
obtained for the transformer based classifier applied on the augmented data set
with the BERT embedding as text representation method, and emojis being represented
as symbols. Furthermore, less populated classes have been taken out from the figure,
and the remaining ones were renamed, such as e.g. chainletter-scary was renamed to
c-scary, question-conversation to q-conversation and statement-openurl to s-openurl, with
the aim to reduce the size of the matrix and the text length of each class label, in order
to increase readability.
71
5. Evaluation
Figure 5.9: Confusion matrix normalized over the columns (precision), obtained for the
transformer based classifier applied on the augmented data set with the BERT
embedding as text representation method, and emojis being represented as symbols.
Furthermore, less populated classes have been taken out from the figure, and the remaining
ones were renamed, such as e.g. chainletter-scary was renamed to c-scary, question-
conversation to q-conversation and statement-openurl to s-openurl, with the aim to
reduce the size of the matrix and the text length of each class label, in order to increase
readability.
72
5.5. Summary
Figure 5.10: Confusion matrix representing total counts over the entire population,
obtained for the transformer based classifier applied on the augmented data set
with the BERT embedding as text representation method, and emojis being removed
from the text. Furthermore, less populated classes have been taken out from the figure,
and the remaining ones were renamed, such as e.g. chainletter-scary was renamed to
c-scary, question-conversation to q-conversation and statement-openurl to s-openurl, with
the aim to reduce the size of the matrix and the text length of each class label, in order
to increase readability.
73
5. Evaluation
Figure 5.11: Confusion matrix normalized over the columns (precision), obtained for the
transformer based classifier applied on the augmented data set with the BERT
embedding as text representation method, and emojis being removed from the text.
Furthermore, less populated classes have been taken out from the figure, and the remaining
ones were renamed, such as e.g. chainletter-scary was renamed to c-scary, question-
conversation to q-conversation and statement-openurl to s-openurl, with the aim to
reduce the size of the matrix and the text length of each class label, in order to increase
readability.
74
5.5. Summary
Figure 5.12: Confusion matrix representing total counts over the entire population,
obtained for the transformer based classifier applied on the experimental data set
with the BERT embedding as text representation method, and emojis represented as
text. Furthermore, less populated classes have been taken out from the figure, and the
remaining ones were renamed, such as e.g. chainletter-scary was renamed to c-scary,
question-conversation to q-conversation and statement-openurl to s-openurl, with the
aim to reduce the size of the matrix and the text length of each class label, in order to
increase readability.
75
5. Evaluation
Figure 5.13: Confusion matrix normalized over the columns (precision), obtained for the
transformer based classifier applied on the experimental data set with the BERT
embedding as text representation method, and emojis represented as text. Furthermore,
less populated classes have been taken out from the figure, and the remaining ones were
renamed, such as e.g. chainletter-scary was renamed to c-scary, question-conversation to
q-conversation and statement-openurl to s-openurl, with the aim to reduce the size of the
matrix and the text length of each class label, in order to increase readability.
76
5.5. Summary
Figure 5.14: Multilabel confusion matrix representing total counts over the entire popu-
lation, obtained for the transformer based classifier applied on the multilabel data
set with the BERT embedding as text representation method, and emojis represented as
text.
77
5. Evaluation
Figure 5.15: Multilabel confusion matrix normalized over the columns (precision), ob-
tained for the transformer based classifier applied on the multilabel data set with
the BERT embedding as text representation method, and emojis represented as text.78
CHAPTER 6
Conclusion
6.1 Summary
In this work, we have (i) analyzed the problem and the chatbot solution currently
in operation, (ii) we have evaluated several machine learning approaches to find the
best performing one, (iii) we have examined the influence of emojis on the classification
performance, and (iv) we have implemented a German language chatbot using open source
technologies. In order to investigate the performance of our system, we have conducted
an evaluation comprising a quantitative and a qualitative part. In the quantitative part,
we have evaluated and compared 120 different approaches based on machine learning,
where we have combined a variety of algorithms, neural network architectures and text
embedding methods. By applying transfer learning based on the BERT language model,
we were able to achieve a classification performance of 0.90 for both metrics F-Score
and accuracy. In the qualitative part, we have compiled a questionnaire regarding the
expected and actual results concerning system behavior and performance. The feedback
received through the questionnaire has been very positive, and it showed that the provided
system has met customer expectations. Furthermore, it showed that we were able to
improve the perceived quality of intent recognition and chain letter detection. However,
throughout this work we did not perform a quantitative comparison with the currently
used system, since we did not have access to the machine learning model, and furthermore
did not want to burden the system in operation with several thousand requests.
6.2 Research Questions
Here we provide our results with regard to the research questions, which we have stated
in section 1.3:
Q1. How do classic machine learning algorithms and conventional deep neural networks
compare to approaches based on fine-tuned, pre-trained language models?
79
6. Conclusion
A1. Transformer-based, pre-trained language models have clearly performed best. How-
ever, the difference in performance is mostly visible when the dataset is large. For instance,
in our original dataset which exhibits some very unevenly and sparsely populated classes,
there is a noticeable difference that ranges between 0.01 and 0.04 in performance (averages
around 0.02), which is the F-Score difference between the particular best performing
algorithm and emoji handling combination. In the augmented dataset, on the other hand,
the performance difference ranges between 0.03 and 0.05 (averages around 0.04). The
reason for this behaviour is the larger amount of instances in less populated classes, thus
having a more balanced dataset which helps to prevent the model from becoming biased.
Due to this more uniform distribution, the model no longer favours majority classes
simply because they contain more data. This is described in section 5 and illustrated
throughout tables 5.1 and 5.12, as well as in tables 5.20 and 5.21.
Q2. To what extent do emojis contribute to the overall classification performance?
A2. Surprisingly, emojis do not contribute to the classification performance at all. In
several combinations, no clear pattern of influence has been observed. As described in
section 5.5, the emoji handling method does not influence the classification performance,
but the combination of the utilized learning approach and the dataset size do. Those two
factors together exhibit a noticeable influence. As is illustrated throughout tables 5.20
and 5.21, the learning approach and the dataset size contribute together around 0.08 to
the overall F-Score (that is, classic ML and emojis as symbols on the original dataset vs.
transfer learning and emojis as symbols on the augmented dataset, 0.80 vs. 0.88).
Q3. Is it possible to improve the existing chatbot with open-source technology?
A3. Yes, this is clearly possible, since we have provided a new implementation of the
chatbot, based on open-source technology.
6.3 Future Work
Throughout this work, the possibility of utilising large language models (LLMs) for
conversations (e.g. such as Bard [Goo23] or ChatGPT [Ope22]) - at least to respond
to openly formulated messages - has not been explored thoroughly enough. We have
made experiments with the mT5 [XCR+21] and GPT-2 [RWC+19] large language models,
as described in section 3.3.1, but did not seriously pursue the approach since results
obtained with our limited dataset did not seem promising at all. However, it would
be very interesting to further explore this route as recent developments have shown
that LLMs can be utilized to generate high quality data. In [TGZ+23] and [WKM+22],
the authors report that synthetic data generated by large language models has very
successfully been utilized for fine-tuning. Hence, LLMs could also be utilized to generate
messages belonging to categories which are confused with each other, such as the question-
conversation, question-advicenecessary and none categories, and thus help to improve
the performance of our classifier based on the BERT language model. Furthermore, it
would be very interesting to experiment with models trained on a variety of tasks, as
80
6.3. Future Work
well as with pre-trained multilingual models in order to see which results they would
deliver after being fine-tuned either on our particular dataset, or on our dataset being
even more enhanced by some other LLM. Another approach that we did not pursue in
this work are Generative Adversarial Networks (GANs). Here, on one hand, it would
be interesting to check whether the existing classifier could be improved by utilizing an
"adversary", and on the other hand to explore how well generating text for "less specific"
categories would work. And finally, collecting more genuine and high-quality data to
achieve a better-balanced class distribution and thereby obtain a more valuable dataset,
could further improve the presented work.
81

List of Figures
2.1 High-level basic architecture of a chatbot, as depicted in [Gal19]. . . . . 5
2.2 Architecture of a chatbot comprising components outlined in this section, as
proposed in [LLS+17] and described and summarized in [Gal19]. . . . . . 7
2.3 CBOW and Skip-gram models, as proposed in [MCCD13a]. . . . . . . . 15
2.4 The sigmoid threshold unit applied to the sum of weighted inputs [Mit97]. 19
2.5 Multilayer perceptron with a single hidden layer composed of three units or
neurons. The input layer receives the data and forwards it to the hidden layer,
from where it is then finally passed to the output layer. The example was
taken from [Mit97]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 The error function for a unit with weights w0 and w1. The depicted arrow
indicates the steepest descent along the error surface towards the minimum
error, as described and depicted in [Mit97]. . . . . . . . . . . . . . . . . . 20
2.7 Local features, such as edges and textures, can be extracted from images.
These are contained in a small window of the original image, as illustrated in
[Cho17]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 In the world of visual perception, a spatial hierarchy of modules exists, which
includes elementary lines or textures that combine into basic objects like
eyes or ears. It eventually culminates in higher-level concepts such as cat, as
illustrated in [Cho17]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 Convolution operation, visualised on a simple example from [Raf22]. . . 22
2.10 A recurrent neural network - a network with a loop. Taken from [Cho17]. 23
2.11 A simple RNN, unrolled over time. Taken from [Cho17]. . . . . . . . . . 24
2.12 Internal structure of an LSTM, as depicted in [Cho17]. . . . . . . . . . . 25
2.13 Input features (pixels) in the original representation and the corresponding
attention scores. The higher (brighter) the attention score, the more important
the corresponding pixel in the image. Example was taken from [Cho21]. 25
2.14 Attention scores are computed between the word “station” and every other
word in the sequence. These are then used to weight a sum of word vectors
that becomes the new “station” vector. Example was taken from [Cho21]. 26
2.15 The transformer model architecture [VSP+17] . . . . . . . . . . . . . . . 27
2.16 During training, the source sequence is processed by the encoder and then sent
to the decoder. The decoder looks at the target sequence so far, and predicts
the offset by one step in the future. During inference, one target token is
generated at a time and fed back into the decoder. Taken from [Cho21] . 28
83
2.17 The basic idea is to treat every text processing problem as a “text-to-text”
problem. This allows for reusing the model, loss function and hyperparameters
across a diverse set of tasks. Taken from [RSR+19] . . . . . . . . . . . . 29
3.1 An example of a message contained in the dataset, belonging to the category
chainletter-spiel. As can be seen, a variety of emojis is used to express the
meaning of the message. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Modules comprising the chatbot. . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Intent recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Answer generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Training workflows for the classic machine learning algorithms, for older
deep neural network architectures such as Convolutional Neural Networks
(CNN) and Long Short Term Memory (LSTM), and for the recent transformer
architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1 An example of a confusion matrix, depicted in four different forms: (a) counts
of each category are reported, (b) ratios of counts divided by the entire
population are reported, (c) ratios of counts divided by the sum of each
column are reported (precision), and finally (d) ratios of counts divided by
the sum of each row are reported (recall). . . . . . . . . . . . . . . . . . . 49
5.2 Confusion matrix representing the total counts over the entire population,
obtained for the random forest classifier applied on the original data
set using the TF-IDF embedding as text representation method, with emojis
represented as symbols. Less populated classes have been removed from the
figure, and the remaining ones renamed, such as e.g. chainletter-scary was
renamed to c-scary, question-conversation to q-conversation and statement-
openurl to s-openurl, with the aim to reduce the size of the matrix and the
text length of each class label, in order to increase readability. . . . . . . 65
5.3 Confusion matrix normalized over the columns (precision), obtained for the
random forest classifier applied on the original data set using the TF-
IDF embedding as text representation method, with emojis represented as
symbols. Less populated classes have been removed from the figure, and
the remaining ones renamed, such as e.g. chainletter-scary was renamed to
c-scary, question-conversation to q-conversation and statement-openurl to
s-openurl, with the aim to reduce the size of the matrix and the text length of
each class label, in order to increase readability. . . . . . . . . . . . . . . 66
84
5.4 Confusion matrix representing the total counts over the entire population,
obtained for the random forest classifier applied on the original data
set using the TF-IDF embedding as text representation method, with emojis
being removed. Less populated classes have been taken out from the figure,
and the remaining ones renamed, such as e.g. chainletter-scary was renamed
to c-scary, question-conversation to q-conversation and statement-openurl to
s-openurl, with the aim to reduce the size of the matrix and the text length of
each class label, in order to increase readability. . . . . . . . . . . . . . . 67
5.5 Confusion matrix normalized over the columns (precision), obtained for the
random forest classifier applied on the original data set using the TF-IDF
embedding as text representation method, with emojis being removed. Less
populated classes have been taken out from the figure, and the remaining
ones renamed, such as e.g. chainletter-scary was renamed to c-scary, question-
conversation to q-conversation and statement-openurl to s-openurl, with the
aim to reduce the size of the matrix and the text length of each class label, in
order to increase readability. . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6 Confusion matrix representing total counts over the entire population, obtained
for the convolutional neural network classifier applied on the augmented
data set with the Word2Vec embedding as text representation method, and
emojis represented as symbols. Less populated classes have been removed
from the figure, and the remaining ones renamed, such as e.g. chainletter-
scary was renamed to c-scary, question-conversation to q-conversation and
statement-openurl to s-openurl, with the aim to reduce the size of the matrix
and the text length of each class label, in order to increase readability. . 69
5.7 Confusion matrix normalized over the columns (precision), obtained for the
convolutional neural network classifier applied on the augmented data
set with the Word2Vec embedding as text representation method, and emojis
represented as symbols. Less populated classes have been removed from the
figure, and the remaining ones renamed, such as e.g. chainletter-scary was
renamed to c-scary, question-conversation to q-conversation and statement-
openurl to s-openurl, with the aim to reduce the size of the matrix and the
text length of each class label, in order to increase readability. . . . . . . 70
5.8 Confusion matrix representing total counts over the entire population, obtained
for the transformer based classifier applied on the augmented data set
with the BERT embedding as text representation method, and emojis being
represented as symbols. Furthermore, less populated classes have been
taken out from the figure, and the remaining ones were renamed, such as
e.g. chainletter-scary was renamed to c-scary, question-conversation to q-
conversation and statement-openurl to s-openurl, with the aim to reduce the
size of the matrix and the text length of each class label, in order to increase
readability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
85
5.9 Confusion matrix normalized over the columns (precision), obtained for the
transformer based classifier applied on the augmented data set with the
BERT embedding as text representation method, and emojis being represented
as symbols. Furthermore, less populated classes have been taken out from
the figure, and the remaining ones were renamed, such as e.g. chainletter-
scary was renamed to c-scary, question-conversation to q-conversation and
statement-openurl to s-openurl, with the aim to reduce the size of the matrix
and the text length of each class label, in order to increase readability. . 72
5.10 Confusion matrix representing total counts over the entire population, obtained
for the transformer based classifier applied on the augmented data set
with the BERT embedding as text representation method, and emojis being
removed from the text. Furthermore, less populated classes have been
taken out from the figure, and the remaining ones were renamed, such as
e.g. chainletter-scary was renamed to c-scary, question-conversation to q-
conversation and statement-openurl to s-openurl, with the aim to reduce the
size of the matrix and the text length of each class label, in order to increase
readability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.11 Confusion matrix normalized over the columns (precision), obtained for the
transformer based classifier applied on the augmented data set with the
BERT embedding as text representation method, and emojis being removed
from the text. Furthermore, less populated classes have been taken out from
the figure, and the remaining ones were renamed, such as e.g. chainletter-
scary was renamed to c-scary, question-conversation to q-conversation and
statement-openurl to s-openurl, with the aim to reduce the size of the matrix
and the text length of each class label, in order to increase readability. . 74
5.12 Confusion matrix representing total counts over the entire population, obtained
for the transformer based classifier applied on the experimental data
set with the BERT embedding as text representation method, and emojis
represented as text. Furthermore, less populated classes have been taken out
from the figure, and the remaining ones were renamed, such as e.g. chainletter-
scary was renamed to c-scary, question-conversation to q-conversation and
statement-openurl to s-openurl, with the aim to reduce the size of the matrix
and the text length of each class label, in order to increase readability. . 75
5.13 Confusion matrix normalized over the columns (precision), obtained for the
transformer based classifier applied on the experimental data set with
the BERT embedding as text representation method, and emojis represented
as text. Furthermore, less populated classes have been taken out from the
figure, and the remaining ones were renamed, such as e.g. chainletter-scary was
renamed to c-scary, question-conversation to q-conversation and statement-
openurl to s-openurl, with the aim to reduce the size of the matrix and the
text length of each class label, in order to increase readability. . . . . . . 76
86
5.14 Multilabel confusion matrix representing total counts over the entire popula-
tion, obtained for the transformer based classifier applied on the multilabel
data set with the BERT embedding as text representation method, and emojis
represented as text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.15 Multilabel confusion matrix normalized over the columns (precision), obtained
for the transformer based classifier applied on the multilabel data set with
the BERT embedding as text representation method, and emojis represented
as text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
87

List of Tables
3.1 Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 The original, augmented and experimental datasets, with the enlisted counts
for each category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Labels, the corresponding contexts when they are set, and the counts of rows
labelled with that specific label. . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Preprocessing steps for each embedding method. . . . . . . . . . . . . . . 38
4.1 Feasible combinations of algorithms and embeddings. . . . . . . . . . . . 44
5.1 Accuracy results when emojis are removed from the original dataset. . 51
5.2 F-Score results when emojis are removed from the original dataset. . . 51
5.3 Accuracy results when emojis are represented as symbols in the original
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 F-Score results when emojis are represented as symbols in the original
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5 Accuracy results when emojis are represented as text in the original
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.6 F-Score results when emojis are represented as text in the original dataset.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.7 Accuracy results when emojis are removed from the augmented dataset. 53
5.8 F-Score results when emojis are removed from the augmented dataset. 53
5.9 Accuracy results when emojis are represented as symbols in the aug-
mented dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.10 F-Score results when emojis are represented as symbols in the augmented
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.11 Accuracy results when emojis are represented as text in the augmented
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.12 F-Score results when emojis are represented as text in the augmented
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.13 Classification report obtained for the random forest classifier, applied
on the original data set with the TF-IDF embedding as the utilized text
representation method, with emojis represented as symbols. . . . . . . . 55
89
5.14 Classification report obtained for the random forest classifier, applied
on the original data set with the TF-IDF embedding as the utilized text
representation method, with emojis being removed. . . . . . . . . . . . . 56
5.15 Classification report obtained for the convolutional neural network classi-
fier applied on the augmented data set with the Word2Vec embedding as
text representation method, and emojis represented as symbols. . . . . . 57
5.16 Classification report obtained for the transformer based classifier, applied
on the augmented data set with the BERT embedding as text representation
method, and emojis represented as symbols. . . . . . . . . . . . . . . . . 58
5.17 Classification report obtained for the transformer based classifier, applied
on the augmented data set with the BERT embedding as text representation
method, and emojis being removed. . . . . . . . . . . . . . . . . . . . . . 59
5.18 Classification report obtained for the transformer based classifier, applied
on the experimental data set with the BERT embedding as the utilized text
representation method, and emojis represented as text. . . . . . . . . . . 60
5.19 Multilabel classification report obtained for the transformer based clas-
sifier applied on the multilabel data set with the BERT embedding as text
representation method, and emojis represented as text. . . . . . . . . . . . 61
5.20 Summarized F-Scores in comparison, applied on the original dataset. . . 64
5.21 Summarized F-Scores in comparison, applied on the augmented dataset. 64
90
Bibliography
[BMR+20] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish
Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler,
Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher
Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.
Language Models are Few-Shot Learners, 2020.
[CHL+22] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William
Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma,
Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun
Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robin-
son, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent
Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi,
Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and
Jason Wei. Scaling Instruction-Finetuned Language Models, 2022.
[Cho17] François Chollet. Deep Learning with Python. Manning, November 2017.
[Cho21] François Chollet. Deep Learning with Python, Second Edition. Manning,
2021.
[DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding.
CoRR, abs/1810.04805, 2018.
[Gal19] Boris Galitsky. Developing Enterprise Chatbots: Learning Linguistic Struc-
tures. Springer Publishing Company, Incorporated, 1st edition, 2019.
[GBB+21] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe,
Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima,
Shawn Presser, and Connor Leahy. The Pile: An 800GB Dataset of Diverse
Text for Language Modeling. CoRR, abs/2101.00027, 2021.
91
[GBC16] Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.
MIT Press, Cambridge, MA, USA, 2016.
[Goo23] Google AI. Bard. https://bard.google.com/, March 2023. [Online;
accessed 07-May-2023].
[HMU07] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to
Automata Theory, Languages and Computation. Pearson Addison-Wesley,
Boston, MA, 3rd edition, 2007.
[HS97] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory.
Neural Computation, 9(8):1735–1780, 11 1997.
[HSW89] Kurt Hornik, Maxwell B. Stinchcombe, and Halbert White. Multilayer feed-
forward networks are universal approximators. Neural Networks, 2(5):359–
366, 1989.
[JM09] Dan Jurafsky and James H. Martin. Speech and language processing : an
introduction to natural language processing, computational linguistics, and
speech recognition. Pearson Prentice Hall, Upper Saddle River, N.J., 2009.
[LC10] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
[LHH19] H. Lane, H. Hapke, and C. Howard. Natural Language Processing in
Action: Understanding, analyzing, and generating text with Python. Manning
Publications, 2019.
[LLS+17] Huiting Liu, Tao Lin, Hanfei Sun, Weijian Lin, Chih-Wei Chang, Teng
Zhong, and Alexander I. Rudnicky. Rubystar: A non-task-oriented mixture
model dialog system. CoRR, abs/1711.02781, 2017.
[MCCD13a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient
Estimation of Word Representations in Vector Space, 2013.
[MCCD13b] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. word2vec.
https://code.google.com/archive/p/word2vec/, 2013. [Online;
accessed 22-April-2023].
[Mit97] Tom M Mitchell. Machine Learning, volume 1. McGraw-hill New York,
1997.
[MRS08] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. In-
troduction to Information Retrieval. Cambridge University Press, USA,
2008.
[Ope22] OpenAI. ChatGPT. https://chat.openai.com/, November 2022.
[Online; accessed 01-May-2023].
92
[OWJ+22] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright,
Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex
Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie
Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and
Ryan Lowe. Training language models to follow instructions with human
feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho,
and A. Oh, editors, Advances in Neural Information Processing Systems,
volume 35, pages 27730–27744. Curran Associates, Inc., 2022.
[PSM14] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe:
Global vectors for word representation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP), pages
1532–1543, Doha, Qatar, October 2014. Association for Computational
Linguistics.
[Raf22] Edward Raff. Inside Deep Learning. Manning, 2022.
[RNSS18] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
Improving language understanding by generative pre-training. 2018.
[RSR+19] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the
Limits of Transfer Learning with a Unified Text-to-Text Transformer. CoRR,
abs/1910.10683, 2019.
[RWC+19] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. Language Models are Unsupervised Multitask Learners. 2019.
[SDCW19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Dis-
tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.
CoRR, abs/1910.01108, 2019.
[Ski17] Steven S. Skiena. The Data Science Design Manual. 2017.
[TGZ+23] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen
Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford
Alpaca: An Instruction-following LLaMA model. https://github.com/
tatsu-lab/stanford_alpaca, 2023.
[TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-
Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric
Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave,
and Guillaume Lample. LLaMA: Open and Efficient Foundation Language
Models, 2023.
[TvWW22] Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Natural Language
Processing with Transformers. O’Reilly, 2022.
93
[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You
Need, 2017.
[WFH11] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical
Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data
Management Systems. Morgan Kaufmann, Amsterdam, 3 edition, 2011.
[WKM+22] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith,
Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct: Aligning Language
Model with Self Generated Instructions, 2022.
[XCR+21] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou,
Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A Massively
Multilingual Pre-trained Text-to-Text Transformer, 2021.
94