Avatar Control by Automatically
Detected Face Interest Points
DIPLOMARBEIT
zur Erlangung des akademischen Grades
Diplom-Ingenieur
im Rahmen des Studiums
Medieninformatik
eingereicht von
Miroslav Byrtus
Matrikelnummer 1328167
an der Fakultät für Informatik
der Technischen Universität Wien
Betreuung: Ao.Univ.Prof. Dr. Horst Eidenberger
Wien, 20. Dezember 2015
Miroslav Byrtus Horst Eidenberger
Technische Universität Wien
A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at
Die approbierte Originalversion dieser Diplom-/ 
Masterarbeit ist in der Hauptbibliothek der Tech-
nischen Universität Wien aufgestellt und zugänglich. 
 
http://www.ub.tuwien.ac.at 
 
 
 
 
The approved original version of this diploma or 
master thesis is available at the main library of the 
Vienna University of Technology. 
 
http://www.ub.tuwien.ac.at/eng 
 

Avatar Control by Automatically
Detected Face Interest Points
DIPLOMA THESIS
submitted in partial fulfillment of the requirements for the degree of
Diplom-Ingenieur
in
Media Informatics
by
Miroslav Byrtus
Registration Number 1328167
to the Faculty of Informatics
at the Vienna University of Technology
Advisor: Ao.Univ.Prof. Dr. Horst Eidenberger
Vienna, 20th December, 2015
Miroslav Byrtus Horst Eidenberger
Technische Universität Wien
A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at

Erklärung zur Verfassung der
Arbeit
Miroslav Byrtus
Vorgartenstraße 67, 1200 Wien
Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen-
deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der
Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder
dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter
Angabe der Quelle als Entlehnung kenntlich gemacht habe.
Wien, 20. Dezember 2015
Miroslav Byrtus
v

Danksagung
Vielen Dank für eure Unterstützung. Insbesondere möchte ich Herrn Ao.Univ.Prof. Dr.
Horst Eidenberger für die engagierte und professionelle Betreuung danken. Des weiteren
möchte ich mich bei den geduldigen Probandinnen und Probanden für die Zusammen-
arbeit bedanken.
Ganz besonders möchte ich mich bei meiner Familie bedanken, die mir mein Studium
und diese Diplomarbeit durch ihre liebevolle und aufrichtige Unterstützung ermöglicht
haben.
Danke!
vii

Acknowledgements
Many thanks for your support. I would especially like to thank Ao. Univ. Prof. Dr.
Horst Eidenberger for his dedicated and professional support. I would also like to thank
the participants of the evaluation for their patience and cooperation.
Additionally, I especially want to thank my family, who facilitated my studies and
this thesis through their loving and sincere support.
Thank you!
ix

Kurzfassung
Das Entwerfen von Systemen, die fähig sind mit den Menschen zu interagieren, ist eine
aufwändige Aufgabe. Ein wichtiger Aspekt bei diesem Problem ist, die menschlichen
Emotionen zu verstehen und auf diese auf menschliche Weise zu reagieren. Kompliziert
ist auch die Tatsache, dass Menschen selber Probleme haben, die Emotionen richtig zu
erkennen.
Derzeit gibt es zahlreiche robuste und gut funktionierende Systeme, die Gesichter
erkennen, Augen, Nase und Mund lokalisieren können. Hier fehlt aber die sogenannte
Meta-Information in Form einer ausführlichen Beschreibung des Gesichts, die ein tiefe-
res Verständnis des Ausdrucks bringen kann. Diese Information sollte nicht unterschätzt
werden, denn die Gesichtsausdrücke beinhalten eine große Menge von Information der
non-verbalen Kommunikation. Die Gesichtsausdrücke spielen in der menschlichen Kom-
munikation eine wichtige Rolle, denn eine große Informationsmenge wird auch durch die
non-verbale Kommunikation übermittelt.
Ein System, das die menschlichen Emotionen automatisch erkennen kann, wäre für
Bereiche wie Human-Computer Interaktion, Psychologie, Soziologie etc. hilfreich. Ein
solches System würde eine automatische Analyse von Stress-, Höhenangst und Aggres-
sivität ermöglichen. Außerdem würde es auch für die Überwachung nutzbar sein.
Das Ziel dieses Projektes ist es, ein robustes System zu entwerfen und zu implemen-
tieren, das die Emotionen in Gesichtern erkennen und analysieren kann. Das System
wird voll automatisiert, sodass der Benutzer keine weiteren Einstellungen tätigen muss,
um das System zum Laufen zu bringen.
xi
Die erwartete Ausgabe ist eine textuelle Beschreibung der Emotion. Die analysierten
Gesichtsparameter werden zudem weiter an die Animationskomponente geschickt, wo
die erkannte Emotion in der Form einer Animation nachgespielt und angezeigt wird.
Abstract
Designing systems that are able to interact with people is a complex process. An impor-
tant aspect of this problem is understanding human emotions and responding to them
in a human way. The fact that people themselves often have problems in recognizing
emotions properly makes the task even more difficult.
There are currently numerous robust and well-functioning systems that can recognize
human faces, and locate the eyes, nose and mouth. However, these systems miss the
so-called meta-information in the form of a detailed description of the face, which can
lead to a deeper understanding of facial expressions. This information should not be un-
derestimated, since facial expressions contain a large amount of non-verbal information.
Facial expressions are important in human communication because much information is
transmitted through non-verbal communication.
A system that can automatically detect human emotion would be useful in areas such
as human-computer interaction, psychology, sociology and other areas. Such a system
would enable automated analysis of stress, vertigo or aggression levels. Moreover, it
would also be useful in monitoring public spaces, resulting in higher security.
The aim of this project is to design and implement a robust system that can recognize
and analyze emotions from human faces. The system should be fully automated so that
the user does not need to setup any parameters in order to make the system run correctly.
The expected output is the textual description of the emotion. The analyzed face
parameters are also forwarded to the animation component, where the facial expression
is animated on an avatar.
xiii

Contents
Kurzfassung xi
Abstract xiii
List of Figures xvii
List of Tables xx
1 Introduction 1
1.1 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 State of the Art 5
2.1 Face Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Face Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Project Description 33
3.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
xv
4 Implementation 39
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Face analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Face rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Evaluation 55
5.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6 Conclusion 61
Bibliography 63
List of Figures
2.1 Simple rectangle features used in the work [VJ01]. . . . . . . . . . . . . . . . 8
2.2 Features selected by AdaBoost. The first feature notices the lightness differ-
ence between eyes and the cheeks. The second notices the difference in the
eye region caused by the nose being between the eyes. [VJ01] . . . . . . . . . 8
2.3 Rotated features introduced in the publication [LM02]. . . . . . . . . . . . . . 9
2.4 A minimal landmark configuration that is enough to match the facial shape,
but is not enough for deeper analysis, such as emotion recognition [LTC97]. 13
2.5 Another configuration where all face parts are annotated in detail, so that
the shape can be analyzed [LTC97]. This setting should be good enough to
recognize emotion from the fitted model. . . . . . . . . . . . . . . . . . . . . 13
2.6 A set of different annotation schemes [SRP10] with different levels of details.
All the scheme details are enough for an emotional recognition task, with
different levels of detail and precision. . . . . . . . . . . . . . . . . . . . . . . 13
2.7 The 66 points of interest system used in [DCLR13]. . . . . . . . . . . . . . . 17
2.8 Facial regions of interest for measuring the heart rate from the face, intro-
duced in [DCLR13]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.9 Example of six action units used in the study of Bartlett, Marian Stewart, et
al. : AU 1 - inner brow raiser, AU 2 - outer brow raiser, AU 4 - brow lower,
AU 5 - upper lid raiser, AU 6 - cheek raiser, AU 7 - lid tightener [BVS+96] . 20
xvii
2.10 25 Action units of the extended Facial Action Coding System introduced in
the work of Cosker et. al in 2010 [CKH10]. Every Action Unit is a muscle
that is used for facial expression. Some of the expressions shown can barely
be differentiated from each other by the human eye, which makes it unusable
for the proof of concept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.11 Six basic expressions described by Ekman and Friesen in 1971 [FACS Ekman]84.
These six basic expressions were used in the system designed within the scope
of this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.12 A picture of Sponge Bob in the SVG Format. On the right-hand side is the
deformed face. The deformation was not applied to the graphics directly but
to the pivot points, which were moved and the angles adjusted. The result
of this change applied to the SVG file is a deformation simulating an angry
expression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.13 3-layered Bayesian Model used in the classification method in the publication
[SRDW09]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 The logical parts of the project divided into stages, showing which libraries
and tools were used in each stage. . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Flowchart showing how the data is processed in the prototype. The starting
point is the Unity box on the left side. It requests data from the C++
program containing the FaceTracker library trough Export API (application
programming interface) call. The FaceTracker library loads the pre-trained
ASM model and fits it to the image captured by the web-cam. The C++
program that is hosting FaceTracker then sends the image further on to the
Emotion Classifier, which classifies the emotion of the face in the image and
returns the label of the recognized emotion. Both the fitted ASM model and
the emotion label are then sent back to Unity, where the data is visualized. . 40
4.2 This figure shows the repositioning of face features when capturing the user’s
face with the same expression from different angles. . . . . . . . . . . . . . . 42
4.3 This figure shows the results of testing the Di-linear AAM fitting. Other
fitting algorithms implemented in this library were tested on both Windows
and OSX operating systems, attaining similar results. . . . . . . . . . . . . . 44
4.4 This screen-shot shows the exactness of the fitting algorithm implemented in
the FaceTracker library. All features were located correctly. . . . . . . . . . 45
4.5 Two frames with the same face pose but with different results of the fitting
process. Such differences happen with each loop, causing strong shaking of
the model, making the library unusable for our prototype. . . . . . . . . . . 47
4.6 STASM library testing. It is impossible to use this library in a real-time
application running on common hardware. The precision of the fitting algo-
rithm implemented in this library is also too low. Notice the inaccuracy of
the located features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.7 The Holmen advance head rig is a 3D rigged model downloaded in the .blend
file format. It is a highly detailed model with spectacular texture. However,
the texture cannot be seen on this screen-shot since the Unity3D Editor was
not able to import it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.8 The quality of this Holmen Face Rig model is the same as the quality of
the previous one shown in the Figure 4.7, but the superiority of the loaded
material makes the model look more realistic. Both models are equipped
with facial muscle joints, making them suitable for our prototype. . . . . . . 49
4.9 A screenshot of 3D models shown in Figures 4.7 and 4.8, opened in Blender
before being imported into the Unity3D Editor. Notice the texture difference.
However, we wanted to use the model in a real-time application made in
Unity3D, which would be too complicated to render in such level of detail on
commonly used hardware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.10 The armature of the Holmen Advanced Rig shows how joints can be modeled
in the Blender modeling tool. . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.11 Data flow chart showing the data flow between the Unity C# script and C++
plug-in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 The results of emotion classification. . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Overall scores of the animation’s rating (See Table 5.2) . . . . . . . . . . . . 58
5.3 Participant H: The happy expression classified incorrectly as the neutral ex-
pression. The ASM debug points were matched correctly, but intentionally
moved to the left top corner so that the real expression can be seen clearly. . 59
5.4 Figure showing how FaceTracker fits the face of Participant F, who sports a
beard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Participant A: Inaccuracy in the animation of the avatar’s fearful animation. 59
List of Tables
5.1 The results of the facial expression categorization evaluation. Tests were
marked as incorrect if the emotion was not recognized correctly on the par-
ticipant’s natural happy/sad/fearful expression, and if the participant had to
make additional effort in order for the emotion to be classified properly. . . . 57
5.2 Avatar animation scores based on participants’ perceptions. 1: bad - 5: ex-
cellent. Participant A gave a low rating of the fearful animation since she
was not able to fit the expression properly at all (See Figure 5.5). . . . . . . 57
xx
CHAPTER 1
Introduction
1.1 Idea
The goal of this thesis is to propose a system for deep visual face analysis. As the proof
of concept, a prototype application for recognizing and animating user facial emotions
in real-time was implemented.
This ushers in a new level of image analysis that can exist alongside advanced
scene analysis, which is also used, for example, for an emotion-driven human-computer-
interface. Due to its sequential input, the content and responses can be adapted over
time according to the user’s emotional responses. Some actions can be also automatized,
or at least predicted according to the user’s emotions.
As a side effect of emotion recognition, the user’s facial data can be used for an
avatar animation. Since the face analysis precisely describes the user’s face, it can also
replicate the emotion precisely. In this way the user’s body language can be credibly
animated, while the user’s real face remains anonymous.
Even though the system is to be developed and tested on Europeans only, the idea
is to create a robust solution that will work with any race, age or gender without limi-
tations.
1
Assuming minimal lighting conditions are met, the proposed solution will be able to
work fluently on a common personal computer with no need for special hardware. The
user will be captured by a web-cam and the data will be analyzed with algorithms that
even modern mid-end laptops will be able to handle.
1.2 Motivation
The motivation for this work is to enable the computer to understand and respond to
a human-specific expression such as facial emotion. With this capability, programs will
be able to reach a new dimension of semantic information.
Emotional expressions happen spontaneously over time. In most cases people do not
think explicitly about what triggers in them an honest reaction. If a machine could
evaluate facial expressions automatically, the user’s response time could be decreased in
some scenarios with less user actions required.
Body language is an important part of interpersonal communication, especially the
expressions of the face and hands. A system capable of analyzing the user’s facial
emotion and animating it on an avatar enables the use of this kind of body language
over the Internet. Thereby enriching the expressive power of virtual communication
without sacrificing the user’s anonymity.
A novel human-computer-interface is another aspect of the motivation. Currently
the usual setting comprises the user sitting in the front of the computer and seeing some
content to which he or she is able to respond to, for example sending a response with
the keyboard. Between these two points, the user reacts spontaneously, and facially, to
the content. The proposal comprising this thesis could also be used for this kind of task.
1.3 Challenges
The most challenging part of this work is to describe the face so precisely that the
individual facial parts, the eyes, mouth etc., can be analyzed separately according to
their form and position relative to other facial parts. Such analysis requires a system of
2
face interest points to be designed, describing the form and the position of every face
feature of an appropriately fitted image.
There are several published approaches and methods for the extensive description of
human faces, but there is as yet no ideal approach for the particular task of emotion
analysis. None of the published algorithms and approaches are sufficiently precise or ap-
plicable for real-time application, a fact that makes the choice of approach a challenging
task. The more precise the method, the higher the computational power needed. This
work therefore searches for the perfect ratio between precision and performance.
It also searches for existing methods, algorithms and libraries that can be combined
together in order to achieve some of the functionality. These are listed and compared in
the following chapters. Following this the most suitable combination will be chosen and
the rest of the functionality will be implemented.
Another technical challenge is to use only open-source libraries, which can be adapted
to the functionality required for this thesis. The preferred programming language is
C/C++, so that communication between the recognizer and animator in the final prod-
uct can be implemented optimally without any limitations. Even if the final product
were to be compiled for usage under the Microsoft Windows operating system, the source
code ought to be platform-independent.
1.4 Overview
This thesis is organized in two parts; research and implementation. The former inves-
tigates approaches that could potentially be used in the prototype. The precision and
computational complexity of the chosen methods are described and compared with each
other.
In the latter approach, we chose the best method evaluated and based the prototype
implementation on it. The prototype was also tested and evaluated among a small group
of people. The project was described, results discussed and future work was proposed.
The description of the prototype also consists of multiple sections. Firstly the user’s
3
face has to be captured, localized and described so precisely, that the positioning and
form of certain face features can be further analyzed. In order to achieve this a system
of significant points on the face had to be designed. These points will then be located
on the face to create a face map that helps to analyze the face further.
The prototype will only work correctly if a set of minimum requirements is met. The
computer needs to be able to access the web-cam, which has to capture the whole frontal
face of the user. The face has to be captured in a way the visual information is fine
enough to distinguish clearly the individual face features.
Assuming that the minimal requirements are met, the significant points mapping is
applied and fitted onto the visual image. This mapping is then analyzed further and
compared to a set of pre-trained emotions, predicting the actual emotional expression of
the perceived face. This is executed in a loop over time, so that a real-time sequential
emotion prediction is reached. From this mapping and precision prediction an avatar face
with textual description is rendered. With the mapping precise enough, processed and
applied to the 3D model properly, an almost real emotional expression can be simulated.
4
CHAPTER 2
State of the Art
Since this thesis is organized in two parts, the State of the Art chapter is also organized
accordingly. Firstly, methods related to the emotion recognition part are described, and
then the animation part commences.
Currently, thanks to the rapidly growing Web and IT technologies, many algorithms
and libraries are published that relate to this topic in some way. There are different
approaches for locating the face, finding and matching patterns and shapes, and many
approaches to machine learning. To attain the functionality required in this thesis a
combination of these methods will be needed, so that the chapter deals with all of them.
• In the first step, technology for locating a user’s face is required. The simplest way
to locate objects in visual images is to use visual trackers. However, for human
faces this is not the best way for accomplishing such a task, for two reasons. First,
having trackers on the face would be unnatural and would not represent a real
environment. Second, to be able to track the face so thoroughly that the emotion
could be recognized, dozens of trackers on a small face-sized surface would be
needed. This would lead to an unrealistic setting that would not find relevance in
the real world. For these reasons tracker-less methods for locating and describing
5
faces based on image processing approaches were researched and used. Section 2.1
focuses on this topic more closely.
• The next problem in this thesis concerns the points of interests of a human face.
The location of the face itself would not be sufficient for deeper analysis, such as
recognizing emotions. This is why the face needs to be located more thoroughly
and with greater detail - information about locations and the shapes of particular
face features are needed. For such thorough, detailed facial description a system
of significant points needs to be defined in order to clearly define the position and
shape of particular face features on the image. There are multiple systems using
different positioning and point counts, and which are described in more detail in
the following sections.
• With such a system of points the fitted point map can be analyzed and compared
to pre-defined maps of emotions. How to pre-define a configuration and how to
compare and match them to a current one is analyzed in the following sections.
Most approaches do use annotated face databases for training a model that is able
to compare the current map of points to the trained one, and to tell which trained
one is most similar to the current one. This training-classifying method is closest
to that used by humans for understanding perceived content. Since the task is
to enable machines to recognize a human facial expression in a human way, this
method is the most appropriate. If implemented correctly, the prototype will be
able to recognize facial emotions in much the same way that humans do.
• There are two ways of classifying emotions: static and dynamic. The static method
takes a static image of a human face and recognizes the expression on it. With
a model trained properly, a simple static image contains enough information for
understanding the emotion on a face. Conversely, dynamic video content contains
more information than the static one. According to a study by C. Soladié, H.
Salam, C. Pelachaud, N. Stoiber and R. Séguier [SSP+12] the duration of a smile
is also important when recognizing an expression. Short smiles can have different
6
meanings than longer ones. In dynamic video recordings, the last facial expres-
sion can even be compared to previous expressions, and the difference measured.
According to this, differences and differences to the emotionless expression, the
strength of an expression can be also measured.
2.1 Face Analysis
This section concerns approaches to facial analysis, which consist of localizing a frontal
human face and matching its shape to a model. The section is divided into two sub-
sections: Face Recognition and Face Analysis; these two tasks differ in several aspects
and there are special methods peculiar to each. The methods are now described and
compared, as is choosing the implementation that will best fit our needs. These were
then tested and combined in order to create the most optimized base for the prototype
solution.
2.1.1 Face Recognition
The first step of the image processing required for reaching the final functionality of
the prototype is to localize the face. This is in order to determine whether there is a
recognizable face captured suitable for further processing. If not, then either there is
no one in the front of the camera, or the lighting conditions are insufficient for face
recognition. In either case the process should be stopped with a warning, since the
system cannot work properly unless there is a recognizable face on the image.
Assuming that the bounding rectangle will be returned from the face recognition
method, the visual information can be reduced by removing any unnecessary part of
the image from the bounding rectangle. In this way, the process can zoom-in and focus
solely on the face. This brings a further optimization level to the whole process.
There are several methods for detecting objects in images. One reliable and currently
widely used method was introduced by P. Viola and M. Jones in 2001, [VJ01].
7
Figure 2.1: Simple rectangle features used in the work [VJ01].
Figure 2.2: Features selected by AdaBoost. The first feature notices the lightness differ-
ence between eyes and the cheeks. The second notices the difference in the eye region
caused by the nose being between the eyes. [VJ01]
This method was improved in 2002 by the work of R. Lienhart and J. Maydt [LM02],
where the Haar-like features were extended by rotating into four edge features, eight line
features and two center-surround features, shown in Figure 2.3.
As a result, a reliable and robust method was created, which was also directly im-
plemented in the OpenCV library 70, making it a highly suitable method for this thesis.
70http://opencv.org/, 21.10.2015
8
Figure 2.3: Rotated features introduced in the publication [LM02].
2.1.2 Face Shape
In the previous Section 2.1.1, the face was only localized with a bounding rectangle
returned. We are thus still limited to pure visual data of the user’s face. It would clearly
not be the most optimized way to analyze emotion using only visual content. This is
why the present section discusses how to reduce the amount of data while describing the
face far more comprehensively. In general, for understanding emotion, the shapes and
positions of particular face features are sufficient. Therefore so the goal of this section
is to find a method that can provide us with this data.
To predict emotions directly from images of human faces would not be impossible,
but still far from an optimal way of solving the task. There is too much information
that is unrelated to the emotion, which would make it complicated, computationally
expensive and not robust enough. Since shapes and the positions of face features are
enough to define an emotion, the shape information needs to be extracted from the visual
information.
Thus a more comprehensive description of a the shape of a face is needed. In this
case this means shape descriptions and localizations of particular face features, such
as the mouth, eyes and eyebrows. To understand human facial emotional expressions,
the positions and the shapes of organs are needed. This is the reason why a biometric
analysis needs to be performed, and for this a set of reference points is required. The
9
problem with suggesting such set of points is that they need to be positioned in such a
way that, if localized correctly, they would clearly indicate the positions and the shapes
of the facial organs, enabling the recognition of emotion. Additionally, the number of
reference points is also an important aspect, hence all important information should be
captured by as few points as possible.
2.1.3 Active Appearance Models
Active Appearance Models (AAM) comprises a special image processing approach, which
deals with deformable objects. It is able to be trained for a specific object that can vary
in form. With such an AAM model, the fitting algorithm comes into play. It can fit the
trained model to the new, unseen image. Parameters applied to the trained model in
order to match the new image are returned. These parameters describe the actual form
of the object.
The method consists of two parts, the parametric model and the fitting algorithm.
With the fitting algorithm, the parametric model can be fitted to any unseen image that
allows for deeper analysis and description.
Since this method also stores the appearance of the object, the visual information
can be combined with the matching shape. However, for the appearance only, gray-level
information is stored [CET98]. As a result of this the AAM can also generate synthetic
images from the model [SRU+10].
The AAM is a general-purpose tool that can be trained for any kind of object, not
only the human face. Having the shape of the whole object and its interior parts, a
deep object analysis can begin, resulting in an extensive object description. The shape
information comprises a set of parameters to be applied to the model in order to fit the
new image. With the resulting description, information concerning deformation of the
face and its parts can be collected and used further. This is why this approach is so
interesting and relevant to this thesis. Since a human face is a deformable object, the
AAM can be trained for it, and returns exactly what is expected - the actual shape of
the human face that is required for further analysis and emotion recognition. Such a
10
description of form is useful not only for the emotion recognition of the human face, but
can also be used in medicine, where particular deformations of cells or organs can be
used to indicate a disease.
There are already published papers that deal with the analysis of human facial ex-
pressions. For example, the publications of Dhall [Dha13] and Garcíia Bueno et. al
[GBGFMB12] deal with expression analysis using AAM on human faces. The latter,
the work of Garcíia Bueno et. al, also uses AAM to locate face features, which are then
classified using neural evolution based on neutral networks and differential evolution al-
gorithm [GBGFMB12] in order to predict a user’s emotion from facial expressions.
Studies by Ashraf et. al [ALC+07] and Hammal et. al [HC12] have shown the use
of AAM in the task of recognizing pain in human facial expressions. Pain is one of
the expressions that can also be recognized and measured from pure visual content.
These publications were used as proof that this method is suitable for the problem of
recognizing emotions from facial expressions. Other expressions can also be recognized
using AAMs, such as happiness, fear, sadness, and others. Further to this, in 2013 Datcu
et. al [DCLR13] proposed a solution showing the use of AAM for heart rate analysis
derived from facial features.
In previous paragraphs the AAM approach was suggested as an ideal solution for
preparing data for the final emotion analysis. The method is robust, fast, and offers a
low failure rate, but it also has its drawbacks. The method encounters heavy problems
when images are partially occluded. If important features are occluded the fitting algo-
rithm becomes lost and fails to match the other features. Thus if the object is at least
partially occluded then the algorithm does not usually fit the model. When working
with human faces especially, this problem is a major one. Here occlusion often appears,
particularly with people wearing glasses, with beards, long hair or make-up. If seen by
a human it does not at first appear to be occlusion, because for humans it is already
common. However, the machine sees it as occlusion, since the spectacles frame or the
beard occludes the face features. Even heavy make-up can cause a disorder in the ap-
pearance of the face that can lead to incorrectly located features. In these cases the
11
algorithm cannot recognize the object correctly, leading to an incorrect match. The ma-
chine cannot handle all these variations and the special exceptions during the training
phase.
Nevertheless attempts are made to avoid occlusion and to ensure the method is
robust. In 2010, Storer et. al [SRU+10] proposed in their article a robust AAM fitting
strategy as an improvement to counter occlusions. The idea was based on and inspired
by the publications of Nguyen et. al [NLEDlT08] and Xiaodong Jia et. al [JG10]. Both
publications deal with the usual facial occlusions. The former proposed a method for
image-based shaving [SRU+10], which removes beards from images of human faces using
an image-processing approach. Beards can occlude up to a third of the face and can
also visually change the shape of the jaw. The latter study, dealt with occlusion caused
by spectacles [JG10], which is also a common occlusion in a real world. These are two
of the most common occlusions of a human face, and comprises a good basis for such
enhancement.
As mentioned above, AAM consists of a model and a fitting algorithm. Since the
model has to be trained for the object of interest, the training stage is also mentioned
here. For this annotated images of objects of interest are needed. The annotations are
points of interest that will create the contour of whole objects and its parts if connected
in the correct way. For facial expression recognition, annotated faces with different ex-
pressions are needed. How many points are required and on which positions is the topic
of the next section. There are systems that use different amounts and different posi-
tions for these annotations. In general they are very similar, but still slightly different,
according the kind of data sought. See the following figures 2.4, 2.5, 2.6.
The exact flow was explained in the work of Kohli et. al [KPG11]. The shape is
defined by landmarks localized on the object. The trained model contains the mean
shape and mean appearance of the object. This mean shape is then used by the fitting
algorithm, which applies parameters to the mean shape when trying to fit the new image.
A similar approach is used for extracting the appearance part. However, this is based on
the normalized mean grey-level appearance instead of the shape. Collected shape and
12
Figure 2.4: A minimal landmark configuration that is enough to match the facial shape,
but is not enough for deeper analysis, such as emotion recognition [LTC97].
Figure 2.5: Another configuration where all face parts are annotated in detail, so that
the shape can be analyzed [LTC97]. This setting should be good enough to recognize
emotion from the fitted model.
Figure 2.6: A set of different annotation schemes [SRP10] with different levels of details.
All the scheme details are enough for an emotional recognition task, with different levels
of detail and precision.
grey-level parameters are combined, resulting in the shape and appearance model of the
newly fitted object.
2.1.4 Active Shape Models
Active Shape Models (ASM) are the predecessors of Active Appearance Models. They
differ in that no appearance part is obtained. Since the appearance part is avoided,
13
there is less data to be processed, which makes the fitting algorithm run faster. On the
other hand, according to the study of Cootes et. al [CET98], there is less data to be
processed, making this approach less robust than the AAM.
The first examples of usage were introduced in 1995 by T.F. Cootes et. al [CTCG95],
where Active Shape Models were used for fitting the shapes of resistors, hands and hearts.
In this study the process of aligning the training set was demonstrated. They invariably
took a pair of shapes and compared them to each other, trying to match one shape
to another by translating, rotating and scaling it so that it fitted the second shape as
exactly as possible. The difference between the two shapes was calculated as the sum of
the distances between point locations. The bigger the difference, the smaller the weight
assigned to the particular shape.
The Active Shape Model also lacks the possibility of synthesizing new, untrained
images from trained ones, as the AAM can. Although this difference makes the method
less robust, it also makes it faster, since there is less data to be processed. It depends
upon the task itself whether computational costs are more important than precision and
robustness. If neither synthesis nor the appearance part is needed, then it is in the
developer’s hands to decide which method to choose. As in this thesis, the appearance
part is not required at all, although the robustness and the precision of the fitting
algorithm is essential for the task. This is why libraries for both Active Shape and
Appearance Models were tested and compared, so that the best method can be chosen.
Since the synthesis of new faces is useless for emotion recognition, Active Shape
Models have the advantage of speed here. The proof of concept needs to run in real-
time, making speed an important requirement. On the other hand, Active Appearance
Models still remain more robust and precise than Active Shape Models, which is an
advantage in every task. The actual testing on hardware with trained AAM and ASM
models shows the comparison, which is concluded in next sections.
14
2.1.5 Face Databases
Since we use a classifier that needs to be trained, some training data is needed. Train-
ing data is clearly an essential part of the whole classification process. The quality of
results depends not just on the training algorithm but on the quality and amount of
training data. The better the training, the bigger the chance of avoiding classification
mismatches.
We need to train two models for the prototype, so we also need two different training
data sources. First one will be used in the Active Shape or Appearance Model. This will
be important for properly matching the shape to the face. Since third party libraries
will be used for ASM/AAM, there is a probability that there already are some pre-built
models of human faces that work well.
For building a proper ASM/AAM model, thousands of faces need to be trained. For
such purposes there are face databases published on the Internet. In order to test the
solution thoroughly several were tested in the implementation part. Since human faces
differ intensively, face databases also differ in many aspects. Besides the differences in
human faces, they also differ in the angle of the captured faces, image quality, annota-
tions, and some of them even capture faces multiple times with diverse facial expressions.
The training algorithm of the ASMs and AAMs has a special requirement for training
images of the objects of interest. It expects annotated images as input. Usually anno-
tations are stored in separate files that describe the positions of the face interest points.
When choosing or creating a face database, the face interest points system should also
be considered. The count and the location of points needs to be declared before the
training process actually starts. Since annotating images manually takes up too much
time, we sought for an already annotated face database with a fitting annotation system.
Additionally a small test database was created for testing the fitting algorithm on one’s
own face with our own annotation system. Testing the model with the same face that
the model was trained with is the best case scenario, although this would not happen in
a real setting.
15
The following databases were tested and used in this thesis: BioID 80, IMM 81,
FRANCK 82 and MUCT 83. Additionally, as mentioned above, a small local database
was also created and tested, containing five to ten annotated images of one face. Since
this would not be sufficient input for general usage, it was tested on the same face as in
the database, and used for testing and debugging purposes only.
The second database needed is a database of emotional expressions linked to the
shape ASM/AAM model. With this model trained, the emotion can be classified. In
comparison to the first type of face database, this only contains additional information
about the captured expression. Since there is no face database containing this labeling,
it had to be created manually. There are face databases capturing the same person with
different emotional expressions, but the labeling still needs to be done manually.
Face databases are usually very large and the training process takes a lot of time.
In this case the search for a compact face database and a fast training algorithm is not
necessarily needed. The training process is executed only once in the beginning of the
whole process in order to prepare the model for the prototype that will be reused with
every prototype run. The face database is not deployed with the prototype; only the
model is. Because of this, the hundreds of megabytes and execution time that is counted
in minutes or even in hours are not significant. The important thing is the model size,
since this has to be loaded with every program start and used in real-time. We focused
on this aspect during the implementation phase. Since no serious problems are expected
and it is too complex a matter to pre-calculate the size and the behavior of the model,
it was tested directly while implementing the prototype.
2.1.6 Face Description
In this section we will examine the topic of describing human faces. The goal of describing
human faces is to extract contextual information from purely visual representations of
80https://www.bioid.com/About/BioID-Face-Database, 21.10.2015
81http://www.imm.dtu.dk/ aam/datasets/datasets.html, 21.10.2015
82http://personalpages.manchester.ac.uk/staff/timothy.f.cootes/data/talking_face/talking_face.html,
21.10.2015
83http://www.milbo.org/muct/, 21.10.2015
16
the human face. There is much information obtainable from the face itself, such as a
person’s race, gender or approximate age. Further, images of faces can be compared for
matching pictures of the same person, so that even identity can be tested by a facial
description.
The work of Kohli et. al in 2011 [KPG11] describes an approach for age estimation
from human faces. Here Active Appearance Models were used in combination with
Dissimilarity Based Classifiers [KPG11]. For this task the appearance component is
important when differing between a child and an adult. In this task the AAM is the
clear choice since with Active Shape Models the solution would miss the appearance
component. Conversely, for an emotion expression description, the shape of face features
is the key, and the appearance component is not essential in such a task.
Another paper [DCLR13] proposed a method for measuring the heart rate from a
facial image. Here the Active Appearance Model was used to recognize small movements
and shape deformations of facial regions. A system consisting of 66 face interest points
was used in this work (See Figure 2.7).
Figure 2.7: The 66 points of interest system used in [DCLR13].
In addition to the specific characteristics of race, gender, age or identity, more ab-
stract contextual information can also be extracted from the image, such as a description
of the emotional facial expression. This is much more complicated than the recognition
of the specific characteristics, since the meaning of an expression can be misunderstood
even by a human. However, recognizing the main emotional expressions in a robust,
trustworthy way should still be possible. This aspect is the topic of this present work,
17
Figure 2.8: Facial regions of interest for measuring the heart rate from the face, intro-
duced in [DCLR13].
and is described and implemented in the scope of the thesis.
We propose to work with the positions and movements of face features directly, since
they indicate the expression in the most accurate way. For this, the suitable annotation
(Interest Points System) for finding contours of all the important face features had to
be found, and then specially adapted for facial expression analysis. This is described in
the following chapter called Emotion Description.
For extracting the abstract contextual information from the face, the shape infor-
mation is needed. For tasks such as race, gender and identity recognition, the shape is
not very important, or even of no use. However, for tasks like emotion recognition it is
essential and contains all the anticipated information needed for further processing. As
described in previous sections, Active Shape- and Appearance Models are useful for ob-
taining shape information, although identity recognition is not possible from shape only.
This means that when extracting the pure shape information from the visual information,
identifying the human becomes impossible. Since the user remains unidentifiable after
extracting the shape from the image, the method can also be used in communications
in which the user can use his or her facial expressions while remaining anonymous.
Another domain of use is Visual Speech Recognition, in which the shape of the
mouth can be used for recognizing what people are saying. If the shape of the mouth
is described precisely, the differences between the pronunciations of various vowel and
consonant sounds can be trained and classified.
18
Emotion Description
This subsection deals with the idea of describing emotional expressions from an image of
a human face. As mentioned above, the basis for the recognition is the shape obtained
from the image. This shape is then analyzed and fitted to a pre-trained model of facial
expression shapes. The fitting process can be performed in various ways. One approach
could be to compare static shapes as a whole directly with the pre-trained face models.
Another would be to segment the face shape into facial parts of interest and to fit them
separately, combining the results for an overall expression description.
A quite different approach would be to use the differences between several recent
shapes in a video stream instead of simply matching static shapes. Shapes can be saved in
a buffer containing a number of recent expressions, where the shape and time differences
can be compared. This seems like an approach containing much richer information, but
it should also be considered whether the additional information would really have any
advantages over working with static images. Since emotion is recognizable from static
photos of humans, it will be considered of little or no use to work also with the differences
over time.
For listed approaches, several tools are required. First, a method for extracting the
facial shape and the contours is required. An example has already been described in the
above sections, and Active Shape and Appearance Model is a suitable solution. This step
is essential for reducing the data from a huge amount of visual data to a much smaller
amount of face shape data, which can be processed much faster and easier. Second, a
classifier that can fit the shape to the emotion is needed. The shape data can be easily
classified with any classifier since it is merely a small set of two dimensional coordinates
that every classifier should be able to handle well.
19
Facial Action Coding System
The Facial Action Coding System (FACS) is a system describing facial expressions with
Action Units (AU), introduced by Ekman and Friesen in 1978 [Ekm78] 84. Action Units
used in the work are facial muscles that cause facial movements, resulting in facial
expressions.
The Facial Action Coding System was developed to work primarily with video se-
quences. This system was developed especially for psychologists to observe human facial
expressions. The idea was to mark Action Units manually on the video images instead of
using an automated tool for fitting AU to the image. An extended system of 46 distinct
action units was developed in the FACS article published by Bartlett, Marian Stewart,
et al. [BVS+96] in 1996.
Our solution is inspired by this system. Instead of using Action Units based on face
muscles, the Action Units in this thesis were based directly on face features visible to the
human eye, such as the mouth, the eyebrows etc. Additionally, not all the Action Units
proposed in the publication were used; the proof of concept comprises working with the
six basic face emotion expressions (See Figure 2.11).
Figure 2.9: Example of six action units used in the study of Bartlett, Marian Stewart,
et al. : AU 1 - inner brow raiser, AU 2 - outer brow raiser, AU 4 - brow lower, AU 5 -
upper lid raiser, AU 6 - cheek raiser, AU 7 - lid tightener [BVS+96]
The problem of emotion expression recognition is the fact that even people often
misunderstand human expressions. There are many slightly differing expressions that
cannot always be classified with confidence. Using facial muscles as Action Units is
84http://ocw.mit.edu/courses/brain-and-cognitive-sciences/9-00sc-introduction-to-psychology-fall-
2011/emotion-motivation/discussion-emotion, 21.10.2015
20
anatomically precise and correct, but unusable for our proof of concept. That is why we
used different AU, as mentioned above (See Figure 2.10).
Figure 2.10: 25 Action units of the extended Facial Action Coding System introduced
in the work of Cosker et. al in 2010 [CKH10]. Every Action Unit is a muscle that is
used for facial expression. Some of the expressions shown can barely be differentiated
from each other by the human eye, which makes it unusable for the proof of concept.
Even though the FACS contains multiple action units that can be combined, and
offers a huge amount of different expressions with a high level of detail, in this thesis a
simplified version (Figure 2.11) was used in both facial analysis (emotion recognition)
and animation.
2.1.7 Classification Methods
In previous sections we discussed ways of describing the face and defining emotions.
That was the preparation for the present section, in which methods for classifying the
emotions are described. Since the facial image as pure visual information has already
been pre-processed and the shape of approximately 60 points was extracted, the emotion
can be now classified directly from the shape. The small number of points reduces the
21
Figure 2.11: Six basic expressions described by Ekman and Friesen in 1971 [FACS
Ekman]84. These six basic expressions were used in the system designed within the
scope of this thesis.
computational cost of the classification process. The simplicity of the shape model helps
the classifier to work fast and to be sufficiently robust.
Since the classifiers need to be trained in order to be able to classify the input, both
the training and the evaluation phase is discussed in this section. For the training pro-
cess, the above mentioned face databases were used. Before choosing the right classifier
for the prototype, both the training and the evaluation processes of particular classifiers
were tested and observed.
In the training process the size of the trained model is the most important aspect
to observe and to compare, since the model needs to be loaded each time the classifier
is to be initialized. In our proof of concept, this happens each time the application
starts. The bigger the size of the trained model is, the longer it takes to load each time
in the evaluation phase. The computational costs of the training phase are negligible,
since the time and complexity of the training algorithm does not affect the evaluation
algorithm and has to be executed only while building the system. It clearly needs to be
executed several times while testing, but the time spent on developing and testing was
not considered a drawback.
Conversely, the computational costs are assumed to be one of the most important
aspects of the evaluating algorithm, which is executed for each frame, making speed
22
essential for an application running in real-time. In addition to the speed of the algo-
rithm, the low failure rate is also considered to be a main aspect. These two factors
need to meet our requirements, since a classifier is useless if it is either running at heavy
computational costs or returning results with a high failure rate.
There are several machine learning methods already implemented in the OpenCV li-
brary, making it easy to test and compare different classifiers within one library. Accord-
ing to the OpenCV Documentation 85, the following classifiers are available in the Ma-
chine Learning Library (MLL): Statistical Models, Normal Bayes Classifier, K-Nearest
Neighbors, Support Vector Machines, Decision Trees, Boosting, Gradient Boosted Trees,
Random Trees, Extremely Randomized Trees, Expectation Maximization, Neural Net-
works and MLData.
Despite the fact that the Support Vector Machines classifier was designed to work
with and to predict two classes only, there was some progress in the implementation
resulting in a multiple class classifier available in the OpenCV library. The OpenCV is
based on the implementation of the C/C++ LibSVM library developed by Chang et. al
[CL11].
Since we are classifying the shape model consisting of 66 points and containing X
and Y coordinates only, the prediction itself is simple. This is why we sought for the
simplest and fastest classifier with the smallest trained model, rather than a large and
complicated classifier that would be too much for such a small quantity of data that
was already prepared in previous steps. For these reasons the Support Vector Machines
classifier was chosen.
Face Segmentation
As a subtopic of the classification section, face segmentation describes how to apply the
classifier to the face shape. Having the classifier implementation ready for use, a proper
way of training and classifying objects needs to be found. One of the decisions that needs
85http://docs.opencv.org/modules/ml/doc/ml.html, 21.10.2015
23
to be made is whether to train and classify the face shape as a whole, or to segment the
face into face areas first and classify them separately.
In a real scenario humans recognize emotion from the overall facial expression. This
is why the classification of the face shape as a whole is the obvious first approach. But
is such classification also possible with a machine-learning approach? Classifying the
overall shape should be straightforward, since not only the face features matter, but also
the relative positions and the angle between them. It must also be considered that the
ASM/AAM approach does not work 100% of the time. Sometimes, if the shape is not
matched correctly at a particular place, some segments can corrupt the classification of
the emotion: for example if the mouth indicates happiness while the eyebrows indicate
sadness. If working with a non-segmented face shape, such small inconsistencies should
not corrupt the overall facial analysis. That makes the choice of classifying the whole
face as one object more appropriate than segmenting it into smaller pieces.
Some face features, for example the mouth, can often reveal emotion alone: the happy
and the sad emotions can be recognized with the mouth only, with the other features
hidden. For recognition of an angry emotion, however, the mouth alone is not enough.
The mouth shape of an angry emotion is not specific enough; it can be observed with
the same shape in other emotions. In this case the shape of eyebrows is also needed to
reveal emotion. This is why, if a segmented face is used for the recognition, a system for
describing and understanding the relations between shapes and the positions amongst
the face features is needed. This would make the process of classifying emotion more
complicated, less robust and slower, and would not be the ideal solution. For this reason
also the face will be classified as a whole. The reliability of the result was observed and
is described in the following sections.
Cultural differences should also be considered when thinking about classifying facial
expressions. From a European perspective, the mouth is the most significant when it
comes to recognizing facial expression, while in the Asian culture the eyes have greater
significance. However, when classifying the face as a whole, the overall expression should
- despite this difference - result in the same emotion in both cultures. When classifying
24
segmented faces, a priority should be assigned to each feature in order to reflect the
human way of understanding facial expressions.
2.2 Face Animation
Assuming that the face was analyzed and the emotion classified, the collected data
should also be visualized in a useful way. The simplest way of outputting the data is the
textual description that can be shown on the screen next to the face. Another solution
might be to adjust the environment according to the recognized emotion, such as lighter
colouring, or something similar. Nevertheless, for this thesis a more advanced approach
is appropriate. As the proof of concept the recognized emotion was visualized on an
animated avatar. Naturally, humans recognize emotion expression directly in the face,
making it ideal for human interaction. As a side effect, when animating expressions
on an avatar the identity of the user remains anonymous. A perfect solution for both
human and machine is the combination of textual representation and animated avatar,
in which the machine can work with the former and the human with the latter.
The problem with animating an avatar is the fact that the human face is a deformable
object, which differs significantly from user to user. The chosen level of detail needs to
be complex enough to be able to visualize the emotion in a realistic way and simple
enough so that the avatar can match the face of any user. There are several animation
methods listed below, but the main discussion is concerned with the differences between
two and three dimensional animation, since these technologies differ substantially. The
advantages and disadvantages of one or the other methods are listed below.
2.2.1 2D Animation
The easiest way to animate emotion in this thesis was to directly visualize the debug
information of either the fitted Active Shape Model or the fitted Active Appearance
Model. Since this model contains enough information to recognize emotion, fitted points
can be drawn as a visualization of the face and the emotion should be still recognizable
25
by the human eye. However, this is not a real animation of a human expression since the
level of detail is as low as possible. Nevertheless, this approach was used for debugging
and testing because of its simplicity and low data quantity, which makes it perfect for
such usage.
One method for animating the data adequately could be to use Scalable Vector Graph-
ics (SVG). The visual information here is saved as a set of vectors instead of a mesh of
pixels. Thus with the shape of the face recognized from the facial analysis, the vectors
that build the SVG image can be easily adapted by translation or rotation in order to
simulate the mapped shape. In other image formats, for example BMP format, in which
the visual data is saved as set of pixels having visual information for each pixel, any
deformation is too complex, making this format unusable for the task.
Figure 2.12: A picture of Sponge Bob in the SVG Format. On the right-hand side
is the deformed face. The deformation was not applied to the graphics directly but to
the pivot points, which were moved and the angles adjusted. The result of this change
applied to the SVG file is a deformation simulating an angry expression.
Flash is an alternative for animating deformable objects, with ActionScript scripting
the animation. However, in comparison with SVG Flash is too complicated for use. We
would need a special development environment to support external applications affecting
objects inside of the environment, while the SVG file format is open and can be adapted
as a plain text file.
2.2.2 3D Animation
Animations in 3D are much more complex than those in 2D. They contain more details,
can be seen from multiple viewpoints, the lighting can be adjusted and set, and much
26
more. All these details create more realism, but this also has a disadvantage in the
form of computational costs. A 3D animation also has different requirements, different
approaches and different environments to work with.
To be able to implement such a system, a working 3D editor is needed. This editor
would need to be capable of loading rigged 3D models and of offering an API with which
the 3D model can be accessed and manipulated from outside. In order to animate the
expression precisely and in an optimized way, a rigged model is used. With this there is
no need to deform the mesh directly; all that is required is translation and rotation of
the joints to deform the face in the desired way.
The Unity Editor meets all the requirements of a suitable 3D editor. It can load
rigged models in different file formats, such as .blend or .max. However, the rig infor-
mation is not always imported correctly. This can happen with both file formats, so
we can assume that the problem is on the Unity site. However, importing rigged .blend
models tends to fail less often. The import functionality is essential here, since the Unity
Editor does not offer the environment for advanced 3D modeling. Therefore the only
way is to create the avatar with an external modeling tool such as Blender or 3DS MAX
and export it for import into the Unity Editor. If modeling in 3DS MAX there is a
special plug-in for this program called Morph-O-Matic, made especially for 3D rigs and
animations of faces. However, the library is not free, and since 3DS MAX does not offer
the functionality required for our prototype listed above, the model would need to be
exported into the Unity Editor. Unfortunately, there is a loss of quality while exporting
and importing, defeating the purpose of using this library for the task.
A further consideration when choosing an environment is the capability of the API
offered by the editor. Since the proof of concept for analyzing facial expressions was
written in C++, an interface for cooperation between the Editor and C++ applications
is essential for the prototype. The amount of shared data is not huge, only around 60
coordinates as well as emotion flags to be transferred. However, since this needs to occur
in each frame, the interface should be able to operate at sufficient speed.
One important aspect when animating facial emotions is the Neutral Face Expression.
27
This is the user’s facial expression without any emotion shown, and with all the muscles
in their relaxed position. This expression is important, especially in the animation
process. Every human has different facial proportions that need to be fitted to a general
model that is the same for every face. For this reason the animation model needs to be
calibrated for the actual user’s facial proportions. This has to be done with the user’s
emotionally blank facial expression so that no emotion will be traced. We will take
the distance between the eyes and the eyebrows for an example. Since the animation
works with the shape of the features and the distances between them, these need to be
calibrated. If user A has a one centimeter distance between the eyes and the eyebrows
and an user B has a two centimeter distance, then the emotion animated on the same
model would appear as if user B shows some emotion even in his or her neutral expression.
This is why these distances need to be normalized for the animation model. Following
this it will be clear whether the eyebrow is really raised or not. The same needs to
be done with, for example, the mouth, where the calibration of the mouth’s width is
essential for differentiating between the neutral expression and a smile.
In this thesis dealing with animation with a high level of detail was anticipated. This
is why the 3D animation of an avatar was chosen. The avatar is also a 3D model of a
human head with face deformed over time in order to simulate the emotion. The defor-
mation is based on normalized positions of face features obtained from the ASM/AAM
model, and normalized according to the neutral expression and applied to the head
model. This approach results in the most realistic animation of all the methods listed
in this section.
2.3 Related Work
Before commencing on this thesis, some related work was researched in order to gain
an overview of existing approaches. Experiments and attempts were observed in various
publications and the experiences from their successes and failures were gathered. The
publications did not have to be associated with our topic directly, but they did need to
28
use at least some of the methods that, with some modifications, can be adapted to our
needs and used for our purposes. The parts of publications listed in this section have
served as an inspiration for building the basic functionality of the prototype.
Earlier publications show different approaches in how to deal with topics similar
to our own. The work published by Michael P. and El Kaliouby R. [MEK03] (2003)
deals with the classification of six basic expressions. 22 facial features were located
and classified with Support Vector Machines. Unfortunately, they only attained a total
accuracy of 60.7% when training and testing with six users.
Another approach was used by Azcarate, A. et. al in their work [AHvdSV05] in
2005. The Face Detector based on the Haar Classifier and extended with Piecewise
Bezier volume deformation was used for detecting faces and locating face features. In
the second stage Naive-Bayes and TreeAugmentedNaive-Bayes classifiers were used for
emotion classification. The Cohn-Kanade face database containing seven expressions
(six basic expressions and one neutral expression) was used as a resource for training
and testing the algorithm. In this work approximately 65% accuracy was attained in
tests, where for the test set, samples of other people than for the training set were used.
In 2007, Datcu D. et. al published an interesting article [DR07] using Active Ap-
pearance Models as a basis for emotional expression analysis. This approach can extract
important face features that can be analyzed further. The work used Support Vector
Machines for classifying expressions of still pictures and video sequences. In the case of
video sequences the attained accuracy was 79.62-88.67%.
The publication of Ari I. et. al [AUA08] (2009) aimed at using the face feature
analysis in sign language, showing yet another method for a face interest points system.
Face features were located with an Active Shape Model Tracker using 116 points, and
classified with Support Vector Machines. In addition to frequently used Support Vector
Machines, there are also other methods for classifying face features fitted to an Active
Shape Model. The article by Milborrow S. and F. Nicolls [MN14b] proposed, in 2014, a
solution with SIFT descriptors matched by the Multivariate Adaptive Regression Spline
(MARS). However, this work focuses on the computational efficiency rather than the
29
failure rate.
Nedkov S. and Dimov D. used in their work [ND13] Action Units from the Facial
Action Coding System introduced by Ekman P. and Friesen W. V. [Ekm78]. The facial
dynamics of emotional expressions were analyzed and classified with a Linear Discrimi-
nant Analysis. Their experiments showed 75% precision of the classification results.
A different method was introduced in the work of Sun X. et. al in 2009 [SRDW09],
where the visual muscle activity was analyzed by Vector Flows. The muscle activity was
then classified according to the six basic facial emotions proposed by Ekman P. [Ekm78].
For the classification process, the Bayesian Network was trained on the Cohn-Kanade
Face Database. For locating face features Active Appearance Models were used, and the
classification of Action Units reached the overall rate of 90%.
Figure 2.13: 3-layered Bayesian Model used in the classification method in the publi-
cation [SRDW09].
The publications listed above offer a comprehensive overview of possible approaches
for dealing with the problem of the emotional analysis of human facial expressions. Each
method has its benefits but also drawbacks, which is why no perfect solution has been
found. Nevertheless, these articles have helped to build the basis of this thesis, since
30
the approaches and experiments described in these publications influenced the choice of
methods to use.
31

CHAPTER 3
Project Description
This chapter describes the functional part of the project. In addition to the goals and
the anticipated, the structure and implementation process is also listed and described
here.
3.1 Goals
The main goal of this project is to implement a robust system capable of recognizing
emotion in the human face. The captured image of the user’s face is analyzed, the facial
expression animated on an avatar and a textual description of the recognized emotion is
visualized.
The following goals were set:
1. Robustness. The prototype needs to be robust and resilient against poor input
and conditions. If the minimal lighting conditions are not met, then no face is
recognized and there is no output. The user’s input possibilities are limited to
facial expressions and head positions only, which is why there is no possibility
of inputting the wrong data an crashing the system. However, as mentioned in
previous chapters, the face will be recognized properly only if it is facing the camera
directly. Otherwise there is a high probability of mismatching the face.
33
2. Fully Working Prototype. The goal is to implement a fully working prototype that
proves the concept by showing a working solution to the problem. The whole
product was developed in a way that allows any other developer to easily adapt
the code to his or her needs. In this way the prototype can be enhanced, or it can
serve as a testing ground for methods used in the implementation.
3. Performance. Another goal was to reach optimal performance, making the proto-
type viable on commonly used laptops and computers without high computational
power, and using only common web-cam devices.
4. Automation. The system also needs to be easy to use and fully automated, so that
the user does not have to set up any specialized settings or run parameters in order
to make the product work properly.
3.2 Challenges
The project faces several challenges:
1. Face Interest Point System. The first complex problem is the system of face interest
points. This needs to be defined well so that the emotion can be optimally read
from the system of points, and an algorithm that fits the points to the facial images
is also required. Optimal count and location of points is also important, enabling
it to run in real time while matching the face with sufficient precision. The shapes
of human faces differ extensively, so the system must be able to fit a wide range of
faces properly. During facial expressions the shape of the face deforms greatly, so
the algorithm needs to be able to match the face in various deformations.
2. Expression Interpretation. The second challenge is the problem of recognizing emo-
tion from facial expressions captured by the web-cam. Besides technical complex-
ity, the diversity in how people understand facial expressions is also a significant
aspect. There are many facial expressions, the meaning of which can be under-
stood differently by different people, because emotion is not always self-evident
34
and is contingent upon personal experience. People from different cultures espe-
cially, or people of different ages, understand certain facial expressions differently
to others. If there were emotions present in the face that could not be unambigu-
ously recognized by other humans, then it would be impossible to build a system
of explanation for all emotions that works perfectly all the time.
3. Libraries. Since one of the main non-functional requirements for the prototype was
that it should be able to run on a personal computer without any special hardware
or computational power needed, we had to build an optimized prototype precise
enough on the one hand but operating fast enough on the other. The problem
with libraries is that they are either very precise but not able to run in real time
on a common computer, or they are fast with low computational costs but are not
robust enough and therefore unreliable.
We had to overcome all the challenges listed above in order to create a valuable
prototype.
3.3 System Design
3.3.1 Overview
Before the implementation started, we searched for robust, well working libraries that
can be used in the prototype. Multiple implementations of different methods were found,
and all were tested, the results being described in Section 4. From these tests we selected
the most suitable libraries that work well and used them in the implementation.
The following diagram (Figure 3.1) shows how particular libraries and modules work
together and how the input data flows through them. The user’s face is first captured
by the web-cam, which is controlled by the OpenCV library. The captured image is then
sent to the external ASM/AAM Library, where the ASM/AAM Model is fitted to the
user’s face. In our prototype we used the FaceTracker ASM Library. The fitted model
is required in both the emotion classification and the rendering part; therefore it is sent
35
to both of them. For memory saving purposes we do not create a copy of the model,
but rather the model is first used in the classification code and then reused for avatar
rendering. These parts do not influence each other, and they are therefore split in the
diagram.
Figure 3.1: The logical parts of the project divided into stages, showing which libraries
and tools were used in each stage.
In order to enable the application to analyze the input data, we need to train and
prepare two models: one for fitting the points of interest to the image and one for the
emotion analysis. The training code we used for building both of these models was
provided by the libraries used in the implementation. There was no need to build our
own algorithms to be able to train the models, as some libraries are equipped with fully
working pre-trained models that can be used directly. Nevertheless, a face database is
36
needed for the training phase, as already mentioned. In this thesis we tested several of
them when implementing the prototype, and details can be found in Section 4. As a
result of the training phase an ASM/AAM model for fitting face features and a SVM
model for classifying emotions were generated. These models are loaded into the runtime
and used in a loop over time, which is used for processing the input data.
We were constantly testing the system while implementing or trying out new ap-
proaches in order to see rapid results. We were able to create automated tests for
emotion classification only, preparing the images with captured emotional expressions,
together with a label of the emotion so that the results could be compared algorithmi-
cally. Testing the animation is more complicated since it strongly depends on the user’s
feeling, which cannot be tested automatically. This is why we tested it by observing
users using the prototype. We tested it while implementing the prototype for refining,
and when the project was finished we also performed tests of the final product among a
small group of people. The final test results are described in Section 5.
The following sections describe the input and the output of the prototype.
3.3.2 Input
As soon as the project was prepared it could begin to read and analyze images of human
faces. The user’s face is captured by the web-cam and the visual information is processed
further. The fitting algorithm of an ASM/AAM library is implemented, fitting the face
features trained in the model to the captured image. The facial features define the
shape of the face, so that by having them localized the shape analysis can begin. The
trained SVM model is loaded by the SVM classifier implemented in OpenCV, based
on the libSVM. The implementation works robustly, but there are still some minimal
requirements yet to be met. The lighting conditions need to be sufficient to recognize the
face with its features on the captured image. The user must sit in front of the web-cam
in such way that his or her whole face is captured. If the face is only partially present
or captured only from a side view, the emotion cannot be recognized reliably, since in
such cases both the fitting algorithm and the emotion classifier become stuck at their
37
minimum.
3.3.3 Output
Having the input data processed and the results available, the prototype needs to present
the data to the user. The collected data from the input analysis are visualized on a 3D
avatar and a textual description of the classified facial expression is displayed. The 3D
avatar represents the facial expression of the user and is adjusted so that the avatar’s face
features correspond to the user’s, creating a mirror effect of the user’s facial expressions.
This visualization is useful only when the program is running in real-time. However, the
textual description of the classified facial expression displayed alongside the face does
not require to be updated in every frame in order to work and to look realistic, hence
some latency here is acceptable.
38
CHAPTER 4
Implementation
4.1 Overview
This chapter describes the overall implementation of the prototype, including the process
of testing and comparing technologies and libraries before the implementation could
commence. Problems and their solutions are also dealt with in this chapter. A graphical
structure of the project can be seen in the following flowchart (Figure 4.1), and a detailed
technical view of the communication between individual parts can be found at the end
of the chapter.
4.2 Face analysis
Deep facial analysis is the main problem of this thesis. The first challenge was to define
and localize those face features important for emotion recognition. There are various
libraries that can describe the shape of a deformable object that are also suitable for
the task of describing the human face. An efficient tool for describing the shape of
the human face and localizing its features is essential for facial expression analysis. To
find the optimal library for this task is also challenging, since published libraries are
mainly either fast but imprecise, or work precisely but are too slow for real-time usage.
39
Figure 4.1: Flowchart showing how the data is processed in the prototype. The starting
point is the Unity box on the left side. It requests data from the C++ program containing
the FaceTracker library trough Export API (application programming interface) call. The
FaceTracker library loads the pre-trained ASM model and fits it to the image captured
by the web-cam. The C++ program that is hosting FaceTracker then sends the image
further on to the Emotion Classifier, which classifies the emotion of the face in the image
and returns the label of the recognized emotion. Both the fitted ASM model and the
emotion label are then sent back to Unity, where the data is visualized.
However, there are many published libraries that use different algorithms, different face
points systems and different face databases. Since we require a fast and precise working
library, we selected the best methods available and tested them. The results of these
tests are analyzed in this chapter.
There are two main aspects to be considered concerning facial analysis.
• The first is the shape of the face features, since, for example, the shape of the
mouth or the eyelids are important in emotion recognition. In this thesis shapes
are trained and classified by Support Vector Machines. We first trained an SVM
model of facial feature shapes that could be used for classifying new shapes by
comparing them to the pre-trained shapes of emotions.
• The second aspect are the locations of face features, because not only the shape
40
of the features is important but also the angle and distance to other features
have their significance. Here we can take eyebrows as an example. Eyebrows do
not substantially change their form; usually it is the angle and position that is
significant. Anger, surprise and fear depend upon heavy eyebrow positions and
angles. Raised or lowered eyebrows also have their particular meanings.
However, since the user can freely move the face and thus scale or rotate it, the location
of the eyebrows cannot be used directly. The face needs to either be normalized and
centered or some of the pivots need to be used. We have chosen to implement a solution
using pivots. For measuring raised or lowered eyebrows the absolute position cannot be
used and so the relative position to a pivot is needed. Ideally, the pivot is a static non-
deformable feature. We have chosen the nose in the implementation, since it is located
in the centre of the face and only small deformations and movements are possible. When
measuring the relative positions of the features on a freely movable face, scaling should
also be considered. Assuming that the user is looking directly at the camera, he or
she can still move forwards and backwards. If moving towards the camera, the overall
face becomes larger and so do relative distances. This is why we took multiple pivots
and recalculated the relative position of the facial features according to the distances
between them.
A Support Vector Machine was trained and used for classifying localized face features.
The problem here was again the user’s movement, where the face is rotated and scaled,
which changes the positions of the face features (See Figure 4.2). The classifier cannot
handle such position changes itself, so a solution for this problem had to be found. We
took three pivot face features that are not deformable and do not change their positions.
As mentioned above, the first was the nose, a static feature in the centre of the face.
The left and right cheekbones were chosen as second and third references. According
to the relative positions of these references, the rotation and scaling of the face can be
calculated, and this enables the rotation and the scaling of the face back to the initial
setup.
To make the classification process even more robust we ignored contour features,
41
Figure 4.2: This figure shows the repositioning of face features when capturing the
user’s face with the same expression from different angles.
which are unimportant for emotion classification. As result only the eyebrows, eyes,
nose and mouth approach more closely to the classification. The chin and the cheeks
were omitted in the classification phase since they do not contain information important
for emotion recognition. They nevertheless still need to be localized since they are
important points for the normalization of the model. With this approach we attained
an optimized and robust way of classifying emotion.
4.2.1 Static vs. dynamic analysis
The method of processing the input images is an aspect that also should be considered.
Input images can be either processed independently, one after other, or a certain relation
between separate inputs in the stream captured by the web-cam could be investigated.
As already mentioned, emotion can be recognized either statically or dynamically. The
study [SSP+12] examined whether the duration of smiling is also significant. However, we
were analyzing the static input when recognizing emotions in our prototype. Measuring
the duration of a happy expression was beyond the scope of this thesis, although but
it would be easy to extend the prototype to embrace this functionality in future work
if somebody should wish to research this topic. Nevertheless, the deeper analysis of
emotion remains more pertinent to psychologists than to ourselves.
42
Some libraries tested in the scope of this thesis also use facial tracking over time as a
performance enhancer. This is a clever approach that results in improved performance,
while not losing precision. FaceTracker, the library used in the final version of the
prototype, works in this way. The library is unusably slow when it loses its tracking
and starts searching for the face. As soon as it locates a face, the performance increases
rapidly. This is one of the reasons why we have chosen this library.
4.2.2 ASM / AAM Libraries
For implementing the prototype, we have chosen to use a library that implements
ASM/AAM instead of writing our own implementation. There are a lot of different
ASM and AAM implementations that could be used for the prototype. Before we began
work on the implementation we tested the existing libraries to see if they are usable
for our purpose, and compared them so that we could choose the best one. The main
properties that we tested are performance, precision, models offered by the library, the
complexity of integration in our project, and the complexity of the training process. An
open-source solution was advantageous, since the code of an open-source solution can
be directly adapted to our needs.
We have tested following ASM and AAM libraries:
DeMoLib
DeMoLib 116 is a complex implementation of Active Shape and Appearance Models cre-
ated as the PhD project of Dr. Jason Saragih and supervised by Dr. Roland Goecke. It
was later extended by other students of the University of Canberra. It is a cross-platform
library written in C++, the sources of which are available for non-profit academic pur-
poses only.
AAM implementation is based on the work of [CET98] and offers various fitting algo-
rithms, such as Di-linear, Project-out inverse compositional, Simultaneous inverse com-
116http://staff.estem-uc.edu.au/roland/research/demolib-home/, 21.10.2015
43
positional, Robust inverse compositional, 2D+3D inverse compositional, Original fixed
Jacobian, Linear iterative discriminative and Nonlinear iterative discriminative.
The library also implements Active Shape Models. However, we did not test these
since the required connectivity file could be found neither in the library nor in any of
the face databases. There is no description of how the file structure should look so we
were unable to create the file.
Figure 4.3: This figure shows the results of testing the Di-linear AAM fitting. Other
fitting algorithms implemented in this library were tested on both Windows and OSX
operating systems, attaining similar results.
The implemented fitting algorithm is fast, but the results of the testing showed that
the library is not stable and does not work precisely enough (See Figure 4.3). Therefore
it was not used in the prototype.
Vision Open Statistical Models
The Vision Open Statistical Models (VOSM) 100 library is another cross-platform, open-
source implementation of Active Shape and Appearance Models. VOSM offers 1D profile
ASM, 2D profile ASM, Direct Local texture contrained (LTC) ASM, Basic AAM, Inverse
compositional image alignment AAM (ICIA) and Inverse additive image alignment AAM
(IAIA). There are also settings and run parameters prepared for training models on
100http://www.visionopen.com/downloads/open-source-software/vosm/, 21.10.2015
44
BioID, IMM, FRANCK, AGING 104, XM2VTS 105, EMOUNT 106 and JIAPei 107 face
databases in the library. Even our own face database can be used with this library,
which makes wide training options possible. However, the problem with this library
is its performance. The implementation runs on high computational costs, making it
unable to run in real-time on a standard personal computer. This makes it unusable for
our prototype.
Face Tracker
FaceTracker 115 is a library that implements Active Shape Models and uses them for
human face tracking. It was written by Dr. Jason Saragih, the author of the above
mentioned DeMoLib library. In comparison to the DeMoLib library, FaceTracker is
much smaller, since it only offers one implementation of ASM. The library is optimized,
runs well in real-time and contains precise pre-trained models that are about 1MB large,
which makes it a perfect candidate for our prototype. The testing showed that the
library is stable and provides precise results (See Figure 4.4).
Figure 4.4: This screen-shot shows the exactness of the fitting algorithm implemented
in the FaceTracker library. All features were located correctly.
104http://agingmind.utdallas.edu/facedb, 21.10.2015
105http://www.ee.surrey.ac.uk/CVSSP/xm2vtsdb/, 21.10.2015
106http://www.emount.com.cn/
107http://visionopen.com/cv/databases/JIAPEI/, 21.10.2015
115http://jsaragih.org/, 21.10.2015
45
As an optimization, the library continuously tracks the user’s face. As soon as
the face becomes lost the library attempts to re-initialize the tracking algorithm. We
recognized the re-initialization in the prototype and used it for re-initialization of the
animation part.
The only drawback with this library is the missing training part. It was also imple-
mented by Dr. Jason Saragih, but the source code is no longer available since it is the
property of the Carnegie Mellon University 108. Because of this there is no possibility to
train our own models. However, the pre-trained model works as desired and so we tol-
erate this drawback. There is just one small problem with fitting a face to a mouth that
is wide open, but this is an unnatural expression and therefore not a serious problem.
AAM / ASM Library
Yao Wei published in his Github repository 109 his implementations of Active Shape and
Appearance Models, providing source-code for both training and evaluating.
It was interesting to observe these libraries working. The precision of the fitting
algorithm is not bad, but still not accurate enough. There is also one serious problem.
When working with dynamic images the fitted model shakes from frame to frame, even
when the face does not move or change. With each fitting loop, the model is slightly
different, and this causes a shaking effect (See Figure 4.5). This makes the library
unusable for the avatar-rendering part, since it would damage the whole animation.
STASM
STASM 110 is a C++ library based on the Active Shape Model introduced by Tim
Cootes. The library was developed and introduced by S. Milborrow and F. Nicolls
[MN14a]. However, this library was considered for use with static images only. We
adjusted the library to work with dynamic input from web-cam, but the fitting algorithm
is too slow for use in a real-time application.
108http://www.cmu.edu/index.shtml, 21.10.2015
109https://github.com/greatyao, 21.10.2015
110http://www.milbo.users.sonic.net/stasm/, 21.10.2015
46
Figure 4.5: Two frames with the same face pose but with different results of the fitting
process. Such differences happen with each loop, causing strong shaking of the model,
making the library unusable for our prototype.
Figure 4.6: STASM library testing. It is impossible to use this library in a real-
time application running on common hardware. The precision of the fitting algorithm
implemented in this library is also too low. Notice the inaccuracy of the located features.
Others
We also wanted to test the Candide 111 and AAM-API 112 libraries, but the outdated
code and lack of good documentation made the compilation of these libraries impractical.
111http://www.icg.isy.liu.se/candide/, 21.10.2015
112http://www.imm.dtu.dk/ aam/aamapi/, 21.10.2015
47
4.3 Face rendering
As already mentioned, a 3D avatar is animated as a visualization of the output. For
this purpose we used the Unity3D Engine with C# scripts for handling the avatar
animation. The Unity3D Editor provides options for importing Blender’s 113 .blend
file format. There are several forums offering many freely downloadable 3D models in
.blend format, some of which are already rigged. The advantages of using ready-rigged
3D models were described in Section 2.2.2. However, the import of Blender files into the
Unity3D Editor does not always handle the rigs well, causing many models to fail to be
imported correctly. We finally located two models that could be imported correctly and
we used them both while implementing the prototype.
Figure 4.7: The Holmen advance head rig is a 3D rigged model downloaded in the
.blend file format. It is a highly detailed model with spectacular texture. However, the
texture cannot be seen on this screen-shot since the Unity3D Editor was not able to
import it.
As mentioned in Figure 4.7 101 and 4.8 102, the Blender models lose their quality
when imported to the Unity3D Editor (See Figure 4.9).
113http://www.blender.org, 21.10.2015
101http://www.blendswap.com/blends/view/48717, 21.10.2015
102http://www.blendswap.com/blends/view/2389, 21.10.2015
48
Figure 4.8: The quality of this Holmen Face Rig model is the same as the quality
of the previous one shown in the Figure 4.7, but the superiority of the loaded material
makes the model look more realistic. Both models are equipped with facial muscle joints,
making them suitable for our prototype.
Since the joints of different rigs are also positioned differently to match the particular
model correctly, our script settings also have to be adjusted when changing the 3D model.
The Unity3D Editor offers a user interface for setting parameters for C# scripts used in
the project. This makes the adjustment of script variables to the new 3D rigged model
easier, with no need to change the source.
When working with a purely non-rigged 3D model, animating facial expressions
becomes too difficult. In this case the mesh needs to be deformed manually, which means
to synchronize and transform multiple vertices into the desired deformation. Since this
approach would be too complex, we worked with rigged models (See Figure 4.10). We
found that a suitable approach was to use a rig for deforming the mesh in order to
deform the face and simulate a facial expression.
A system for mapping the coordinates of face features onto the 3D model had to be
implemented. The input from ASM/AAM is a list of coordinates of the fitted model,
which have to be converted before applying them to the 3D model rig. The conversion
was implemented in a C# script in Unity3D, which first maps features to joints and
49
Figure 4.9: A screenshot of 3D models shown in Figures 4.7 and 4.8, opened in Blender
before being imported into the Unity3D Editor. Notice the texture difference. However,
we wanted to use the model in a real-time application made in Unity3D, which would
be too complicated to render in such level of detail on commonly used hardware.
then recalculates the translations that need to be applied to particular joints. Since the
ASM/AAM model has different proportions than the 3D model, the ASM/AAM feature
locations cannot be applied directly but have to be recalculated to joint translations,
causing feature shifts on the 3D model. We calculate the ratio between the ASM/AAM
model and 3D model proportions at the initial stage, capturing the neutral expression.
We first calculate the shifts between the current and neutral feature locations in the
ASM/AAM model, which are then applied to the 3D model joints in neutral positions.
If the user’s face becomes lost, both the ASM/AAM library and the 3D model neutral
expressions are re-initialized to avoid any mismatch between the neutral expressions of
the ASM/AAM and the 3D model.
Sometimes, when the captured face moves too fast, FaceTracker returns incorrect
data, causing a shift drift on avatar’s joints. In such cases the system should be re-
initialized in the same way as it is in case of losing the user’s face. However, it is much
more complicated recognizing this occurrence than recognizing the face loss itself, since
FaceTracker is not aware of delivering incorrect results. This is an open problem of the
implemented prototype.
50
Figure 4.10: The armature of the Holmen Advanced Rig shows how joints can be
modeled in the Blender modeling tool.
4.4 Technical Details
In order to enable things to work together, the interfaces for communication between
libraries and tools had to be implemented. The most complicated part was the data
exchange between the OpenCV program written in C++ and the Unity3D Project with
scripts written in C#. For this purpose we used Export API in the C++ application
in order to be able to use program methods from other applications. With Export API
we are able to run the program as a plug-in in the Unity3D. The Unity3D project needs
the data from SVM classifier and features located in the ASM/AAM library, so that the
recognized emotion can be shown and the features applied to the avatar. C# scripts
written in MonoDevelop were used for retrieving data from the plug-in and applying
them to the avatar. The data exchange between C# Unity3D scripts and C++ plug-
in was done by Marshalling. Since the OpenCV::Mat data structure is not implicitly
supported, data could not be transfered directly. The C# script sends two float arrays
51
to the plug-in, which the OpenCV matrix is partitioned into. These filled arrays are
then reassembled back in the C# script (See Figure 4.11).
In addition to the OpenCV::Mat representation of the fitted ASM/AAM model, the
classified emotion also needs to be transferred. Here only a predefined label is transferred,
which lowers the data amounts sent between the plug-in and Unity3D project. The
showCamera boolean parameter is a flag for turning debug options on and off.
Figure 4.11: Data flow chart showing the data flow between the Unity C# script and
C++ plug-in.
EXPORT_API int getShape(
float** pointsX, float** pointsY, // marshalling of a OpenCV::Mat
int* classified_label, // classified emotion label
bool showCamera)
The C# script can call the above method as follows:
[DllImport("FACE_TRACKER")]
private static extern int getShape(
52
ref IntPtr pointsX, ref IntPtr pointsY,
ref int classified_label,
bool showCamera);
All of the libraries used were downloaded in a form of source code that first had to
be built. For this we used the CMake 114 tool. As a build target we used Microsoft
Visual Studio 2010 or 2013 when working on Windows, and Xcode 6.X when working
on OSX. These environments were used to build the dynamic library for use as a plug-in
in the Unity3D project. As mentioned above, the Unity3D project was developed in
the Unity3D Editor and scripts used in this project were written in the MonoDevelop
Editor, since these two editors cooperate well.
One of the goals of this thesis was to write prototype that does not need any special
hardware to run fluently in real-time. Common integrated or USB web-cams turned out
to be satisfactory, even for capturing the user’s face.
114http://www.cmake.org/, 21.10.2015
53

CHAPTER 5
Evaluation
5.1 Approach
After finishing the implementation phase, the evaluation began. We evaluated the im-
plemented prototype in both a quantitative and qualitative manner.
We focused on two main goals in the evaluation phase.
• We wanted to evaluate the emotion classification part to find out how precisely it
works and how it behaves with various faces.
• The second goal was to evaluate the animation part. We wanted to find out
how accurately the animation matches users’ expressions and whether it animates
diverse faces well.
The evaluation phase was divided into two parts. We evaluated the emotion recogni-
tion part and the animation part separately, since they had to be processed in different
ways.
• The results of the emotion classification were processed quantitatively. The individ-
uals tested alternated their facial expressions, informing us whether the recognized
emotion matched their intention. Counting correctly classified facial expressions
gave us an overview of how well the prototype works.
55
• Secondly we wanted to know how people perceive the animation of their expressions
on the given avatar. Here individuals were asked to rate the avatar animation with
points from 1 to 5; the higher the score the better the animation. We did not adjust
the parameters of the avatar to individual faces since we wanted to check the basic
prototype without fine tuning.
We executed the evaluation of the prototype as described above on a sample of 10
individuals. We wanted participants to be as different as possible, and so we therefore
created three age groups, both male and female. The first group consisted of participants
of around 20, the second of around 30-40 and the last, around 60. Three female and one
male participant took part in the first group, one female and two males in the second
group and two females and one male in the third. Their tasks were to keep switching
between three different facial expressions: happy, sad and fearful, until they become
familiar with the prototype.
In addition to this, after the testing we discussed with the participants their overall
feelings concerning the prototype. We wanted to know if they found it useful, smooth
and precise enough, and whether they had any ideas for improvements. We also observed
the precision of the ASM fitting algorithm in the debug mode while participants were
testing the prototype, in order to see if the model matches properly. As soon as the
model mismatched an individual’s face, we stopped him or her and reinitialized the
model by covering the web-cam for a second.
The results are provided in the Section 5.2.
5.2 Results
The results shown in Table 5.1 (graphically illustrated in Figure 5.1) show that emotion
recognition works well. However, the emotion is not recognized correctly all of the time;
Figure 5.3 for example shows the incorrectly recognized happy expression of Participant
H. Here, the happy expression was recognized as the neutral one. The participant did
not look directly into the web-cam and was captured at an angle, which can influence the
56
Expression HAPPY SAD FEAR
Participant A correct correct correct
Participant B incorrect correct incorrect
Participant C correct correct correct
Participant D (without glasses) incorrect correct correct
Participant E correct correct correct
Participant F correct correct correct
Participant G incorrect correct incorrect
Participant H incorrect correct correct
Participant I (without glasses) correct correct correct
Participant J correct correct correct
Table 5.1: The results of the facial expression categorization evaluation. Tests were
marked as incorrect if the emotion was not recognized correctly on the participant’s
natural happy/sad/fearful expression, and if the participant had to make additional
effort in order for the emotion to be classified properly.
Animation Rating HAPPY SAD FEAR
Participant A 5 4 2
Participant B 4 5 3
Participant C 3 4 3
Participant D 5 5 4
Participant E 4 4.5 3.5
Participant F 4 5 4
Participant G 4 4 3
Participant H 5 5 4
Participant I 5 5 5
Participant J 5 5 4
Table 5.2: Avatar animation scores based on participants’ perceptions. 1: bad - 5:
excellent. Participant A gave a low rating of the fearful animation since she was not
able to fit the expression properly at all (See Figure 5.5).
57
Figure 5.1: The results of emotion classification.
Figure 5.2: Overall scores of the animation’s rating (See Table 5.2)
classification. At the beginning Participant E was unable to achieve correct expression
classification correctly at all. It transpired that the problem was the distance between
him and the web-cam, which apparently negatively influenced the classification. He was
classified correctly as soon as he moved closer to the camera.
Another problem with emotion classification was caused by the incorrectly matched
ASM model. Participant D and I were wearing spectacles, and this interfered with
FaceTracker, so they removed them. This ASM/AAM fitting drawback was described
in Section 2.1.3. As also described in Section 2.1.3, the ASM/AAM fitting algorithm
encounters problems when fitting a face with a beard correctly (See Figure 5.4). This oc-
58
curred with Participant F, who was sporting a beard. Despite FaceTracker not matching
the face well, the emotion classifier was able to classify the emotion correctly.
Figure 5.3: Participant H: The happy expression classified incorrectly as the neutral
expression. The ASM debug points were matched correctly, but intentionally moved to
the left top corner so that the real expression can be seen clearly.
Figure 5.4: Figure showing how FaceTracker fits the face of Participant F, who sports
a beard.
Table 5.2 (graphically illustrated in Figure 5.2) shows participants’ ratings of the
animation. The animation of the fearful expression gained a lower rating, since the
rig of the avatar’s mouth is not flexible enough to simulate a mouth wide open (See
Figure 5.5).
Figure 5.5: Participant A: Inaccuracy in the animation of the avatar’s fearful animation.
59
During the free discussion participants complained about losing the face tracking
and the need to re-initialize the model by covering the web-cam for a second. This is a
known issue of FaceTracker that does not happen often and strongly depends on lighting
conditions. Many participants noted that the sensitivity of the emotion classifier was too
low. They became detained in the sad classification until it unnaturally over-expressed a
different emotion. This was caused by the smaller training sample with large differences
between emotions, and can be easily corrected by training the emotion classifier with a
larger training sample of facial expressions.
We noticed that participants tended to move and rotate the head while working with
the computer. Both rotating and moving heads affects the prototype in some way; head
movements affect the avatar’s animation, and rotation affects the emotion classification.
We also observed that participants tended to move the head towards or away from web-
cam noticeably when expressing fear. Since this is natural behavior in a real setting, it
should be considered before the prototype could be practically used.
Finally, all the participants said that it was fun to play with the prototype and
that they had not tried anything like this before. They are doubtful about its usage in
chat-like applications, while the idea to automatically analyze the user’s reaction when
watching advertisements was considered a potentially useful replacement for question-
naires in cases where the user is aware of being observed.
60
CHAPTER 6
Conclusion
In this work we managed to build a fully working prototype application for recognizing
and animating user facial emotions in real-time. We compared state-of-the-art technolo-
gies and chose the most suitable. We first focused on the recognition of face features
and considered possible ways of solving the problem. Active Shape and Active Appear-
ance Models were chosen for this task. We tested and compared published ASM/AAM
implementations, in which FaceTracker was considered the most suitable library for
implementing ASM. Secondly, with face features recognized, we focused on expression
analysis. We used Support Vector Machines for the facial expression classification and
also created a training interface that was used for building the SVM model. The final
step was to animate the facial expression on an avatar. Since this part was designed in
Unity Engine, we used the face features recognition and expression classification part as
a plug-in in the Unity project. Here, face features were animated on an avatar, and a
textual description of the classified emotion was also displayed.
The evaluation showed that both emotion recognition and avatar animation parts
work satisfactorily. However, certain drawbacks such as a lower sensitivity of the emotion
classifier and a low flexibility of avatar’s mouth emerged, and improvements could be
made to these in future work. The following list summarizes the drawbacks and suggests
improvements:
61
• Evaluation participants found the recognition not sensitive enough. In order to be-
come classified by the prototype correctly, they had to show their expressions more
intensely than they would do in real-life scenarios. In the case of one participant
the sensitivity problem was corrected by moving closer to the camera.
• We observed that users tended to move their heads frequently, affecting both emo-
tion classification and avatar animation in a negative way. For future work the
movement and the rotation could be recognized and dealt with so that the system
is no longer influenced in this way.
• Another problem that we observed while testing is that sometimes the FaceTracker
library is unable to match the model correctly to the face. We re-initialized the
model automatically in code when FaceTracker lost the face entirely. In the case
of mismatching the face, we were not able to automatically recognize the failure
and to re-initialize the model. This issue should also be dealt with in future work.
• Evaluation participants were sometimes not satisfied with the avatar’s mouth an-
imation. The chosen avatar does not have a mouth flexible enough for opening
wide. This could be corrected either by adjusting the avatar being used or using
another one with a higher level of detail.
These are enhancements to the built prototype that could enable it to work more
precisely and to look better. However, with these enhancements in place the prototype
would become ready for production. For future work we recommend training the SVM
model with a larger training sample in order to render the emotion classification both
more precise and more sensitive. The avatar could also be enhanced by rigging it more
precisely and better optimizing it to the ASM model.
For testing the prototype, both the source code120 and Windows executable121 are
provided. Despite the executable being provided for the Windows operating system only,
the platform-independent source code can also be built on Linux and OSX.
120https://github.com/mirobyrtus/fips/tree/master/source, 10.11.2015
121https://github.com/mirobyrtus/fips/tree/master/executable/stabilized, 10.11.2015
62
Bibliography
[AHvdSV05] Aitor Azcarate, Felix Hageloh, Koen van de Sande, and Roberto Valenti.
Automatic facial emotion recognition. Universiteit van Amsterdam, 2005.
[ALC+07] Ahmed Bilal Ashraf, Simon Lucey, Jeffrey F. Cohn, Tsuhan Chen, Zara
Ambadar, Ken Prkachin, Patty Solomon, and Barry J. Theobald. The
painful face: Pain expression recognition using active appearance mod-
els. In Proceedings of the 9th International Conference on Multimodal
Interfaces, ICMI ’07, pages 9–14, New York, NY, USA, 2007. ACM.
[AUA08] I. Ari, A. Uyar, and L. Akarun. Facial feature tracking and expression
recognition for sign language. In Computer and Information Sciences,
2008. ISCIS ’08. 23rd International Symposium on, pages 1–6, Oct 2008.
[BVS+96] Marian Stewart Bartlett, Paul A. Viola, Terrence J. Sejnowski, Beat-
rice A. Golomb, Jan Larsen, Joseph C. Hager, and Paul Ekman. Clas-
sifying facial action. In D.S. Touretzky, M.C. Mozer, and M.E. Has-
selmo, editors, Advances in Neural Information Processing Systems 8,
pages 823–829. MIT Press, 1996.
[CET98] Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. Ac-
tive appearance models. In IEEE Transactions on Pattern Analysis and
Machine Intelligence, pages 484–498. Springer, 1998.
[CKH10] Darren Cosker, Eva Krumhuber, and Adrian Hilton. Perception of linear
and nonlinear motion properties using a facs validated 3d facial model.
63
In Proceedings of the 7th Symposium on Applied Perception in Graphics
and Visualization, APGV ’10, pages 101–108, New York, NY, USA, 2010.
ACM.
[CL11] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support
vector machines. ACM Trans. Intell. Syst. Technol., 2(3):27:1–27:27,
May 2011.
[CTCG95] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham. Active shape
models-their training and application. Computer Vision and Image Un-
derstanding, 61(1):38 – 59, 1995.
[DCLR13] Dragos Datcu, Marina Cidota, Stephan Lukosch, and Leon Rothkrantz.
Noncontact automatic heart rate analysis in visible spectrum by specific
face regions. In Proceedings of the 14th International Conference on Com-
puter Systems and Technologies, CompSysTech ’13, pages 120–127, New
York, NY, USA, 2013. ACM.
[Dha13] Abhinav Dhall. Expression analysis in the wild: From individual to
groups. In Proceedings of the 3rd ACM Conference on International Con-
ference on Multimedia Retrieval, ICMR ’13, pages 325–328, New York,
NY, USA, 2013. ACM.
[DR07] Dragoş Datcu and Léon Rothkrantz. Facial expression recognition in
still pictures and videos using active appearance models: A comparison
approach. In Proceedings of the 2007 International Conference on Com-
puter Systems and Technologies, CompSysTech ’07, pages 112:1–112:6,
New York, NY, USA, 2007. ACM.
[Ekm78] W. V. Friesen Ekman, P. The facial action coding system: A technique
for measurement of facial movement. Consulting Psychologists Press,
1978.
64
[GBGFMB12] Jorge Garcíia Bueno, Miguel González-Fierro, Luis Moreno, and Car-
los Balaguer. Facial gesture recognition using active appearance mod-
els based on neural evolution. In Proceedings of the Seventh Annual
ACM/IEEE International Conference on Human-Robot Interaction, HRI
’12, pages 133–134, New York, NY, USA, 2012. ACM.
[HC12] Zakia Hammal and Jeffrey F. Cohn. Automatic detection of pain in-
tensity. In Proceedings of the 14th ACM International Conference on
Multimodal Interaction, ICMI ’12, pages 47–52, New York, NY, USA,
2012. ACM.
[JG10] Xiaodong Jia and Jiangling Guo. Eyeglasses removal from facial image
based on phase congruency. In Image and Signal Processing (CISP), 2010
3rd International Congress on, volume 4, pages 1859–1862, Oct 2010.
[KPG11] Sharad Kohli, Surya Prakash, and Phalguni Gupta. Age estimation using
active appearance models and ensemble of classifiers with dissimilarity-
based classification. In Proceedings of the 7th International Conference
on Advanced Intelligent Computing, ICIC’11, pages 327–334, Berlin, Hei-
delberg, 2011. Springer-Verlag.
[LM02] R. Lienhart and J. Maydt. An extended set of haar-like features for
rapid object detection. In Image Processing. 2002. Proceedings. 2002
International Conference on, volume 1, pages I–900–I–903 vol.1, 2002.
[LTC97] A. Lanitis, C.J. Taylor, and T.F. Cootes. Automatic interpretation and
coding of face images using flexible models. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 19(7):743–756, Jul 1997.
[MEK03] Philipp Michel and Rana El Kaliouby. Real time facial expression recog-
nition in video using support vector machines. In Proceedings of the
5th International Conference on Multimodal Interfaces, ICMI ’03, pages
258–264, New York, NY, USA, 2003. ACM.
65
[MN14a] S. Milborrow and F. Nicolls. Active Shape Models with SIFT Descriptors
and MARS. VISAPP, 2014.
[MN14b] Stephen Milborrow and Fred Nicolls. Active shape models with sift de-
scriptors and mars. VISAPP, 1(2):5, 2014.
[ND13] Svetoslav Nedkov and Dimo Dimov. Emotion recognition by face dynam-
ics. In Proceedings of the 14th International Conference on Computer
Systems and Technologies, CompSysTech ’13, pages 128–136, New York,
NY, USA, 2013. ACM.
[NLEDlT08] Minh Hoai Nguyen, Jean-Francois Lalonde, Alexei A Efros, and Fernando
De la Torre. Image-based shaving. Robotics Institute, page 141, 2008.
[SRDW09] Xiaofan Sun, Leon Rothkrantz, Dragos Datcu, and Pascal Wiggers. A
bayesian approach to recognise facial expressions using vector flows. In
Proceedings of the International Conference on Computer Systems and
Technologies and Workshop for PhD Students in Computing, CompSys-
Tech ’09, pages 28:1–28:6, New York, NY, USA, 2009. ACM.
[SRP10] Amrutha Sethuram, Karl Ricanek, and Eric Patterson. A comparative
study of active appearance model annotation schemes for the face. In Pro-
ceedings of the Seventh Indian Conference on Computer Vision, Graphics
and Image Processing, ICVGIP ’10, pages 367–374, New York, NY, USA,
2010. ACM.
[SRU+10] Markus Storer, PeterM. Roth, Martin Urschler, Horst Bischof, and
JosefA. Birchbauer. Efficient robust active appearance model fitting. In
AlpeshKumar Ranchordas, JoãoMadeiras Pereira, HélderJ. Araújo, and
JoãoManuelR.S. Tavares, editors, Computer Vision, Imaging and Com-
puter Graphics. Theory and Applications, volume 68 of Communications
in Computer and Information Science, pages 229–241. Springer Berlin
Heidelberg, 2010.
66
[SSP+12] Catherine Soladié, Hanan Salam, Catherine Pelachaud, Nicolas Stoiber,
and Renaud Séguier. A multimodal fuzzy inference system using a con-
tinuous facial expression representation for emotion detection. In Pro-
ceedings of the 14th ACM International Conference on Multimodal Inter-
action, ICMI ’12, pages 493–500, New York, NY, USA, 2012. ACM.
[VJ01] P. Viola and M. Jones. Rapid object detection using a boosted cascade
of simple features. In Computer Vision and Pattern Recognition, 2001.
CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference
on, volume 1, pages I–511–I–518 vol.1, 2001.
67