Automatic human-head and
shoulder segmentation of
frontal-view face images
DIPLOMARBEIT
zur Erlangung des akademischen Grades
Diplom-Ingenieur
im Rahmen des Studiums
Visual Computing
eingereicht von
Robin Melán, B.Sc.
Matrikelnummer 1029201
an der Fakultät für Informatik
der Technischen Universität Wien
Betreuung: O. Univ. Prof. Dr. Walter Kropatsch
Wien, 1. Jänner 2018
Robin Melán Walter Kropatsch
Technische Universität Wien
A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at
Die approbierte Originalversion dieser Diplom-/ 
Masterarbeit ist in der Hauptbibliothek der Tech-
nischen Universität Wien aufgestellt und zugänglich. 
 
http://www.ub.tuwien.ac.at 
 
 
 
 
The approved original version of this diploma or 
master thesis is available at the main library of the 
Vienna University of Technology. 
 
http://www.ub.tuwien.ac.at/eng 
 

Automatic human-head and
shoulder segmentation of
frontal-view face images
DIPLOMA THESIS
submitted in partial fulfillment of the requirements for the degree of
Diplom-Ingenieur
in
Visual Computing
by
Robin Melán, B.Sc.
Registration Number 1029201
to the Faculty of Informatics
at the TU Wien
Advisor: O. Univ. Prof. Dr. Walter Kropatsch
Vienna, 1st January, 2018
Robin Melán Walter Kropatsch
Technische Universität Wien
A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at

Erklärung zur Verfassung der
Arbeit
Robin Melán, B.Sc.
Oberzellergasse 3/17/12, 1030 Wien
Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen-
deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der
Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder
dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter
Angabe der Quelle als Entlehnung kenntlich gemacht habe.
Wien, 1. Jänner 2018
Robin Melán
v

Kurzfassung
Bildsegmentierung ist eines der elementaren Themen in der Mustererkennung und Bild-
verarbeitung. Das erst kürzlich entdeckte spezifische Problem einer automatischen Kopf-,
Gesicht- und Schultersegmentierung, gewinnt in letzter Zeit an Bedeutung für eine Reihe
von Anwendungen. Vorranging wäre eine automatische Extraktion einer Person von
einem undefinerten, komplexen Hintergrund für jegliche Profilbilder von Nutzen, die
in Dokumenten verwendet werden können. Ebenfalls würde eine solche Anwendung
Verbesserungen bei der automatischen Gesichtserkennung bedeuten, und weiters im
E-Government Bereich bzw. gewerblichen Sektor von Interesse sein. In dieser Arbeit
behandeln wir das Problem der automatischen Kopf-, Gesicht- und Schultersegmentie-
rung aus Bildern in Frontalansicht mit einem undefinierten komplexen Hintergrund,
indem wir eine Methodik bestehend aus individuellen Teilaufgaben präsentieren. Diese
Teilaufgaben können unabhängig voneinander betrachtet werden und bestehen aus einer
Gesichtshautfarbe-Detektion, einer Gegenüberstellung zweier Superpixel Algorithmen
und der Untersuchung von Haar- und Kleidungseigenschaften für unsere Haar- und Schul-
tersegmentierung. Wir evaluieren unsere Methoden und präsentieren konkurrenzfähige
Resultate in jeder Teilaufgabe.
vii

Abstract
Object segmentation is one of the basic issues in image processing and computer vi-
sion. However, especially human-head and shoulder segmentation is a topic which was
introduced only recently, gaining in importance for a wide area of computer vision appli-
cations, such as testing compliance for ID document issuing, improving images for facial
recognition or even used in the upcoming e-governmental self services and commercial
sector. In this thesis we address the problem of automatic human-head and shoulder
segmentation of frontal-view face images from non-uniform complex backgrounds and
propose an approach composed of different subtask. These subtasks can be viewed
individually and consist of a novel face skin silhouette detection approach based on
supervised classification learners, a study of two state-of-the-art superpixel algorithms in
relation to the specified problem statement and a novel hair and shoulder segmentation
approach. We discuss and evaluate our methods and present competitive results for each
subtask.
ix

Contents
Kurzfassung vii
Abstract ix
Contents xi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Preliminary 7
2.1 Previous Work on Human-Head and Shoulder Segmentation . . . . . . 7
2.2 Previous Work on Skin Detection . . . . . . . . . . . . . . . . . . . . . 9
2.3 Previous Work on Hair Segmentation . . . . . . . . . . . . . . . . . . . . 11
2.4 Basic Methodological Strategy . . . . . . . . . . . . . . . . . . . . . . 13
3 Face Skin Silhouette Detection 17
3.1 Technical Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Face Skin Detection based on Classification Learners . . . . . . . . . . 18
3.3 Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 SLIC and GS04 Superpixel Comparison 29
4.1 Technical Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Recall: SLIC (Simple Linear Iterative Clustering) . . . . . . . . . . . . 30
4.3 Recall: GS04 (Efficient Graph-Based Image Segmentation) . . . . . . . . 31
4.4 Comparing SLIC and GS04 . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Hair and Shoulder 37
5.1 Technical Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Hair and Shoulder Segmentation . . . . . . . . . . . . . . . . . . . . . 38
xi
5.3 Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6 Conclusion 53
List of Figures 55
List of Tables 57
List of Algorithms 59
Bibliography 61
CHAPTER 1
Introduction
In this thesis, we focus on a particular segmentation task: extracting the human-head
and shoulder boundary of static frontal-view face images from an arbitrary complex
background.
The automatic detection and segmentation of human subjects for static images is still
a challenge, due to several real world factors such as illumination conditions, shadows,
occlusions and background clutter, unnatural skin tones, different ethnicity groups, color
saturation and no prior knowledge about the person nor its environment and background
in the image. Another challenge to be mentioned are problems like image quality, image
noise, resolution, or even issues related to factors associated with the dynamic of the
human being, such as the great variety of poses, appearance and shapes [20].
Automatic people detection and segmentation in general can be widely used in many
computer vision based applications, including photo analysis, surveillance systems, people
counting, robotics, natural user interfaces and editing [21]. The particular problem of
human-head and shoulder segmentation especially can be of interest for a wide range of
applications.
1.1 Motivation
As previously mentioned, the idea of having such a segmented portrait image of a persons
head and shoulders with a uniform background can be of interest for many different kinds
of applications and areas.
The ISO 1/IEC 19794-5 Information technology — Biometric data interchange formats —
Part 5: Face image data [19], is the fifth of 8 parts of the ISO standard ISO/IEC 19794,
published in 2005 by the International Civil Aviation Organization (ICAO). It describes
interchange formats for several types of biometric data, more specifically defining a
1International Organization for Standardization
1
1. Introduction
standard scheme for codifying data describing human faces to be used correctly by
facial recognition systems. Modern biometric passport photos should comply with this
standard [18].
Many organizations and public authorities have already started enforcing its directives,
and several software applications have been produced, like the ICAO Portrait Checker
Module 2, to automatically test compliance to the specification. These software applica-
tions provide automatic checks on the head pose, head position, occlusion, expression,
eye visibility, illumination artifacts caused by glasses, illumination and color condition,
specifically color saturation (over or under exposure) and unequal light distribution on the
face. The examination whether the ICAO requirement for a uniform background is met
has still to be done manually by an official. So to issue a passport or any other kind of
document e.g. driving license, credit card, personal identity card, etc. public authorities
or cooperations inspect the submitted profile picture initially manually for background
uniformity and then check the other requirements with an ICAO software to verify if all
other criteria are given 2. The uniform background is important for further computer face
recognition algorithms, because the first step is to compute facial features by extracting
landmarks from the subject’s face to then compare them with a face database. The
uniform background improves the identification rate in the face recognition process, since
the confusion of false determination of facial landmarks in the background is reduced or
even removed.
Another problem occurs for countries and their public authorities or cooperations which
do not follow the ISO/IEC 19794 norm yet. In the past they accepted any kind of images
not following this criteria using face images with random background or even scanned
pictures. The ability to reuse these images as references for an automatic face recognition
software requires human-head and shoulder segmentation as well.
Other possible cases where such a solution is needed lie in the upcoming e-government
sector. A filing of application, like a passport, personal identity or any kind of document
renewal could be initiated online, uploading a digital photo (e.g. taken with a webcam)
which is then checked by an ICAO-Checker and beforehand segmented with the proposed
segmentation algorithm from this master thesis.
Similarly to the idea of usage for e-government self service, a software including the
proposed methodology could be created for the event management area, issuing out
identification badges (e.g. VIP, guest pass, host badge) which contain a photograph for
the registration process. To facilitate the manual verification the background would be
removed.
In future work this idea could be extended for applications in the law enforcement sector,
identifying people which are on a blacklist via surveillance cameras, a 1:N verification.
The video stream could be split into frames, choosing the one with optimal face detection.
The further segregation of the image background would increase the success rate of face
recognition algorithms, reducing the considerably large hit list due to low resolution and
2 Manual Biometrics: ICAO Portrait Checker technical description from the Biometrics Center Atos,
June 2011
2
1.2. Problem Statement
quality.
Similarly, this idea could be used in scope of commercial applications using white lists,e.g.
in shopping centers, recognizing noted customers as they enter and notifying them about
offers, news, etc.
1.2 Problem Statement
Recording a still portrait image can be very diverse and variable. Considering the problem
from a computer vision perspective, finding an accurate head and shoulder silhouette in
static images is difficult due to several real world factors such as different illumination
conditions, which include shadows, unnatural skin tones, saturated colors (e.g. too
dark, too light), as well as occlusion, background clutter, image quality, image noise,
low resolution, variety of head poses, appearance, shape and the randomly textured
background.
The aim of this master thesis is to study the problem of automatic human-head and
shoulder segmentation of frontal-view face images and analyze the proposed method-
ological strategy. We provide an approach which detects and segments out the arbitrary
background to be replaced with a uniform background, to achieve the expected result
showed in Figure 1.1. For this we allow to define some technical specifications and
Figure 1.1: (1) Input image. (2) Ground truth output image.
conditions for the input image, which will play a role in the individual topics studied
in the main section of this thesis. Some of these conditions follow the defined data
requirements stated in the ISO/IEC 19794-5 standard [18].
• Pose: Rotation of the head shall be less than +/−5 degrees from frontal in every
direction roll, pitch and yaw. Specially when it comes to face recognition software,
the pose is known to strongly affect performance of automated face recognition
systems.
3
1. Introduction
• Occlusion: As we define our interest in frontal-view face images where the person
is looking towards the camera, we exclude any sort of occlusion of the persons face,
specially facial landmarks like eyes, nose and mouth.
• Eye glasses lightning artifacts: If the person wears glasses, they shall be clear glass
and transparent so that the eyes are clearly visible. Furthermore no lightning
artifacts or flash reflections on glasses shall be projected on to the image.
• Over or under exposure: The gradations in textures for the persons face, hair and
clothes shall be clearly visible. In this sense, the pictures will be within a range of
saturation without over or under exposure. On the one hand, if the exposure is
too long or lens aperture opened too widely, then the image is over exposed and
too bright. The the other hand, if exposure is too short or lens aperture too small,
then the image is under exposed and therefore too dark.
• Focus and depth of field: The subject’s captured image shall always be in focus
from nose to ears and chin to crown.
• Unnatural color: The illumination shall produce a face image with natural looking
flesh tones when viewed in typical examination environments.
• Resolution: The resolution for a frontal-view face image, which consists of face, hair
and outline of the shoulders of the subject, must be 420x525 pixels as minimum.
Moreover, the eye distance has to be at least 90 pixels. These requirements ensure
a certain face and hair quality to be able to extract meaningful information, but
also assures the possibility to use the output portrait results as profile pictures in
documents according to the ISO/IEC 19794-5 standard [18] if desired. Additional
artificial changes of color representation, contrast, focus and intensity, such as
changes e.g. for purpose of beauty enhancement are not included.
• No baldness: For future work we leave open the possibility to handle bald and
semi-bald people, so in our proposed method we are considering strictly people
with hair.
• Aesthetic look: The goal is to achieve an output result which allows a correct face
and shoulder segmentation with a detailed and aesthetic representation of the
hair. Since these results could land on documents, which often are checked for
identification manually by an officer, the aesthetic plays an important role as well.
So in the context of this thesis an aesthetic result of the hair and shoulder structure,
means the segmented hair and outline of the shoulders, has to look normal in the
eyes of the beholder and, for instance, it is not important for every single strand of
hair to be segmented correctly.
• Background complexity: Our interest lies in handling a human-head and shoul-
der segmentation where the background is not uniform, but complex, including
indoor/outdoor scenes and shadows, whilst a contrast from the persons face, hair
4
1.3. Contribution
and clothes in relation to the surrounding background is given (e.g. black hair on
black background is excluded in this thesis).
• Biometric features: Important to mention is that all the biometric features of
the person remain unchanged. There are no modifications done on the persons
properties in the image.
1.3 Contribution
Our contributions in this thesis are the following:
• Proposing a methodology for human-head and shoulder segmentation of frontal
view face images, by splitting up the complex problem into different subtasks and
proposing an approach to solve each subtask. These approaches combined give a
possible methodology to this particular problem statement, but can be observed
individually as well and applied independently for other tasks with similar technical
requirements.
• Introducing a novel skin detection approach based on classification learners, ex-
tending the training set of the classifiers by adding automatically labeled subset of
pixels extracted from the test image.
• Comparing two very different superpixel algorithms which oversample the image,
with one adhering the topology and the other resulting into compact similar sized
homogeneous regions to simplify the extraction of features.
• Characterizing hair, clothes and the background in an oversampled image with
color, texture and the superpixels relative position generating hair, shoulder and
background models without any prior knowledge on the person and background
complexity of the image.
• Labeling due to occlusion of not connected hair regions as one class.
• Maintain the biometric features of the person unchanged to allow an identity check
in the following.
1.4 Structure of the Thesis
The remainder of this report is organized as follows: Chapter 2 gives a brief description
on the state-of-the-art in literature concerning the primary topic of Human-Head and
Shoulder Segmentation. Additionally, the preliminaries on Skin Detection and Hair
Segmentation are reviewed, which gives a brief overview for the algorithm’s subtasks
of our proposed method on automatic human-head and shoulder segmentation. At the
end of Chapter 2 we provide our Basic Methodological Strategy, where the different
components of our algorithm are set into relation to each other. In Chapter 3 the first
5
1. Introduction
component Face Skin Silhouette Detection is described and evaluated. In Chapter 4
two Superpixel algorithms are compared and evaluated, since oversampling the input
image is a crucial step in processing our proposed algorithm. Chapter 5 describes the last
component of our method, followed by an evaluation and discussion regarding the results.
Chapter 6 concludes this master thesis and closes with an outlook to future work.
6
CHAPTER 2
Preliminary
Most literature regarding this topic is treating face detection, face recognition, fake face
detection, tracking and feature extraction, so focusing more on the face only rather than
the hair and the rest of the body as well. The specifitc problem of automatic human-head
and shoulder segmentation is relatively new and was first introduced by Xin et al. [64] in
the year of 2011. Since then few papers have been published which will be described in
the subsequent Section 2.1. In our approach, we are splitting the problem into different
subtasks, such as Face Skin Detection and Hair and Shoulder Detection, which together
compose the human-head and shoulder segmentation. However, if these subtasks are
viewed independently and given certain preprocessing steps, they can be applied for
different applications as well. Therefore in this Chapter we will describe the current
state-of-the-art on Human-Head and Shoulder Segmentation in Section 2.1, followed
by the state-of-the-art on Skin Detection in Section 2.2 and Hair Segmentation 2.3.
The Chapter closes with the short description of our Basic Methodological Strategy in
Section 2.4 of our approach giving an understanding on how these subtasks are combined
with each other to achieve a complete strategy on human-head and shoulder segmentation.
2.1 Previous Work on Human-Head and Shoulder
Segmentation
The discussion on object segmentation is one of the basic issues in image processing
and computer vision. Extensive studies on segmentation algorithms are presented in
the literature. Rother et al. [45] presented the GrabCut algorithm to address the
problem of interactive extraction of a non-articulated foreground object in a complex
environment, whose background cannot be trivially subtracted. The disadvantage is the
need of user interaction and only considering color differences. Recently another object
detection, recognition and segmentation algorithm has been proposed by Facebook AI
7
2. Preliminary
Research (FAIR) 1 based on Deep Learning. Introducing the DeepMask [42] segmentation
framework coupled with the SharpMask [43] segment refinement module, they enable the
machine vision system to detect and delineate every object in an image. The final stage
of recognizing and classifying the object is done by the convolutional network [61].
Segmentation of articulated object, more specific human segmentation is important in
dynamic (video) as well as still images. Shu [53] addresses the problem of human detection
and tracking in surveillance videos. Thombre et al. [56] uses a simple image segmentation
technique for human detection and Kalman filter for human tracking. Human-head and
shoulder segmentation on still images has received considerably less attention.
Xin et al. [64] introduced a first approach dealing with this issue in 2011. In their
approach an iterative shape mask guided graph cut algorithm with sketch constraints is
applied to the oversampled image graph using Watershed Algorithm [59] to get a border
that segments the head-shoulder from its background. The limitations stated by the
authors are that the algorithm fails when the ground truth is far away from trained shape
masks. Furthermore the shape sketch constraint suggests to deal only with a particular
hair style.
In Bu et al. [7] the same authors tried a different approach using structural patches
tiling to guide the human head and shoulder segmentation. They apply a local structure
classifier trained by random forest to the input image in a sliding window manner and
then construct an Markov Random Field (MRF) to build a probabilistic mask from the
responses collected from the previous stage. They compare their results outperforming
GrabCut [45] and achieving similar results as their previously published paper [64].
Jacques et al. [21] propose a head-shoulder contour estimation model for human figures
in still images, captured in a frontal pose. The contour estimation is guided by a learned
head-shoulder shape model, initialized automatically by a face detector. A graph is
generated around the detected face with an omega-like shape, and the estimated head-
shoulder contour is a path in the graph with maximal cost. The authors improved their
human contour estimation in Jacques and Musse [20] through clusters of learned shape
models. However, it is important to emphasize that Jacques et al. [21] and Jacques and
Musse [20] try to segment the most omega-like head-shoulder contour, focusing on a
well known shape/feature of the human body [20], which means that this contour could
include the persons hair, but not necessarily has to.
Sangüesa et al. [47]’s publication is based on the previously mentioned GrabCut [45]
algorithm. Computing an initial estimation of the foreground they avoid an manual
interaction as input for GrabCut and perform the algorithm for a certain number
of iterations. They focus on passport images, which require an almost pixel-perfect
segmentation in order to be a valid photo. Their evaluation shows lack of non-uniform
backgrounds and was not tested on such challenging scenarios [47].
The last and most recent publication was by Deng and Wu [11] in 2016 and they present a
learning-based method for robust head-and-shoulder segmentation results in applications
1https://research.fb.com/learning-to-segment/
8
2.2. Previous Work on Skin Detection
where the person in the query image is a known prior and the background is non-uniform.
In this case a prior knowledge and a portrait image of the person is required to create
the head-shoulder object (HSO) and train the method, before predicting on new images
of the same subject.
Most of these proposed approaches are concentrating only on color information. In our
master thesis we split the problem into different subtasks and consider specially for the
hair segmentation texture information as well. Furthermore, the challenge of not having
any prior knowledge on the person in the image and the background complexity increases
the difficulty we are focusing on in this thesis.
2.2 Previous Work on Skin Detection
Saxen and Al-Hamadi [48] categorizes skin segmentation algorithms into threshold-based,
model-based and region-based methods.
Thresholding-based methods are the simplest and most frequently used human skin
detection methods, where a fixed decision boundary is defined [55]. For each color space
component single or multiple ranges of threshold values are defined. The pixel values
of the input image that fall within those predefined ranges are labeled as skin pixels,
all the others are defined as non-skin. Liensberger et al. [34] are applying for their
online video annotation a combination of YCbCr, normalized RGB and RGB for skin
detection. One of the drawbacks of working in the RGB color space is that luminance
and chrominance cannot be separated. The RGB components are highly correlated, so
changing the luminance of a given skin patch affect each component.
Transforming from RGB into any of the orthogonal color spaces is a linear transfor-
mation [13]. All these color spaces separate the illumination component (Y) from the
two orthogonal chrominance components (UV, IQ, CbCr). Therefore, unlike the RGB
color space the location of the skin color in the chrominance components is not affected
by changing the intensity of the illumination [13]. The simplicity of the transforma-
tion and the invariant properties made these color spaces widely used in skin detection
applications [50, 15].
Perceptual color spaces are described by HSI, HSV/HSB, and HSL. They separate
three components: hue (H), saturation (S) and brightness, also called intensity, value
or lightness (I,V, or L). These color spaces are deformations of the RGB color cube
and are computed by a non-linear transformation. The boundary of the skin color class
is specified in terms of hue and saturation. The brightness component I, V and L is
often dropped to reduce illumination dependency of skin color. Shaik et al. [50] as well
as Platzer et al. [44] used these color spaces in their skin detection approaches.
Commonly used model-based methods in literature are Gaussian classifiers or Gaussian
Mixture Models (GMMs), which try to approximate the skin-color distribution [25].
9
2. Preliminary
Greenspan et al. [16] show a mixture of Gaussians as a robust representation that can
accommodate large color variations, as well as highlights and shadows. They trained
GMM with two components, where one component captures the distribution of the skin
color while the other captures the distribution of the highlighted regions of the skin.
Lee and Yoo [30] compare the performance of a single Gaussian model (SGM) with a
GMM of six components. Under controlled illumination condition, skin colors of different
individuals in a orthogonal color space cluster in a small region. Hence, in these conditions
the skin color distribution can be modeled through an elliptical Gaussian joint probability
distribution function (pdf). Once other image conditions have to be considered a SGM is
not sufficient and GMMs with multiple components have to be considered. The key idea
behind using multiple components is that different parts of the face are illuminated in a
different manner and they can be modeled by different components [25].
Lü and Huang [35] propose a skin detection method based on the cascaded adaptive
boosting (AdaBoost) classifier, which consists of minimum-risk based Bayesian classi-
fier and models in different color spaces such as HSV (hue, saturation, value), YCgCb
(brightness, green, blue) and YCgCr (brightness, green, red). Ma et al. [36] proposed the
Semantically Constraint Skin Detection (SCSD) method based on Random Forests. The
semantic constraint is based on the dependence between skin pixels and human body
parts, to limit the influence of background skin-like pixels. Khan et al. [29] compare
their random forest based skin detection approach with other classification learners
like Bayesian network, Multilayer Perceptron, SVM, AdaBoost, Naive Bayes and RBF
network.
The third possibility is incorporating spatial information, using a region-based method-
ology. A common region-based method used for skin segmentation is Region Growing [48].
The problem with Region Growing is the need of seed points. Abdullah-Al-Wadud et al.
[2] use a color distance map and based on this map they generate some skin as well as
non-skin seed pixels. Then they grow them to capture the appropriate regions. With this
approach they do not generate much noisy segments and do not need any prior training
session. Saxen and Al-Hamadi [48] propose a region growing approach computing the
seed points by a Bayes approach.
Khan et al. [28] propose a skin segmentation approach using graph cuts. They model
the skin segmentation as a min-cut problem on a graph defined by the image color
characteristics and a universal seed to overcome the potential lack of successful seed
detections. The advantage of their approach is that it is only based on skin sampled
training data making it robust to unseen backgrounds.
In this thesis we present a model-based supervised classification learner based on inde-
pendently decision trees and weighted kNN. We include high-level information of the
query image into the training set and show how this improves the skin detection. The
data used as training and testing sets were transformed from RGB color space into the
10
2.3. Previous Work on Hair Segmentation
orthogonal color space YCbCr from which the two chrominance channels Cb and Cr
represent the feature space.
2.3 Previous Work on Hair Segmentation
Hair plays a significant role in the overall appearance of an individual and many computer
vision tasks can benefit from segmented hair. For instance, it provides an important
clue for gender classification, since hair styles (including facial hair) of male and female
are generally different. Often hair can also facilitate the automatic age estimation of a
person, since hair volume, density and color gradually changes (or disappears in case of
baldness) with the increase of age, especially for old men and women [62].
For humans hair is a major cue for face recognition as well [65], changes in hairstyle or
facial hair can mislead the observer in recognizing faces, suggesting that it could be of
advantage to use hair information in recognition to provide a useful cue for identification
or at least narrowing possible matches. However, hair appearance and attributes can easily
be changed, and therefore should be treated with caution when it comes to identification.
Wang et al. [63] describes another possible application being, people wanting to see
whether or not some hair style fits them or not. With the rapid development of internet,
online makeup has become more popular and a good hair style identification or search
tool is necessary, which makes hair segmentation essential.
Shen et al. [51] presents an approach for the application of automatic facial caricature
synthesis where an accurate detection and presentation of hair region is one of the key
components as well. Another interesting application is AutoHair introduced by Chai
et al. [8] reconstructing a 3D hair model from a 2D image. Even in the beauty industry
with augmented reality Levinshtein et al. [32] addressed the topic of live hair color
augmentation.
Detecting and segmenting hair within images represents a significant challenge due to
the diversity of hair patterns and background variability [62] and in the literature very
different approaches were presented.
Wang et al. [62] propose a two-tier Bayesian based method for hair segmentation. In
the first tier, Wang et al. [62] uses a Bayesian Model by integrating hair occurrence
prior probabilities for computing the initial hair seeds selection, which later on in the
second tier are used to built the hair-specific Gaussian model. The algorithm is finalized
with Mean Shift results to remove holes and spread hair regions. Relying on Mean
Shift to fill in the holes, spreading hair regions, for the final segmentation could lead
to adding superpixels which contain small hair information and large background areas
if the superpixel granularity is not high enough or the boundary coherence not correct.
Furthermore depending only on color information narrows the set of images to handling
simpler backgrounds.
Aarabi [1] proposes in their paper an automatic hair segmentation method by extracting in
a multi-step process various information components from an image, including background
color, face position, hair color, skin color and skin mask with region growing. Regions far
11
2. Preliminary
away from the face are considered to define a background color likelihood histogram. For
the hair detection, they obtain an initial guess on the hair information by taking narrow
strips above the face and narrow strips on the sides of the face defining the hair color
likelihood histogram. With these two likelihoods the rest of the pixels are classified. To
improve the hair detection cleanup post-procedures are used removing eyebrows, eyes,
island region patches, and strands or segments that point upwards.
Since it is assuming hair above the face and on the side the limitations are that the data
set which can be used on this approach is constraint to have long hair and a celebrity alike
hairstyle at best as their dataset results show. Their description of how they define their
background regions is vague far away from the face, which concludes into the assumption
that a uniform background is expected.
Wang et al. [63] proposes a learning approach, called Compositional Exemplar-based
Model (CEM) for hair segmentation. CEM generates an probabilistic mask in the manner
of Divide-and-Conquer, which can be divided into a decomposition and composition stage.
In the first stage a strong ranker based on a group of weak semantic similarity features
is learned. In the second composition stage, a neighbor label consistency constraint
reduces the ambiguity between data representation and semantic meaning and then
recomposes the hair style using alpha-expansion algorithm. The final segmentation result
is obtained by Dual-Level Conditional Random Fields. The approach shows difficulties
and hair being confused with the background when shadows occur or color contrast to
the background is low. Moreover an exact result of the test image can only be guaranteed
if its hair characteristics exist in the training library.
Rousset and Coulon [46] was as far as our knowledge goes the first to introduce frequential
information into the process of hair segmentation. Their algorithm is divided into two
steps. Firstly performing a raw segmentation based on frequential and color analysis to
place markers in hair regions. Secondly a matting process is used to achieve a final hair
mask. The crucial limitation about this approach is that it is bound to a particular set of
images, where the background is not that high frequential so that the threshold for their
frequential map holds for the hair otherwise no hair regions are found in the frequential
map. Moreover too small markers could lack on enough color information and lead to
bad estimation of the alpha matte.
Ahn and Kim [4] propose a face and hair region labeling semi-supervised spectral
clustering-based multiple segmentation approach introducing texture information to
improve the object class distinction. For the training dataset they generated superpixels
with watershed algorithm [59] on the frontal-view face images to extract color (in Lab
color space) and texture with Leung-Malik filter bank [31] features.
Liang et al. [33] provides a hair segmentation solution combining the outputs of a color
camera and a depth camera. With the additional depth map a face mesh is computed
with which the head region skin and hair can be segmented. However, here an additional
depth camera is needed which in practice is not commonly used and expensive.
In this thesis we present in Chapter 5 a method to detect and segment hair and shoulder
12
2.4. Basic Methodological Strategy
automatically using the results of the Face Skin Silhouette Detection (see Chapter 3)
and the oversampled image (see Chapter 4) to built a hair, shoulder and background
model based on color, texture and location for the individual image to classify the rest
of the unknown superpixels in between. Similar to previously mentioned approaches in
literature such as [46], [62], [1], [24], [65] we concentrate on frontal-view face images,
where the subject is not bald nor semi-bald.
2.4 Basic Methodological Strategy
Our methodological approach to accomplish the expected result of a background reduction
in a digital frontal-view face image comprises the following subtasks visualized in Fig. 2.1:
Following the enumeration of the subtasks in Figure 2.1, at first, the eyes in the input
image Figure 2.1(1) are detected to place the control points for an Active Contour Model
(ACM) [27] in Figure 2.1(2). The shape mask identified with the blue contour line
in Figure 2.1(2) is the initial mask for ACM. The purpose of this rough separation into
foreground and background is to segregate most of the background out of the image
resulting in an incomplete background mask (3a) and a foreground mask (3b) with
spurious segments including the subject. For images with a salient object and a simple,
noiseless and uniform background such a segmentation through ACM could be sufficiently
enough to produce an acceptable segmentation result, but for our problem statement of
dealing with complex images regarding foreground and background this procedure leads
to a foreground mask with erroneous regions.
From these rough background and erroneous foreground masks automatically labeled
data is extracted to improve the performance of our face skin classifier (in Figure 2.1(4)).
The skin detection (described in Chapter 3) determines pixels, which are certainly part
of the foreground.
To evaluate this subtask of our face skin detection algorithm, we compare quantitative
results with other skin detection strategies measuring the performances in terms of True
Positive Rate and False Positive Rate and comparing qualitative results with existing
skin detection explicit thresholding methods.
With the results provided by ACM and our Skin Detection algorithm the image is
subdivided into a trimap: pixels which are certain to be background (6a), pixels which
are detected as skin (and all areas surrounded by skin like eyes, eyebrows, mouth) classified
as foreground (6c), and pixels in between labeled undefined (6b). To classify this last
undefined area of pixels correctly the image is oversampled into superpixels in subtask (5),
to extract per superpixel color and texture information and their relative location in the
image. With these per superpixel informations, hair, shoulder and background models
can be characterized, to determine in the final stage whether the remaining superpixels
correspond to the foreground or background. The final boundary mask describes the
result shown in Fig. 2.1-(7).
To evaluate the superpixel segmentation we compare two state-of-the-art algorithms,
13
2. Preliminary
SLIC (Simple Linear Iterative Clustering) by Achanta et al. [3] and GS04 (Efficient
Graph-Based Image Segmentation) by Felzenszwalb and Huttenlocher [14], discussing
the granularity with respect to our problem statement (see Chapter 4).
The final stage of the methodology, hair and shoulder segmentation, is described in Chap-
ter 5 and to assess the resulting human-head and shoulder segmentation quantitative as
well as qualitative evaluations were conducted, measuring the consistency of segmentation
results with manually labeled ground truth in terms of the overlap ratio. This evaluation
criterion quantifies the error of labeled foreground containing background information
and vice versa.
14
2.4. Basic Methodological Strategy
Figure 2.1: Overview of our approach: (1) Input image. (2) Detect eyes and placing
control points for a initial mask (see blue line). (3*) ACM Results (3a) Foreground &
(3b) Background. (4) Face Skin Detection. (5) Superpixel Segmentation. (6) Trimap:
Decision Procedure for ambiguous pixels in undefined area considering their color, texture
and location information. (7) Output Image.
15

CHAPTER 3
Face Skin Silhouette Detection
Skin detection in general is the process of finding skin colored pixels and regions in an
image or a video [13]. It is the process of separating skin and non-skin pixels. Skin
detection is an important feature for several computer vision applications, such as face
detection and tracking [9] or hand detection and tracking [12], which is widely used
for navigation and object manipulation in Virtual Reality (VR) or Augmented Reality
(AR) [23]. Other examples are the retrieval of humans in databases and the Internet,
automatic annotation, archival and retrieval [34], and content filtering for parental control
software or criminal investigations [44]. In this master thesis we concentrate on images
with frontal-view face images and look at skin detection as a preprocessing step of head
and shoulder segmentation. Therefore our main interest lies on the classification of skin
pixels around the silhouette of the face, neck or possible shoulders of humans.
3.1 Technical Specification
For computer vision systems, skin detection is prone to many challenges and still an open
problem, while for the human visual system skin detection is easy. Spillmann and Werner
[54] describes the human perception with an example of seeing a blue ball, were we all
can agree in that the ball is perceived blue as whole, and not as a ball having blue patches
and some other color patches produced by differences in illumination. Furthermore, the
human visual system can dynamically adapt to varying illumination conditions, so it can
preserve the actual color of the object [25]. In literature this is called color constancy or
chromatic adaptation.
Most of the literature on human skin detection has focused on using color information,
which can be a challenging task as the skin color in images is sensitive to various factors
such as illumination, camera characteristics, ethnicity and skin-alike background [25].
17
3. Face Skin Silhouette Detection
We specify the technical specifications for the input image as follows, which will be
referenced in the subsequent sections of this chapter:
1. In this thesis we are focusing on frontal-view face images, which means that the
pose of the person’s head and shoulders captured has to face the camera, so the
rotation of the head shall be less than +/−5 degrees from frontal in every direction
roll, pitch and yaw.
2. Based on human anatomy we know that in general a human being has two eyes, one
nose, one mouth and two ears. Hence, in a static image where the person is looking
towards the camera in a frontal-view pose, these facial features will be visible,
except for the ears which could be occluded by hair. As an additional requirement,
the persons eyes must be open; Closed or covered eyes are not accepted.
3. Since face skin detection is considered here as a preprocessing step, the focus lies
on finding a correct silhouette of the subjects face and neck. If the pixels around
the silhouette are correctly classified, the remaining pixels inside the silhouette can
be labeled as foreground and everything outside as background.
3.2 Face Skin Detection based on Classification Learners
We propose a novel skin detection algorithm based on classification learners. In pattern
recognition, classification is considered an instance of supervised learning, e.g. learning
where a training set of correctly identified observations is available. In literature there
are a number of algorithms including [37]: Linear Classifiers, Support Vector Machines
(SVM), Kernel estimation like k-Nearest Neighbor (kNN), Boosting (meta-algorithms),
decision trees and neural networks (NN). These algorithms were studied during the master
thesis and for further information on why we decided on weighted kNN and decision trees
for this particular problem we refer the reader to our published technical report [39].
The novelty of the proposed approach lies in our improvements on the training set of the
kNN and decision trees classifier (see Section 3.2.3).
3.2.1 Recall: Decision Trees
Decision trees are characteristic in having fast prediction speed1, small memory usage2
and being easy to interpret. A disadvantage can be that they have low predictive accuracy
and tend to overfit, if the depth of the splits is not pruned to a maximum number of
splits [22]. We decided on a decision tree with a maximum number of 100 splits, which
could lead to overfitting on the training set. Since we are improving our classification
learner as described in Section 3.2.3 with information of the input image itself a detailed
decision tree is more suiting to classify the remainders of the input image correctly. In
the following evaluation we refer to this methodology by tree.
1Speed: Fast 0.01 sec.; Medium 1 sec.; Slow 100 sec.
2Memory: Small 1MB; Medium 4MB; Large 100MB
18
3.2. Face Skin Detection based on Classification Learners
3.2.2 Recall: Weighted k-Nearest Neighbor (kNN)
Nearest Neighbor classifiers are characteristic in having slow to medium prediction speed1,
medium memory usage2 and being harder to interpret compared to decision trees. They
typically have good predictive accuracy in low dimensions. As dimensionality increases,
the distance to the nearest data point approaches the distance to the farthest data point,
which might lower the prediction accuracy. In the k-Nearest Neighbor (kNN) algorithm
categorizing a query point is based on its closest k neighbors in the training examples.
In the weighted kNN the distances to the neighboring points are weighted. Choosing a
high number of neighbors can be time consuming to fit. For the evaluation in Section 3.3
a number of 10 neighbors was defined and a distance weight of squared inverse. It is
referenced by kNN.
3.2.3 Face Skin Classification Learner (FSCL)
To improve the performance of classification learners the training data is extended with
automatically extracted sample information of the query image. A series of preprocessing
steps were performed on the input image to extract pixel information to be included into
the training set.
Figure 3.1: Overview of the preprocessing steps: (1) Input image. (2) Detect eyes and
placing control points for the initial ACM mask (see blue line). (3*) ACM Results (3a)
Foreground & (3b) Background. (4) Extracting skin pixels (in color) (5) Zoomed into the
selection of the extracted skin information.
Following the enumeration of the subtasks in Figure 3.1, at first, the face and eyes in
the input image Figure 3.1(1) are detected by Viola-Jones [60]. Viola–Jones requires
full view frontal upright faces. Thus in order to be detected, the entire face must point
19
3. Face Skin Silhouette Detection
towards the camera and should not be tilted to either side. These requirements are met,
since we are focusing in this thesis on frontal-view face images and as described in our
technical specification 1 the person’s face captured by the camera is in a frontal-view
pose, leading to technical specification 2 as a trivial result showing all facial features.
With the face and eyes location and dimension the control points for an Active Contour
Model (ACM) [27] in Figure 3.1(2) are placed. The shape mask identified with the blue
contour line in Figure 2.1(2) is the initial mask for ACM. The purpose of this rough
separation into foreground and background is to segregate most of the background out
of the image resulting in an incomplete background mask (3a) and a foreground mask
(3b) with spurious segments including the subject. The last preprocessing step is the
extraction of human skin information of the query image, shown in Figure 3.1(4). With
the same Viola-Jones algorithm [60], but different Haar-like features the nose of the
person is found in the image underneath the eyes location. With this information skin
pixels are extracted from the region between the eyes location and the nose bounding
box, represented as the two skin boxes in Figure 3.1(4). In the following evaluation we
refer to our two improvements of the classification learners by tree-FSCL and kNN-FSCL.
3.2.4 Color Space YCbCr for Skin Detection
As color space for training both classification learners the orthogonal color space YCbCr
was chosen, since orthogonal color spaces like YCbCr separate the illumination component
(Y) from the two orthogonal chrominance components (CbCr). Unlike the RGB color
space the location of the skin color in the chrominance components is not affected by
changing the intensity of the illumination [13]. According to Elgammal et al. [13] the
skin color of different ethnicity groups almost co-locates in the chrominance channels.
Observing the histograms of the publicly available UCI database (see detailed description
in Section 3.3.1) containing skin and non-skin pixels once in RGB color space (Figure
3.2) and in YCbCr color space only considering the chrominance components Cb and
Cr (Figure 3.3) can be observed that in the RGB the non-skin pixels overlap completely
with the skin pixels making the correct classification harder. For the YCbCr color space
only a smaller overlap can be observed for this particular database, which makes it more
suiting for a correct classification.
Figure 3.4 shows the incorrect classified pixels after training a decision tree with UCI
dataset once in RGB color space and once in YCbCr color space only considering the
chrominance components. In orange are the false positive and in blue the false negatives.
The decision tree trained in CbCr color space classifies 0.14% less incorrectly than the
decision tree trained in RGB color space regarding the UCI database.
20
3.2. Face Skin Detection based on Classification Learners
Figure 3.2: 1-D histograms of skin vs. non-skin pixels of the UCI database in RGB color
space.
Figure 3.3: 1-D histograms of skin vs. non-skin pixels of the UCI database considering
the chrominance components of the YCbCr color space.
Figure 3.4: Incorrectly classified pixels from decision tree. Left: Result of trained decision
tree in RGB color space. Right: Result of trained decision tree in YCbCr color space. In
orange are the false positive and in blue the false negatives.
21
3. Face Skin Silhouette Detection
3.3 Results and Evaluation
The proposed approaches based on the classification learners decision tree and weighted
kNN and their improvements were implemented in MATLAB3. In the following qualitative
and quantitative evaluation we compare our proposed approaches tree and kNN and
their improvements tree-FSCL and kNN-FSCL with skin detection based on explicit
thresholding in the YCbCr Color Space [13] (thresholdYCbCr), HSV Color Space [15]
(thresholdHSV) and RGB Color Space [15] (thresholdRGB).
In a further analysis as evaluation criterion, only the silhouette of the ground truth is
considered (see Figure 3.5). As described in the technical specification 3 the region of
interest of our facial skin detection lies on the silhouette of the persons face and neck. If
the pixels around the silhouette are correctly classified then the rest inside the silhouette
can be labeled as face and everything outside the silhouette as background.
Figure 3.5: Focus in the evaluation of skin detection on the silhouette of the person. (1)
Original image. (2) Ground truth. (3) New silhouette ground truth.
Representative sample images from different databases were selected to demonstrate
the performance and limitations of the proposed approach. Regarding quantitative
evaluations the segmentation results of the approaches were compared against the ground
truth and the silhouette ground truth (see Figure 3.5). In the context of skin classification,
• true positives are skin pixels that the classifier correctly labels as skin.
• true negatives are non-skin pixels that the classifier correctly labels as non-skin.
• false positives are non-skin pixels that the classifier erroneously labels as skin.
• false negatives are skin pixels that the classifier erroneously labels as non-skin.
The goal of a good classifier is to have low false positive and false negative rates. As in any
classification problem, there is a trade-off between false positives and false negatives [13].
Having a soft class boundary the false negative rate is low and the false positive rate
is high, which results in a high recall value. Having a tighter class boundary the false
negatives are high and the false positives low. This normally results in a higher precision
3MATLAB: https://de.mathworks.com
22
3.3. Results and Evaluation
value.
For a more detailed evaluation and survey of results we allow to refer the reader to our
published technical report [39].
3.3.1 Databases
Experiments are conducted using the following public datasets which all except for the
last one provide a ground truth. The databases were transformed from RGB color space
into the orthogonal color space YCbCr, from which the two chrominance channels Cb
and Cr represent the two-dimensional feature space.
It is important to mention that the primary focus is on images, where the face can be
easily found with state-of-the-art face detection algorithms, so the subject in the image
is not occluded and face and shoulders are facing the camera.
• UCI [6]: is collected by randomly sampling B,G,R values from face images of
various age groups (young, middle, old), ethnicity groups (white, black, and Asian),
and genders obtained from FERET database and PAL database. The dataset
provides ground truth and contains 245.057 pixel entries (50.859 skin and 194.198
non-skin).
• Pratheepan [55]: is collected randomly from Google and images are captured with
a range of different cameras, using different color enhancement, under different
illuminations, variation of age (young, middle), ethnicity groups (white, Asian),
and genders. The database provides ground truth and contains 32 face images.
• CALTECH 4: this frontal face dataset is collected at California Institute of Technol-
ogy, capturing 27 people under different light conditions, facial expression, ethnicity
groups (mostly white and Asian), gender and complex backgrounds. It provides
images under different conditions with a complex background, where the orientation
of the head and shoulders is facing the camera according to the defined criteria we
are focusing on in this thesis. The database does not provide any ground truth.
Therefore, for a small set of images ground truth was generated manually and those
samples were used in qualitative evaluations.
3.3.2 Evaluation of FSCL
In this subsection we are discussing quantitative and qualitative results concerning the
proposed FSCL approach and compare it with state-of-the-art algorithms. For the
evaluation we are using UCI database as training set for the classification learners tree
and kNN. As described in Section 3.2.3 tree-FSCL and kNN-FSCL include in the training
phase information of the input image and the UCI database. Both quantitative results
4Collected by Markus Weber at California Institute of Technology http://www.vision.caltech.
edu/html-files/archive.html
23
3. Face Skin Silhouette Detection
in Tables 3.1 and 3.2 are realized with Pratheepan as testing set. In the first Table 3.1
the complete provided ground truth has been considered. In the second Table 3.2 the
results are regarding only the correct classification around the silhouette of the subjects
skin region. Some qualitative results are provided in Figure 3.6.
The results concerning the complete ground truth of skin are shown in Table 3.1. The
best performance regarding accuracy, precision and F1 measure is our tree-FSCL. All clas-
sification learners outperfom the explicit thresholding methods regarding the Pratheepan
database as testing set. The explicit thresholding methods thresholdYCbCr and threshol-
dRGB are prone to generally classify more pixels as skin, leading to a high value of true
positives but also false positives. This can also be observed in the precision value, which
considers the false positive rate in its calculation.
Approach Accuracy Precision Recall / TPR FPR F1
tree-FSCL 0.934 0.852 0.848 0.052 0.841
kNN-FSCL 0.926 0.818 0.869 0.067 0.831
tree 0.908 0.796 0.842 0.080 0.797
kNN 0.910 0.794 0.860 0.083 0.807
thresholdYCbCr 0.690 0.348 0.774 0.356 0.450
thresholdHSV 0.738 0.319 0.419 0.215 0.323
thresholdRGB 0.695 0.330 0.657 0.320 0.409
Table 3.1: Evaluation on the testing database Pratheepan concentrating on the complete
ground truth.
The results of Table 3.2 are evaluating the classification only around the silhouettes ground
truth. Our proposed classification learners do not outperform the explicit thresholding
methods in Recall and F1-score even though the same testing database of Pratheepan has
been used for both evaluations (Tables 3.1 and 3.2). Recall is higher for thresholdYCbCr
and thresholdRGB, because both find more skin pixels but as a drawback also categorize
a large number of background pixels as skin (see FPR). This could have great negative
impact on the further process of segmenting out the background from the person (as can
be observed more clearly in the qualitative results in Figure 3.6). Regarding accuracy
and precision the supervised classification learner based on decision tree tree-FSCL
outperforms the other algorithms.
To give a further comparison, in the latest survey of skin-color modeling and detection
methods by Kakumanu et al. [25], the authors compare skin detection strategies and
their performance in terms of the true positive rate (TPR) and false positive rate (FPR).
Obviously it is difficult to compare these different published methodologies, since there is
no uniform benchmark dataset on skin detection like there is on general image segmen-
tation and boundary detection (Berkeley Segmentation Dataset and Benchmark [38]).
Therefore we have to keep in mind that the results listed in this report are all concerning
their own dataset with a respective ground truth.
24
3.3. Results and Evaluation
Approach Accuracy Precision Recall / TPR FPR F1
tree-FSCL 0.797 0.799 0.778 0.205 0.772
kNN-FSCL 0.788 0.771 0.801 0.245 0.770
tree 0.764 0.757 0.778 0.264 0.743
kNN 0.765 0.753 0.793 0.274 0.751
thresholdYCbCr 0.698 0.64 0.914 0.515 0.745
thresholdHSV 0.720 0.754 0.600 0.181 0.644
thresholdRGB 0.775 0.732 0.882 0.333 0.789
Table 3.2: Evaluation on the testing database Pratheepan concentrating on the silhouette
as ground truth.
The best performing algorithms regarding the quantitative results listed in the report,
show a confidence value of around 88.5%-99.4% TPR and 10%-15.5% FPR. In our report
regarding the Pratheepan dataset on the complete ground truth (see Table 3.1), we can
observe that for tree-FSCL a 84.8% TPR is reached, which is for a small margin below
the state-of-the-art results reported in the survey, and 5.2% FPR is achieved, which
shows better performance.
For the qualitative examples illustrated in this thesis we selected images with a variety
of different skin tones, background and illumination to give a good representation on the
tested samples. Observing the first and second examples, the face is illuminated from
the side causing a shadow in the background and different skin tone patches in the face
of the subject. For the simple explicit thresholding algorithms these areas are difficult to
distinguish and classify correctly. The results of the classification learners are in these
two samples better. Furthermore, observing the difference between tree-FSCL and tree
a noticeable improvement can be detected regarding the reduction of false positives.
Looking at the third example, the results of the classification learners are very similar,
still outperforming the explicit thresholding methods.
It can be concluded that our novel supervised skin classifier improves results significantly
when we are dealing with complex backgrounds, different ethnicities and different illu-
mination conditions. The simple explicit thresholding methods and the classification
learners tree and kNN have problems distinguishing between skin-alike pixels in the back-
ground and actual skin pixels of the person since no contextual information is available.
All three examples demonstrate the typical behavior of tresholdYCbCr and threshol-
dRGB classifying more pixels as skin, leading to a high true positive and false positive rate.
Allowing too much or too little light during exposure makes images darker or brighter,
respectively changing the natural tone of skin. A color space such as YCbCr allows
to compensate this problem by splitting color into the luminance and chrominance
components. In Figure 3.7 an example of over- and underexposed image can be seen,
where the thresholdYCbCr results are spurious not finding most of the skin pixels. Using
25
3. Face Skin Silhouette Detection
Figure 3.6: Qualitative Examples: image (1) is from CALTECH database and images
(2),(3) from Pratheepan. White pixels are skin, black non-skin and around the silhouette
green represent all true positives (TP) and true negatives (TN) and red all false positives
(FP) and false negatives (FN).
the idea of classification learners in particular looking at tree the results are even worse,
but after adding high-level information (skin pixels and background pixels of the input
image) in tree-FSCL the results improve but having still erroneous regions.
26
3.4. Discussion
Figure 3.7: Examples of a under- and overexposed image where the results of thresholdY-
CbCr and tree fail completely. Tree-FSCL improves the skin detection but not sufficiently
enough.
3.4 Discussion
In this master thesis we concentrate on frontal-view face images and look at skin detection
as a preprocessing step of an automatic human-head and shoulder segmentation. Our
main interest lies on the classification of facial skin pixels around the silhouette of the
face and neck of humans, since everything in the face skin silhouette can be defined as
foreground (skin) as well.
We present a novel model-based approach using classification learners with supervised
learning, called Face Skin Classification Learner (FSCL). The proposed solution is based
on independent pixel classifiers, namely weighted kNN and decision trees. Both classifiers
are trained from automatically labeled data and extended by using Viola-Jones eyes and
nose detectors and Active Contour Model (ACM) to extract sample pixels of both skin
and non-skin classes.
Evaluations on multiple datasets with frontal-view face images were discussed, and results
were compared with explicit thresholding methods. Furthermore we discussed the results
of skin detection strategies summarized in the survey report by Kakumanu et al. [25]
measuring the performances in terms of true positive rate (TPR) and false positive rate
(FPR). The evaluation shows improvements over several baselines and is above the average
of the best performing state-of-the-art algorithms regarding FPR. In our particular case,
it is more important to have a low FPR rather than low false negative rate, since missing
skin pixels can be compensated in the following subtasks of our methodology through the
oversampled image into Superpixels or the closing morphological operation and filling of
holes. When it comes to the falsely classified background pixels as foreground, they are
not removed in any post-processing step.
Including information of the input image into the training set and applying FSCL on the
remainder of the image allows the reduction of false positive detections significantly and
27
3. Face Skin Silhouette Detection
the classification results around the silhouette become more reliable.
Since we are considering color as single information, difficulties are visible when unnatural
skin tones occur through shadows, over and under exposure or color bleeding (the colored
reflection of indirect light from a nearby object).
28
CHAPTER 4
SLIC and GS04 Superpixel
Comparison
Achanta et al. [3] describes superpixel algorithms as grouping pixels into perceptually
meaningful atomic regions which can be used to replace the rigid structure of the pixel grid.
According to this definition, the idea of superpixels is to capture the image redundancy,
provide convenient primitives from which image features can be extracted and reduce
complexity of subsequent image processing tasks.
In our case this is a relevant pre-processing step to enable an extraction of initial hair,
shoulder and background information to build a representative model for the particular
image. Per superpixel the color, texture and relative positions to each other is stored to
further analyze whether a certain superpixel is part of the background or foreground.
Algorithms for generating superpixels can be broadly categorized as either graph-based
or gradient-ascent-based methods. In graph-based methods each pixel is treated as
a node and the similarity between two neighbors define the edge weights. Well known
algorithms which were used in the past are: Normalized Cuts Algorithm by Shi and
Malik [52] and GS04 by Felzenszwalb and Huttenlocher [14]. Gradient-ascent-based
algorithms start with a rough clustering of pixels and iteratively refine the clusters until
some convergence criteria is met to form superpixels. Some examples are Mean Shift by
Comaniciu and Meer [10], Quick Shift by Vedaldi and Soatto [58], Watershed Approach
by Vincent and Soille [59] and most recent one SLIC (Simple Linear Iterative Clustering)
by Achanta et al. [3].
For our purpose we chose to work with SLIC, since in the current literature it is defined as
state-of-the-art [5] and GS04, since it considers the topology creating very irregular sizes
and shapes and has similar computational time compared to SLIC [3]. In the following
Sections we explain both superpixel algorithms briefly and compare them in relation to
our problem statement.
29
4. SLIC and GS04 Superpixel Comparison
4.1 Technical Specification
The desired properties of superpixel algorithms are as follows [3]:
1. The idea of superpixels is to group pixels into meaningful regions, simplifying the
redundancy and homogeneous regions into superpixels to compute image features
and to greatly reduce the complexity. Therefore, most importantly the superpixels
should adhere well to image boundaries.
2. When a superpixel algorithm is used as pre-processing step to reduce complexity,
the algorithm should be fast, memory efficient and simple to use.
4.2 Recall: SLIC (Simple Linear Iterative Clustering)
Among all the superpixel algorithms, the simple linear iterative clustering (SLIC) method
is widely adopted due to its practicality and performance (see evaluation results in Fig.
4.3). SLIC is an adaptation of the k-means algorithm, generating compact and regular-
sized superpixels by clustering pixels located close to each other based on their color
similarity and spatial information. For this it uses a five-dimensional space, namely labxy,
where lab represents pixel color values in the CIELAB color space which is considered
both device independent and suitable for color distance calculations, and xy represents the
coordinates for pixel position. The reason for SLIC having a linear complexity compared
with the original k-means algorithm is limiting the size of search region to a constant
distance measure, instead of comparing each pixel with every cluster center. Results
of SLIC can be observed in Fig. 4.1, where the only parameter is the total number of
superpixels for a particular image.
(a) image_0001 from CALTECH Database (b) image_0143 from CALTECH Database
Figure 4.1: Results of SLIC algorithm with different resolution size of superpixels
N = {3000, 1500, 500}.
30
4.3. Recall: GS04 (Efficient Graph-Based Image Segmentation)
4.3 Recall: GS04 (Efficient Graph-Based Image
Segmentation)
GS04 [14] is based on selecting edges from a graph, where each pixel corresponds to
a node in the graph and neighboring pixels are connected by undirected edges, where
the weights on each edge measures the dissimilarity between pixels. Two quantities
are compared at the boundary of two regions: one based on intensity differences across
the boundary and the other based on intensity differences between neighboring pixels
within each region. So, the intensity differences across the boundary of two regions are
perceptually important if they are large relative to the intensity differences inside at
least one of the regions. To control the size of the components a constant parameter k is
defined beforehand by the user. A larger k causes a preference for larger components,
however, parameter k is not a minimum component size. For smaller components a
strong evidence for a boundary, so a sufficiently large difference between neighboring
components, is required. This way however, it does not offer an explicit control over the
amount of superpixels or their compactness. This can be observed in the total number of
resulting superpixels in Figure 4.2.
Compared with SLIC, for GS04 both fine detail and larger structures are perceptually
important, leading to producing superpixels with very irregular sizes and shapes. Fig-
ure 4.2b shows that the segmentation preserves small regions such as hair strands which
are not considered in the regular atomic superpixels generated with SLIC.
(a) Leading to a number superpixels for im-
age_001 from CALTECH Database: 2450
(top), 1905 (middle) and 871 (bottom).
(b) Leading to a number superpixels for im-
age_0143 from CALTECH Database: 2374
(top), 1672 (middle) and 665 (bottom).
Figure 4.2: Results of GS04 algorithm with different k values 50 (top), 100 (middle) and
500 (bottom).
4.4 Comparing SLIC and GS04
The desired properties of superpixel algorithms are most importantly the adherence
of boundaries (see technical specification 1). How well a method adheres the contours
31
4. SLIC and GS04 Superpixel Comparison
can be measured with the quantitative evaluations of the boundary recall and the
under-segmentation error. The first evaluation describes the fraction of the ground
truth edges that fall within at least two pixels of a superpixel boundary. The second
measures the amount of superpixel “leak” for a given ground truth region. The third
important quantitative evaluation is speed. When superpixels are used, one of the desired
properties are to reduce computational complexity as a pre-processing step (see technical
specification 2). Superpixels should be fast to compute, memory efficient, and simple to
use, so that they improve the following steps of a method.
In the publication of Achanta et al. [3] a number of superpixel algorithms where compared
regarding these three criteria on the Berkely database [38]. The results presented show
that superpixels generated by SLIC and GS04 demonstrated the best boundary recall
performance, which means that very few true edges were missed. Regarding the under-
segmentation error, SLIC outperforms the other methods and for the computational
time required to generate superpixels SLIC is the fastest, followed closely by GS04
outperforming the others. The complexity of SLIC is O(N) linear in the number of pixels
N , irrespective of the number of superpixels, and GS04 is O(N logN) complex. According
to Achanta et al. [3] SLIC is also more memory efficient than GS04.
Figure 4.3: Quantitative evaluation measurements from Achanta et al. [3]: The SLIC
and GS04 algorithm outperform most of the other State-of-the-Art approaches in (a)
boundary recall, (b) under-segmentation error and (c) speed.
When it comes to our problem of a correct human-head and shoulder segmentation the
focus lies on the correct boundary detection around the silhouette of the person. So the
evaluation criterion is measuring the error when a superpixel contains both foreground
and background information. This criterion can be measured with the overlap ratio [64]:
overlap = Ground ∩ Segment
Ground ∪ Segment
(4.1)
where Ground is the image ground truth, and Segment is the result of the superpixel
algorithm. This evaluation allows to tell how big the error is when supposedly all super-
pixels are correctly assigned and labeled with background or foreground. Experiments
32
4.4. Comparing SLIC and GS04
(a) For different number of superpixels N. (b) For different values of k. The bigger k the
bigger superpixel components can get.
Figure 4.4: Overlap Ratio for both Superpixel Algorithms SLIC and GS04 on 30 images
of CALTECH Database and FEI Face Database.
are carried out on 30 frontal-view images from two datasets CALTECH 1 and FEI Face
Database 2 with various hair styles and different skin colors as well as background. It
must be kept in mind that a manual segmented image as these contain errors as well,
specially because of low contrast around the boundary, single hair strands standing out
or artifacts due to Bayer filter interpolation or demosaicing. Experiments were conducted
with both superpixel algorithms regarding the overlap ratio in relation to the number
of superpixel in Figure 4.4. It can be observed that the more superpixels the better
the boundary is preserved, but for SLIC at N = 2000 and GS04 at k = 100 the margin
gets smaller. Even for GS04 at k = 10 forcing smaller components and therefore more
superpixels (see Figure 4.5 and relation of k and number of superpixels), the overlap
ratio is even worsen for a small margin compared to k = 50.
Allowing large components in the GS04 superpixel segmentation leads to a smaller number
of superpixels as can be observed in Figure 4.5. The larger k is selected the smaller the
margin of number of superpixels gets. Important in our case is a trade off between a good
oversampling of the image to further extract information per superpixel and adhere the
boundaries around the silhouette of the person. In Figure 4.6 an example is demonstrated
for a too low granularity of superpixels. For the SLIC algorithm at N = 500 as well
as GS04 at k = 500 the hair and background is incorrectly merged into one superpixel.
Having higher granularity and therefore more superpixels the hair and background are
separated correctly into different ones.
On the one hand, it can be concluded that a too low granularity does not consider the
1Collected by Markus Weber at California Institute of Technology http://www.vision.caltech.
edu/html-files/archive.html
2From the Department of Electrical Engineering in Brazil http://fei.edu.br/~cet/
facedatabase.html
33
4. SLIC and GS04 Superpixel Comparison
Figure 4.5: GS04 parameter k = {10, 50, 100, 200, 500} to control the size of the compo-
nents. Larger k causes a preference for larger components and therefore smaller Number
of Superpixels.
level of boundary adherence we desire for this particular problem and a more detailed
superpixel segmentation is needed. On the other hand, having a too high granularity
hinders a meaningful per superpixel extraction of information, to enable a correct hair
and shoulder segmentation in the following. With SLIC superpixels a better comparison
of similarity is guaranteed since the result of SLIC are regular compact superpixels in
matters of size (total number of pixels), but with GS04 the topology is considered leading
to irregular shaped superpixels taking into account non-rigid objects such as hair with
e.g. outstanding hair strands.
4.5 Discussion
We compared and evaluated both superpixel algorithms SLIC and GS04 considering the
defined technical specifications and discussed the differences regarding the oversampled
output image. SLIC allows to specify the amount of superpixels, is fast in computation and
returns compact regular sized superpixels, which is often desired because their bounded
size and few neighbors form a more interpretable graph, allowing to extract more locally
relevant features [3]. However, compactness comes at the expense of boundary adherence,
where GS04 shows better results regarding overlap ratio and boundary recall adhering
the topology more. Still, the superpixels from GS04 result into very irregular shaped
34
4.5. Discussion
regions, for which the total number of superpixels can not be defined beforehand.
Figure 4.6: Example of granularity of superpixels and problem arising when superpixels
are too big merging background and foreground into one superpixel.
35

CHAPTER 5
Hair and Shoulder
Hair is an important feature of human appearance and the most variant aspect of human
appearance [65]. Robust and accurate hair segmentation is difficult because of challenging
variation of hair color, shape [63]. Further difficulties stem from the need of a higher
resolution than normally available for perfect segmentation.
In computer vision, hair style analysis or hair segmentation remains an ongoing research
issue and by far an unsolved problem [63]. As described in previous chapters in literature
most of the publication on skin and hair detection ignore the shoulder parts of the person,
labeling it as background. In this master thesis we are looking into a full automatic
human-head and shoulder segmentation and therefore propose an approach to handle
hair and a correct shoulder segmentation as well.
5.1 Technical Specification
Hair has a variety of properties and attributes, which makes the detection and segmenta-
tion challenging. Yacoob and Davis [65] described hair representation along the following
dimensions: length, volume, surface area, dominant color, coloring (e.g.: color variations),
forehead/outer hairline, density, baldness, symmetry, split location, reflectance/shine,
structural alteration (e.g.: banded, layered, or braided hair), layering arrangement, tex-
ture, sideburns, and facial hair cover. Muhammad et al. [40] characterizes hair based
on different hairstyles: straight, wavy, curly, kinky, braids, dreadlocks, and short. Not
only the diversity of hair patterns makes hair detection and segmentation a significant
challenge within images, but also the confusion between hair and similar background. It
might be possible that both the environmental background and the subject’s clothing
share similar color or textures to hair, and thus make a separation very difficult [66].
We specify the technical terms and conditions for the input image as follows, which will
be referenced in the subsequent sections of this chapter:
37
5. Hair and Shoulder
1. Based on the human anatomy we know that hair starts growing from above the
forehead of the person (except for bald and semi-bald people of course which are not
considered in this thesis) and the persons shoulders being below the face adjacent to
it. In such a static frontal-view face image, if the persons hair is long, it can appear
e.g. running down over the shoulders occluding parts of it as well. If it is shorter
there is the possibility that hair regions might not be connected due to occlusion of
the ears or person’s face.
2. Besides hair color and shape challenges, the image quality plays an important role
as well to achieve an acceptable hair segmentation. On the one hand a certain
resolution of the person is essential and on the other hand the focus and depth
of field should ideally capture the subject in focus from nose to ears and chin to
crown.
3. In this thesis we are focusing on frontal-view face images, which means that the pose
of the person’s head and shoulders captured has to face towards the camera, so the
rotation of the head shall be less than +/−5 degrees from frontal in every direction
roll, pitch and yaw. This leads to the expectation that the person illustrated in the
static image has visible hair around the forehead from ear to ear and his shoulders
are turned to the camera.
4. When it comes to facial hair, specially men can have additionally beard stubble,
mustache or small to large beards, covering up parts of the face/neck or shoulders.
Based on human anatomy and the pose of the person towards the camera, it can
be concluded that the facial hair is either a connected region surrounded by skin or
at least attached to the persons face/neck and shoulders.
5. Important is maintaining the biometric features of the person unchanged, since the
results could be used as profile pictures, which are checked often (manually) for
identification.
6. The expected output should correctly extract the person’s face and shoulders with
a detailed and aesthetic representation of the hair. The aesthetic appeal plays an
important role for the eye of the beholder. So in the context of this thesis an
aesthetic result of the hair and shoulder structure, means the segmented hair and
outline of the shoulders has to look normal in the eyes of the beholder and e.g. not
every single strand of hair is important.
5.2 Hair and Shoulder Segmentation
After the initial rough segmentation by ACM and the result of the face skin silhouette
detection we obtain a trimap subdividing the picture into a known background, foreground
and the undefined region, remaining with parts of the background which were not removed
with ACM, the person’s shoulder parts and hair (Figure 5.1).
After the skin detection a closing and fill in morphological operation is performed to
38
5.2. Hair and Shoulder Segmentation
reduce small noise, close borders and classify everything inside the silhouette as face/neck
by filling the holes. Only the connected component in the face location is selected as
foreground to remove possible skin-alike regions outside.
Figure 5.1: Generated trimap after ACM rough segregation and skin detection.
Figure 5.2: Example of the automatically selected superpixels to generate the corre-
sponding hair, shoulder and background models to predict the rest of superpixels in the
undefined area.
39
5. Hair and Shoulder
To classify the remaining regions of the undefined area in between the labeled background
and foreground, we approximate hair being present adjacent to facial skin above the
forehead to initialize a for the image particular hair model. Similarly shoulders are
approximated being present below the detected face adjacent to it. This presumption
follows the technical definition 1 based on the human anatomy of hair growing above the
forehead and the shoulders being adjacent below the persons face. To find some reliable
hair and shoulder seed pixels to obtain a generic hair and shoulder model for the particular
image the results of the oversampled superpixel segmentation are considered. The initial
areas are automatically set based on the location of the detected skin considering the
adjacent superpixels (see Figure 5.2). Since we are focusing in this thesis on frontal-
view face images the head pose plays an important role as defined in the technical
specification 3. This leads to the assumption that in general the person’s hair (without
considering bald and semi-bald people) is visible around the forehead from ear to ear.
These hair superpixels or shoulder superpixels initialize the hair model as well as the
shoulder model of the particular image. From these superpixels independently color and
texture information are stored through histogram bins, to characterize the appearance of
the persons hair and clothes as visualized in Figure 5.2. Additionally such a background
model is computed from the surroundings of the undefined area and similarly for every
background superpixel color and texture information is stored in histogram bins. These
descriptors per superpixel per model (hair, shoulder and background) are used to identify
hair, shoulder and background superpixels in the remains of the undefined area. A
similarity check is performed and if the distance between a superpixel’s color and texture
is small, it is added to the corresponding model. Iteratively the models are refined until
no changes were found. The procedure can be observed in Algorithm 5.1. In the following
we are naming the superpixels from a particular model hair, shoulder or background
superpixels and from the undefined area undefined superpixel respectively.
We describe hair and shoulder superpixels with color in RGB color space and texture
through rotational invariant Local Binary Patterns (LBPs) [41]. Hair [46] as well as
clothes [26] have particular texture aspects and it is known that the discriminating
power to distinguish object classes can be improved by considering texture in addition to
color [4]. It is also known that significant texture information can only be extracted if
the salient object has a certain resolution and is focused in terms of depth of field (see
technical specification 2).
Often in literature filter banks e.g. Leung and Malik (LM) [31] or Schmid (S) [49], which
are collections of N filters, are used for texture classification. The image is convolved with
multiple filters producing a stack of images for which every pixel is a feature vector of
size N . Such filter banks e.g. LM consist of 48 filters, which can be time consuming. To
narrow down the feature vector size and respectively the computational time, the filters
with maximum response have to be evaluated for the particular problem. Observing
a persons hair style in an image the hair texture can occur at arbitrary rotations and
subjected to varying illumination conditions, therefore methods e.g. LM filter bank being
not rotationally invariant [57] are unfavored for the texture analysis here.
40
5.2. Hair and Shoulder Segmentation
Algorithm 5.1: Hair and Shoulder Detection Algorithm
input :Hair model SHair, Shoulder model SShoulder, Background model SBg
and Undefined superpixel Set SUndefined
output :Undefined superpixels classified as Hair, Shoulder and Background
parameter :Thresholds to ensure minimum dissimilarity
1 while (size = |SUndefined|) ≥ 0 do
2 foreach su ∈ SUndefined do
3 scol
hair ← DissimilarityColor(su, SHair, ChiSquare);
4 scol
shoulder← DissimilarityColor(su, SShoulder, ChiSquare);
5 scol
bg ← DissimilarityColor(su, SBg, ChiSquare);
6 stex
hair ← DissimilarityTexture(su, SHair, CityBlock);
7 stex
shoulder← DissimilarityTexture(su, SShoulder, CityBlock);
8 stex
bg ← DissimilarityTexture(su, SBg, CityBlock);
9 scol ← min(scol
hair, s
col
shoulder, s
col
bg );
10 stex ← min(stex
hair, s
tex
shoulder, s
tex
bg );
11 assign su to SHair, SShoulder or SBg respectively if condition 5.3, 5.4, 5.5 or
5.6 holds;
12 end
13 if size == |SUndefined| then No undefined superpixel was assigned to neither
Hair, Shoulder nor Background, therefore break loop and return;
14 ;
15 end
LBPs are a simple yet very efficient texture operator which labels the pixels of an image
by thresholding the neighborhood of each pixel with the value of the center pixel and
considers the result of the neighbor set as a binary number. So e.g. a LBP8 operator
produces 256 (28) different output values (binary patterns) with a neighborhood set of
8. When the image is rotated the intensity values will correspondingly move along the
perimeter of the circle around the observed center pixel. This would result into a different
LBP8 value. To achieve rotation invariance this neighbor set is rotated clockwise so
many times that a maximal number of the most significant bits are 0, which is equal
to performing a circular bit-wise right shift on the binary number. For a LBP with
neighborhood 8 this would lead to 36 different values, corresponding to 36 unique rotation
invariant local binary patterns illustrated in Figure 5.3. These patterns can be considered
also as feature detectors, e.g. #0 detects bright spots, #8 dark spots and flat areas, and
#4 edges.
Due to its discriminative power and computational simplicity, LBP texture operator has
become a popular approach in various applications. It can be seen as a unifying approach
41
5. Hair and Shoulder
to the traditionally divergent statistical and structural models of texture analysis [41].
Figure 5.3: From Ojala et al. [41]: The 36 unique rotation invariant binary patterns that
can occur in the eight pixel circularly symmetric neighbor set. Black and white circles
correspond to bit values of 0 and 1 in the 8-bit output of the LBP8 operator. The first
row contains the nine ‘uniform’ patterns.
As can be observed in the description of the Algorithm 5.1 we chose two different distance
metrics for measuring dissimilarity for color and texture histograms (Pseudocode lines
3 − 5 DissimilarityColor and pseudocode lines 6 − 8 DissimilarityTexture): For color
dissimilarity Chi-Square distance was used and for texture dissimilarity City Block
distance was more suiting. The Chi-squared distance between two normalized histogram
samples x and y can be computed by:
ChiSquare: χ2(x, y) = 1
2
d∑
i=1
(xi − yi)2
xi + yi
(5.1)
where i denotes a d-dimensional vector, which in our case is the number of histogram
bins, and xi indicates the i-th feature of the sample x. Chi-Square distance metric is
bounded to [0; 1], 0 meaning both histogram samples are equal showing no dissimilarity
and 1 meaning both histograms are completely different not intersecting at all.
City block distance, in literature also referred to as Manhattan distance, between two
normalized histogram samples x and y can be computed as follows:
CityBlock: D(x, y) =
d∑
i=1
|xi − yi| (5.2)
where D is always greater than or equal to zero. The measurement would be 0 for identical
samples and high for samples that show little similarity. In most cases, this distance
measure yields results similar to the Euclidean distance (also referred in literature as L2
42
5.2. Hair and Shoulder Segmentation
distance). However, note, that with City block distance, the effect of a large difference in
a single dimension is dampened, since the distances are not squared.
For illustration purposes we simplified a possible color histogram dissimilarity check
(see histogram samples in Figure 5.4), for which the Chi-Square χ2, City Block D and
Euclidean Distance L2 between samples x and y1, y2 are computed, to demonstrate the
importance of choosing the best suiting distance metric. Per superpixel a color feature
vector is described by its color histogram bins and generally since SLIC as well as GS04
are oversampling the image using color information, the histogram would look similar to
sample x in Figure 5.4. Observing the three histogram samples we would suggest that
the color histogram x is more similar to y1 than y2, but only for the Chi-Square distance
this is true.
Regarding the texture dissimilarity check, City Block distance metric is more suiting
since the texture histogram is generally more uniformly distributed.
Figure 5.4: Simplified example color histograms with 8 bins and their distance metric
results comparing Chi-Square χ2, City Block D and Euclidean Distance L2.
Some logic constraints based on the human anatomy were defined, such as our initial
assumptions of hair growing from above the forehead of the person and his shoulders
being adjacent below to the face (see technical specification 1). Additionally, we split the
image into a hair region and shoulder region not allowing a superpixel in the undefined
hair region to match in similarity with a shoulder superpixel from the shoulder region.
The other way around is possible since the person’s hair can reach down to the shoulders
or even occlude parts of them.
A undefined superpixel is labeled hair, shoulder or background if the following respective
condition holds:
43
5. Hair and Shoulder
(stex
u ≤ θtex
h ) ∧ (min(stex) ∈ SHair) ∧
(scol
u ≤ θcol
h ) ∧ (min(scol) ∈ SHair)
⇒ su ∈ SHair
(5.3)
(stex
u ≤ θtex
s ) ∧ (min(stex) ∈ SShoulder) ∧
(scol
u ≤ θcol
s ) ∧ (min(scol) ∈ SShoulder) ∧
(su ∈ ShoulderRegion)
⇒ su ∈ SShoulder
(5.4)
(stex
u ≤ θtex
h ) ∧ (min(stex) ∈ SBg) ∧
(scol
u ≤ θcol
h ) ∧ (min(scol) ∈ SBg)
⇒ su ∈ SBg
(5.5)
where su is a undefined superpixel and SHair, SShoulder and SBg are being the sets of hair,
shoulder and background superpixels. For the first constraint in Formula 5.3 a undefined
superpixel is added to the set of hair superpixels SHair if the texture dissimilarity stex
u is
under a certain threshold θtex
h , the minimal texture dissimilarity is with a superpixel stex
from set SHair and for the color dissimilarity of the undefined superpixel scol
u the same has
to hold. A undefined superpixel su is added to the shoulder set SShoulder if additionally
to the conditions above su is in the defined shoulder region of the image, otherwise it
cannot be part of the person’s shoulder since this is from anatomy perspective of the
human body not possible (see technical specification 1).
It can be observed that we are looking at the color and texture characteristics indepen-
dently. It is possible that min(stex) and min(scol) are different superpixels. Otherwise
we would need to set a relation between texture and color when combining the two results
together. The question would arise how to weight them? How can both normalized
values be compared when they are so independent from each other. We have evaluated
this during our master thesis and came to the conclusion that this is not resulting into a
possibility to assign hair, shoulder and background superpixels correctly. The problem
arising is that if both are combined it might occur frequently that a supposedly hair
superpixel would be labeled as background, even though e.g. the color similarity is not
minimal.
The thresholds θtex
h , θcol
h for possible hair similarity and respectively θtex
s , θcol
s for shoulder
similarity ensure that the undefined superpixel has to have a certain similarity to the
foreground or background superpixel. Also the fact that it has to be labeled twice for
color and texture with the corresponding region ensures a stability of the procedure.
A dissimilarity in e.g. color of θcol
h = 0.4 means that the distance function of a unlabeled
superpixel and its corresponding match has to be similar for at least 60%. A larger
44
5.2. Hair and Shoulder Segmentation
dissimilarity threshold would work well for images with simpler (uniform) color regions
leading to fewer iterations of the Algorithm 5.1 to generate a result. For more complex
problem regions as it is for hair, clothes and complex backgrounds as we are phasing, a
smaller threshold should be used. Specially, when it comes to low contrast around the
silhouette of the subject (e.g. background has a similar color compared to the person’s
hair) then the normalized differences are reduced, which leads to a higher acceptance
rate of supposedly background superpixels if the threshold is to large.
Important to consider here is the size of the superpixels as well. The smaller the super-
pixels the larger the dissimilarity thresholds should be, because for bigger superpixels
the similarity in general is higher.
Hair and clothes have different characteristics when it comes to color and texture.
Normally hair is higher frequential. The shoulders, depending on the material of texture
of the clothes the person is wearing, have in general the highest frequencies around
the border if the contrast to the background is high. Therefore one more constraint
concerning a undefined superpixel su located in the ShoulderRegion prioritizing color is
added as forth condition during the dissimilarity test:
(scol
u ≤
θcol
s
2 ) ∧ (min(scol) ∈ SShoulder) ∧ (su ∈ ShoulderRegion)
⇒ su ∈ SShoulder
(5.6)
where only color information is considered if the undefined superpixel color dissimilarity
scol
u is under half of θcol
s and the minimum dissimilarity matches a superpixel in the
shoulder set SShoulder. If this condition holds then su is added to the SShoulder as well,
regardless of the texture information. The low threshold of θ
col
s
2 ensures that only very
similar in color superpixels are added to the SShoulder set.
After every iteration undefined superpixels are assigned to a either foreground (hair or
shoulder set) or background and the algorithm terminates if no changes were made. For
all still unassigned undefined superpixels a probabilistic mask is computed independently
for color and texture since they were assigned to different classes or the similarity was
above the thresholds.
This probabilistic mask is considered in the post-processing step for minor improvements
on the result by additionally considering the location of the particular superpixels and its
neighborhood. If the superpixel is e.g. surrounded by hair and it has a high probability
to be actually hair, either because of the color or texture similarity, then it is assigned to
hair, otherwise to the background (see Figure 5.5).
45
5. Hair and Shoulder
(a) Original Image. (b) The higher the intensity, the higher the likeli-
hood.
(c) Human-head and shoulder segmentation result.
Figure 5.5: Example of the probabilistic mask (b) and the result (c) of our procedure on
a particular sample image from CALTECH database
46
5.3. Results and Evaluation
5.3 Results and Evaluation
In the following subsections we describe the dataset of frontal-view face images used for
the quantitative and qualitative results. We evaluate the proposed method by assessing
the consistency of segmentation results and the manually labeled ground truth in terms of
the overlap ratio, which was used in this context before by Xin et al. [64]. This evaluation
criterion measures the error when segmented foreground contains background information
and vice versa:
overlap = Ground ∩ Segment
Ground ∪ Segment
(5.7)
where Ground is the image ground truth, and Segment is our segmentation result. In
our case well known evaluation measurements e.g. accuracy, precision, recall or F1 score
(harmonic average of the precision and recall) would not return meaningful information
about the performance of the approach, since in all cases it would approximate 1. This
is because most of the image pixels are labeled correctly as background or foreground
and the only small erroneous area is around the silhouette of the person, that is why
the overlap ratio was chosen as a significant evaluation criterion. Based on this criterion
and the qualitative results we compare our methodology using both discussed superpixel
algorithms SLIC and GS04 (see Chapter 4).
5.3.1 Database
Experiments are conducted using a public dataset for which ground truth was manu-
ally segmented for 250 images totally. Representative sample images were selected to
demonstrate the performance and limitations of our approach reaching to a variation of
ethnicity, gender, age, illumination, (simple) to complex backgrounds, outdoor and indoor
scenes, long/short hair with different hairstyles, color and facial hair (three-days-beard,
mustache). The database was used previously in the skin detection evaluation (see
Chapter 3) and in the comparison of the superpixel algorithms in Chapter 4 as well.
• CALTECH 1: is a frontal face dataset collected at California Institute of Technology,
capturing 27 people under different light conditions, facial expression, ethnicity
groups (mostly white and Asian), gender and complex backgrounds. It provides
images under different conditions with a complex background, where the orientation
of the head and shoulders is facing the camera according to the defined criteria
we are focusing on in this thesis. The database does not provide any ground
truth. Therefore, ground truth was generated for a set of 50 images. Additionally,
to enlarge the dataset for significant quantitative and qualitative results, the
background is replaced with arbitrary chosen complex background images from
the Berkeley Benchmark [38] resulting into a total number of 250 images (see
Figure 5.6).
1Collected by Markus Weber at California Institute of Technology http://www.vision.caltech.
edu/html-files/archive.html
47
5. Hair and Shoulder
Figure 5.6: Enlarge dataset (a) combining complex backgrounds from Berkeley Benchmark
Dataset and (b) the manually segmented ground truth from CALTECH (c) resulting into
new images for quantitative and qualitative evaluations.
5.3.2 Human-head and Shoulder Segmentation Results
For the following quantitative and qualitative results the decision tree was used as
classification learner for the skin detection. Furthermore, concluding from the results in
Chapter 4 for the superpixel algorithm SLIC N = 1500 of total superpixels was selected
and for GS04 k = 50. The dissimilarity threshold values were chosen by experience:
θtex
h = 0.35, θcol
h = 0.4, θtex
s = 0.5, θcol
s = 0.4.
Experiments are carried out on 250 frontal-view face images comparing the segmentation
result with the manually generated ground truth. As described above the overlap ratio
was considered as evaluation criterion and the results are visualized in Figure 5.7. As
can be observed for this particular dataset the average ratio for the head and shoulder
segmentation algorithm using GS04 with 0.9482 outperforms using SLIC with 0.9449
by a very small margin. To give the reader an impression on how such results look like
with a high and low overlap ratio, the best as well as the worst results are illustrated.
The skin alike colors in the background of the worst case sample lead to a failed skin
detection result, from which in the following no hair samples can be found to initialize
the hair model correctly. Observing the best sample result a almost perfect segmentation
is achieved.
Figure 5.8 shows some qualitative results comparing our algorithm using both superpixel
algorithms. It can be observed that indoor as well as outdoor scenes achieve good results.
In most of the indoor scenes in this particular dataset the subject casts a shadow due
48
5.4. Discussion
to the used illumination conditions (see Figure 5.8 first, second and seventh sample).
The color dissimilarity of these regions with rest of the background is often high and
could be falsely detected as more similar to the hair, but with the additional texture
dissimilarity constraint such problems can be prevented easily as can be seen in e.g. the
second row. In sample six and eight we illustrate two results of persons with longer hair,
where specially observing the man in the last row his hair is detected even though the
hair regions are not connected to each other, nevertheless all hair regions are detected
and labeled as foreground.
Small erroneous regions are visible e.g. in the fifth sample only for the result of the
segmentation using SLIC superpixel algorithm, where parts of the background on the left
side are detected as part of the shoulders. Similar is the incorrect labeling of a superpixel
above the head of the seventh sample.
A possible problem occurs when the models are initialize incorrectly containing superpixels
which are falsely considered in the model as e.g. in Figure 5.8 last sample for the
segmentation using SLIC to generate the superpixels. In this particular sample this is
not the case for the segmentation using GS04. Similar is the behavior described above, if
the skin detection fails completely and no hair superpixel is added to the hair model (see
worst sample in Figure 4.4). In Figure 5.9 the initialization mask for the ACM resulting
into the rough segmentation as one of the preprocessing steps, defines partly strands of
hair as background. The superpixels generated with SLIC containing these strands of
hair bias the segmentation in an non-acceptable way. As for the segmentation with GS04,
even though the background model is initialized wrongly with hair superpixels, it does
not affect the segmentation much, compensating the error in an acceptable result.
5.4 Discussion
With our proposed algorithm to segment hair and shoulders in a static frontal-view face
image without prior knowledge on the person’s appearance and background complexity,
we are able to handle all sorts of different input images following the conditions stated in
the technical specification Section 5.1. We evaluated our proposed algorithm with the
overlap ratio in a quantitative way and showed representative qualitative result.
Still, we have to keep in mind that this evaluation is limited in a way since the possibilities
of input images are endless. Therefore, the logical constraints and properties of our
proposed hair and shoulder algorithm can be taken into consideration. Based on the
properties of our algorithm we can conclude that hats or other head coverings could also
be handled if they are not casting significant shadows in the face region, because then
instead of hair superpixels initializing the hair model, hat or head covering superpixels
describe the model. Similarly would be the initialization of the shoulder model of an
image with a bearded person (see technical specification 4). After the face skin detection
the shoulder model would be initialized partly with superpixels containing information of
the person’s beard and partly the person’s clothing. Following our algorithm iteratively
would lead to finding the remaining beard superpixels as well as the shoulder superpixels.
49
5. Hair and Shoulder
Figure 5.7: Overlap Ratio of our Human-Head and Shoulder Segmentation using GS04
and SLIC superpixel algorithm. Best and worst overlap ratio result is above and below
the boxplot respectively with (a) Original Image and (b) the segmentation result.
50
5.4. Discussion
Figure 5.8: Head and shoulder segmentation results.(a) Original image. (b) Ground
truth. (c) using GS04 (d) with SLIC as oversampling algorithm. 51
5. Hair and Shoulder
Figure 5.9: After incorrect rough segregation with ACM due to narrow initialization
mask (a) ACM result foreground mask (containing the person and possible remaining
background). (b) ACM result background mask (containing only background regions). (c)
Head and shoulder segmentation with GS04 superpixel algorithm. (d) Head and shoulder
segmentation with SLIC superpixel algorithm.
52
CHAPTER 6
Conclusion
In this thesis, we present a novel automatic human-head and shoulder segmentation
algorithm of frontal-view portrait images with arbitrary unknown complex backgrounds.
We began by introducing the basic strategy of our approach and how the different
subtasks are combined to form the proposed methodology for the expected result. We
discussed these subtasks of our method independently in the main section of the thesis,
as each task can be viewed as a subtask of our particular problem, but also as a possible
approach for similar technical requirements: Face Skin Silhouette Detection in Chapter 3;
oversampling the input image and comparing two different state-of-the-art superpixel
algorithms in Chapter 4; and Hair and Shoulder Segmentation in Chapter 5. We evaluate
and provide qualitative and quantitative results using different public databases for each
individual part to illustrate their performance.
As one main contribution, we introduce a new Face Skin Silhouette Detection algorithm
based on supervised classification learners (FSCL), adding automatically labeled pixel
information from the query image to the training set, improving the performance of the
classifiers prediction on the remaining query image significantly. Another contribution is
the comparison and discussion of two different superpixel algorithms, SLIC and GS04, on
their adherence of boundaries, evaluating the boundary recall and under-segmentation
error in general and the overlap ratio for our particular problem statement highlighted in
this thesis. A further major contribution is our hair, shoulder and background models in
our novel Hair and Shoulder Segmentation algorithm, composed of color, texture and
the superpixel’s relative position in the query image, without any prior knowledge on
the person and background complexity. With these image specific models, the remaining
of the query image is classified into foreground and background. Based on the logical
conditions and algorithm properties, an additional feature of our approach is that due
to occlusion not connected regions can be found correctly and labeled with the same
class. We achieve an automatic human-head and shoulder segmentation without changing
53
6. Conclusion
any biometric features on the persons face to allow a possible identification in the following.
Regarding future work we would like to address the problem of handling baldness
and semi-baldness as part of our human-head and shoulder segmentation. Moreover,
incorporating alpha matting or guided image filtering [17] to create a smooth transition
between foreground and any new background may greatly improve the visual quality of
the results.
We want to conclude this thesis by broadening the context. Observing our Hair and
Shoulder Segmentation algorithm, the combination of color and texture information
extracted from superpixels to initialize e.g. the hair model, is able to label the remaining
undefined unlabeled areas of the image as hair, even though parts of those segments do
not have to be connected. This could be useful for other applications which e.g. have a
salient object which is partly occluded, resulting in multiple components. Furthermore
in our case the models are initialized only with information on the particular query
image but they can be initialized easily with any additional training data depending on
the application. Logical conditions can be added easily as well as their weight can be
modified as we performed different threshold parameter settings for classifying clothes
for the shoulder part or hair.
54
List of Figures
1.1 (1) Input image. (2) Ground truth output image. . . . . . . . . . . . . . . 3
2.1 Overview of our approach: (1) Input image. (2) Detect eyes and placing con-
trol points for a initial mask (see blue line). (3*) ACM Results (3a) Foreground
& (3b) Background. (4) Face Skin Detection. (5) Superpixel Segmentation.
(6) Trimap: Decision Procedure for ambiguous pixels in undefined area con-
sidering their color, texture and location information. (7) Output Image. 15
3.1 Overview of the preprocessing steps: (1) Input image. (2) Detect eyes and
placing control points for the initial ACM mask (see blue line). (3*) ACM
Results (3a) Foreground & (3b) Background. (4) Extracting skin pixels (in
color) (5) Zoomed into the selection of the extracted skin information. . . 19
3.2 1-D histograms of skin vs. non-skin pixels of the UCI database in RGB color
space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 1-D histograms of skin vs. non-skin pixels of the UCI database considering
the chrominance components of the YCbCr color space. . . . . . . . . . . . 21
3.4 Incorrectly classified pixels from decision tree. Left: Result of trained decision
tree in RGB color space. Right: Result of trained decision tree in YCbCr
color space. In orange are the false positive and in blue the false negatives. 21
3.5 Focus in the evaluation of skin detection on the silhouette of the person. (1)
Original image. (2) Ground truth. (3) New silhouette ground truth. . . . 22
3.6 Qualitative Examples: image (1) is from CALTECH database and images
(2),(3) from Pratheepan. White pixels are skin, black non-skin and around
the silhouette green represent all true positives (TP) and true negatives (TN)
and red all false positives (FP) and false negatives (FN). . . . . . . . . . . 26
3.7 Examples of a under- and overexposed image where the results of thresholdY-
CbCr and tree fail completely. Tree-FSCL improves the skin detection but
not sufficiently enough. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Results of SLIC algorithm with different resolution size of superpixels N =
{3000, 1500, 500}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Results of GS04 algorithm with different k values 50 (top), 100 (middle) and
500 (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
55
4.3 Quantitative evaluation measurements from Achanta et al. [3]: The SLIC and
GS04 algorithm outperform most of the other State-of-the-Art approaches in
(a) boundary recall, (b) under-segmentation error and (c) speed. . . . . . 32
4.4 Overlap Ratio for both Superpixel Algorithms SLIC and GS04 on 30 images
of CALTECH Database and FEI Face Database. . . . . . . . . . . . . . . 33
4.5 GS04 parameter k = {10, 50, 100, 200, 500} to control the size of the com-
ponents. Larger k causes a preference for larger components and therefore
smaller Number of Superpixels. . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 Example of granularity of superpixels and problem arising when superpixels
are too big merging background and foreground into one superpixel. . . . 35
5.1 Generated trimap after ACM rough segregation and skin detection. . . . . 39
5.2 Example of the automatically selected superpixels to generate the correspond-
ing hair, shoulder and background models to predict the rest of superpixels in
the undefined area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 From Ojala et al. [41]: The 36 unique rotation invariant binary patterns that
can occur in the eight pixel circularly symmetric neighbor set. Black and
white circles correspond to bit values of 0 and 1 in the 8-bit output of the
LBP8 operator. The first row contains the nine ‘uniform’ patterns. . . . . 42
5.4 Simplified example color histograms with 8 bins and their distance metric
results comparing Chi-Square χ2, City Block D and Euclidean Distance L2. 43
5.5 Example of the probabilistic mask (b) and the result (c) of our procedure on
a particular sample image from CALTECH database . . . . . . . . . . . . 46
5.6 Enlarge dataset (a) combining complex backgrounds from Berkeley Benchmark
Dataset and (b) the manually segmented ground truth from CALTECH (c)
resulting into new images for quantitative and qualitative evaluations. . . 48
5.7 Overlap Ratio of our Human-Head and Shoulder Segmentation using GS04
and SLIC superpixel algorithm. Best and worst overlap ratio result is above
and below the boxplot respectively with (a) Original Image and (b) the
segmentation result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.8 Head and shoulder segmentation results.(a) Original image. (b) Ground truth.
(c) using GS04 (d) with SLIC as oversampling algorithm. . . . . . . . . . . 51
5.9 After incorrect rough segregation with ACM due to narrow initialization mask
(a) ACM result foreground mask (containing the person and possible remaining
background). (b) ACM result background mask (containing only background
regions). (c) Head and shoulder segmentation with GS04 superpixel algorithm.
(d) Head and shoulder segmentation with SLIC superpixel algorithm. . . . 52
56
List of Tables
3.1 Evaluation on the testing database Pratheepan concentrating on the complete
ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Evaluation on the testing database Pratheepan concentrating on the silhouette
as ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
57

List of Algorithms
5.1 Hair and Shoulder Detection Algorithm . . . . . . . . . . . . . . . . . . . 41
59

Bibliography
[1] P. Aarabi. Automatic segmentation of hair in images. In 2015 IEEE International
Symposium on Multimedia (ISM), pages 69–72, Dec. 2015.
[2] M. Abdullah-Al-Wadud, M. Shoyaib, and O. Chae. A skin detection approach based
on color distance map. EURASIP J. Adv. Signal Process, 2008:199:1–199:10, Jan.
2008. ISSN 1110-8657.
[3] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. SLIC
Superpixels Compared to State-of-the-Art Superpixel Methods. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, Nov. 2012.
[4] I. Ahn and C. Kim. Face and hair region labeling using semi-supervised spectral
clustering-based multiple segmentations. IEEE Transactions on Multimedia, 18(7):
1414–1421, 2016.
[5] M. Barstugan, R. Ceylan, M. Sivri, and H. Erdogan. Automatic liver segmentation in
abdomen ct images using slic and adaboost algorithms. In Proceedings of the 2018 8th
International Conference on Bioscience, Biochemistry and Bioinformatics, ICBBB
2018, pages 129–133, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5341-0.
[6] R. B. Bhatt, G. Sharma, A. Dhall, and S. Chaudhury. Efficient Skin Region
Segmentation Using Low Complexity Fuzzy Decision Tree Model. In 2009 Annual
IEEE India Conference, pages 1–4, Dec. 2009.
[7] P. Bu, N. Wang, and H. Ai. Using Structural Patches Tiling to Guide Human Head-
shoulder Segmentation. In Proceedings of the 20th ACM International Conference
on Multimedia, MM ’12, pages 797–800, New York, NY, USA, 2012. ACM.
[8] M. Chai, T. Shao, H. Wu, Y. Weng, and K. Zhou. Autohair: fully automatic hair
modeling from a single image. ACM Transactions on Graphics (ToG), 35(4):116,
2016.
[9] J. Chatrath, P. Gupta, P. Ahuja, A. Goel, and S. M. Arora. Real time human face
detection and tracking. In 2014 International Conference on Signal Processing and
Integrated Networks (SPIN), pages 705–710, Feb. 2014.
61
[10] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space
analysis. IEEE Trans. Pattern Anal. Mach. Intell., 24(5):603–619, May 2002. ISSN
0162-8828.
[11] X. Deng and X. Wu. Fast Head-and-shoulder Segmentation. Master’s thesis,
McMaster University, Canada, 2016.
[12] A. Diplaros, T. Gevers, and N. Vlassis. Skin detection using the EM algorithm
with spatial constraints. In 2004 IEEE International Conference on Systems, Man
and Cybernetics (IEEE Cat. No.04CH37583), volume 4, pages 3071–3075 vol.4, Oct.
2004.
[13] A. Elgammal, C. Muang, and D. Hu. Skin detection-a short tutorial. Encyclopedia
of Biometrics, pages 1–10, 2009.
[14] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based image segmentation.
Int. J. Comput. Vision, 59(2):167–181, Sept. 2004. ISSN 0920-5691.
[15] F. Gasparini and R. Schettini. Skin segmentation using multiple thresholding. In
Internet Imaging VII, Proceedings of SPIE, volume 6061, pages 128–135, 2006.
[16] H. Greenspan, J. Goldberger, and I. Eshet. Mixture model for face-color modeling
and segmentation. Pattern Recognition Letters, 22(14):1525–1536, 2001.
[17] K. He, J. Sun, and X. Tang. Guided image filtering. In European conference on
computer vision, pages 1–14. Springer, 2010.
[18] International Civil Aviation Organization (ICAO). Machine Readable Travel Docu-
ments: Part 5. ICAO, 999 Robert-Bourassa Boulevard, Montréal, Quebec, Canada
H3C 5H7, 7 edition, 2015. Doc 9303.
[19] ISO/IEC 19794-5. Information technology - Biometric data interchange formats -
Part 5: Face image data. ISO, 1 edition, June 2005.
[20] J. C. S. Jacques and S. R. Musse. Improved Head-Shoulder Human Contour
Estimation through Clusters of Learned Shape Models. In 2015 28th SIBGRAPI
Conference on Graphics, Patterns and Images, pages 329–336, Aug. 2015.
[21] J. C. S. Jacques, C. R. Jung, and S. R. Müsse. Head-shoulder human contour esti-
mation in still images. In 2014 IEEE International Conference on Image Processing
(ICIP), pages 278–282, Oct. 2014.
[22] G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical
Learning: with Applications in R. Springer, New York, 1st ed. 2013, corr. 7th
printing 2017 edition, Sept. 2017. ISBN 978-1-4614-7137-0.
[23] D. Jensch, D. Mohr, and G. Zachmann. A Comparative Evaluation of Three Skin
Color Detection Approaches. Journal of Virtual Reality and Broadcasting, 12(2015)
(1), Jan. 2015.
62
[24] P. Julian, C. Dehais, F. Lauze, V. Charvillat, A. Bartoli, and A. Choukroun.
Automatic hair detection in the wild. In Pattern Recognition (ICPR), 2010 20th
International Conference on, pages 4617–4620. IEEE, 2010.
[25] P. Kakumanu, S. Makrogiannis, and N. Bourbakis. A Survey of Skin-color Modeling
and Detection Methods. Pattern Recogn., 40(3):1106–1122, Mar. 2007.
[26] Y. Kalantidis, L. Kennedy, and L.-J. Li. Getting the look: clothing recognition and
segmentation for automatic product suggestions in everyday photos. In Proceedings
of the 3rd ACM conference on International conference on multimedia retrieval,
pages 105–112. ACM, 2013.
[27] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. Interna-
tional Journal of Computer Vision, 1(4):321–331, Jan. 1988.
[28] R. Khan, A. Hanbury, and J. Stöttinger. Universal seed skin segmentation. Advances
in Visual Computing, pages 75–84, 2010.
[29] R. Khan, A. Hanbury, and J. Stöttinger. Skin detection: A random forest approach.
In Image Processing (ICIP), 2010 17th IEEE International Conference on, pages
4613–4616. IEEE, 2010.
[30] J. Y. Lee and S. I. Yoo. An elliptical boundary model for skin color detection. In In
Proc. Int. Conf. on Imaging Science, System and Technology, 2002.
[31] T. Leung and J. Malik. Representing and recognizing the visual appearance of
materials using three-dimensional textons. International journal of computer vision,
43(1):29–44, 2001.
[32] A. Levinshtein, C. Chang, E. Phung, I. Kezele, W. Guo, and P. Aarabi. Real-time
deep hair matting on mobile devices. CoRR, 2017.
[33] L. Liang, C. F. Huitema, M. A. Simari, and S. E. Anderson. Camera system and
method for hair segmentation, Sept. 19 2017. US Patent 9,767,586.
[34] C. Liensberger, J. Stöttinger, and M. Kampel. Color-based and context-aware skin
detection for online video annotation. In 2009 IEEE International Workshop on
Multimedia Signal Processing, pages 1–6, Oct. 2009.
[35] W. Lü and J. Huang. Skin detection method based on cascaded adaboost classifier.
Journal of Shanghai Jiaotong University (Science), 17(2):197–202, Apr 2012.
[36] B. Ma, C. Zhang, J. Chen, R. Qu, J. Xiao, and X. Cao. Human skin detection
via semantic constraint. In Proceedings of International Conference on Internet
Multimedia Computing and Service, ICIMCS ’14, pages 181:181–181:184, New York,
NY, USA, 2014. ACM.
63
[37] I. Maglogiannis. Emerging Artificial Intelligence Applications in Computer Engineer-
ing: Real Word AI Systems with Applications in EHealth, HCI, Information Retrieval
and Pervasive Technologies. Frontiers in artificial intelligence and applications. IOS
Press, 2007.
[38] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural
images and its application to evaluating segmentation algorithms and measuring
ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages
416–423, July 2001.
[39] R. Melán. Skin detection in frontal-view faces. Technical Report PRIP-TR-142,
PRIP, TU Wien, 2018.
[40] U. R. Muhammad, M. Svanera, R. Leonardi, and S. Benini. Hair detection, segmen-
tation, and hairstyle classification in the wild. Image and Vision Computing, 71:25 –
37, 2018.
[41] T. Ojala, M. Pietikäinen, and T. Mäenpää. Gray scale and rotation invariant texture
classification with local binary patterns. In European Conference on Computer
Vision, pages 404–420. Springer, 2000.
[42] P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to segment object candidates.
In Advances in Neural Information Processing Systems, pages 1990–1998, 2015.
[43] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object
segments. In European Conference on Computer Vision, pages 75–91. Springer, 2016.
[44] C. Platzer, M. Stuetz, and M. Lindorfer. Skin Sheriff: A Machine Learning Solution
for Detecting Explicit Images. In Proceedings of the 2Nd International Workshop
on Security and Forensics in Communication Systems, SFCS ’14, pages 45–56, New
York, NY, USA, 2014. ACM.
[45] C. Rother, V. Kolmogorov, and A. Blake. "GrabCut": Interactive Foreground Ex-
traction Using Iterated Graph Cuts. In ACM SIGGRAPH 2004 Papers, SIGGRAPH
’04, pages 309–314, New York, NY, USA, 2004. ACM.
[46] C. Rousset and P. Y. Coulon. Frequential and color analysis for hair mask segmen-
tation. In 2008 15th IEEE International Conference on Image Processing, pages
2276–2279, Oct 2008.
[47] A. A. Sangüesa, N. K. Jorgensen, C. A. Larsen, K. Nasrollahi, and T. B. Moeslund.
Initiating grabcut by color difference for automatic foreground extraction of passport
imagery. In 2016 Sixth International Conference on Image Processing Theory, Tools
and Applications (IPTA), pages 1–6, Dec 2016.
[48] F. Saxen and A. Al-Hamadi. Color-based skin segmentation: An evaluation of the
state of the art. In 2014 IEEE International Conference on Image Processing (ICIP),
pages 4467–4471, Oct. 2014.
64
[49] C. Schmid. Constructing models for content-based image retrieval. In Computer
Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE
Computer Society Conference on, volume 2, pages II–II. IEEE, 2001.
[50] K. B. Shaik, P. Ganesan, V. Kalist, B. S. Sathish, and J. M. M. Jenitha. Comparative
Study of Skin Color Detection and Segmentation in HSV and YCbCr Color Space.
Procedia Computer Science, 57:41–48, Jan. 2015.
[51] Y. Shen, Z. Peng, and Y. Zhang. Image based hair segmentation algorithm for the
application of automatic facial caricature synthesis. The Scientific World Journal,
2014, 2014.
[52] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern
Anal. Mach. Intell., 22(8):888–905, Aug. 2000.
[53] G. Shu. Human Detection, Tracking and Segmentation in Surveillance Video. PhD
thesis, M.S. Shanghai Jiaotong University, China, 2014.
[54] L. Spillmann and J. S. Werner. Visual Perception: The Neurophysiological Founda-
tions. Academic Press, 2 edition, 1990.
[55] W. R. Tan, C. S. Chan, P. Yogarajah, and J. Condell. A Fusion Approach for
Efficient Human Skin Detection. IEEE Transactions on Industrial Informatics, 8(1):
138–147, Feb. 2012.
[56] D. V. Thombre, J. H. Nirmal, and D. Lekha. Human detection and tracking
using image segmentation and Kalman filter. In 2009 International Conference on
Intelligent Agent Multi-Agent Systems, pages 1–5, July 2009.
[57] M. Varma and A. Zisserman. A statistical approach to texture classification from
single images. International journal of computer vision, 62(1-2):61–81, 2005.
[58] A. Vedaldi and S. Soatto. Quick Shift and Kernel Methods for Mode Seeking, pages
705–718. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
[59] L. Vincent and P. Soille. Watersheds in digital spaces: An efficient algorithm based
on immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell., 13(6):583–598,
June 1991.
[60] P. Viola and M. J. Jones. Robust real-time face detection. International Journal of
Computer Vision, 57(2):137–154, May 2004.
[61] N. Vu and B. S. Manjunath. Shape prior segmentation of multiple objects with
graph cuts. In 2008 IEEE Conference on Computer Vision and Pattern Recognition,
pages 1–8. IEEE, June 2008.
[62] D. Wang, S. Shan, W. Zeng, H. Zhang, and X. Chen. A novel two-tier bayesian
based method for hair segmentation. In Image Processing (ICIP), 2009 16th IEEE
International Conference on, pages 2401–2404. IEEE, 2009.
65
[63] N. Wang, H. Ai, and S. Lao. A compositional exemplar-based model for hair
segmentation. In Asian Conference on Computer Vision, pages 171–184. Springer,
2010.
[64] H. Xin, H. Ai, H. Chao, and D. Tretter. Human head-shoulder segmentation. In
Face and Gesture 2011, pages 227–232, Mar. 2011.
[65] Y. Yacoob and L. S. Davis. Detection and analysis of hair. IEEE transactions on
pattern analysis and machine intelligence, 28(7):1164–1169, 2006.
[66] C.-K. Yang and C.-N. Kuo. Automatic hair extraction from 2d images. Multimedia
Tools and Applications, 75(8):4441–4465, 2016.
66