3D-Kopfverfolgung und
Gestenerkennung mittels eines
8-mal-8 Infrarotsensor-Array
DIPLOMARBEIT
zur Erlangung des akademischen Grades
Diplom-Ingenieur
im Rahmen des Studiums
Visual Computing
eingereicht von
Omar Ismail, B.Sc.
Matrikelnummer 01327702
an der Fakultät für Informatik
der Technischen Universität Wien
Betreuung: Em.O.Univ.Prof. Dr. Walter G. Kropatsch
Mitwirkung: Darshan Batavia, Ph.D.
Dr. techn. Jiri Hladuvka
Wien, 20. Jänner 2024
Omar Ismail Walter G. Kropatsch
Technische Universität Wien
A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at

3D Head Tracking and Gesture
Recognition using an 8-by-8 Array
of Infrared Sensors
DIPLOMA THESIS
submitted in partial fulfillment of the requirements for the degree of
Diplom-Ingenieur
in
Visual Computing
by
Omar Ismail, B.Sc.
Registration Number 01327702
to the Faculty of Informatics
at the TU Wien
Advisor: Em.O.Univ.Prof. Dr. Walter G. Kropatsch
Assistance: Darshan Batavia, Ph.D.
Dr. techn. Jiri Hladuvka
Vienna, 20th January, 2024
Omar Ismail Walter G. Kropatsch
Technische Universität Wien
A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at

Erklärung zur Verfassung der
Arbeit
Omar Ismail, B.Sc.
Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen-
deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der
Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder
dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter
Angabe der Quelle als Entlehnung kenntlich gemacht habe.
Wien, 20. Jänner 2024
Omar Ismail
v

Danksagung
Zu allererst möchte ich mich bei Em.O.Univ.Prof. Dr. Walter Kropatsch bedanken,
welcher mich für Bildverarbeitung begeistern und motivieren konnte - seine Geduld hat
letztendlich dazu geführt, dass ich den letzten Schritt meines Diplomingenieurs abschließe.
Diese Diplomarbeit ist ohne die Hilfe anderer nicht möglich gewesen und ich möchte
mich hier bei ihnen bedanken. Vielen Dank an die Studienbeihilfe Wien, welche mir
überhaupt die Möglichkeit gegeben hat, mein Studium zu verfolgen und abzuschließen.
Ohne sie wäre es mir nicht möglich gewesen, mir ein Studium ohne Vollzeitarbeit zu
leisten. Ich möchte mich auch bei LEWITT GmbH für ihre Unterstüzung während der
Implementierungs- und Experimentenphase bedanken, insbesondere Dr. Christian Walter,
Moritz Lochner und Pavol Puffler.
Nach dem Abschluss des praktischen Teils dieser Arbeit, hatte ich große Schwierigkeiten
beim Schreiben dieser Arbeit. Hier kamen meine geliebten Personen mit ihrer scheinbar
unendlichen Unterstützung ins Spiel: meine Mutter Wafaa Ahmed, meine Geschwister
Sherin Ismail, Mohamed Ismail, Rayan Ismail, und Moaz Ismail, und meine Großmutter
Aziza Ahmed waren immer da, als ich sie gebraucht habe. Meine unglaublichen Freunde
Elias Marold, Ines Burgstaller, Lina Kröncke, Negín Sadeghi, Emilia Jäger, Cornelia
Zimmel und Maja Zupančič, haben mich während meinem Studium begleitet, mir den
Weg gezeigt und inspiriert, mehr zu erreichen. Gemeinsam haben wir Schwierigkeiten
erlebt und gemeistert und ich weiß, dass ich ohne sie nicht dort wäre, wo ich jetzt bin.
Zum Schluss möchte ich mich bei Sophia Hannes bedanken, welche mir mit ihrer nicht en-
denden Unterstützung, Geduld und Liebe die nötige Kraft gegeben hat, diese Diplomarbeit
neben einer Vollzeitstelle zu schreiben und abzuschließen.
Vielen Dank euch allen, die Welt ist ein besserer Ort mit euch in ihr.
vii

Acknowledgements
First and foremost, I want to thank Em.O.Univ.Prof. Dr. Walter Kropatsch for inspiring
and motivating me to keep pursuing academia - his patience is what ultimately led to
me finishing the last step of my Master’s Degree.
This thesis was not possible without the help of others, and I want to acknowledge them
here. I want to thank the Austrian Study Grant Authority for making it possible for
me to pursue my master’s. Without their help, I probably would not have been able to
afford to pursue a Masters Degree. I also want to thank LEWITT GmbH for supporting
me during the implementation and experimentation phase of this thesis, specifically
Christian Walter, Moritz Lochner, and Pavol Puffler.
After being done with the practical part, I have struggled heavily with writing. That is
where my loved ones came in and lent me their unending support: my mother Wafaa
Ahmed, my siblings Sherin Ismail, Mohamed Ismail, Rayan Ismail, and Moaz Ismail, and
my grandmother Aziza Ahmed were always there when I needed them. My amazing
friends Elias Marold, Ines Burgstaller, Lina Kröncke, Negín Sadeghi, Emilia Jäger,
Cornelia Zimmel, and Maja Zupančič, were with me during my academic pursuit, guided
and inspired me to do more. We have struggled and succeeded together and I know that
without them, I would not have been where I am right now.
Finally, I want to thank Sophia Hannes, who, with her incredible support, patience, and
love, gave me all the energy I needed to write and finish this thesis next to working a
full-time job.
Thank you all, the world is a better place with you in it.
ix

Kurzfassung
Kopfverfolgung und Gestenerkennung sind bekannte Problemstellungen im Bereich der
Bildverarbeitung mit Lösungen anhand RGB-Kameras oder einer Kombination aus Infra-
rotsender und -empfänger. In dieser Diplomarbeit wird eine Methode für Kopfverfolgung
und Gestenerkennung mit einem 8-mal-8 Infrarotsensor-Array vorgestellt. Dabei wird
ein neuartiger Time-of-Flight Abstandssensor verwendet, welcher sowohl finanziell als
auch rechentechnisch günstig ist. Zusätzlich werden dank der sehr niedrigen Auflösung
des Feldes Privatsphärenbedenken reduziert.
Die Methode besteht aus zwei Teilen; zuerst wird ein Kopf mittels Kreiserkennung und
Formeigenschaften in dem kombinierten Amplituden- und Tiefenbild gesucht. Wird kein
Kreis gefunden, werden Annahmen über die Form getroffen, um die Position zu schätzen.
Anschließend wird die Distanz des ermittelten Kopfzentroiden verwendet, um einen Raum
zwischen Sensoren und berechneter Kopfposition zu definieren (Gestenraum). Dieser Raum
wird anschließend von der Gestenerkennung für die Verfolgung von Bewegungen über
fünf Bildern verwendet. Falls die Richtung der größten Bewegung eine Geschwindigkeit
von mindestens vier Pixel pro Sekunde hat, wird eine Geste erkannt.
Die Experimente werden sind in Feld- (Sensor auf einem Tisch in einem Wohnzimmer
mit Tageslicht und Fenster hinter Person) und Laborexperimente (Sensor auf einem
Drehtisch in einem Lichtzelt in einem Labor, mit künstlichem Licht von oben) aufgeteilt.
Die Ergebnisse der Kopferkennung deuten auf eine durchschnittliche zweidimensionale
Abweichung von 2.5 Pixel / 5.7 cm bei einer durchschnittlichen Distanz von 40.3cm
(Laborexperimente), bzw. 1.9 Pixel / 4.2 cm bei einer durchschnittlichen Distanz von
42.5cm (Feldexperimente) zum Kopfmittelpunkt (= Nasenspitze).
Für die Gestenerkennung deuten die Ergebnisse auf eine durchschnittliche Erkennungsrate
(Geste erkannt, unabhängig von der Richtung) von 33.54% bei Labor- bzw. 22.55% bei Fel-
dexperimenten. Die durchschnittlichen Genauigkeit (Geste und Richtung korrekt erkannt)
beträgt 42.33% bei Labor-, bzw. 47.82% bei Feldexperimenten, und die durchschnittlichen
Falscherkennungsrate beträgt 28.73% bei Labor-, bzw. 21.54% bei Feldexperimenten.
xi

Abstract
Human head tracking and gesture recognition are both known problems with solutions
using RGB-cameras or an infrared emitter/receiver setup. In this thesis, we propose a
method for head tracking and gesture detection using an 8-by-8 infrared sensor array.
For this, a novel time-of-flight infrared sensor array is employed, which is both financially
and computationally inexpensive, while also alleviating privacy concerns due to the very
low resolution of the array.
The method is split into two parts: first, a human head is detected using circle detection
on the filtered combination of depth and amplitude images. If no circle is detected, shape
information is used to estimate the position of the head. To reduce false detection and
outliers, the movement of the head is tracked over time.
Using the depth value of the detected centroid, gesture detection then looks for movement
in the given space between the sensor and detected centroid depth (gesture space) and
tracks it over five frames. If the major movement direction exceeds a speed of four pixels
per second, a gesture is detected.
The experiments are split up into field (sensor on a desk in a living room with a window
behind the person, daylight) and laboratory (sensor on a turntable in a photography
light tent in a lab with artificial ceiling lighting). The results of head detection suggest
an average centroid deviation of 2.5 pixels / 5.7 cm at an average depth value of 40.3 cm
(laboratory experiments), or 1.9 pixels / 4.2 cm at an average depth value of 42.5 cm
(field experiments) from the middle of the head (= tip of the nose).
For gesture recognition, the results suggest an average true detection rate (gesture de-
tected, regardless of direction) of 33.54% (laboratory experiments), or 22.55% (laboratory
experiments). The average accuracy (gesture and direction correct) is 42.33% for labora-
tory experiments or 47.82% for field experiments, and the average false positive rate is
28.73% for laboratory experiments or 21.54% for field experiments.
xiii

Contents
Kurzfassung xi
Abstract xiii
Contents xv
1 Introduction 1
1.1 Time-of-Flight-Infrared-Sensors . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Head Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 5
2.1 State of the Art for Head Tracking . . . . . . . . . . . . . . . . . . . . 5
2.2 State of the Art for Gesture Recognition . . . . . . . . . . . . . . . . . 7
3 Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images 11
3.1 Theory of Head Detection in Low-Resolution Infrared Amplitude Images 13
3.2 Theory of Gesture Recognition in Low-Resolution Infrared Amplitude
Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Algorithm of Head Detection and Gesture Recognition in Low-
Resolution Infrared Amplitude Images 41
4.1 Algorithm of Head Tracking in Low-Resolution Infrared Amplitude Images 41
4.2 Algorithm of Gesture Recognition . . . . . . . . . . . . . . . . . . . . . 46
5 Challenges of Head Detection and Gesture Recognition in Low-
Resolution Infrared Amplitude Images 49
5.1 Internal Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 External Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Experimental Results 53
6.1 Evaluation Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
xv
6.3 Laboratory Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.4 Field Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7 Conclusion 65
List of Figures 67
List of Tables 71
Bibliography 73
CHAPTER 1
Introduction
With the advance of technology, humanity continuously tries to improve and add to the
ways we interact and communicate with our devices and by extension, each other. Human
Interface Devices (HIDs) like keyboards, computer mice, joysticks, knobs, buttons, as
well as touchscreens, work via physical interaction and direct contact. Infrared Head
tracking and gesture recognition both are non-physical interaction methods that could
enable an alternative or better use of existing and upcoming technology.
As of 2023, a large corpus of literature offers solutions for both gesture recognition and
head tracking using RGB cameras, infrared (IR) transmitters and IR cameras, or a
combination thereof. Devices like the Microsoft Kinect offer a ready-to-use combination
of an RGB camera and IR transmitter/camera setup. The drawback of these methods
is that they can be financially and computationally expensive, and - depending on the
technology employed - do not work at close distances (e.g. the Kinect, which works best
at distances between 1.2 and 3.5 meters, according to its manual). Another problem is
the privacy concern: for motion recognition to work, the (high resolution) cameras need
to be recording multiple frames, which resembles the behaviour of video cameras. This
might hinder the acceptance of these technologies.
In this thesis, we propose a handcrafted method to track a human head and detect
gestures using a financially inexpensive 8×8 Time-of-Flight-Infrared (ToF-IR) sensor
array. The output of this algorithm for each frame will be the position of the head in
(x, y)-coordinates, its distance in millimeters and the type of gesture executed, if one was
detected.
Due to technical limitations of the sensor array, the algorithm was created with a distance
of up to 73 cm in mind. The sensors used cannot capture the details of a human face
due to the very low resolution, and can thus help alleviate privacy concerns. Combined
with efficient pattern recognition, the computational costs are also held at a minimum.
1
1. Introduction
With advances in this field, non-physical interaction methods could become more widely
accepted and employed.
This thesis is segmented into seven chapters. First, the Introduction in Chapter 1
describes the problem this thesis aims to solve and defines the key terms used throughout.
Chapter 2 then presents related work in the area of head detection and gesture recognition,
and Chapter 3 introduces the theory of the methodology used in this thesis. Then, the
algorithm is described in Chapter 4, and external and internal challenges are discussed
in Chapter 5. Finally, the experimental results are presented in Chapter 6, with the
conclusion being in Chapter 7.
1.1 Time-of-Flight-Infrared-Sensors
In this thesis, infrared (IR) sensors are categorised into passive and active IR sensors:
Definition 1.1.1 (Active IR Sensor). An active IR sensor works by emitting infrared
radiation and measuring the amount of photons (amplitude) reflected, returning a 1-
dimensional output.
Definition 1.1.2 (Passive IR Sensor). A passive IR sensor measures the infrared radiation
emitted/reflected by objects in the field-of-view (FoV), returning a 1-dimensional output.
Another variable in IR sensors is the wavelength or range of wavelengths they are sensible
to. For example, Landsat satellites [RWL+14] or infrared thermometers [LBJP12] operate
using long-wavelength infrared (LWIR), which lies between 8 and 14 µm [UVG+14].
The sensor used is an STMicroelectronics VL53L1X attached to a breakout board with a
gyro-sensor, as seen in Figure 1.1, which emits and measures photons with a wavelength
of 0.94 µm, putting it in the category of near-infrared (NIR) sensors [UVG+14].
Figure 1.1: The sensor used, an STMicroelectronics VL53L1X (black element at
the centre of the board), with the gyro sensor connected to the green LEDs to
the right to visualise the pose of the sensor.
2
1.2. Head Tracking
In detail, the VL53L1X has a size of 4.9 × 2.5 × 1.56 mm, emits a 940 nm invisible laser
(class 1), and receives using a single photon avalanche diode (SPAD) with an integrated
lens. Its ranging can reach up to 4m distance (with degrading quality the farther away
an object is), with a frequency of up to 50 Hz (not possible in 4 × 4 or 8 × 8 mode) and
a full field of view (FoV) of 27°.
By combining its IR sensor with a Time-of-Flight sensor, it is possible to calculate the
depth by measuring the time the photons need to reflect, thus returning an additional
1-dimensional output.
Definition 1.1.3 (Time-of-Flight Sensors). Time-of-Flight sensors are able to calculate
the distance of an object (its depth value) by measuring the time the emitted photons
need to reflect, returning a 1-dimensional output.
The same principle has been used by Radar systems, where pulses of electromagnetic waves
are emitted and their reflection is measured. Using signal processing and engineering,
like with Synthetic Aperture Radars (SARs) [Cut90], the distance, velocity, angle and an
image of the object reflecting the pulse can be determined.
Arranging multiple sensors in a rectangular array allows for two 2D outputs: An amplitude
image and an infrared image, both in very low size (as of 2023, 8×8).
There are three ranging modes offered by the sensor: short (up to 136/135 cm in
low/strong ambient light, respectively), medium (up to 290/76 cm in low/strong ambient
light), and long (up to 360/73 cm in low/strong ambient light). The algorithm presented
in this thesis uses the long-ranging mode, as it provides the lowest repeatability error
(i.e. the highest measurement consistency), which is more important in scientific work.
1.2 Head Tracking
2D head tracking focuses on detecting and tracking a head in a two-dimensional space
with two outputs: the vertical (usually the Y-coordinate) and horizontal position (usually
the X-coordinate).
3D head tracking (which is where this thesis’ objective lies) adds a third dimension to the
detection and tracking: depth (Z-coordinate). This is not to be confused with 3D head
pose estimation [KKL+21], which not only aims to detect and track the three-dimensional
position of a head but also its pose, i.e. the three-dimensional rotation in the space,
adding three further outputs (the rotation for each axis).
Working reliably, head tracking can have many usages in interaction and security: In
the context of vehicle safety, a driver’s head and distance could be tracked to ensure
that the head is always at a safe distance from the steering wheel in case of activation
of the airbag. 2D-holograms, visualisations, and 3D-modelling software could simulate
three-dimensional behaviour using head movement to manipulate the virtual camera. 3D
3
1. Introduction
holograms could use the distance of the head to control the zoom, making the object
bigger when users move their heads closer and vice-versa.
In the context of entertainment, head tracking can be used to increase immersion by
adapting camera movement in video games and movies to the head movement of players
and viewers.
1.3 Gesture Recognition
In this thesis, a gesture is defined as a movement of the hands and arms to express intent.
A big distinction made here is between static and dynamic gestures:
Definition 1.3.1 (Static Gestures). Static gestures describe a certain pose (e.g., certain
fingers are extended, while others are not, or the angle at which an arm is extended) and
can be captured in one single frame. They have little to no barycentric movement, i.e.
the centre of the hand does not move between frames.
Definition 1.3.2 (Dynamic Gestures). Dynamic gestures describe a certain movement
(e.g. a swipe from left to right, or ”drawing” a shape in the air), necessitating the
processing of multiple frames to discern. These gestures exhibit barycentric movement.
This thesis focuses on the latter, intending to detect gestures made by the movement of
a flat hand with its palm facing the sensor. By adding the constraint of only parallel
movement along the XYZ-axes, we arrive at the definition of the directional gesture
recognition employed in this thesis.
Definition 1.3.3 (Directional Gestures). Directional gestures are a subset of dynamic
gestures, where the barycentric movement (i.e. the movement of the hand) follows a
straight line. We further define directional gestures to move parallel to the X-(horizontal)
and Y-(vertical)axes (left, right, up, and down gestures), or perpendicular to them
(parallel to the Z-axis, front and back gestures).
Contact-less gesture recognition can be used to control screens and devices without
needing to physically touch them, which is especially useful in environments where
physical interaction is held to a minimum – like in medical institutions [MHWH17].
As with head detection, entertainment and future technologies like holograms could
also benefit from contact-less gesture recognition. In the specific case of holograms,
manipulation using the movement and rotation of the user’s hand could provide a viable
interaction method. If the technology is reliable enough, another possible application
could be found in car entertainment systems, as an alternative to conventional touch
screens.
4
CHAPTER 2
Related Work
Since our thesis is using a sensor with amplitude and depth data of small size (8 × 8) and
narrow FoV (27°) – especially compared to systems like the Kinect (with an image size
of 640 × 480 for the v1 and 1920 × 1080 for the v2 [WS17]) – we can only approximate
and discuss the methodology separately. Of relevance are any papers that try to solve
head detection/gesture recognition in low resolution and/or video.
Since our technique transforms the sensor data into very low-resolution grey-scale images,
we find that methods using RGB cameras are also relevant. Thus, the methods for head
detection discussed here will be using IR- (not to be confused with the simpler infrared
sensor array discussed in this thesis) and/or RGB-cameras.
For gesture detection, a large corpus of scientific literature is dedicated to the recognition
of static hand gestures (Definition 1.3.1). The task of dynamic or directional gesture
(Definitions 1.3.2 and 1.3.3) recognition does not have the challenge of finer details such
as finger pose, but adds the problem of tracking the hands or arms over multiple frames
and ascertaining the flow of their movement.
2.1 State of the Art for Head Tracking
In this section, we consider Head Pose Estimation if it uses infrared imaging for its
data acquisition. For performance evaluation and validation of face detection algorithms,
publicly available data sets exist, like the Color FERET dataset [Mar00], the CMU
Pose, Illumination and Expression (PIE) dataset [SBB01], the SCface dataset [GDG11],
the XM2VTS dataset [RMG+99], the VGGFace2 dataset [CSX+18] or the novel IRHP
database [LWZ+20]. However, because the sensor used in this thesis is not a camera, but
a set of IR emitters and receivers, no public data set for this type of data is available (as
of 2023), making direct comparison impossible.
5
2. Related Work
The problem of low-resolution (LR) face recognition has garnered interest in areas like
cost-efficient and/or long-distance surveillance. In their review, Wang et al. [WMJW+14]
give an overview of the LR face recognition and outline four challenges:
Misalignment, where facial features are misaligned due to e.g. the angle of the camera
and thus cannot be matched.
Noise, which gets amplified by a lower resolution due to the higher impact of the camera
pose, lighting, environmental and technical issues.
Lack of effective features, making the extraction of features like Gabor or Local
Binary Patterns difficult.
Dimensional mismatch, leading to difficulties with some subspace learning methods
(further detailed by Choi et al. [CRP08]).
they [WMJW+14] have also named multiple papers that define a lower threshold for
reliable face detection, where the required image size lies between 21 × 16 to 64 × 48,
depending on the methodology used [LP02][BBSV06][FLCS12].
Following that, Wang et al. [WMJW+14] define problems below this size as low-resolution
(LR) face recognition (FR), while Wilman WW Zou and Pong C Yuen. [ZY11] call them
very low-resolution (VLR) face recognition. Since the output of our sensor array can
be approximated to a 2D 8 × 8 image, our problem lies in the domain of VLR face
recognition.
Definition 2.1.1 (Very Low-Resolution Face Recognition). Very low-resolution face
recognition is the task of recognising a human face with a size smaller than 21 × 16 pixels.
To make detection feasible and learn features, three possible approaches are defined in
the review of Wang et al. [WMJW+14]:
Up-Scaling/Super-Resolution: Image is interpolated using methods like bi-cubic
interpolation. With interpolation, no new features are added, but defects like noise
are amplified. To combat this, super-resolution is employed, where an algorithm
learns from a gallery of faces and takes advantage of a face’s symmetry and self-
similarity. This way, features are added and the effective resolution is increased.
Unified/Inter-resolution feature space: The low-resolution face is projected onto a
common space with high-resolution faces and then compared. The issue with this
technique is the possibility of noise being introduced with either of the bi-directional
functions that project high and low-resolution images to the inter-resolution space.
Down-Scaling: The training data is down-scaled to the resolution of the task at hand,
losing features and thus being generally the least ideal option.
6
2.2. State of the Art for Gesture Recognition
Shinji Hayashi and Osamu Hasegawa [HH06] were one of the first to tackle the problem
of LR FR and solved it by using an upper body detector and training face recognition on
upper body images using AdaBoost. To enhance recognition, the image is up scaled via
bi-cubic interpolation, and a new detector is defined using features with height (H) and
width (W) bigger than 4. Finally, a support vector machine (SVM) is trained with the
output of the detectors, leading to an experimental detection rate of 73%.
The solution of Wilman WW Zou and Pong C Yuen [ZY11] lies in a novel learning method
for super-resolution algorithms, where the relationship between the high-resolution image
space and the VLR image space is learned. Taking advantage of the self-similarity
of the human face and adding two constraints, namely a new data constraint and a
discriminative constraint, it is possible to learn a relationship operator which is used to
interpolate features and to raise the effective resolution of the face.
In 2004, Krotosky et al.[KCT04] compared the performance of stereo IR cameras with that
of Long-Wave Infrared (LWIR) cameras regarding head detection for airbag safety. They
found that stereo IR works more reliably than LWIR due to the stability of reflectance
regardless of head-wear, while the LWIR heat image would be altered by them. However,
in the case where a hand is in the frame and is roughly the same size as the head, LWIR
manages to discern a difference in temperature between the human head and hand, while
stereo IR struggles to differentiate between the two.
To help with the lack of data for IR head pose recognition, Liu et al. [LWZ+20] created the
IRHP database with 145 high-resolution low-light IR head pose images. Based on that,
they propose a convolutional neural network (CNN) architecture that extracts and com-
bines high and low-level features. The experimental results suggest a performance better
than algorithms like DLDL [GXX+17], IndepCA(HOG) and CartCA/MvCA [CJH+19],
and the CNN architecture proposed by Seungsu Lee and Takeshi Saitoh [LS18]. Khan et
al. [KKL+21] did an extensive systematic review on that topic.
Opplinger et al. [OGG+22] use a combination of an LWIR camera and a 3D ToF camera
to detect living beings. By fusing the two outputs of the cameras, they manage to
achieve better results than either one of them separately, since LWIR cameras can detect
body heat while 3D ToF cameras can create a three-dimensional image and return the
reflectance amplitude of any given being/object.
2.2 State of the Art for Gesture Recognition
Gesture recognition using infrared is a topic which already finds its uses in entertainment
and human-computer interaction (e.g. the Leap Motion Controller [WBRF13] or the
Microsoft Kinect [HSXS13]). Note, that most of the work found and discussed here will
focus on static gestures (Definition 1.3.1) while this thesis focuses on dynamic/directional
gestures (Definitions 1.3.2 and 1.3.3).
In 2013, Wojtczuk et al. [WBA+13] used a setup similar to ours for gesture detection;
their sensor had four passive LWIR sensors (i.e. non-emitting and measuring body heat)
7
2. Related Work
instead of an 8 × 8 active (emitting) sensor array, which were aligned along the positive
and negative X-/Y-axes for vertical and horizontal movement.
Due to the alignment of the sensors and the use of an aperture, movement/gestures
parallel to the vertical or horizontal axes can be detected by tracking the amplitude
response along a row or column of sensors [WBA+13]. For example, if the amplitude
exceeds a certain threshold first at the left sensors and then at the right sensors, a move
from left to right is registered. Conversely, if it first exceeds the threshold at the upper
sensors and then at the lower ones, a move from top to bottom is registered.
In their review of IR gesture recognition using machine/deep learning, Rubén E Nogales
and Marco E Benalcázar [NB21] discern between two types of hand gestures: static
and dynamic gestures, which are defined the same as in this thesis (Definitions 1.3.1
and 1.3.2). To compare and analyse the papers discussed in their review, they [NB21]
define five different modules that can be used in combination:
Data Acquisition: How data is acquired and in what modality. According to Rubén E
Nogales and Marco E Benalcázar [NB21], there are only two ways for this: spatial
position and depth data. Most papers in their review use the spatial position for
IR gesture recognition. The setups used for data acquisition in these papers are
Kinect, Leap Motion Controller (LMC), Intel RealSence, or an interactive gesture
camera, with the Kinect and LMC being the most frequently used.
Pre-Processing: Signal pre-processing to enhance gesture detection, ranging from slight
adjustments to the input data to a complete transformation of it. Methods used
include dimensionality reduction, normalisation, segmentation, or filters, with
normalisation being the most frequently used technique.
Feature Extraction: Extracting the relevant information that can be used to discern
categories/classes. The techniques employed in the discussed papers include im-
age segmentation, statistical operations, distance/spatial operations, convolution,
Histogram of Gradients (HoG), and chronological-pattern indexing.
Classification: Classifying the input data using unprocessed, pre-processed, and/or
extracted information.
Post-Processing: Post-processing of the output to filter false classifications or use the
output as input for another module.
During their work, Rubén E Nogales and Marco E Benalcázar. [NB21] have found that
the LMC is used more often for detailed gestures including the position of the fingers
while the Kinect is used for broader movements of the whole arm, sacrificing accuracy of
the exact hand pose. Moreover, all papers discussed employed supervised learning to
solve the problem of gesture recognition using heuristics found through trial and error.
Unfortunately, a predominant number of papers do not disclose their code, with only 10
8
2.2. State of the Art for Gesture Recognition
of them reporting the processing speed, rendering a proper reproduction of the results
impossible.
Tateno et al. [TZM19] proposed gesture recognition using a passive (thermal) 32 × 24
IR-array. Their technique employs barycentric movement detection and, depending on the
movement, uses a CNN to detect static gestures (Definition 1.3.1) or a simple movement
detection to detect moving gestures (Definition 1.3.2).
By using the body temperature for background subtraction and normalising the frames
afterwards, the movement of the barycenter is tracked along the X- and Y-axis, enabling
the detection of up, down, left, and right gestures. Experimental results suggest a
detection rate of 97% for moving gesture detection and a total accuracy of 87.5% for
static gesture detection.
9

CHAPTER 3
Theory of Head Detection and
Gesture Recognition in
Low-Resolution Infrared
Amplitude Images
Human head detection is the task of detecting a human head in an image, while head
tracking is detecting the same head over a set of frames. Due to a lack of information
in the data, an insufficient algorithm, or simply hardware unfit for the task, errors can
occur in the tasks of head detection and tracking.
Adding gesture recognition for directional gestures (Definition 1.3.3) to this task brings
additional challenges. Due to the assumption that a human head is always in frame, the
head needs to be ignored and only the movement of the hand should be tracked.
Conversely, the hand is a potential false positive candidate for head detection, meaning
that for head detection, the hands need to be ignored. Thus, this algorithm alternates
between detecting the head and the hand while trying to minimise computational
complexity.
The functionality of the presented algorithm is limited to a frontal view, with the person
facing the sensor at a distance of up to 73cm, using only one hand to execute a directional
gesture.
Due to the image size of our sensor array (8×8, see Section 1.1), our data lacks information
to accurately and reliably detect a human head in every frame (see Chapter 2). At this
image size, a human head and hand are very similar in both amplitude and shape, as
seen in Figure 5.1. Furthermore, the limitation of efficiency means that the detection of
the head needs to sacrifice accuracy for efficiency.
11
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
According to the data-sheet of the sensor [STM18], a higher time budget (thus a longer
range mode) will yield a lower repeatability error (Figure 3.1). There, repeatability is de-
fined as ”the standard deviation of the mean ranging value of 32 measurements” [STM18].
In other words, repeatability denotes the consistency of the distance measurements by
the sensor. It does not denote accuracy – if the sensor is constantly off by exactly 5mm,
it will have an accuracy of ±5mm, but a repeatability error of 0.0mm.
Figure 3.1: Maximum distance and repeatability error vs. timing budget of the
sensor. Tested on a target with 54% reflectance and no ambient light, actual
distance in mm. TB = timing budget in ms, STDEV = standard deviation. The
blue line denotes the mean range, while the red dots are the repeatability error.
From the VL53L1X data-sheet [STM18].
A lower repeatability error also reduces sudden erroneous distance measurements, which
could be misread as movement by the gesture recognition algorithm presented in this
thesis. Thus, we have opted for a long-ranging mode with a timing budget of 200 ms,
sacrificing ambient light stability, compared to the short-ranging mode described by the
data sheet [STM18]. This however lowers the recommended range to 73 cm [STM18],
hence the maximum operating range set in this thesis.
12
3.1. Theory of Head Detection in Low-Resolution Infrared Amplitude Images
3.1 Theory of Head Detection in Low-Resolution Infrared
Amplitude Images
The sensor used in this thesis has two outputs: amplitude from the infrared emit-
ter/receiver and distance from the ToF sensor. An example of the two outputs for the
same frame is shown in Figure 3.2.
(a) Amplitude output of the sensor.
The amplitude range for this frame
is [25 − 672].
(b) Distance output of the sensor.
The distance range for this frame
is [202 − 391].
Figure 3.2: Examples of the two outputs of the sensor for the same frame. The
range of the colour map is [0 − 700].
To improve head detection with the sensor employed in this thesis, amplitude can be
beneficial. As seen in Figure 3.3, the National Institute of Standards and Technology has
mapped out the reflectance of human skin over various wavelengths. Our sensor emits
light in the range of 940 nm, meaning that human skin has a reflectance factor of around
60%. By filtering for an amplitude range, objects with a reflectance different from human
skin are excluded.
Without the limitation of maximising computational efficiency, many (or a combination)
of the papers presented and discussed in Chapter 2 could be used after pre-processing the
input ”image” sent by the sensor array. The task then becomes a question of hardware:
the better the hardware, the more elaborate/precise the head and gesture recognition
can be. One could also use multiple sensors with separate algorithms to recognise heads
and gestures.
Since the VL53L1X uses a lens to focus the photons on the sensing array, it is subject to
perspective projection [SHB13]. This means that the field of view is akin to a pyramid,
as seen in Figure 3.4. Thus, the measured depth and position values are not the same as
the real-world distance and position of the object in frame from the sensor.
With the projection from three dimensions to a two-dimensional image, one can use shape
abstraction to reach the core assumption of the head detection algorithm employed in
this thesis:
13
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
Figure 3.3: Reflectance of the human skin, according to the National Institute
of Standards and Technology [CA13]. The thick red line marks the wavelength
of the sensor used in this thesis, grey indicates the instrument’s uncertainty.
Reflectance factor denotes the relative amount of photons reflected ([0.0, 1.0])
Figure 3.4: A 2D-visualisation of perspective projection exhibited by the sensor.
The red line is the depth as measured by the ToF sensor, while the black line at
the centre is the distance between the object and the sensor.
Assumption 1. Abstracted into simple shapes, a human head posed upright and facing
the sensor is an ellipsoid on top of a rectangle of similar width (neck) on top of a
trapezoid/rectangle (torso and shoulders).
This means that – using shape properties as a feature – to achieve the best possible
results, the whole head needs to be in the frame at all times. Taking the limitation
of the maximum recommended distance (73 cm) into account, the shoulders and torso
could appear in frame, but not the lower body/legs. Under these conditions, multiple
approaches can be taken:
For example, a centroid is the ”balanced” centre of a given shape. One simple imple-
mentation of the centroid is calculating the arithmetic mean of all pixel positions in a
given region. If only the head is in the frame, this approach would reliably find its centre.
14
3.1. Theory of Head Detection in Low-Resolution Infrared Amplitude Images
However, if the neck and/or torso are also visible, that centroid would shift downwards,
which could lead to incorrect results, especially if the head is tilted away from/towards
the sensor.
Distance transforms are another possibility, which map a binary image into a distance
matrix indicating the distance of each pixel of a region to its closest boundary. By only
choosing the pixels with the highest distance, one can extract a morphological skeleton,
which is a minimal representation of a shape [NGC92]. If only the head is in frame, the
resulting skeleton would be a dot or line across the centre, with the centre of the line
being the centre of the head (see Figure 3.5). The skeleton will contain multiple lines
with branching points if the torso is also visible.
Figure 3.5: Skeletons of an abstracted torso and a circle. Skeletons have been
thickened for better visibility and have an actual thickness of 1 px.
Both methods fail to accurately detect the head if the torso is in frame or the head is tilted.
Recalling the core assumption described at the beginning of this chapter (Assumption 1),
a head’s elliptical shape will not be influenced by its pose or the visibility of the torso;
thus, ellipse detection can be a viable method to find a human head reliably.
3.1.1 Circles and Spheres
If the ears are visible, they can alter the apparent shape in the sensors’ output in such
a way that it appears more circular by adding width to the ellipsoid, making circle
detection another viable method. Furthermore, due to the concave shape of parts of the
face, like the eye sockets and the space between the bottom lip and chin (a fact that is
taken advantage of by Haar-like features), we gain details that help determine the centre
of the head (as seen in Figures 3.13 and 4.2).
Remark. Haar-Like features were proposed by Paul Viola and Michael Jones [VJ01] as
features for their Viola-Jones-Algorithm, taking advantage of variations in pixel intensity,
like the shadows naturally cast by the human face to classify objects and faces.
However, since the native output of the sensor array used in this thesis is 8 × 8, finer
details like the Haar-Like features cannot be used reliably and only eight ”circles” are
possible, with radii between 0.5 and 4 px, as seen in Figure 3.6 – provided that the head
is precisely at the centre of the image.
15
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
(a) r=0.5, A=1 (b) r=1, A=4 (c) r=1.5, A=9 (d) r=2, A=12
(e) r=2.5, A=21 (f) r=3, A=32 (g) r=3.5, A=37 (h) r=4, A=52
Figure 3.6: The possible discrete circles in an 8 × 8 image with their centre at
the image centre.
Spheres projected into a two-dimensional image appear as grey-scale circles, as shown in
Figure 3.7.
(a) r=0.5, A=1 (b) r=1, A=4 (c) r=1.5, A=9 (d) r=2, A=12
(e) r=2.5, A=21 (f) r=3, A=32 (g) r=3.5, A=37 (h) r=4, A=52
Figure 3.7: Examples for 2D-projected spheres in an 8 × 8 image. The grey-scale
values depend on the surface angles and can differ from the ones shown here.
If the head is off-centre, the number of possible circles is reduced since parts of the head
would be out of frame.
16
3.1. Theory of Head Detection in Low-Resolution Infrared Amplitude Images
To be precise, the possible discrete pixel circles have a radius range of
[0.5, ⌊d⌋] , d ≥ 1 (3.1)
where d is the minimal distance between the centre of the head and the image boundaries
along the X- or Y-axis. It is defined as
d = min(min(x, w − x), min(y, h − y)) (3.2)
where x, y are the X- and Y-coordinates of the head in the image space, and w, h is the
image width and height, respectively. Of note is that circles with a radius below 2 are
essentially squares, meaning that circle detection is impossible at this size range.
To facilitate circle detection, bi-cubic interpolation (with its side effect, the blob formation,
see Subsection 3.1.3) is employed. Due to the bigger image size and blob formation, small
”circles” with radii below 2 appear circular, while bigger circles remain circular.
Using circle detection to more accurately detect a human head makes the algorithm even
more dependent on the positioning. Suppose the head is at the boundaries of the sensor’s
field of view, either due to being too close to the sensor or too far away from the image
centre. In that case, two complications can occur:
Clipping: Part of the head is outside the frame, thus altering the shape of the head
visible to the sensor, see Figure 3.8a for a schematic example. This can be mitigated
by taking advantage of the head’s self-symmetry.
Edge loss: At least one side of the head is occluded by the image boundaries. Although
the shape visually looks like a circle, the image boundaries are not recognised as
edges. This leads to an ambiguity of shape, as it is not clear whether the visible
shape represents the whole object or only a part of it. See Figures 3.8b and 3.8c
for a schematic example.
In cases where circle detection is impossible – due to clipping, edge loss, a distance that
is too small or too big, or a pose that alters the apparent shape of the head from the
point of view of the sensor – shape properties are used to estimate the position of the
head.
Another assumption for this algorithm is that a hand has to be in front of the head to
execute a dynamic gesture. In case two circles do get detected (hand in front and head
in the back), one can ignore the circle with the lower depth value (i.e. the closer object).
Of the methods presented, circle detection is the most computationally expensive (with
a best-case time complexity of O(n2) for circle detection [KRG94], vs. a best-case
time complexity of O(log n) for distance transform [BK23]), where n is the number of
pixels in the image. Nevertheless, due to its stability regarding positioning and head
pose, the algorithm discussed in this thesis uses circle detection as the primary and the
computationally inexpensive shape centroid as the secondary/fallback method for head
detection.
17
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
(a) Clipping due to being too far
from the image centre or too close
to the sensor.
(b) Edge loss due to alignment with
the image boundaries or being too
close to the sensor.
(c) Shape ambiguity due to edge
loss. Is 3.8b truly the whole shape
or just a clipped version of this?
Figure 3.8: Examples of sub-optimal head positioning.
3.1.2 Calculating the Angular Size
We can calculate the minimum distance for any given circular object to be entirely in
frame: By visualising our FoV as a triangle, we can draw a line across the centre, resulting
in two right triangles, as seen in Figure 3.9.
Figure 3.9: Side-way visualisation of the field of view. α denotes the angle of
the field of view, D is the distance between an object and a is the image size of
the object.
18
3.1. Theory of Head Detection in Low-Resolution Infrared Amplitude Images
By applying the trigonometric equation
tan(α) = a
b
(3.3)
to one of the right triangles that make up half of the field of view, we get:
tan(α
2 ) = r
D
(3.4)
where α
2 is half the angle of the field of view, D is the distance between the sensor and a
given object and r is the radius of the object in the image. Now, by approximating a
human head as a circle with a diameter of 18 cm, we can calculate the minimal distance
needed for it not to take up the whole field of view:
tan(27◦
2 ) = 9
D
⇒ D · tan(13.5◦) = 9 ⇒ D = 9
tan(13.5◦) ≈ 37.5 cm (3.5)
By changing α to be a quarter of its full size (= 6.75◦), we can calculate the maximum
distance needed for the approximated shape to be registered by at least four sensors
(Figure 3.6.b):
9
tan(3.375◦) ≈ 152.6 cm (3.6)
However, this is under the assumption that a completely flat, parallel circle is in front of
the sensor. The concavity of the eye sockets, the area between the lower lip and chin,
and the angle of the nasal ridge, will lead to non-orthogonal reflection of the photons.
This means, that a part of the photons will be reflected in such a way that they don’t
return to the receivers of the sensor array, which alters the apparent shape and size.
19
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
3.1.3 Bi-cubic Interpolation
When up-sampling an image (i.e. resizing it to a bigger size), the values of newly created
pixels in between are unknown (see Figure 3.10).
(a) 4 × 4 image of a white square on a
black background.
(b) The same image after resizing it by a
factor of 2, leading to an 8 × 8 image.
Figure 3.10: Schematic example of image resizing. The cyan pixels are newly
created pixels with unknown values.
To interpolate (i.e. infer the value from the given data) the new pixel values, many
methods exist, like the nearest neighbour interpolation, bi-linear interpolation, or bi-cubic
interpolation.
Nearest-neighbour interpolation is a simple method, which assigns each new pixel the
value of the closest original pixel. However, as can be seen in Figure 3.10b, a new pixel
can have none or multiple closest original pixels. To consistently assign pixel values, the
nearest-neighbour algorithm defines a fixed ”direction” to look for the closest original
pixel.
The resulting image contains no new pixel values and can appear ”blocky” due to the
abrupt changes in pixel values.
Given that the image size of our sensor array is 8 × 8, shapes like the head or the hand
already appear blocky. Because we want to better differentiate between an organic
shape like a body part and for example, rectangular furniture, the nearest neighbour
interpolation is a poor choice for us, as shown in Figure 3.12.
Bi-linear interpolation creates a linear approximation of the new pixel values, creating
a smoother image than Nearest-Neighbour interpolation, but the resulting image loses
sharpness and edges appear less defined due to the linear approximation of pixel values
(ramps).
As we need well-defined edges to reliably detect shapes efficiently, a function that creates
ramps can be counterproductive. Thus, while bi-linear interpolation helps with rounding
shapes (as seen in Figure 3.12), it is not favourable for shape detection.
20
3.1. Theory of Head Detection in Low-Resolution Infrared Amplitude Images
Finally, bi-cubic interpolation interpolates the missing pixel values using the following
model:
f(x, y) =
3
j=0
3
i=0
aijxiyj (3.7)
where aij is one of 16 coefficients and x, y is the position of the calculated pixel.
Simplified, bi-cubic interpolation approximates the values using splines, which not only
depend on the values of the starting and ending pixel but also their tangents, thus also
depending on the previous and the next pixels. A visualisation for a row of pixels is
shown in Figure 3.11.
Figure 3.11: Visualisation of bi-cubic interpolation on a row of 6 pixels (red
markers). The Y-axis denotes pixel value, and the X-axis is the pixel’s position
along the row.
While this approach leads to better-perceived image sharpness, it also tends to under- and
overshoot, as seen in Figure 3.11: the values between pixels 1 and 2 are in the negative,
while the values between 3 and 4 exceed the maximum value of the original pixel row
(150).
To show the effect of the various interpolations, Figure 3.12 shows an image with a square
having the same pixel values. Of note there is how the bi-cubic interpolation creates
values outside the original value range and how both bi-linear and bi-cubic interpolation
seem to make a ”sphere” out of the square. Figure 3.13 compares the three interpolation
methods applied on the sensor amplitude output.
By approximating the model using a convolution kernel, Keys [Key81] proposed a bi-
cubic interpolator which sacrifices accuracy for efficiency, and can be used for Image
Pyramids [Kro90] or convolution layers in a deep learning network.
21
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
(a) Unaltered (b) Nearest Neighbour (c) Bi-linear (d) Bi-cubic
(e) Unaltered (f) Nearest Neighbour (g) Bi-linear (h) Bi-cubic
Figure 3.12: Comparison of various interpolation methods for a scale of 2 on a
6 × 6 image of a square.
(a) Unaltered (b) Nearest Neighbour
(c) Bi-linear (d) Bi-cubic
Figure 3.13: Comparison of various interpolation methods for the sensor output.
22
3.1. Theory of Head Detection in Low-Resolution Infrared Amplitude Images
3.1.4 Edge detection
To analyse shapes present in an image, they first need to be extracted. Since a shape
is defined by its edges (e.g. rectangles have four connected edges at 90° angles, squares
are a special type of rectangle with all four edges having equal length, triangles have
three connected edges with their three angles summing up to 180°, ...), one approach is
extracting the edges present in an image.
According to Barrow and Tenenbaum [BT81], an edge in an image can be defined as a
discontinuity in pixel values, be it colour or brightness/intensity, which implies
1. A discontinuity in depth
2. A discontinuity in surface orientation
3. A variation in reflective properties
4. A variation in illumination
One method of extracting edges is by filtering the image using convolution kernels
(filtering). Edge detection kernels include the Prewitt [P+70] and Sobel [KVB88] filters,
which are first-order kernels, and function by smoothing and normalising the image before
approximating the first derivative of the image. The resulting horizontal and vertical
edge images can be combined into a single edge image containing all horizontal and
vertical edges [P+70] [KVB88].
However, in cases where edges are neither horizontal nor vertical, both edge detection
methods will fail. Thus, second-order kernels like the Laplacian edge detector [MH80]
can be used, which approximate the second-order derivative and discern between outward
and inward edges. Due to using the second derivative of the image however, the Laplacian
filter is more sensitive to noise, which can be mitigated by applying a noise-reducing
filter like the Gaussian beforehand [MH80].
Finally, the Canny method [Can86] combines first-order kernels like Sobel and Prewitt
with pre- and post-processing steps to achieve an edge detection that is less sensitive to
noise. After calculating the gradient of the image, non-maximum suppression is used to
determine the pixel with the highest intensity along the gradient direction of the edge
and eliminate any other edge pixel that does not satisfy this condition – thus thinning
the edges.
Next, a double threshold is employed as a first step to eliminate false edges; any edge
pixel with an intensity value above the high threshold is considered a strong pixel and
thus contributes to the final edge, while edge pixels with an intensity value above the
weak threshold, but lower than the strong threshold, are considered weak pixels, which
need further post-processing to determine their contribution to the final edge image.
Edge pixels below the given weak threshold are considered non-relevant and are discarded.
23
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
The resulting edge image only contains three pixel values: 0 for non-edge pixels, 255 for
strong edge pixels, and 128 for weak edge pixels.
Finally, weak edge pixels undergo Hysteresis, where they are either eliminated or trans-
formed into strong edge pixels (i.e. raising their intensity value to 255). This is done
by tracking every weak edge from one end to another; if the weak edge connects to a
strong edge on any end, it is transformed into a strong edge. Otherwise, it is discarded
by turning the intensity value of all its pixels to 0.
Due to the multiple passes, the Canny method is computationally more taxing. Never-
theless, the higher quality of the resulting edge image thanks to the noise stability and
false positive elimination increases the odds of correctly identifying shapes, which is why
it is chosen as the pre-processing for the next step, the Circular Hough Transform in
Subsection 3.1.5.
3.1.5 Hough Transform
One of the known methods for finding geometric shapes like lines, circles, or other classes
of (parametric) shapes is the Hough Transform [Hou62]. By extracting an edge image
out of an input image and then transforming it to a parameter space (also known as
Hough space) with polar coordinates, imperfect shapes can be still recognised, if there is
an intersection in the Hough space (examples in Figure 3.14).
Instead of representing a line using the slope a and intercept b in the Cartesian system
like y = ax + b, the polar representation uses ρ for the shortest distance between the
origin and the line and θ for the angle between the X-axis and the distance line. Given ρ
and θ, the following equation is true for any point along the line:
ρ = x cos θ + y sin θ (3.8)
Thus, a line can be mapped to a single point in the parameter space of ρ and θ. Now, if
there is a point in image space – meaning that x and y are fixed – its representation in
the parameter space is a sinusoid.
Using these two properties of the parameter space, straight lines can be detected, even if
they are imperfect or disconnected: Every point creates a sinusoid, which means that a
drawn line in the image space will result in multiple sinusoids in the parameter space.
These sinusoids will intersect at a single point: the point with the ρ and θ value of the
line they are aligned with (see Figure 3.14 for examples).
The point of intersection is found by creating an accumulator space containing the
sinusoids of the parameter space. Via voting, every point along the sinusoids increases
the corresponding point in the accumulator space. If sinusoids intersect, their intersection
point will have n votes, where n is the number of intersecting sinusoids. Finally, the local
maxima (i.e. the points with the highest number of votes/intersections) are chosen as
the parameters for the line candidates in the image space.
24
3.1. Theory of Head Detection in Low-Resolution Infrared Amplitude Images
(a) Dashed, shaky line. The intersection at θ ≈ 42.5 and ρ ≈ 530 shows, that the segments align
along the line 530 = x cos 42.5 + y sin 42.5.
(b) Two shaky lines. The two intersections mark the polar parameters of the two lines that align
with them.
Figure 3.14: Examples of Hough transformation with lines.
Of course, due to this property, lines can also be detected where there are none, e.g.
when objects in an image happen to align coincidentally. Another drawback is that the
end of a line cannot be determined using Hough Transform alone, since it only works
with lines of infinite length.
In his work, Dana H Ballard [Bal81] present their Generalized Hough Transform, which is
able to detect any arbitrary shape, but is computationally more taxing than the standard
Hough Transform for lines.
25
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
Circular Hough Transform
The circular Hough Transform is a version of Hough Transform which uses a three-
dimensional parameter space: the X-position of the centre, the Y-position of the centre
and the radius.
Particularly at lower resolutions (and image sizes like 8 × 8 like in the case of this thesis),
circle detection might provide a better detection rate than face detection techniques
that rely on finer facial features. By taking advantage of the overshoot (or haloing) and
blurring that occurs when using Bi-cubic Interpolation, circles with a radius ≥ 2 px
appear more circular (see Figure 3.15).
(a) r=1.5, unaltered (b) r=1.5, interpolated (c) Sobel of interpolation
(d) r=2, unaltered (e) r=2, interpolated (f) Sobel of interpolation
(g) r=3.5, unaltered (h) r=3.5, interpolated (i) Sobel of interpolation
Figure 3.15: Four-time magnification of discrete circles using Bi-cubic Interpola-
tion and their resulting Sobel edge images. Note how the ”circle” with r = 1.5
merely turns into a square with rounded corners. Furthermore, observe how the
Sobel edge detection behaves when the circle is at the image boundary. Even
though the circle is symmetrical, the edge image appears to have two protrusions
to the image boundaries.
26
3.1. Theory of Head Detection in Low-Resolution Infrared Amplitude Images
One such circle detection algorithm is the Circular Hough Transform, which, instead
of taking the polar coordinates used to describe a line like in Hough Transform, uses
a three-dimensional space with the parameters of a circle: The two coordinates of the
centre (a, b) and the radius r. If a circle of fixed radius is drawn on some/all points of an
edge and all circles along the points intersect at one single point (found again using the
accumulator matrix), then that point is the centre of the detected circle and its radius is
equal to that of the circles in the parameter space, as seen in Figure 3.16.
Figure 3.16: Example of Circular Hough Transform. The dashed circles are in
the parameter space, and the solid is in the image space. As can be seen here, if
their radius is equal to the image circle, they will intersect at the centre of it.
However, finding the intersections of the circles in the three-dimensional parameter space
is more computationally taxing than in a two-dimensional space. Taking our image size
of 8 × 8 alone, a search for just the position of a circle (meaning the x and y coordinates)
means looking at 64 possible points. If we also need to know the radius of a circle, we
not only have the 64 possible centre points but also the 16 possible radii (see Figure 3.6).
Thus, we would not have 64, but 64 · 16 = 1024 possibilities.
One method to simplify this task is the Phase Code Hough Circular Transformation
by Tim J Atherton and Darren J Kerbyson [AK99], which searches for circles in a
range of radii, thus limiting the radius dimension and only needing to search in the
two-dimensional x, y space of the circle centre. Unfortunately, even though the Phase
Code Hough Circular Transformation is generally scale-invariant, a certain resolution is
needed for a circle to be recognised as such (see Figure 3.6).
Thus, the circle detection works best when the head is at a depth where it reflects to
at least 12 receivers (e.g. Figure 3.6d), but not more than 34, as it would then be at
at least one image boundary, making shape detection impossible. Furthermore, circle
detection in the application presented in this thesis assumes that the head is always in
frame, upright, and facing the sensor. If the head is facing upwards, it might lose its
apparent circularity.
27
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
3.1.6 Blob Detection and Connected-Component Labeling
Blob detection and Connected-Component Labeling (CCL) are two other possibilities
– essentially finding groups of pixels based on similarity and shape (blob detection) or
connectivity and an attribute (CCL).
Blob Detection
Methods like the Laplacian of Gaussian (LoG) [Lin93] or Difference of Gaussian (DoG) [Low04]
are used to detect blobs of multiple sizes by convolving the image with their respective
kernel using different scales. The resulting responses create a scale space, and by searching
for extrema in the scale space, the position and characteristic scale of the blobs are
determined. Figure 3.17 shows an example output for both methods.
Figure 3.17: Example for Laplacian of Gaussian and Difference of Gaussian blob
detection on an image of a coffee cup.
Connected Component Labeling
The attribute is what defines a region CCL – pixels that are similar in regards to the
chosen attributes belonging to a region, provided they are in the chosen connectivity. In
our application, we have two attributes from the get-go: depth and amplitude. Of course,
further attributes can be created by combining or computing data, making more options
available.
Connectivity defines the ”search window” and has multiple options, the common ones
being the 4- and 8-connectivity. 4-connectivity compares the four neighbouring pixels
(x ± 1, y) and (x, y ± 1), while 8-connectivity also compares the neighbouring corner
pixels, i.e. looking at every pixel in (x ± 1, y ± 1). Figure 3.18 shows an example of CCL
using 4- and 8-connectivity given an arbitrary attribute.
28
3.1. Theory of Head Detection in Low-Resolution Infrared Amplitude Images
(a) 4-connectivity, 5 regions (b) 8-connectivity, 2 regions
Figure 3.18: Examples for 4- and 8-connectivity and their impact on the number
of detected regions.
Going back to depth and amplitude, if we, for example, encode amplitude with colour
hue and depth with brightness, we can visualise how connected component labelling
could work depending on the attribute, as seen in Figure 3.19.
(a) amplitude, 2 regions (b) depth, 4 regions
Figure 3.19: Examples for connected component labelling on amplitude and
depth, using 8-connectivity. Amplitude is visualised by colour hue, while depth
is denoted by the pixel intensity (brightness).
The issue with blob detection and CCL is their indiscriminate functionality; blob detection
will detect any shape, as long as it is compact (i.e. if it roughly fits inside a circle), while
CCL will return every region that is connected and similar regarding the given attribute.
This means, that non-circular objects can be detected, increasing the false positive rate
and thus reducing the accuracy of head tracking. Hence, we have decided to use circle
detection for head detection.
29
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
3.1.7 Shape Properties
In computer vision, shape properties are multiple metrics and attributes that describe a
shape (i.e. a coherent region of pixels), which can be used for decision-making and feature
extraction. Shape properties are calculated mainly from binary images and include
attributes like area, perimeter, orientation, centroid, and extrema, which are presented
here.
Circularity
One shape property is the circularity/roundness of a region, which is calculated as
4πA
P 2 (3.9)
where A is the area and P is the perimeter of the region (i.e. the sum of the distance
between boundary pixels in an 8-neighbourhood). Assuming a perfect circle, we know
that the area is πr2 and the perimeter is 2πr. Inserting this into the circularity equation
we get:
4π(πr2)
(2πr)2 = 4π2r2
4π2r2 = 1 (3.10)
However, as with the Circular Hough Transformation in Section 3.1.5, circularity is only
reliable above a certain shape size. In the case of the sensors used in this thesis, the
image size is too small, as seen in 3.11, where a discrete circle with a diameter of 7 has a
higher circularity than a theoretical perfect circle; with an area of 37 and a perimeter of
≈ 18.4, we get a circularity of ≈ 1.4.
4 · π · 37
18.42 ≈ 1.4 (3.11)
Note, that even with an up-sampling of 4 times, a circle with a diameter of 28 would still
have a circularity above 1. Nevertheless, rectangular shapes will have a circularity below
1 and thus, circularity can eliminate some non-organic shapes.
The circularity of a rectangle is
4πab
(2a + 2b)2 = 4πab
4a2 + 8ab + 4b2 = 4πab
4ab(a
b + 2 + b
a)
= π
a
b + 2 + b
a
(3.12)
Since π is a constant and the denominator will always be bigger in N starting from
a = b = 1, the circularity will always be < 0, getting smaller the bigger the difference
between a and b gets. The rectangle with the highest circularity is a square, which will
always have a circularity of ≈ 0.785 since its area is a2 and its perimeter is 4a. Thus, the
circularity equation would be:
4π(a2)
(4a)2 = π
4 ≈ 0.785 (3.13)
30
3.1. Theory of Head Detection in Low-Resolution Infrared Amplitude Images
Centroid
The centroid is the centre of mass of any given shape. It is calculated as the arithmetic
mean of all pixel coordinates that belong to the shape:
(cx, cy) = Σ(px, py)
n
(3.14)
Where px, py are the (x, y)-coordinates of the pixel and n is the total amount of pixels in
this region. Unlike pixels however, which have (x, y)-positions solely in N (e.g. [1, 2, 5]),
the coordinates of a centroid are in R (e.g. [1.4, 2.6, 5.8]), meaning that they can be
”inside” pixels.
Keep in mind that this method of calculation can be unstable, especially in small image
sizes. Because every detected pixel of a region has an equal influence on the outcome,
the relative amount of outliers and artefacts will be higher in small image sizes compared
to larger ones; for example, 4 ”faulty” pixels in an 8 × 8 image amount to 6.25%, while 4
pixels in a 32 × 32 image are 0.39%. Thus, outliers and inexact shape boundaries will
shift the centroid more, the smaller the shape is (see Figure 3.20).
A way to mitigate this is calculating the weighted centroid, which also takes pixel values
into account:
(cx, cy) = Σpv(px, py)
n
(3.15)
Where pv is the pixel value in [0, 1], (px, py) are the x and y coordinates of the pixel, and
n is the total amount of pixels in this region. In the application presented in this thesis,
this would be the depth or amplitude value. The higher the pixel value, the ”heavier” it
is, therefore shifting the centroid more towards it. Due to the spherical/ellipsoid shape of
the human head, the centre should have a higher amplitude and/or lower depth, which is
why the weighted centroid is employed in this thesis.
31
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
(a) 3 × 3, [4.00, 4.00] (b) 25 × 25, [16.50, 16.00]
(c) 3 + 1 × 3, [4.20, 4.00] (d) 25 + 1 × 25, [16.52, 16.00]
Figure 3.20: Comparison of impact of outlier pixels on two different sizes. One
pixel shifts the centroid of the 3 × 3 square by 5% to the right, and that of the
25 × 25 square by 3%.
Remark. As mentioned at the beginning of this Chapter (Section 3.1), the centroid
is not the only possible method to calculate the ”center” of a given shape. Another
way to calculate it can be done in two steps: first, calculate the distance transform by
Azriel Rosenfeld and John L Pfaltz [RP68] or the eccentricity transform by Kropatsch et
al. [KIHF06], which is more stable in regards to noise. Then, choose the pixel with the
highest value for Distance Transform, or the lowest value for Eccentricity Transform. In
the case where multiple points have the highest distance, a simple average between the
centre candidates would suffice. Both methods could deliver more accurate and stable
points of reference.
32
3.1. Theory of Head Detection in Low-Resolution Infrared Amplitude Images
Shape Extrema
Definition 3.1.1. Shape extrema are the highest and/or lowest (x, y)-positions belonging
to a region and are calculated using an n×2-sized matrix Pxy containing the pixel positions
of all pixels belonging to a region.
By keeping one of the two coordinates fixed, eight extrema (shown in Figure 3.21) can be
calculated: For example, if we choose right-bottom, we first look for the highest x-value
of the region (the global maximum). Having found that extreme ex, we now search for
the lowest y-value belonging to the region at that specific x-position (y, ex) (the local
minimum). Hence, the following extrema can be calculated:
left-bottom: global minimum of x, local minimum of y.
(x, y) = min
x=ex
(Pxy), ex = min(Px)
left-top: global minimum of x, local maximum of y
(x, y) = max
x=ex
(Pxy), ex = min(Px)
right-bottom: global maximum of x, local minimum of y
(x, y) = min
x=ex
(Pxy), ex = max(Px)
right-top: global maximum of x, local maximum of y
(x, y) = max
x=ex
(Pxy), ex = max(Px)
bottom-left: global minimum of y, local minimum of x
(x, y) = min
y=ey
(Pxy), ey = min(Py)
bottom-right: global minimum of y, local maximum of x
(x, y) = max
y=ey
(Pxy), ey = min(Py)
top-left: global maximum of y, local minimum of x
(x, y) = min
y=ey
(Pxy), ey = max(Py)
top-right: global maximum of y, local maximum of x
(x, y) = max
y=ey
(Pxy), ey = max(Py)
If the whole region consists of just a single pixel, it would still have all eight extrema -
two at each corner. This process also works with concave shapes, as the local extrema
are searched in all pixel positions of Pxy.
33
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
Figure 3.21: Example for extrema. It is possible for extrema to overlap, like
with right-bottom and bottom-right.
Furthermore, one can combine two or more extrema to generate new ones, e.g. right-
middle, which is the centre of right-top and right-bottom. Note, however, that those new
extremes can sometimes lie outside of the shape, especially in the case of concavities. For
example, calculating the average of top-left and left-top in the shape shown in Figure 3.21
would lead to such outside points.
3.2 Theory of Gesture Recognition in Low-Resolution
Infrared Amplitude Images
Gesture recognition is a combination of movement tracking and pattern matching. The
challenge lies in detecting the hand to track, and adjusting the sensibility to movement,
since humans rarely stay perfectly still or recreate shapes perfectly.
Additionally, hands can be used as part of communication (e.g. for emphasis), which
should not be recognised as deliberate gestures. Another point of consideration is
the limits of the human anatomy in regards to hand and arm movement: While a
hand may be able to ”draw” a circle with its fingertips, its range of movement is
limited [HGMBJ90], especially without movement of the arm, which is also limited in its
movement [GST+20] [STM11].
In this thesis, we define a gesture as a deliberate, consistent movement in one direction
(directional gesture, Definition 1.3.3). Thus, we have four parallel (up, down, left, right)
and two perpendicular (front, back) gestures, corresponding to the 6-connectivity in
three dimensions N3. Due to the rectangular arrangement of the sensor field, diagonal or
”shape” (e.g. circle, square, triangle, L-shape) gestures could be implemented, but are
not in the scope of this thesis.
Since ToF-IR-Sensors not only calculate the amplitude but also the depth, we can
34
3.2. Theory of Gesture Recognition in Low-Resolution Infrared Amplitude Images
already discern between parallel and perpendicular movement (see Figure 3.22). Parallel
movements along the general axes can be detected by tracking the amplitude changes
in every field of the array, while perpendicular gestures can be detected by tracking the
depth changes.
Figure 3.22: Parallel and perpendicular movement and its influence on the depth
value. The sensor is the small grey rectangle on the left, while the ellipse is
a moving object. With parallel movement, the depth measurements stay in a
given range, while perpendicular movement can have any depth between 0 and
∞.
Since this thesis combines head detection with gesture recognition, the algorithm presented
works under the assumption that a human head is in the field of view. Hence, just tracking
the amplitude or depth changes without discerning between hand or head could lead to
missing or spurious gesture recognition:
For example, suppose the hand is still for three frames and the algorithm correctly detects
the hand in the first frame, then wrongly the head in the second, and finally the hand
correctly again in the third frame. This would cause a difference in position (and thus
”movement”) to be registered equivalent to the hand’s position in frame 1, to the head’s
position in frame 2, and back to the hand’s position in frame 3.
If the centroid of the head is falsely recognised as the centroid of the hand over multiple
frames, movement of the head, not the hand is tracked, leading to gestures being missed.
Additionally, if we manage to remove the head from the field of view, we still have to
track the palm of the hands and not the arm. Hence, a method is needed to recognise
the arm in the scene and track the movement of its hand.
Finally, we have to consider the aperture problem [Hil84]: If the arm is perfectly
horizontal or vertical and the hand is out of view, we cannot determine horizontal or
vertical movement, since the edge information will not change between frames. Thus, the
algorithm presented in this thesis also requires the hand to always be in frame during
the execution of a gesture.
35
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
3.2.1 Gesture Space
At an image size of 8 × 8, detecting a hand cannot rely on finer features like fingers, since
the image size is not big enough to visualise fingers even at 1 px thickness; under the
assumption, that the boundary or gap between fingers has a minimum width of 1 px,
we have five fingers and four boundaries between them, meaning that we would need at
least 9 px to visualise all fingers of the hand.
The algorithm in this thesis makes assumptions to simplify this task: Since the field of
view of the sensors is narrow (27°) and our algorithm is limited to a close distance (up to
73 cm), we can assume that the hand must be in front of the head while gesturing and is
the closest object to the sensor. This space between the sensor and head is what makes
up the ”gesture space” in this thesis (seen in Figure 3.23 as an exemplary visualisation).
Figure 3.23: Exemplary visualisation of the created gesture space (blue). The
sensor is the grey rectangle on the left.
Thus, all that needs to be done is to take the closest point to the sensor and use anything
that is a distance ϵ away from it to mask the image. Of course, this also means that, in
order to minimise false positives, no other objects should be in the field of view in front
of the head.
3.2.2 Shape Properties for Hand Position Estimation
As discussed in Subsection 3.1.7, there are ways to estimate the position of the centroid of
a shape. Assuming only the hand is in frame, a centroid would be a sufficient choice for
tracking of the movement. However, since the arm will be visible at distances between 40
cm and 73 cm, calculating a simple centroid might yield a position on the forearm, which
36
3.2. Theory of Gesture Recognition in Low-Resolution Infrared Amplitude Images
can lead to a complication: the further down on the forearm the calculated centroid is,
the smaller the perceived movement of the centroid when a gesture is performed without
moving the elbow.
This is because, if the elbow is fixed, movement of the forearm can be considered as an
arc, whose length can be calculated using
r · θ
180◦ · π (3.16)
where r is the length between the centre (in our case the elbow) and θ is the angle
between start and finish point (see Figure 3.24). By converting degrees to radians
1◦ = π
180 rad ⇒ 180◦ = π rad, we can simplify the equation to:
r · θ, θ ∈ [0 rad, π rad] (3.17)
Thus, the farther down on the forearm the centroid is, the smaller r is and the less likely
it is, for the movement to exceed the set threshold for gesture detection.
Figure 3.24: Visualisation of a left-to-right movement of an arm and the arcs
created by it. θ is the angle between the start and finish point of the movement
and r is the length between the elbow and another point of the arm (be it the
middle of the forearm or the middle of the hand).
37
3. Theory of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
Because we want to make gesture recognition independent of arm pose, shape extrema
cannot be considered, as they are highly pose-dependent in our application; since a hand
could lie anywhere on a circle, it could be on any of the extremes, and calculating a
centroid between top-left, top-right, right-top, and left-top can only work if the arm
is facing upwards. Nevertheless, this approach could be combined with another shape
property: Orientation.
Shape orientation is the angle between the major axis (the longest possible line) of the
shape and the X-axis (in our case, the horizontal plane). Since the major axis of an
arm is the line between joints, calculating its orientation and then combining a set of
shape extrema depending on the orientation could lead to a hand (or end-of-forearm)
detection with a tolerance of ≈ 10 cm, provided the hand and arm are the closest objects
in the gesture space, Figure 3.25 visualises major axis and estimated centroid using shape
extrema.
Figure 3.25: Schematic example for centroid calculation using Shape Properties
and Orientation. The dashed red line marks the major axis of the shape, the red
dots are the shape extrema used for the given orientation, and the blue asterisk
is the centroid calculated from the given extrema.
Another option would be to calculate the convex hull of the hand/arm and apply the
medial axis transform to build a skeleton [LKC94], as seen in Figure 3.26. Then, the
branching point would be a good approximation for the hand centroid.
38
3.2. Theory of Gesture Recognition in Low-Resolution Infrared Amplitude Images
Figure 3.26: Schematic example for convex hull (blue overlay) creation and
skeletonization (red line) using medial axis transform.
However, since a flat palm facing the sensors should have a higher amplitude than the
arm (see Figure 3.3), the weighted centroid (see Subsection 3.1.7) can be used for pose
independent hand detection, which is computationally efficient. By using the amplitude
data for pixel weights, the centroid should be skewed towards the end of the arm and
ideally, the hand.
39

CHAPTER 4
Algorithm of Head Detection and
Gesture Recognition in
Low-Resolution Infrared
Amplitude Images
The algorithm described in this thesis combines the knowledge presented to track the
position of a human head and detect gestures. This is done by first detecting the head
using one of two approaches, depending on distance.
After estimating the position of the head, the space in front of it (i.e. the depth values
between 0 and the detected head position) is defined as the space used for hand and
gesture recognition (gesture space). If head detection is disabled, the whole depth range
[0 − 73] cm is considered as the gesture space. Then, a weighted centroid is used to
estimate the position of the hand, and by tracking its movement to discern a direction,
directional gestures (Definition 1.3.3) are recognised.
4.1 Algorithm of Head Tracking in Low-Resolution
Infrared Amplitude Images
The head tracking algorithm consists of two parts: Detecting the head and tracking its
position over time.
For head detection, a stream of frames is fed to the algorithm, containing amplitude
and depth data. The first and most important step is to mask the image to improve
the conditions for our head detection to work with. An ideal mask would only show the
human skin (or better yet, the human head) and nothing else.
41
4. Algorithm of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
At close range (0 − 37.5 cm distance from the tip of the nose to the sensor), shape
properties suffice, since the head takes up most of the visual field of the sensor (as shown
in 3.6). Using the amplitude at these distances, we can even differentiate between the
palm of a hand and a head. This is because a hand with its palm facing the sensor will
always have a higher average amplitude than a head due to the roundness of a human
head and the shadows cast by the eye sockets, nose and chin, which is corroborated in
Figure 4.3.
We can apply this knowledge to every depth value by employing the inverse square law
of photometry [Gla14]:
amplitude = amplitude0
depth2 (4.1)
where amplitude0 is the amplitude at depth = 0, and amplitude is the measured re-
flectance at a given point.
However, since this equation is an approximation, the resulting data still exhibits expo-
nential behaviour, seen in Figure 4.1.
Figure 4.1: Amplitude (Y-axis, number of photons) over depth (X-axis, in mm)
measurements from our experimental data. The measurements are taken from
the tip of the nose.
Remark. The experiments in this thesis for the amplitude range seen in Figure 4.1 were
done with only three subjects. Thus, we cannot make a generalised statement about the
correct amplitude range.
By masking data outside this range and passing it on to the head detection algorithm,
this approach virtually reduces the possible candidates to three: The head, the torso,
and the hand. With enough data, infrared amplitude could be used to create a mask for
human skin, as indicated by Mendenhall et al. [MNM15].
42
4.1. Algorithm of Head Tracking in Low-Resolution Infrared Amplitude Images
Due to the very low input image size, efficient real-time head detection using machine/deep
learning can be achieved using both amplitude and depth data; ground truth can be
created using a high-resolution sensor in tandem for annotation and after training, the
weights can be uploaded to a microchip running inference.
Support Vector Machines [CV95] could also find a discriminating threshold for human
skin. Both alternative methods, however, are outside the scope of this thesis, as we want
to explore a comprehensible and reproducible method for solving this task.
Another idea could be depth masking, where one can take advantage of the fact that
a human head is always attached to the body, which means that the lowest row of the
IR array would always show part of the body: the torso, neck, or head. By taking the
50-percentile (median) of the depth values of the bottom row and creating a search
window ϵ, a mask can be created, that only shows measurements at the depth of the
torso/neck/head ± the search window ϵ.
However, if a person is positioned in such a way that the torso, neck, or head only
takes up one sensor in the bottom row of the array, the other seven would sample the
background, thus skewing the median towards the background depth values. The resulting
50-percentile would be a depth value closer to the background measurements and the
search window ϵ could, depending on the depth values of the background, completely
exclude the person in view since they would be significantly closer to the sensor.
4.1.1 Estimating the position of the head
Once the input is pre-processed (IR input masked on depth, reshaped to 8 × 8, and
rotated based on the output of the gyro-sensor), head detection can begin. One of two
modes is used, depending on the depth value of the closest measured point of the body
(i.e. the measurement with the smallest depth value).
Body closer than 37.5cm
The body is so close to the sensor, that part of the head is clipped from the image
(see 3.5). Thus, the resulting shape cannot be detected as a circle, but we can use the
weighted centroid of the shape, which is within 10 cm of the true centroid of the head.
This is under the assumption, that the object in frame indeed is a head and not another
object of similar reflectance.
In this mode, a hand can and will be detected as a head, since there is no discrimination
or filtering.
Body between 37.5cm and 73cm
The head is sufficiently far away to make circle detection possible, provided it is centred.
It is also not so far away as to be registered by less than four sensors (see 3.6).
43
4. Algorithm of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
We assume that to make a gesture, a user’s hand will always be closer to the sensor than
their head. Thus, the difference between the closest and farthest depth measurement of
the body (= depth range) will be larger with the hand in frame than without.
This assumption can be used to mitigate false positive detection of the hand. If the depth
range exceeds 10 cm (the approximate depth difference between the tip of the nose and
the neck), the midway distance is calculated:
dmidway = min(dbody)
max(dbody) (4.2)
where dbody are the depth measurements of every sensor, filtered by the amplitude
measurements as described at the beginning of Section 4.1.
Afterwards, a mask can be created to filter out any fields with depth values lower (=
closer) than the midway distance, which is then applied to the amplitude data. This
leaves only amplitude pixels that have a depth measurement in [dmidway, max(dbody)].
This approach is not without fault however, as if the hand is closer to the body than the
defined threshold (e.g. next to the face), false positive rate and reliability will not be
improved, as the hand will still be in frame.
The resolution of the image is then up-scaled using Bi-cubic Interpolation, and to minimise
the computational complexity of the following methods while still taking advantage of a
higher fidelity and blob formation, a factor of 4 was chosen, as seen in Figure 4.2.
Figure 4.2: A human head at ≈ 30cm distance before and after up-scaling. Note
the blob formation and higher amplitude values on the cheeks and chin after
up-scaling.
Afterwards, Circular Hough Transform is applied, searching for circles with radii between
4 and 16 pixels (corresponding to circles with radii of 1-4 pixels in an 8 × 8 image, as
seen in Figure 3.6).
44
4.1. Algorithm of Head Tracking in Low-Resolution Infrared Amplitude Images
This approach, however, has caveats:
1. If the head is at the edge of an image, its shape will not be circular (see Figure 3.15i)
and will thus not be detected.
2. Circle detection does not differentiate between circular shapes. If there is another
object in the field of view with a similar reflectance, circle detection will detect it,
too.
If no circle can be detected, a centroid is calculated from the average of the top-left,
top-right, left-top, and right-top shape extrema – under the assumption that the head is
inside the field of view.
4.1.2 Selecting the correct head candidate
Since we have filtered for reflectance close to human skin, our head candidates should be
at least one of three body parts: the head, the hand, and the torso.
As mentioned in Subsection 3.1.1, the shape of the head will lead to non-orthogonal
reflection. In comparison, both the hand and torso are relatively flat and most of the
time parallel to the sensor (if the user is facing the sensor), leading to a higher average
amplitude compared to the head. This assumption is further corroborated by the findings
of Mohamad et al. [MSJO14], visualised in Figure 2 of their publication (see Figure 4.3).
Figure 4.3: Near infrared reflectance spectra of nine different parts of face and
hand. [MSJO14]. Note how the forehead and both cheeks always have a lower
reflectance than the hands.
Hence, by calculating the average amplitude of each detected head candidate, the one
with the lowest average amplitude is selected as the head.
45
4. Algorithm of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
4.1.3 Tracking the Head
Once a centroid has been selected and returned, it is stored in a list of positions and
depths in memory with a size of 3 × m, for m frames. This array is updated every frame
in the first-in-first-out (FIFO) principle. To smooth the positioning, one can average the
positions like such:
Phead(xn, yn, zn) =
n
i=(n−m)+1 Phead(xi, yi, zi)
m
(4.3)
where n is the number of the n-th (=latest) frame in the image series, m is the memory
size, Phead(xn, yn, zn) is the centroid position and measured depth of the head in the
nth frame, and Phead(xi, yi, zi) is the centroid position and depth of the head in the i-th
frame.
Then, the gradients ∂Phead
∂x
,
∂Phead
∂y
,
∂Phead
∂z
of the centroid positions and depth values
in memory are computed for each frame to get the change in position/depth, essentially
applying Optical Flow [HS81] to a single three-dimensional point.
4.2 Algorithm of Gesture Recognition
If head tracking is enabled, gesture recognition waits for the head detection to return the
depth of the head position. The head position is then used to create a gesture space (see
Figure 3.23), and the movement of objects with depth measurements in [0, depthhead] is
tracked.
If head tracking is disabled, the gesture space is set to the whole depth range of [0, 73]
cm.
4.2.1 Shape Properties for Gesture Recognition
After creating the gesture space, the weighted centroid is then calculated on the closest
shape within to estimate the position of the hand. As described in Subsection 3.2.2, the
weighted centroid using amplitude data is an efficient and pose-independent approach to
estimating the position of the hand.
4.2.2 Gesture recognition
As with tracking the head, the centroid positions and depth measurements of the hand
are saved and averaged in a list of size 3 × k, for k frames:
Phand(xn, yn, zn) =
n
i=(n−k)+1 Phand(xi, yi, zi)
k
(4.4)
and the gradients ∂Phand
∂x
,
∂Phand
∂y
,
∂Phand
∂z
of the centroid positions of the hand are
calculated.
46
4.2. Algorithm of Gesture Recognition
If the sum of the absolute changes exceeds a given threshold in at least one of the three
axes, 

n
i=n−k
∂Phand
∂x
n
i=n−k
∂Phand
∂y
n
i=n−k
∂Phand
∂z


>


tx
ty
tz


(4.5)
the maximum absolute change is detected as a gesture in that direction (the X-axis
determines left-right gestures, the Y-axis determines up-down gestures, and the Z-axis
determines forward-backward gestures). The threshold is highly dependent on the chosen
array length k, which essentially defines the gesture speed.
For example, if the array length is 15, it would be equal to a memory of 1 second (at a
frame rate of 15 frames per second). Setting the threshold for the (x, y)-movement to 1
would mean that a movement needs to have a speed of at least 1 pixel (or sensor) per
second to be recognised as a gesture. Furthermore, due to the sum of absolute changes
being used for gesture detection, setting the array length to 15 could potentially cause a
delay of 1 second, if the oldest frame position Phand(xn−k, yn−k, zn−k) alone exceeds the
set threshold.
Thus, a balance between array length and threshold is needed. Since the gesture
recognition in this thesis is solely concerned with the direction and does not recognise
”shape” gestures (circle, square, triangle, etc.), an array length of 5, and an (x, y) threshold
of 1.33 pixels was chosen. A shorter array length increases the detection sensitivity (thus
also increasing the false positive rate), but also reduces erroneous, delayed detection
described above.
Of note are the different dimensions used by the axes. While the X- and Y-axes are in
the image space/pixel dimension, the Z-axis uses depth data, which is in millimetres.
Applying the same threshold (e.g. moving at least 1.3 units in one direction for 15
frames), would make the algorithm overly sensitive to small movements along the Z-axis
(moving at a speed of at least 0.266 mm per frame in this example). Thus, a separate
(higher) threshold must be set for the depth values to be detected as a forward/backward
gesture. In this thesis, the threshold set for Z-axis movement is 150 mm per second, to
account for unconscious moving/shaking of the hand.
47

CHAPTER 5
Challenges of Head Detection and
Gesture Recognition in
Low-Resolution Infrared
Amplitude Images
The challenges are divided into two categories: internal (adjustable) and external (non-
adjustable) challenges. Internal challenges refer to the challenges of our implementation,
while external challenges are environmental.
5.1 Internal Challenges
The biggest internal challenges of our head tracking are efficiency and reliability in
combination with low image resolution. Detecting the head at low image sizes on the
(non-thermal) infrared spectrum alone requires various compensations to mask objects
that do not belong to the human body (like the amplitude range or the reflectance).
At the original image size (8 × 8), circle detection does not work, as shape variety is
limited, leading to shape similarities (like the hand and head in Figure 5.1).
Shape variety and image fidelity can be increased by artificially increasing the resolution
using Bi-cubic Interpolation and taking advantage of its blob formation, which helps
with circle detection but increases the number of pixels in the image and thus, the length
of computation.
Additionally, circle detection will fail at certain distances and positions; if the head is too
close and connects with the image borders, no circle can be detected, and if the head is
too far, less than 4 of the 64 sensors can detect it and the resulting shape is thus neither
49
5. Challenges of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
Figure 5.1: Sensor output showing the similarity of the shapes of a human
hand (red) and head (blue). The hand is closer to the sensor, thus the higher
amplitude/brightness.
square nor elliptical/circular. In these cases, shape properties are used to estimate the
position of the head, which are not as reliable.
Choosing the correct head candidate, while keeping computational costs at a minimum
is another considerable challenge. Our algorithm relies on the fact that a human head
has a lower average amplitude than the palm of a hand or a torso. However, in the
case where no hand is in frame, and only the head and another circular object with
a similar reflectance like human skin is detected, this assumption might lead to false
positive detection.
For gesture recognition, the problem of efficiency is further exasperated by the computa-
tional needs of head detection, thus forcing us to sacrifice accuracy to ensure efficient
computation. Our algorithm has to rely on shape properties to detect the palm of a
hand. Using shape extremes would not work, as depending on the arm pose, the palm of
a hand could be at any of the left, top, and right extremes.
One possible idea to alleviate this problem is to take advantage of the circle detection
done in the head detection step; after selecting a head candidate, one of the discarded
circles could be in front of the head, and in this case, it is likely to be a flat hand. However,
the palm of a hand might not always be detected as a circle and we would have to fall
back to shape properties again, increasing the computational cost with indeterminate
benefit.
50
5.2. External Challenges
Once the hand is detected, recognising a gesture reliably brings further challenges. For
better usability, gesture recognition should allow for varying execution speeds, and have
a threshold for angle and line imperfections (for example due to anatomy, as seen in
Figure 3.24).
Relying on speed or movement length alone will result in random hand movement,
gesticulation for speech emphasis, or simply the adjustment of the hand position before
executing a gesture to be falsely detected as a gesture. Thus, a combination of both
speed and distance needs to be defined to define a gesture, which would rely heavily on
heuristics.
5.2 External Challenges
By far the biggest external challenge is the very low resolution of the sensor array, which
in the case of our algorithm is ”misused” to work similarly to an IR camera. Since human
heads are rarely rectangular, circle detection at such low resolutions is almost impossible.
One idea to mitigate this, other than increasing the resolution, would be to place a
fish-eye lens in front of the sensor, thus ”warping” the input to be more circular. However,
we could not test this theory.
Our 3D ToF-sensor faced the same problem as the stereo IR camera used by Krotosky et
al.[KCT04] for their comparison: In low resolutions, the shape, size, and reflectance of
a human hand are so close to that of a human head, that it is difficult to differentiate
between the two due to the similarity in reflectance and apparent shape at low resolution
(see Figure 5.1).
However, the image size is not the only problem we face with an 8 × 8 sensor array:
the field of view is very narrow at 27°. This in turn makes detecting larger movements
difficult and leads to clipping of shapes at closer distances.
Adding to that, the maximum frame rate in 8 × 8-mode is 15 frames per second, leading
to choppy movement, potentially losing track of fast hand gestures. For example, by
solving 3.4 for a at a distance of D = 30 cm, we can calculate the width of the field
of view, which is ≈ 14.4 cm. Knowing this, a horizontal gesture at the speed of 1 m/s
will at best have two frames, where the hand is moving, while 13 frames will have the
hand outside of the field of view. While the sensor does allow for a faster frame rate (up
to 30 Hz), it only offers this in 4 × 4 mode, which halves the (already small) sampling
resolution.
As with every infrared sensor, one of the challenges is other infrared light sources such as
candles and sunlight which falsify the amplitude and depth measurements. Although the
sensor used offers a short-range mode to increase stability in regards to ambient light, we
are operating the sensor at long-range mode to reduce the repeatability error. Thus, we
have found that direct sunlight can still interfere with its amplitude and depth readings
(see Section 6.4).
51
5. Challenges of Head Detection and Gesture Recognition in Low-Resolution
Infrared Amplitude Images
Another challenge is the reflectance, as some materials/surfaces of clothing and accessories
can change the measured reflectance of the head, hand, and torso. Furthermore, it is
theoretically possible that a material has the same or a very similar reflectance of human
skin, leading to false positives.
For people using prostheses, the material of the prosthesis might not have the same
reflectance, thus leading to different accuracy with gesture recognition. Further exper-
imentation with a wide range of materials and prostheses is required to evaluate the
performance in these cases.
Finally, simply the position and angle of the sensors can pose a challenge, since not only
the imaged shapes could get distorted, but also the movement could not be defined as
strictly perpendicular or parallel to the sensor. For this, arm pose estimation could be
used to correctly determine the gesture being performed.
Regardless of the pose of the camera, distance is an important factor for gesture recogni-
tion: A person performing the same horizontal or vertical gesture at different distances
will have varying calculations of gradients in the pixel dimension.
Say, for example, a gesture is done once at a distance of 40 cm and once at a distance of 70
cm. Solving 3.4 for r, one can see that tan(13.75◦) · 40 ≈ 9.6 and tan(13.75◦) · 70 ≈ 16.8.
Thus, the field of view covers a width/height of ≈ 19.2 cm with a horizontal/vertical
resolution of ≈ 2.4 cm per pixel at a distance of 40 cm, while 70 cm covers a width/height
of ≈ 33.6 cm with a horizontal/vertical resolution of ≈ 4.2 cm per pixel. To overcome
this challenge, distance compensation is needed, which is outside the scope of this thesis.
52
CHAPTER 6
Experimental Results
For evaluation, two types of experiments were performed: laboratory and field experi-
ments.
Laboratory experiments were done under controlled conditions, and are done in a
laboratory using artificial, indirect lighting, a turntable for accurate control of the
rotation, a filled glove on a pendulum for gesture simulation, and accurate distance
measuring.
Field experiments were done in an ”organic” setting, i.e. a living room with an open
window facing the sensor at the back of the room to introduce ambient lighting. Two
persons conducted the field experiments during the late morning, afternoon, and evening
with artificial lighting from above.
6.1 Evaluation Goals
The goal of the laboratory experiments was to determine the functional range of angles
and distances of the sensor to the user and to provide a fixed set of parameters (distance,
lighting, movement speed) that enable the quantifiable reproduction of the experiments
performed.
Field experiments focus on usability and ”real-life”-performance by having human subjects
perform a set of movements in an environment with direct and indirect sunlight. With
field experiments, the parameters of lighting, distance, gesture speed, and silhouette vary
to determine edge cases and shortcomings of the algorithm and/or the sensor used.
6.2 Evaluation Method
Evaluation was done using three metrics described in this section. The average deviation
over the number of frames is used for evaluating the performance of head and hand
53
6. Experimental Results
detection, while detection rate and accuracy evaluate the performance of the gesture
recognition.
The ground truth is created by annotating every frame, depending on the application:
Head Detection / Tracking:
Center of the head: (x, y)-coordinates on the image
Gesture Recognition:
Center of the hand: (x, y)-coordinates on the image
Direction of gesture: The direction of the gesture, if one was executed.
In total, 10,685 frames from 20 experiments were annotated for the evaluation of the
algorithm presented in this thesis.
6.2.1 Deviation and Outliers
The average deviation is the distance between the ground truth and the centroid computed
by the algorithm over every frame. The deviation for each frame is calculated using the
Euclidean distance:
||[x1, y1] − [x2, y2]||2 = |x1 − x2|2 + |y1 − y2|2 (6.1)
where [x1, y1] is the annotated center of the head or hand in the ground truth, and [x2, y2]
is the position of the centroid calculated by the algorithm.
To approximate the deviation in cm, further calculations are done: after calculating the
Euclidean distance, we solve 3.4 for a
2 using the distance of the head in the ground truth
as D to get half the approximate FoV in cm:
DGT · tan(13.5◦) = a
2 (6.2)
Dividing the approximate FoV by 8 (number of sensors along horizontal/vertical dimen-
sion) yields the approximate resolution at distance DGT , which can then be multiplied
by the Euclidean distance:
DGT · tan(13.5◦) · 2
8 · ||[x1, y1] − [x2, y2]||2 (6.3)
Deviations are considered outliers if they are bigger than the upper fence using the
Interquartile Range:
||[x1, y1] − [x2, y2]||2 > Q3 + (1.5 · IQR) (6.4)
where Q3 is the 75-quartile of all deviations and IQR = Q3 − Q1, with Q1 being the
25-quartile of all deviations.
54
6.3. Laboratory Experiments
6.2.2 Gesture Recognition Evaluation
The evaluation for gesture recognition is done for each frame by comparing the ground
truth to the algorithm output in regards to four factors:
Accuracy: Do the detected gestures match in direction? This metric checks the equiv-
alency of ground truth annotation and algorithm output. Any frame, where the
output differs from the ground truth, is counted as an error.
True Positive Rate: Was a gesture recognised, regardless of direction? This metric is
true for every frame where both the ground truth and output are non-zero (gesture).
If the ground truth has annotated 0 (no gesture) for a given frame, it is ignored for
this metric.
False Positive Rate: Was a gesture recognised, when it was not annotated in the
ground truth? This metric is true for every frame where the ground truth is 0 (no
gesture) and the algorithm output is non-zero (gesture) for any frame. If the ground
truth is annotated as non-zero for a given frame, it is ignored for this metric.
Miss Rate: Was no gesture recognised, when one was annotated in the ground truth?
This metric is true for every frame where the ground truth is non-zero (gesture)
and the algorithm output is 0 (no gesture) for any frame. If the ground truth is
annotated as 0 for a given frame, it is ignored for this metric.
6.3 Laboratory Experiments
These experiments were done in a laboratory with indirect artificial ceiling light. The
sensor is placed on the centre of a turntable with a diameter of 60 cm inside a photography
light tent with dimensions 50 × 50 × 50 cm (see Figure 6.1b). For additional experiments
regarding gesture speed, a pendulum with a filled glove as an approximation for a human
hand (Figure 6.1a) was used.
The turntable was fixed at four different angles: -20°, 0°, 20, and 45°. Experiments for
head and hand detection of a human subject (hand length from palm to tip of the middle
finger = 18 cm; length from chin to top of head = 21 cm) were performed at a fixed
distance of 30 cm (edge of the turntable).
To help with the annotation, a mobile camera with a video resolution of 1920 × 1080 px
and a frame rate of 30 frames per second was recording at the same time.
55
6. Experimental Results
(a) Turntable with the ”Hand”-Pendulum. (b) Turntable with a ruler to mark the distance.
Figure 6.1: The turntable setup used for laboratory experiments.
6.3.1 Head Detection
Head detection evaluation is done by calculating the Euclidean distance for two-dimensional
pixels, as described in Section 6.2, and then averaging the distance over all frames.
1,780 frames from 4 experiments were evaluated for simulated head detection. The results
are shown in Tables 6.1 and 6.2:
Table 6.1: Head detection mean deviations in 2D for laboratory experiments (px).
Average Devia-
tion (px)
Min / Max De-
viation (px)
Number of
Outliers
Number of
Frames
Head (-20°) 2.4 0.1 / 4.4 0 347
Head (0°) 1.5 0.1 / 3.3 0 537
Head (20°) 2.3 0.1 / 5.7 0 555
Head (45°) 3.6 2.4 / 4.6 0 341
Total Average 2.5 0.7 / 4.5 0 445
Median 2.4 0.1 / 4.5 0 442
Of note is that the performance is consistent in the range of ±20°. Starting from 20°,
part of the head is out of frame, which means clipping occurs.
56
6.3. Laboratory Experiments
Table 6.2: Approximate head detection mean deviations in 2D for laboratory experiments
(cm).
Average Devia-
tion (cm)
Min / Max De-
viation (cm)
Average Depth
Value (cm)
Number of
Frames
Head (-20°) 6.2 0.1 / 15.6 45.1 347
Head (0°) 3.6 0.1 / 8.2 43.1 537
Head (20°) 6.1 0.2 / 20.0 41.0 555
Head (45°) 6.8 4.9 / 9.8 32.0 341
Total Average 5.7 1.3 / 13.4 40.3 445
Median 6.2 0.2 / 12.7 42.1 442
While the average deviation of Head (45°) in Tables 6.1 and 6.2 appears to be similar
to that of Head (±20◦), the minimum deviation is more telling. Because the head is
completely out of the frame for most of the time, the resulting average deviation is skewed
due to the small amount of frames to compare.
Analysing the input after amplitude filtering shows another problem with the chosen
approach: filtering the amplitude for a range based on the approximation 4.1 does not
work as intended, as the plywood section of the turntable is still visible after filtering.
Hence, when the head is entirely out of frame, the only object in frame is the turntable,
which is at a distance of ≤ 30cm. As such, the shape centroid is calculated and returned
as the centre of the head, explaining the average maximum deviation of ≥ 4.5 px (i.e.
more than half the image width) at an angle.
These results suggests that either, amplitude filtering is not the right approach for pre-
processing, or that the inverse law of photometry is not accurate enough for amplitude
filtering.
Furthermore, looking at the average maximum deviation of 4.5 px / 13.4 cm, we can see
that the chosen approach fails to reliably detect the head even under the best possible
controlled conditions. Such a distance means that the detected centroid was most likely
found away from the head, as 4.5 px is more than half the image size, and 13.4 cm away
from the tip of the nose could already be the background. However, remember that the
laboratory experiments were done to discern reproducible weaknesses in the methodology
presented in this thesis. Placing the sensor at an angle to the user was done deliberately
to test the field of view of the sensor and as such, it seems that the sensor performs best
between -20° and 20°.
6.3.2 Gesture Recognition
For the evaluation of gesture recognition, head detection was turned off, and a fixed
gesture space of 73 cm was used.
Additionally, since the sensor is placed on a turntable in a photography light tent (see
Figure 6.1), and the gesture recognition algorithm takes the closest object without
57
6. Experimental Results
discriminating, only the photo tent or the turntable would be visible after masking,
rendering the results useless. Thus, for these experiments only, a minimum depth value
of 30 cm (radius of the turntable) was set to the gesture space.
Each frame output is compared to the annotations and set to true or false depending on
one of the four metrics described in Section 6.2. Then, the total amount of ”true” values
is averaged over the amount of frames to achieve the average accuracy.
Evaluation with a human subject (length between the palm of hand and tip of middle
finger = 18 cm) was done by executing 24 gestures per hand, four per direction (up,
down, left, right, front, back), resulting in a total of 48 gestures per instance.
Pendulum evaluation let the pendulum swing thrice from left to right and thrice from
back to bottom for each instance, letting it swing until it stood still.
Since gesture detection relies on correct hand detection, the output of the hand detection
step is evaluated as well; by annotating the position of the hand in the ground truth and
calculating the distance between the hand detection output and the ground truth like in
Section 6.2, we can evaluate the performance of the hand detection.
In total, 3,114 frames from six experiments were evaluated for laboratory gesture recog-
nition. The results for exact gesture recognition are seen in Table 6.3:
Table 6.3: Gesture recognition accuracy for laboratory experiments).
True Detection
Rate
False Detec-
tion Rate
Miss Rate Average Accu-
racy
Hand (-20°) 46.07% 50.53% 53.93% 19.78%
Hand (0°) 26.03% 14.49% 73.97% 44.98%
Pendulum (0°) 40.00% 17.59% 60.00% 56.74%
Hand (20°) 47.60% 68.62% 52.40% 23.89%
Pendulum (20°) 29.52% 21.13% 70.48% 44.40%
Hand (45°) 11.97% 0.00% 88.03% 64.19%
Total Average 33.54% 28.73% 66.47% 42.33%
Median 34.80% 19.36% 65.24% 44.69%
Unfortunately, gesture recognition does not perform as well as head detection. Although a
minimum depth value was set up to mitigate false positive detection of the light tent and
turntable, it could not be eliminated, leading to a bad overall performance. Considering
the average miss rate of 66.47%, it seems that the chosen method of hand detection
(weighted centroid) is still too imprecise and results in the centroid being found on the
forearm (see Figure 3.24).
Looking at the results of the hand detection evaluation in Tables 6.4 and 6.5 provides
further information.
58
6.3. Laboratory Experiments
Table 6.4: Gesture recognition mean distances in 2D for laboratory experiments (px).
Average Devia-
tion (px)
Min / Max De-
viation (px)
Number of
Outliers
Number of
Frames
Hand (-20°) 3.7 2.1 / 7.6 32 384
Hand (0°) 3.0 0.3 / 4.5 1 841
Pendulum(0°) 1.9 0.3 / 5.3 0 251
Hand (20°) 2.0 0.3 / 7.4 17 759
Pendulum (20°) 1.9 1.0 / 3.4 0 327
Hand (45°) 2.7 1.8 / 3.5 0 552
Total Average 2.5 1.0 / 5.3 8.3 519
Median 2.4 0.7 / 4.9 0.5 468
Table 6.5: Approximate gesture recognition mean distances in 2D for laboratory experi-
ments (cm).
Average Devia-
tion (cm)
Min / Max De-
viation (cm)
Average Depth
Value (cm)
Number of
Frames
Hand (-20°) 7.6 1.2 / 18.1 36.7 384
Hand (0°) 6.9 0.6 / 12.6 36.6 841
Pendulum(0°) 4.9 0.7 / 14.0 40.6 251
Hand (20°) 4.6 0.2 / 20.8 39.2 759
Pendulum (20°) 4.2 0.6 / 9.8 34.8 327
Hand (45°) 6.7 1.3 / 10.3 40.5 552
Total Average 5.8 0.8 / 14.3 38.1 519
Median 5.8 0.7 / 13.3 38.0 468
As can be seen in Tables 6.4 and 6.5, the average two-dimensional deviation is 2.5 px / 5.8
cm, which means a deviation by more than a quarter of the image width/height, or more
than half the width of the hand of the person performing the laboratory experiments.
Additionally of note are the high number of outliers of Hand (±20), which explain the
particularly high false positive rate of both experiments. Outliers in hand position
estimation will lead to movement being detected where there is none.
If the sensor is at an angle, and the depth value of the hand from the sensor exceeds 40
cm, the calculated centroid would stay on the turntable or light tent, since they are the
closest objects in view. Furthermore, if the hand/arm moves closer to the sensor than the
set minimum depth value of 30 cm, it will not be considered for hand detection, again
making the turntable or light tent the ”closest” object to the sensor. Frames, where
the algorithm ”jumps” from the hand/arm to the turntable or the light tent might lead
to false positive recognition of movement. If the hand stays out of the depth range of
30 − 40 cm, the centroid will not move, which might explain the miss rate of 66.47%.
59
6. Experimental Results
6.4 Field Experiments
The purpose of field experiments is to ascertain reliability in less controlled environments
and more "realistic" situations. The sensor is placed in a living room at the edge of a
table with a height of 74 cm, 4.75 metres across an unobstructed window facing east.
Figure 6.2 shows a schematic illustration of the room.
Figure 6.2: Schematic overview of the field experiment setup. Both tables are
below the window on the right.
Head detection was evaluated during the late morning, when ambient light is strong from
the window, and in the evening with artificial lighting. Gesture recognition was evaluated
during the afternoon when ambient light was weaker.
As with the simulated experiments, a mobile camera with a resolution of 1920 × 1080 px
and a frame rate of 30 frames per second was recording simultaneously to help annotation
of the ground truth.
6.4.1 Head Detection
To evaluate the performance of head detection for qualitative experiments, two human
subjects with differing skin tones executed a set of head movements (Person 1: length
from chin to top of head = 21 cm, Person 2: length from chin to top of head = 18 cm).
To test the impact of different silhouettes, over-ear headphones and glasses were used.
Person 1 was evaluating during the late morning, while Person 2 was evaluating during
the evening.
As with the head detection in the simulated experiments (Section 6.3.1), the average
deviation is calculated over every frame. 4,449 frames from 7 experiments were evaluated
for head detection under these conditions. The results can be seen in Tables 6.6 and 6.7.
60
6.4. Field Experiments
Table 6.6: Head detection mean distances in 2D for field experiments (px).
Average
Deviation
(px)
Min / Max
Deviation
(px)
Number of
Outliers
Number of
Frames
Person 1 @ 20cm 3.8 0.6 / 6.2 8 855
Person 1 @ 35cm 1.9 0.0 / 6.8 9 904
Person 1 @ 40cm 1.9 0.1 / 6.6 9 660
Person 1 @ 50cm 2.3 0.4 / 5.4 3 941
Person 2 @ 45cm 0.8 0.1 / 3.3 21 283
Person 2 (Glasses) @ 45cm 0.6 0.0 / 4.3 18 363
Person 1 (Headphones) @ 45cm 2.2 0.1 / 7.0 10 443
Total Average 1.9 0.2 / 5.7 11.1 636
Median 1.9 0.1 / 6.2 9 660
Table 6.7: Approximate head detection mean distances in 2D for field experiments (cm).
Average
Deviation
(cm)
Min / Max
Deviation
(cm)
Average
Depth
Value (cm)
Number of
Frames
Person 1 @ 20cm 4.9 0.3 / 7.4 21.6 855
Person 1 @ 35cm 4.6 0.0 / 16.7 41.1 904
Person 1 @ 40cm 4.9 0.2 / 15.4 45.1 660
Person 1 @ 50cm 6.0 0.8 / 18.2 48.3 941
Person 2 @ 45cm 2.1 0.2 / 7.5 49.0 283
Person 2 (Glasses) @ 45cm 1.7 0.1 / 9.0 49.1 363
Person 1 (Headphones) @ 45cm 5.2 0.2 / 17.6 43.0 443
Total Average 4.2 0.3 / 17.6 42.5 636
Median 4.9 0.2 / 15.4 45.1 660
The results indicate an average two-dimensional deviation of 1.9 px / 4.2 cm, with
exceptional performance of the algorithm in the experiments with Person 2. Considering
that the evaluation of Person 1 was performed during the late morning with strong ambient
light from the window in the background, it seems that the sensor is indeed susceptible to
ambient infrared radiation. This is further indicated by the higher maximum deviation
in comparison to the laboratory experiments in Tables 6.1 and 6.2. Thus, it seems that
the algorithm cannot perform head detection reliably under strong ambient light, leading
to the centroid jumping between head and window.
In contrast to the laboratory experiments, amplitude filtering successfully masks anything
other than the head and torso of the participants, as the change of the silhouette due to
over-ear headphones and the use of glasses did not negatively impact the performance.
This further points toward amplitude filtering being dependent on ambient lighting.
61
6. Experimental Results
Of note is that the higher number of outliers of Person 2 is due to the low average deviation,
leading to a smaller interquartile range and thus a higher sensitivity to variance.
Thus, the head detection algorithm described in this thesis can be used in the context of
(home-)entertainment. For medical and safety applications like in a vehicle, a significantly
higher reliability (i.e. a lower amount and smaller extent of outliers) is needed. Especially
in vehicles, where ambient lighting changes rapidly, a lighting-independent method is
necessary.
6.4.2 Gesture recognition
For gesture recognition, a series of consecutive gestures were performed: two up-gestures,
then two down-gestures, an alteration of left and right gestures twice, a back gesture,
and finally a front gesture. The gestures were performed during the afternoon, with the
length between the palm and tip of the middle finger being 18 cm.
As with the laboratory experiments of gesture recognition, the four metrics and the
calculation of the deviation described in Section 6.2 were used. This time, head detection
was turned on to evaluate the performance of the combined algorithms. As there is no
turntable in frame, no compensation was needed and thus the minimum depth value was
removed.
In total, 1,342 frames from 3 experiments were evaluated for qualitative gesture recognition.
The results for gesture recognition are shown in Table 6.8.
Table 6.8: Gesture recognition accuracy for field experiments (with head detection).
True Detec-
tion Rate
False Detec-
tion Rate
Miss Rate Average Accu-
racy
Person 1 @ 20cm 37.90% 9.16% 62.10% 66.47%
Person 1 @ 35cm 20.13% 39.02% 79.87% 32.39%
Person 1 @ 40cm 9.63% 16.45% 90.37% 44.60%
Total Average 22.55% 21.54% 77.45% 47.82%
Median 20.13% 16.45% 79.87% 44.60%
What is immediately apparent is the worse performance compared to the laboratory
experiments in Table 6.3, indicating that the creation of the gesture space might not be
the correct approach for reliable gesture recognition with the head in frame. This can be
verified by turning off head detection and using a fixed gesture space of 73 cm, as seen in
Table 6.9.
As can be seen in Table 6.9, both the true positive and the false positive rate is increased
when head detection is turned off. This might be an indicator of the memory length
being too short with 5 frames or the threshold being too low at 1.33 px per 5 frames
since both regulate the sensitivity to movement. Another possibility is that the gesture
62
6.4. Field Experiments
Table 6.9: Gesture recognition accuracy for field experiments (without head detection).
True Detec-
tion Rate
False Detec-
tion Rate
Miss Rate Average Accu-
racy
Person 1 @ 20cm 58.95% 25.10% 41.05% 54.91%
Person 1 @ 35cm 35.71% 63.42% 64.29% 19.81%
Person 1 @ 40cm 51.11% 71.71% 48.89% 16.38%
Total Average 48.59% 53.41% 51.41% 30.37%
Median 35.71% 63.42% 48.89% 19.81%
space created by the head detection is too small and could achieve better results by being
set bigger.
The two-dimensional deviations shown in Tables 6.10 and 6.11 offer further insight.
Table 6.10: Gesture recognition mean distances in 2D for field experiments.
Average Devia-
tion (px)
Min / Max De-
viation (px)
Number of
Outliers
Number of
Frames
Person 1 @ 20cm 4.9 0.0 / 7.8 0 489
Person 1 @ 35cm 3.0 0.2 / 8.7 0 449
Person 1 @ 40cm 3.4 0.2 / 7.9 0 404
Total Average 3.8 0.1 / 8.1 0 447
Median 3.0 0.2 / 7.8 0 449
Table 6.11: Approximate gesture recognition mean distances in 2D for field experiments.
Average Devia-
tion (cm)
Min / Max De-
viation (cm)
Average Depth
Value (cm)
Number of
Frames
Person 1 @ 20cm 7.5 0.0 / 15.6 24.6 489
Person 1 @ 35cm 8.3 0.2 / 22.0 41.7 449
Person 1 @ 40cm 8.7 0.7 / 24.5 42.6 404
Total Average 8.2 0.3 / 20.7 36.3 447
Median 8.3 0.2 / 22.0 41.7 449
As can be seen in Tables 6.10 and 6.11, the average deviation of 3.8 px / 8.2 cm correlates
with the worse performance, and the maximum deviation is constantly above 7.8 px /
15.6 cm – almost an entire image length/width or the length of the hand of the person
doing the evaluations.
This can be explained by ambient light creating false measurements with the sensor, but
could also just further indicate that a weighted centroid on the closest object alone is not
the right approach for accurate and reliable gesture recognition.
Another point of consideration is the importance of positional stability, compared to
head tracking. With head tracking, as long as the centroid is in the region of the head,
63
6. Experimental Results
an average deviation of 3.8 px / 8.2 cm does not necessarily result in poor performance.
However, with gesture recognition, if the centroid abruptly changes position from frame
to frame, it will inadvertently lead to false directional changes, even if the average of every
frame is taken for gesture recognition. Therefore, robust and reliable hand detection is
needed first to make usable gesture recognition possible.
64
CHAPTER 7
Conclusion
In this thesis we have shown an efficient hand-crafted method for tracking a human head
and recognising gestures using an 8 × 8 infrared sensor array. The sensor used is a novel
3D-ToF IR-sensor array (ST VL53L1X) with the capability to return both amplitude
and depth data, resulting in a new form of input data not yet used.
We have discussed related work and have given a broad overview of the theory and
methodologies employed in the tasks of head detection and gesture recognition. Challenges
faced during implementation and experiments, as well as possible future complications,
have been documented including ways to mitigate them. The experimental results are
discussed and analysed in depth, with theories explaining particular results.
Head detection has shown promising results (total average deviation of 1.9 px/4.2 cm in
field experiments, and an average deviation of 2.5 px/5.7 cm in laboratory experiments)
and if efficiency is less of a concern, could be further improved upon, with possible
starting points described in this thesis.
Gesture recognition performed poorly (total average true positive rate of 22.55% and
false positive rate of 21.54% in field experiments, and a true positive rate of 33.54% and
false positive rate of 28.73% in laboratory experiments), alternatives have been discussed
that could make simultaneous head and gesture detection a reliable possibility.
The technology of IR-ToF sensors in combination with pattern recognition algorithms
could be a viable interface for machines in fields like (home-)entertainment, medicine,
and even automobiles. Especially at very low resolutions, privacy concerns can be kept
at a minimum and the acceptance of private usage of such technology can be favourable.
65

List of Figures
1.1 The sensor used, an STMicroelectronics VL53L1X (black element at the centre
of the board), with the gyro sensor connected to the green LEDs to the right
to visualise the pose of the sensor. . . . . . . . . . . . . . . . . . . . . . . 2
3.1 Maximum distance and repeatability error vs. timing budget of the sensor.
Tested on a target with 54% reflectance and no ambient light, actual distance
in mm. TB = timing budget in ms, STDEV = standard deviation. The blue
line denotes the mean range, while the red dots are the repeatability error.
From the VL53L1X data-sheet [STM18]. . . . . . . . . . . . . . . . . . . . 12
3.2 Examples of the two outputs of the sensor for the same frame. The range of
the colour map is [0 − 700]. . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Reflectance of the human skin, according to the National Institute of Standards
and Technology [CA13]. The thick red line marks the wavelength of the sensor
used in this thesis, grey indicates the instrument’s uncertainty. Reflectance
factor denotes the relative amount of photons reflected ([0.0, 1.0]) . . . . . 14
3.4 A 2D-visualisation of perspective projection exhibited by the sensor. The red
line is the depth as measured by the ToF sensor, while the black line at the
centre is the distance between the object and the sensor. . . . . . . . . . . 14
3.5 Skeletons of an abstracted torso and a circle. Skeletons have been thickened
for better visibility and have an actual thickness of 1 px. . . . . . . . . . . 15
3.6 The possible discrete circles in an 8 × 8 image with their centre at the image
centre. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Examples for 2D-projected spheres in an 8 × 8 image. The grey-scale values
depend on the surface angles and can differ from the ones shown here. . . 16
3.8 Examples of sub-optimal head positioning. . . . . . . . . . . . . . . . . . . 18
3.9 Side-way visualisation of the field of view. α denotes the angle of the field
of view, D is the distance between an object and a is the image size of the
object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.10 Schematic example of image resizing. The cyan pixels are newly created pixels
with unknown values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.11 Visualisation of bi-cubic interpolation on a row of 6 pixels (red markers). The
Y-axis denotes pixel value, and the X-axis is the pixel’s position along the
row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
67
3.12 Comparison of various interpolation methods for a scale of 2 on a 6 × 6 image
of a square. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.13 Comparison of various interpolation methods for the sensor output. . . . 22
3.14 Examples of Hough transformation with lines. . . . . . . . . . . . . . . . . 25
3.15 Four-time magnification of discrete circles using Bi-cubic Interpolation and
their resulting Sobel edge images. Note how the ”circle” with r = 1.5 merely
turns into a square with rounded corners. Furthermore, observe how the Sobel
edge detection behaves when the circle is at the image boundary. Even though
the circle is symmetrical, the edge image appears to have two protrusions to
the image boundaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.16 Example of Circular Hough Transform. The dashed circles are in the parameter
space, and the solid is in the image space. As can be seen here, if their radius
is equal to the image circle, they will intersect at the centre of it. . . . . . 27
3.17 Example for Laplacian of Gaussian and Difference of Gaussian blob detection
on an image of a coffee cup. . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.18 Examples for 4- and 8-connectivity and their impact on the number of detected
regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.19 Examples for connected component labelling on amplitude and depth, using
8-connectivity. Amplitude is visualised by colour hue, while depth is denoted
by the pixel intensity (brightness). . . . . . . . . . . . . . . . . . . . . . . 29
3.20 Comparison of impact of outlier pixels on two different sizes. One pixel shifts
the centroid of the 3 × 3 square by 5% to the right, and that of the 25 × 25
square by 3%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.21 Example for extrema. It is possible for extrema to overlap, like with right-
bottom and bottom-right. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.22 Parallel and perpendicular movement and its influence on the depth value.
The sensor is the small grey rectangle on the left, while the ellipse is a moving
object. With parallel movement, the depth measurements stay in a given
range, while perpendicular movement can have any depth between 0 and ∞. 35
3.23 Exemplary visualisation of the created gesture space (blue). The sensor is the
grey rectangle on the left. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.24 Visualisation of a left-to-right movement of an arm and the arcs created by it.
θ is the angle between the start and finish point of the movement and r is the
length between the elbow and another point of the arm (be it the middle of
the forearm or the middle of the hand). . . . . . . . . . . . . . . . . . . . 37
3.25 Schematic example for centroid calculation using Shape Properties and Orien-
tation. The dashed red line marks the major axis of the shape, the red dots
are the shape extrema used for the given orientation, and the blue asterisk is
the centroid calculated from the given extrema. . . . . . . . . . . . . . . . 38
3.26 Schematic example for convex hull (blue overlay) creation and skeletonization
(red line) using medial axis transform. . . . . . . . . . . . . . . . . . . . . 39
68
4.1 Amplitude (Y-axis, number of photons) over depth (X-axis, in mm) measure-
ments from our experimental data. The measurements are taken from the tip
of the nose. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 A human head at ≈ 30cm distance before and after up-scaling. Note the blob
formation and higher amplitude values on the cheeks and chin after up-scaling. 44
4.3 Near infrared reflectance spectra of nine different parts of face and hand. [MSJO14].
Note how the forehead and both cheeks always have a lower reflectance than
the hands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Sensor output showing the similarity of the shapes of a human hand (red)
and head (blue). The hand is closer to the sensor, thus the higher ampli-
tude/brightness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.1 The turntable setup used for laboratory experiments. . . . . . . . . . . . . 56
6.2 Schematic overview of the field experiment setup. Both tables are below the
window on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
69

List of Tables
6.1 Head detection mean deviations in 2D for laboratory experiments (px). . 56
6.2 Approximate head detection mean deviations in 2D for laboratory experiments
(cm). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Gesture recognition accuracy for laboratory experiments). . . . . . . . . . 58
6.4 Gesture recognition mean distances in 2D for laboratory experiments (px). 59
6.5 Approximate gesture recognition mean distances in 2D for laboratory experi-
ments (cm). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.6 Head detection mean distances in 2D for field experiments (px). . . . . . 61
6.7 Approximate head detection mean distances in 2D for field experiments (cm). 61
6.8 Gesture recognition accuracy for field experiments (with head detection). 62
6.9 Gesture recognition accuracy for field experiments (without head detection). 63
6.10 Gesture recognition mean distances in 2D for field experiments. . . . . . . 63
6.11 Approximate gesture recognition mean distances in 2D for field experiments. 63
71

Bibliography
[AK99] Tim J Atherton and Darren J Kerbyson. Size invariant circle detection.
Image and Vision computing, 17(11):795–803, 1999.
[Bal81] Dana H Ballard. Generalizing the Hough transform to detect arbitrary
shapes. Pattern recognition, 13(2):111–122, 1981.
[BBSV06] Bastiaan J Boom, GM Beumer, Luuk J Spreeuwers, and Raymond NJ
Veldhuis. The effect of image resolution on the performance of a face
recognition system. In 2006 9Th international conference on control,
automation, robotics and vision, pages 1–6. IEEE, 2006.
[BK23] Majid Banaeyan and Walter G. Kropatsch. Distance Transform in Parallel
Logarithmic Complexity. In 12th International Conference on Pattern
Recognition Applications and Methods. SCITEPRESS, 2023.
[BT81] Harry G Barrow and Jay M Tenenbaum. Interpreting line drawings as
three-dimensional surfaces. Artificial intelligence, 17(1-3):75–116, 1981.
[CA13] Catherine C Cooksey and David W Allen. Reflectance measurements of
human skin from the ultraviolet to the shortwave infrared (250 nm to 2500
nm). In Active and Passive Signatures IV, volume 8734, pages 152–160.
SPIE, 2013.
[Can86] John Canny. A Computational Approach to Edge Detection. IEEE
Transactions on Pattern Analysis and Machine Intelligence, PAMI-8(6):679–
698, 1986.
[CJH+19] Ke Chen, Kui Jia, Heikki Huttunen, Jiri Matas, and Joni-Kristian Kämäräi-
nen. Cumulative attribute space regression for head pose estimation and
color constancy. Pattern Recognition, 87:29–37, 2019.
[CRP08] Jae Young Choi, Yong Man Ro, and Konstantinos N. Plataniotis. Feature
subspace determination in video-based mismatched face recognition. In
2008 8th IEEE International Conference on Automatic Face & Gesture
Recognition, pages 1–6, 2008.
73
[CSX+18] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman.
Vggface2: A dataset for recognising faces across pose and age. In 2018 13th
IEEE international conference on automatic face & gesture recognition
(FG 2018), pages 67–74. IEEE, 2018.
[Cut90] LJ Cutrona. Synthetic aperture radar. Radar handbook, 2:2333–2346, 1990.
[CV95] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine
learning, 20:273–297, 1995.
[FLCS12] Clinton Fookes, Frank Lin, Vinod Chandran, and Sridha Sridharan. Eval-
uation of image resolution and super-resolution on face recognition per-
formance. Journal of Visual Communication and Image Representation,
23(1):75–93, 2012.
[GDG11] Mislav Grgic, Kresimir Delac, and Sonja Grgic. SCface–surveillance cam-
eras face database. Multimedia tools and applications, 51:863–879, 2011.
[Gla14] Andrew S Glassner. Principles of digital image synthesis. Elsevier, 2014.
[GST+20] Tiffany K Gill, E Michael Shanahan, Graeme R Tucker, Rachelle Buch-
binder, and Catherine L Hill. Shoulder range of movement in the general
population: age and gender stratified normative data using a community-
based cohort. BMC musculoskeletal disorders, 21:1–9, 2020.
[GXX+17] Bin-Bin Gao, Chao Xing, Chen-Wei Xie, Jianxin Wu, and Xin Geng. Deep
label distribution learning with label ambiguity. IEEE Transactions on
Image Processing, 26(6):2825–2838, 2017.
[HGMBJ90] Mary C Hume, Harris Gellman, Harry McKellop, and Robert H Brum-
field Jr. Functional range of motion of the joints of the hand. The Journal
of hand surgery, 15(2):240–243, 1990.
[HH06] Shinji Hayashi and Osamu Hasegawa. A detection technique for degraded
face images. In 2006 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’06), volume 2, pages 1506–1512.
IEEE, 2006.
[Hil84] Ellen C Hildreth. The computation of the velocity field. Proceedings of the
Royal society of London. Series B. Biological sciences, 221(1223):189–220,
1984.
[Hou62] Paul VC Hough. Method and means for recognizing complex patterns,
December 18 1962. US Patent 3,069,654.
[HS81] Berthold KP Horn and Brian G Schunck. Determining optical flow. Artifi-
cial intelligence, 17(1-3):185–203, 1981.
74
[HSXS13] Jungong Han, Ling Shao, Dong Xu, and Jamie Shotton. Enhanced com-
puter vision with microsoft kinect sensor: A review. IEEE transactions on
cybernetics, 43(5):1318–1334, 2013.
[KCT04] Stephen J Krotosky, Shinko Y Cheng, and Mohan M Trivedi. Face detection
and head tracking using stereo and thermal infrared cameras for" smart"
airbags: a comparative analysis. In Proceedings. The 7th International
IEEE Conference on Intelligent Transportation Systems (IEEE Cat. No.
04TH8749), pages 17–22. IEEE, 2004.
[Key81] Robert Keys. Cubic convolution interpolation for digital image processing.
IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–
1160, 1981.
[KIHF06] Walter G Kropatsch, Adrian Ion, Yll Haxhimusa, and Thomas Flanitzer.
The eccentricity transform (of a digital shape). In International Conference
on Discrete Geometry for Computer Imagery, pages 437–448. Springer,
2006.
[KKL+21] Khalil Khan, Rehan Ullah Khan, Riccardo Leonardi, Pierangelo Migliorati,
and Sergio Benini. Head pose estimation: A survey of the last ten years.
Signal Processing: Image Communication, 99:116479, 2021.
[KRG94] Senthil Kumar, Nathan Ranganathan, and Dmitry Goldgof. Parallel
algorithms for circle detection in images. Pattern Recognition, 27(8):1019–
1028, 1994.
[Kro90] Walter G. Kropatsch. Image pyramids and curves, 1990. Universität
Innsbruck.
[KVB88] N. Kanopoulos, N. Vasanthavada, and R.L. Baker. Design of an image
edge detection filter using the sobel operator. IEEE Journal of Solid-State
Circuits, 23(2):358–367, 1988.
[LBJP12] Barid Baran Lahiri, Subramaniam Bagavathiappan, T Jayakumar, and
John Philip. Medical applications of infrared thermography: a review.
Infrared physics & technology, 55(4):221–235, 2012.
[Lin93] Tony Lindeberg. Detecting salient blob-like image structures and their
scales with a scale-space primal sketch: A method for focus-of-attention.
International Journal of Computer Vision, 11(3):283–318, 1993.
[LKC94] Ta-Chih Lee, Rangasami L Kashyap, and Chong-Nam Chu. Building
skeleton models via 3-d medial surface axis thinning algorithms. CVGIP:
Graphical Models and Image Processing, 56(6):462–478, 1994.
[Low04] David G Lowe. Distinctive image features from scale-invariant keypoints.
International journal of computer vision, 60:91–110, 2004.
75
[LP02] A. Lemieux and M. Parizeau. Experiments on eigenfaces robustness. In
2002 International Conference on Pattern Recognition, volume 1, pages
421–424 vol.1, 2002.
[LS18] Seungsu Lee and Takeshi Saitoh. Head pose estimation using convolutional
neural network. In IT Convergence and Security 2017: Volume 1, pages
164–171. Springer, 2018.
[LWZ+20] Hai Liu, Xiang Wang, Wei Zhang, Zhaoli Zhang, and You-Fu Li. Infrared
head pose estimation with multi-scales feature fusion on the IRHP database
for human attention recognition. Neurocomputing, 411:510–520, 2020.
[Mar00] Karen Marshall. Color FERET database. https://doi.org/10.
18434/M31475 (Accessed 2024-01-18), 2000.
[MH80] David Marr and Ellen Hildreth. Theory of edge detection. Proceedings of the
Royal Society of London. Series B. Biological Sciences, 207(1167):187–217,
1980.
[MHWH17] Andre Mewes, Bennet Hensen, Frank Wacker, and Christian Hansen.
Touchless interaction with software in interventional radiology and surgery:
a systematic literature review. International journal of computer assisted
radiology and surgery, 12:291–305, 2017.
[MNM15] Michael J Mendenhall, Abel S Nunez, and Richard K Martin. Human skin
detection in the visible and near infrared. Applied optics, 54(35):10559–
10570, 2015.
[MSJO14] M Mohamad, ARM Sabbri, MZ Mat Jafri, and AF Omar. Correlation
between near infrared spectroscopy and electrical techniques in measuring
skin moisture content. In Journal of Physics: Conference Series, volume
546, pages 12–21. IOP Publishing, 2014.
[NB21] Rubén E Nogales and Marco E Benalcázar. Hand gesture recognition using
machine learning and infrared information: a systematic literature review.
International Journal of Machine Learning and Cybernetics, 12(10):2859–
2886, 2021.
[NGC92] C Wayne Niblack, Phillip B Gibbons, and David W Capson. Generating
skeletons and centerlines from the distance transform. CVGIP: Graphical
Models and image processing, 54(5):420–437, 1992.
[OGG+22] Moritz Oppliger, Jonas Gutknecht, Roman Gubler, Matthias Ludwig, and
Teddy Loeliger. Sensor Fusion of 3D Time-of-Flight and Thermal Infrared
Camera for Presence Detection of Living Beings. In 2022 IEEE Sensors,
pages 1–4. IEEE, 2022.
76
[P+70] Judith MS Prewitt et al. Object enhancement and extraction. Picture
processing and Psychopictorics, 10(1):15–19, 1970.
[RMG+99] Gaël Richard, Y Mengay, I Guis, N Suaudeau, Jérôme Boudy, Philip
Lockwood, C Fernandez, F Fernández, Constantine Kotropoulos, Anas-
tasios Tefas, et al. Multi modal verification for teleservices and security
applications (M2VTS). In Proceedings IEEE International Conference on
Multimedia Computing and Systems, volume 2, pages 1061–1064. IEEE,
1999.
[RP68] Azriel Rosenfeld and John L Pfaltz. Distance functions on digital pictures.
Pattern recognition, 1(1):33–61, 1968.
[RWL+14] David P Roy, Michael A Wulder, Thomas R Loveland, Curtis E Woodcock,
Richard G Allen, Martha C Anderson, Dennis Helder, James R Irons,
David M Johnson, Robert Kennedy, et al. Landsat-8: Science and product
vision for terrestrial global change research. Remote sensing of Environment,
145:154–172, 2014.
[SBB01] Terence Sim, Simon Baker, and Maan Bsat. The CMU Pose, Illumination
and Expression database of human faces. Carnegie Mellon University
Technical Report CMU-RI-TR-OI-02, 2001.
[SHB13] Milan Sonka, Vaclav Hlavac, and Roger Boyle. Image processing, analysis
and machine vision. Springer, 2013.
[STM11] Matthew Sardelli, Robert Z Tashjian, and Bruce A MacWilliams. Func-
tional elbow range of motion for contemporary tasks. JBJS, 93(5):471–477,
2011.
[STM18] STMicroelectronics. A new generation, long distance ranging Time-of-
Flight sensor based on ST’s FlightSense™ technology, 2018. Rev. 3.
[TZM19] Shigeyuki Tateno, Yiwei Zhu, and Fanxing Meng. Hand gesture recognition
system for in-car device control based on infrared array sensor. In 2019
58th Annual conference of the society of instrument and control engineers
of Japan (SICE), pages 701–706. IEEE, 2019.
[UVG+14] Rubén Usamentiaga, Pablo Venegas, Jon Guerediaga, Laura Vega, Julio
Molleda, and Francisco G Bulnes. Infrared thermography for temperature
measurement and non-destructive testing. Sensors, 14(7):12305–12348,
2014.
[VJ01] Paul Viola and Michael Jones. Rapid object detection using a boosted
cascade of simple features. In Proceedings of the 2001 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition. CVPR
2001, volume 1, pages I–I. Ieee, 2001.
77
[WBA+13] Piotr Wojtczuk, David Binnie, Alistair Armitage, Tim Chamberlain, and
Carsten Giebeler. A touchless passive infrared gesture sensor. In Adjunct
Proceedings of the 26th Annual ACM Symposium on User Interface Software
and Technology, pages 67–68, 2013.
[WBRF13] Frank Weichert, Daniel Bachmann, Bartholomäus Rudak, and Denis Fis-
seler. Analysis of the accuracy and robustness of the leap motion controller.
Sensors, 13(5):6380–6393, 2013.
[WMJW+14] Zhifei Wang, Zhenjiang Miao, QM Jonathan Wu, Yanli Wan, and Zhen
Tang. Low-resolution face recognition: a review. The Visual Computer,
30:359–386, 2014.
[WS17] Oliver Wasenmüller and Didier Stricker. Comparison of Kinect v1 and
v2 depth images in terms of accuracy and precision. In Computer Vision–
ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei,
Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages
34–45. Springer, 2017.
[ZY11] Wilman WW Zou and Pong C Yuen. Very low resolution face recognition
problem. IEEE Transactions on image processing, 21(1):327–340, 2011.
78