Tracking Related Multiple Targets in Videos DISSERTATION zur Erlangung des akademischen Grades Doktor/in der technischen Wissenschaften eingereicht von Nicole M. Artner Matrikelnummer 0727746 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: O.Univ.Prof. Dipl.Ing. Dr.techn. Walter G. Kropatsch Diese Dissertation haben begutachtet: (O.Univ.Prof. Dipl.Ing. Dr.techn. Walter G. Kropatsch) (Prof. Em. Dr. Horst Bunke) Wien, 10.10.2013 (Nicole M. Artner) Technische Universität Wien A-1040 Wien  Karlsplatz 13  Tel. +43-1-58801-0  www.tuwien.ac.at Die approbierte Originalversion dieser Dissertation ist in der Hauptbibliothek der Technischen Universität Wien aufgestellt und zugänglich. http://www.ub.tuwien.ac.at The approved original version of this thesis is available at the main library of the Vienna University of Technology. http://www.ub.tuwien.ac.at/eng Tracking Related Multiple Targets in Videos DISSERTATION submitted in partial fulfillment of the requirements for the degree of Doktor/in der technischen Wissenschaften by Nicole M. Artner Registration Number 0727746 to the Faculty of Informatics at the Vienna University of Technology Advisor: O.Univ.Prof. Dipl.Ing. Dr.techn. Walter G. Kropatsch The dissertation has been reviewed by: (O.Univ.Prof. Dipl.Ing. Dr.techn. Walter G. Kropatsch) (Prof. Em. Dr. Horst Bunke) Wien, 10.10.2013 (Nicole M. Artner) Technische Universität Wien A-1040 Wien  Karlsplatz 13  Tel. +43-1-58801-0  www.tuwien.ac.at Erklärung zur Verfassung der Arbeit Nicole M. Artner Bahngasse 1, 7312 Unterpetersdorf Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwende- ten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit - einschließlich Tabellen, Karten und Abbildungen -, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Ent- lehnung kenntlich gemacht habe. (Ort, Datum) (Unterschrift Verfasserin) i Acknowledgments I would like to thank all former and current people working at PRIP for their support. Thanks for being test subjects for my experiments, for giving feedback and for cheering me up. Special thanks goes to my mentor and Doktorvater Walter G. Kropatsch for his constant support, in- spiring ideas and patience. Thank you for the countless hours we spent discussing the research presented in this thesis. I would also like to thank Adrian Ion for his invaluable scientific and emotional support. Thanks for the endless brainstorming sessions, late night paper submission sessions and philosophical discussions about the meaning of a PhD and life. Deep appreciation goes to my partner Martin. Thank you for loving and supporting me all this years. Without you this thesis would not have seen the light of day. Last but not least, I would like to thank my parents Maria and Josef as well as my brother Armin for their love and care. iii Abstract This cumulative thesis presents research in the field of tracking. Tracking is one of the most thoroughly researched problems in computer vision. The aim of tracking is to follow an object of interest (target) in a video. In this thesis, I focus on a special problem: tracking related multiple targets. Two important questions in tracking are: What is the target? and Where is the target? The core contributions of this thesis answer these two questions with the help of graph-based representations and methods. The first core contribution is a fully automatic initialization for target models (What?), based on the principal that things which move together belong together. The input of the approach is a video showing the targets in motion. In this video a set of salient points is tracked to extract the necessary motion information in the form of trajectories. A triangulated graph is built based on the initial positions of the tracked points. Then, the triangulated graph is deformed based on the motion encoded in the trajectories. This deformation of the triangulation over time is the input of a hierarchical grouping process, which is realized by an irregular dual graph pyramid. In the top level of the resulting pyramid the rigid entities (e.g. body parts of a human body) are identified. Finally, the motion of these rigid entities is analyzed to find possible points of articulation connecting them (e.g. upper and lower arm of a human). The second core contribution is a novel approach for finding temporal correspondences of multiple related targets (Where?). This thesis proposes to represent the targets by a graph model, where each target is represented by a vertex and their relationships are encoded by edges. The traditional solution to find the temporal correspondences of a graph model is graph matching. In contrast to that, this thesis proposes a novel approach, which finds the correspondence of each vertex (target) by combining the appearance cue of a simple tracker with the structural cue deduced from a graph model. These two cues are combined in an iterative process inspired by the well-known Mean Shift algorithm. The outcome are correspondences for all vertices and edges in the graph, which locally maximize the similarity in appearance and locally minimize the deviation from the structure encoded in the model. Finally, the main goal of this thesis is to show the potential of graph-based representations and methods in tracking. This goal has been achieved through these two core contributions. v Kurzfassung Diese Dissertation präsentiert Forschung auf dem Gebiet des Trackings (Verfolgung). Tracking ist eines der am gründlichsten erforschten Themen im computerunterstützten Sehen (Computer Vision). Das Ziel beim Tracking ist es ein gewähltes Objekt (Ziel) in einem Video zu verfolgen. Diese Dissertation konzentriert sich auf ein spezielles Problem bei dem mehrere Ziele verfolgt werden sollen die in Beziehung zueinander stehen. Zwei wichtige Fragen beim Tracking sind: Was ist das Ziel? und Wo ist das Ziel? Die zwei wichtigsten wissenschaftlichen Beiträge dieser Dissertation beantworten diese Fragen mit Hilfe von Graphen. Der erste Beitrag der Dissertation ist eine vollautomatische Initialisierung für Zielmodelle (Was?) basierend auf dem Prinzip: Dinge die sich gemeinsam bewegen gehören zusammen. Als Eingabe dient ein Video der sich bewegenden Ziele. In diesem Video werden interessante Punk- te verfolgt und die Bewegungsinformation in Form von Trajektorien gespeichert. Basierend auf den Positionen der verfolgten Punkte im ersten Bild des Videos wird ein triangulierter Graph erstellt. Auf Grund der Bewegungsinformationen in den Trajektorien wird der Graph verformt. Die Verformung des Graphen wird zur Eingabe der folgenden, hierarchischen Gruppierung ver- wendet. Die Gruppierung wird durch eine unregelmäßige, duale Graphenpyramide umgesetzt. An der Spitze der Pyramide findet man die starren Komponenten des Videos (z.B. die Körper- teile eines Menschen). Im letzten Schritt kann man durch Analyse der Bewegung feststellen, ob sich Komponenten durch Artikulationspunkte verknüpfen lassen (z.B. Ober- und Unterarm eines Menschen). Der zweite Beitrag ist ein innovativer Ansatz, um zeitliche Übereinstimmungen für meh- rere voneinander abhängige Ziele zu finden (Wo?). In dieser Dissertation wird vorgeschlagen das Ziel als Graph zu repräsentieren, wobei jedes Ziel als Knoten und ihre räumlichen Zusam- menhänge als Kanten im Graphen gespeichert werden. Um eine zeitliche Übereinstimmung für einen Graphen zwischen zwei Bildern eines Videos herzustellen, wird üblicherweise nach dem ähnlichsten Graphen im zweiten Bild gesucht. Im Gegensatz dazu wird in dieser Dissertation ein innovativer Ansatz vorgestellt, der die Übereinstimmung für jeden Knoten (jedes Ziel) einzeln sucht. Dabei werden Informationen eines einfachen Trackingverfahrens, die vom Aussehen des Ziels abhängen, mit strukturellen Informationen aus dem Graphen kombiniert. In einem iterati- ven Prozess, der dem bekannten Mean Shift Algorithmus ähnlich ist, werden diese zwei Arten vii von Information kombiniert. Das Ergebnis sind Übereinstimmungen für alle Knoten und Kanten im Graphen die lokal optimal bezüglich ihres Aussehens und ihrer Struktur sind. Das Ziel dieser Arbeit war das Potential von Graphen im Tracking aufzuzeigen. Durch die zwei Beiträge dieser Dissertation konnte dieses Ziel erreicht werden. viii Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 TWIST-CV Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Outline Part I: Basic Methodologies and Concepts in Tracking . . . . . 3 1.3.2 Outline Part II: Selected Publications . . . . . . . . . . . . . . . . . . 4 1.3.3 Outline Part III: Appendix . . . . . . . . . . . . . . . . . . . . . . . . 6 I Basic Methodologies and Concepts in Tracking 7 2 About Tracking 9 2.1 What is Tracking? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 What is a Target? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Levels of Difficulty in Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.4 Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.5 Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.6 Distractors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.7 Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.8 Rating Levels of Difficulty . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Components of a Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Target Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Finding Correspondences . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Initialization of Target Models 25 ix CONTENTS 3.1 Motivation and Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Irregular Dual Graph Pyramids . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Summary of Selected Publications . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 Paper A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.2 Paper B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 Finding Temporal Correspondences 37 4.1 Motivation and Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Mean Shift Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Summary of Selected Publications . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 Paper C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.2 Paper D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5 Concluding Remarks 53 5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Bibliography 57 II Selected Publications 69 A Hierarchical Spatio-Temporal Extraction of Models for Moving Rigid Parts 71 B Spatio-Temporal Extraction of Articulated Models in a Graph Pyramid 85 C Multi-scale 2D Tracking of Articulated Objects Using Hierarchical Spring Systems 97 D Structural Cues in 2D Tracking: Edge Lengths vs. Barycentric Coordinates 111 III Appendix 121 A Curriculum Vitae 123 x CHAPTER 1 Introduction 1.1 Motivation Tracking aims to find the correspondence (e.g. location and pose) of an object of interest in every frame of a video sequence. It is one of the most thoroughly researched problems in computer vision. Even books [70] and reviews [100] are only able to cover a fraction of the field of tracking. The research in tracking is mostly application-driven and the developed approaches are tailored for them. Tracking is often an important processing step in frameworks and is used in many different applications. These applications can be grouped into six main areas [70]: 1. Media production [40, 103] and augmented reality [50, 18] 2. Medical applications [92, 97, 53] and biological research [82, 94] 3. Surveillance [106, 105] and business intelligence [83, 104] 4. Robotics [84, 77] and unmanned vehicles [80, 44] 5. Tele-collaboration [87, 41] and interactive gaming [38, 23] 6. Art installations and performances [74, 79, 42] Tracking is an active field of research. Nevertheless, there are still challenging and open problems. Imagine a tracking approach as a detective trying to follow a suspicious person (object of interest). The detective needs to be careful not to mistake the suspect with a similar person (distractors). Furthermore, the suspect he is tracking might hide (occlusion), change clothes (changes in appearance) or even run away (fast, unexpected motion). These challenges and more have to be tackled by a successful tracking approach (see Section 2.3 for difficulties in tracking). 1 CHAPTER 1. INTRODUCTION The success of the chase of the detective depends on his tracking skills (method for finding temporal correspondences) and his knowledge and understanding of the person of interest (target model). In cases where his suspect hides (occlusion), the detective is able to estimate the location based on his previous observations and his knowledge (target model). He might also be able to tell, where the suspect will reappear due to his understanding of the suspect (target model). If the suspect disguises itself, the detective may be able to recognize it by other information he knows (target model), e.g. body composition (target structure) and style of walking. Hence, this thesis focuses on how to describe the object of interest (initialization of target models) and how to follow it over time (finding temporal correspondences). The research presented in this thesis is based on the work of the Pattern Recognition and Im- age Processing (PRIP) group at the Vienna University of Technology, especially on their work in the TWIST project (see Section 1.2). Their expertise are graph-based representations and meth- ods. Graph-based representations (for details see Section2.4.1.3) are an elegant way to model an object of interest (target model) based on different kinds of information (e.g. appearance, geom- etry and structure). Coming back to the example with the detective, graph-based representations allow to describe the suspect based on its clothing, behavior, style of walking and composition of body by storing these information and their inter-dependencies. Graph-based methods employ such descriptions to solve tasks like tracking. Thus, the main goal of this thesis is to study the potential of graph-based representations and methods in tracking related multiple targets. 1.2 TWIST-CV Project This thesis is based on the research done within the TWIST-CV project [1]. “Tracking with Structure in Computer Vision” (TWIST-CV) was funded by the Austrian Science Fund – Fond zur Foerderung der wissenschaftlichen Forschung (FWF) – under the grant FWF-P18716-N13 and was carried out between March 2006 and December 2009. The main goal of this project was to solve open problems in computer vision with the help of graph-based methods. There were three sub-goals: 1. Finding object correspondences in image sequences. 2. Finding object correspondences in images from different view points. 2 1.3. OUTLINE (a) (b) (c) Figure 1.1: (a) Tracking through partial occlusion with a graph-based representation [ 10]. (b) Composi- tional representation in space and time [16]. (c) Initialization of graph-based representation based on the observation of motion [9]. 3. Finding object correspondences in image sequences from different view points. My contributions to this project are focused on the first subgoal. The main findings were a graph-based representation [15, 7, 10], a compositional representation [16, 6] and an approach to automatically initialize graph-based representations [9, 72]. Figure 1.1 shows images from the related publications. 1.3 Outline This thesis is a cumulative doctoral thesis and is divided into three parts (in addition to this brief introductory chapter). 1.3.1 Outline Part I: Basic Methodologies and Concepts in Tracking The aim of this part is to provide the necessary knowledge to understand the selected publications in Part II. Chapter 2 covers an introduction to tracking, which is relevant for all publications in Part II. Chapter 3 presents basic concepts and methodologies of the initialization of target models for Paper A and B. Chapter 4 introduces methodologies and concepts relevant for the proposed 3 CHAPTER 1. INTRODUCTION Table 1.1: Publications associated with the two main topics of this thesis and ordered by their date of publication (decreasing, starting with most recent). Read from bottom to top. Initialization of target models Finding temporal correspondences Selected publications (see Part II) Paper B [13] Determination of points of articula- tion for scenes with motion in the image plane. Paper D [14] Introduction of a novel structural cue. Simplification of the proposed tracking algorithm in [12]. Com- parison of the novel structural cue against the previously used one. Paper A [11] Extension of [9] to motion out of the image plane. Paper C [12] Refinement of approach presented in [8]. Additional experiments. Previous publications [9] Refinement of the ideas proposed in [72]. This approach improves re- sults for articulated targets and ac- curacy in general. [8] Adding a coarse-to-fine method- ology to the tracking algorithm of [10]. [72] First concepts for the initialization of target models of rigid and artic- ulated targets based on their motion in the image plane. [10] Extension of [15] to articulated tar- gets. Additional experiments. [7] Extension of [15] to articulated tar- gets. [15] Introduction of a novel tracking al- gorithm combining appearance and structural information for rigid tar- gets. [71] Motivation and first ideas for find- ing temporal correspondences using structural information. tracking approach in Paper C and D. This part concludes by listing the original contributions of this thesis and possible future work (see Chapter5). 1.3.2 Outline Part II: Selected Publications Part II is a selection of four papers from my publications (see AppendixA). These four papers were selected as they cover the most important or recent research on the two main topics of this thesis: initialization of target models (see Paper A and B) and finding temporal correspon- dences (see Paper C and D). All four papers passed through international peer-reviewing and are published as articles in international journals (PaperA and C) or as chapters in books (Paper B 4 1.3. OUTLINE and D). Table 1.1 lists all publications related to the two main topics and shortly explains their dependencies. 1.3.2.1 Contributions of the co-authors This section describes the contributions of Walter G. Kropatsch, Adrian Ion and myself to the selected papers of this cumulative thesis. My contributions: As the first author, I was the main source of the ideas and the developed methods presented in all selected papers. For PapersA and B, I adapted the existing implemen- tation of the PRIP group of the irregular dual graph pyramid for image segmentation to motion segmentation. The novel concepts (observing and measuring similarity in motion, and identi- fication of points of articulation) were implemented from scratch. The implementation of the methods proposed in Paper C and D, was solely done by myself (except for the explicitly cited third party code in Paper C). I created the synthetic and real life videos (except for the explicitly cited third party videos), gathered ground truth data and conducted the experimental evaluations and their analysis for all papers. The text and figures of the four Papers were mainly written and created by me, except for short paragraphs resulting from the feedback of Walter G. Kropatsch and Adrian Ion. Contributions of Walter G. Kropatsch: As my mentor (Doktorvater) and head of the Pattern Recognition and Image Processing Group (PRIP), he was involved in all my research activities, including the selected papers of this thesis. Irregular graph pyramids are the long term research topic of Walter G. Kropatsch. Therefore, he supported me with his in-depth knowledge during the development of the ideas leading to these publications and also in solving difficult problems. Furthermore, he provided feedback for the written text of all papers and gave ideas for possible future work. Contributions of Adrian Ion: Adrian Ion worked at PRIP as a project assistant and his re- search partly overlaped with the topics of this thesis. Together with Walter G. Kropatsch and Yll Haxhimusa, he did research on cognitive vision [55, 54] and image segmentation [48, 62], which is related to the research presented in this thesis. For PaperA and B, he provided the implemen- tation of the grouping framework based on irregular dual graph pyramids for image segmentation and he supported me during the adaptation of the code to the problem of motion segmentation. Furthermore, he assisted in recording the videos for the experiments. For PaperA, B and C, Adrian Ion contributed to the text by giving feedback, especially for the paragraphs about the theoretical background of irregular dual graph pyramids. 5 CHAPTER 1. INTRODUCTION 1.3.3 Outline Part III: Appendix This appendix includes my curriculum vitae and a complete list of my publications. 6 Part I Basic Methodologies and Concepts in Tracking 7 CHAPTER 2 About Tracking The aim of this chapter is to provide an introduction to tracking and the relevant terminology for the following chapters and the selected publications in Part II. 2.1 What is Tracking? In the book of Maggio et al. tracking is defined as follows [70]: “One fundamental feature essential for machines to see, understand and react to the environ- ment is their capability to detect and track objects of interest. The process of estimating over time the location of one or more objects using a camera is referred to as video tracking.” Maggio et al. use the term video tracking in their book to emphasize that the input is a video. A video is an ordered sequence of images, which are often called frames. In this thesis, I simply use the term tracking. Figure 2.1 visualizes the concept of a simple tracker over three frames. INPUT initial position p(v, 1) = (3, 3) apperance rgb(v) = (255, 0, 0) TRACKING OUTPUT trajectory (p(v, 1);p(v, 2);p(v, 3)) ((3, 3); (4, 3); (4, 2)) = Frames 1 2 3 Figure 2.1: Example of a simple tracker over three frames. Input: position p of target v in the first frame and appearance described by a color-vector (red). Tracking: starting from the position in frame 1 the tracking algorithm searches for the new position within a search window (green rectangle). Output: 2D trajectory consisting of a sequence of 2D positions. 9 CHAPTER 2. ABOUT TRACKING Table 2.1: Targets in tracking. Term Dimensionality Homogeneity target point 0-dimensions homogenous target line 1-dimension homogenous target patch 2-dimensions homogenous target volume 3-dimensions homogenous target structure arbitrary non-homogenous 2.2 What is a Target? The object of interest in tracking is often called target object or short target. This thesis addi- tionally distinguishes between targets depending on their dimensionality and homogeneity (see Table 2.1). A target is homogenous if its elements are of the same kind, e.g. a target patch is made up of pixels. Target structures consist of related multiple non-homogenous elements and can be of arbitrary dimension. They can be composed of several targets. For example, a hu- man can be tracked as a target structure, which consists of target patches (body parts) connected through target points (points of articulation). In addition, target structures can be hierarchies built of several levels of different kinds of targets. Targets should not be confused with tar- get representations, which are used to describe them (e.g. a target patch can be described by a histogram and a target structure by a graph; see Section2.4.1). 2.3 Levels of Difficulty in Tracking A tracking task is a concrete problem with a certain input and an expected output. The difficulty of a tracking task depends on several factors. Figure2.2 gives an overview of these factors and their properties. In the following sections the factors are explained in detail. 2.3.1 Input The input of a tracking task is the data on which the tracker operates. This data is not limited to videos, but includes user interaction and knowledge. Therefore, the factor input includes properties of the acquisition of the video and properties independent of the video. The quality of the sensor, which is used to capture a video, has a direct influence on the video quality. Video quality includes resolution of the target, but also grade of image noise, which is introduced into the video during the acquisition. In general, the higher the video quality the higher is the resolution of the target (more data) and the lower the grade of noise. High-quality videos allow accurate tracking and low-quality videos fast tracking (less data to process). 10 2.3. LEVELS OF DIFFICULTY IN TRACKING Figure 2.2: The level of difficulty of a tracking task depends on several factors (yellow and green boxes) and their properties. Input and output differ from the other factors as they do not concern the video, but its acquisition and issues independent of the video (e.g. interaction). The video input of a tracking task can originate from different numbers of sensors. Generally, the computational complexity of a tracking approach increases with the number of sensors. The input of multiple sensors needs to be related (registered) against each other as each sensor typically captures the scene from a different viewing angle [47]. Having two or more sensors does not automatically increase the level of difficulty. For example, estimating the 3D position of a target from the input of a single sensor [65] is more challenging than from multiple sensors. Prior knowledge is another form of input, which is available to the tracker and aims to improve its performance. It can be information about the scene (see Section 2.3.3), the tar- get (see Section 2.3.4), the motion (see Section 2.3.5), the distractors (see Section 2.3.6) and the occlusions (see Section 2.3.7). The higher the amount of knowledge the more informa- tion is available to the tracker to find correspondences of the target over time. For example, a background-model [30], which allows the tracker to separate the target from the background, can substantially reduce the difficulty of a tracking task. However, if the background-model does not describe the background well (e.g. the model assumes a static background, but a tree is moving in the background) it has a negative influence on the tracking and may lead to worse results than tracking without a background-model. Tracking can be done manually, interactive or fully automated. In manual tracking the user determines the temporal correspondences of the target at each frame (e.g. by selecting the center 11 CHAPTER 2. ABOUT TRACKING position of the target). Such manual tracking is very time consuming and is only employed in cases where automated procedures fail to deliver the necessary accuracy (e.g. tracking landmarks on faces of facial palsy patients [37]). Interactive tracking comes with different degrees of interaction. Some tracking approaches use a manual initialization of the target by the user in the first frame of the video. Other trackers ask for several manual corrections during tracking (e.g. when the target is lost). Fully automated tracking does not employ any user interaction. In general, the higher the degree of interaction the easier the tracking (under the assumption that the input of the user is correct). 2.3.2 Output Each tracking approach delivers an output, where the properties of this output depend on the task. The higher the dimensionality of the expected output the higher the level of difficulty. A simple output is a binary one, where 1 indicates that something moved between two frames and 0 stands for no motion (i.e. two consecutive frames are equal). For most tracking tasks such a binary output is not sufficient and the output consists of at least one or more trajectories (i.e. a sequence of target positions) in 2D or even 3D. Besides trajectories, some tasks require the output of the orientation, area, shape or even pose of a target. A tracking task typically requires a certain accuracy of the output. High accuracy is usually related to a higher level of difficulty and/or computational complexity. Accuracy can be quanti- fied by the deviation from the ground truth. For example, if the expected output is the 3D pose of a rigid target, the deviation from ground truth can be calculated by comparing the values of the six degrees of freedom of the result (rotation around every axis and translation in all three dimensions) against the values in the ground truth data. Depending on the task, it is necessary to deliver results with high accuracy (e.g. medical applications [37]) or low accuracy (e.g. real time tracking for gaming). 2.3.3 Scene A scene can be defined as a fraction of reality captured by the sensor of a camera in the form of a discrete grid of pixels. The composition of a scene is relevant for the tracking task. It divides the scene into fore- ground (target) and background (everything except the target). Scenes where the target occupies a significant part of the space (e.g. a target of 300 × 300 pixels in a video with frames of 800 × 600 pixels) are generally easier to track than scenes where the target is only a marginal part. Concerning the background, the level of difficulty depends on the amount of clutter (low: easy; 12 2.3. LEVELS OF DIFFICULTY IN TRACKING high: difficult). A uniformly colored background has no clutter (e.g. white wall), whereas a highly textured background has a high amount of clutter (e.g. library). The illumination in a scene influences tracking as most features (e.g. color and edges) are not invariant to changes in illumination. Constant illumination, which can be found in controlled, indoor environments, reduces the difficulty of a tracking task, whereas non-constant illumination increases the difficulty. 2.3.4 Target In tracking, a target is the object of interest which is tracked over time (see Section2.2). The difficulty of a tracking task depends on the number of targets. Tracking a single target is computationally less expensive than tracking multiple targets. It is difficult to deliver accurate tracking results (close to ground truth) for multiple targets in real time, while keeping the com- plexity of the algorithm low. Furthermore, it is easier to track an object of interest as a single target patch (e.g. human) [26] in comparison to tracking related multiple target patches (e.g. a target patch for each body part) [12]. Please note, that tracking multiple targets can also be an advantage. It can be easier to track several targets than to track only one target. For example, in a scene with pedestrians it can be difficult to avoid mixing them up, but if all pedestrians are separately tracked this problem becomes easier. The type of a target is a relevant issue. One can distinguish between rigid (e.g. mobile phone), non-rigid (e.g. face) and articulated targets (e.g. robot arm). Articulated targets are built of several parts connected via points of articulation [11]. The most challenging target to track is probably an articulated target consisting of non-rigid parts (e.g. human). Furthermore, the spatial distribution of a target is of importance. The pixels of compact targets are more or less evenly distributed in space (2D: x- and y-axis; 3D: x-, y- and z-axis). Examples for compact targets are face, ball and car. In non-compact targets, the pixels are unevenly distributed in space (e.g. airplane, nail and train). 2.3.5 Motion The motion in a video sequence can originate from the motion of the camera or from the motion of entities in the scene (e.g. humans, cars, animals and trees moving in the wind). Videos where the camera and entities in the scene move, are more difficult to track than videos with only one source of motion. The difficulty of motion can be quantified based on the degrees of freedom (DOF). A train moving along a track has one DOF, where the position of the train is determined by the traveled (moved) distance along the track. Rigid targets moving in 2D space have up to three DOF 13 CHAPTER 2. ABOUT TRACKING (translation along x-axis, translation along y-axis, rotation in the 2D plane). In 3D space, a rigid target may move with up to six DOF (rotation around every axis and translation in all three dimensions). For a non-rigid target, the quantification of the DOF of its motion is more complex. One possibility is to split the non-rigid target into patches and define the DOF of each patch. Another possibility is to approximate the motion of the non-rigid target by the DOF of a rigid target. This solution is suitable for cases, where the focus lies on the global motion of the target rather than small local movements. The third property of the factor motion is speed. In tracking, speed is often measured by how many pixels a target moves between two frames. For a target point it is easy to determine how many pixels it moved, but for a target patch or a target structure it is more complicated. If a target patch rotates around its center the speed of its pixels differs. Pixels closer to the center move slower than pixels farther away. A solution is to pick the maximum pixel distance of a target as representative speed for the whole target. To categorize a certain speed as fast or slow, it is necessary to consider the size (in pixels) and type (shape) of the target. For example, the speed of a target point moving with 20 pixels per frame is considered fast, while the speed of a target patch of 200×200 pixels moving with 20 pixels per frame is considered moderate or even slow. If the target is non-compact (e.g. pen), the speed of motion along its major axis needs to be treated differently than along its minor axis. Fast motion (high speed) is more difficult to track than slow motion (low speed). In addition, it is important to consider if the speed is constant or non-constant. A target with non-constant speed is more difficult to track, because estimating the position of the target cannot be solved by a simple linear motion model. 2.3.6 Distractors A distractor is an entity in the scene of a video, which may distract the tracker from the target leading to inaccuracy (deviation from the ground truth) or even tracking failure (target is lost). The higher the number of distractors and the smaller their distance to the target, the higher is the possibility that the tracker is influenced and the higher the level of difficulty. The distance is measured in pixels and depends on the size and shape of the target and the distractor itself. For example, if both, the target and the distractor, are compact and have a diameter of ten pixels, a distance (measured from the centers) of 20 pixels or less is close. Under different circumstances with a target and a distractor of smaller size (e.g. diameter of five pixels), a distance of 20 pixels is less problematic. Besides the distance, the similarity of a distractor to the target influences the level of dif- ficulty. Similarity can be measured based on different criteria (e.g. appearance, geometry or motion). Dealing with distractors becomes more difficult the more similar the distractors are to 14 2.3. LEVELS OF DIFFICULTY IN TRACKING the tracked target. 2.3.7 Occlusions During an occlusion the target or parts of the target become invisible. An occluder can be a static entity in the scene (e.g. traffic sign), a moving entity (e.g. car) or a part of the target (e.g. a leg of a human occluding the other leg while walking, a so-called self-occlusion). If a target leaves the scene, this can also be called an occlusion. An important property of an occlusion is its degree. The degree quantifies how much of the target is occluded. This can be measured in percent for target patches or in number of vertices for target structures like graphs. In general, the higher the degree of occlusion the higher the level of difficulty. Full occlusions where the whole target becomes invisible are the most difficult cases. In such situations trackers rely on prior knowledge (e.g. motion during previous frames) to estimate the motion and behavior of the target during the occlusion. Besides the degree, it is also relevant to consider the duration of an occlusion. The longer the occlusion the higher the possibility for tracking errors and the higher the difficulty. Usually a tracker estimates the state of an occluded target based on the state of the target before the occlusion. Due to changes in the motion of the target, tracking errors accumulate over time. Thus, the longer the occlusion the higher the accumulated error. As for distractors (see Section 2.3.6), occlusions are especially problematic, if they are sim- ilar to the target. 2.3.8 Rating Levels of Difficulty Based on the factors described above (see Figure 2.2), the level of difficulty of a tracking task can be rated. To the best of my knowledge, there is no standard-approach for the rating of a tracking task. On the contrary, this is an open issue in the field of tracking. Rating the difficulty of a tracking task is relevant, as it allows to better understand the strengths and weaknesses of a certain approach and it helps to judge the quality of the delivered results. Setting up a rating system is a challenging and complex task, because one has to consider the factors with their properties, their temporal dynamics (e.g. the degree of an occlusion is changing over time due to the motion of the occluder) and their inter-dependencies (e.g. the type of the target limits the DOF of the motion). In Table 2.2 a simple rating system is proposed, which is used in the discussions in Chapter3 and 4. The speed of a target v can be calculated as follows: speed = ||p(v, t) − p(v, t− 1)||2 size(v) , (2.1) 15 CHAPTER 2. ABOUT TRACKING Table 2.2: Rating the level of difficulty of tracking tasks. This rating results in a vector of scalars, where each element represents the difficulty (0 = not occurring, 1 = low, 2 = medium, 3 = high) of the related factor. For example (1, 1, 2, 3, 2, 2, 1, 0), where the ordering of the factors is as listed in this table. Factor Low difficulty (1) Medium difficulty (2) High difficulty (3) Input number of targets known and initialization manually number of targets unknown and initialization manu- ally OR number of targets known and no manual ini- tialization number of targets unknown and no manual initializa- tion Output single trajectory multiple trajectories multiple trajectories in a hierarchy Scene background not cluttered background cluttered background highly clut- tered Target rigid articulated with rigid parts articulated with non-rigid parts Motion DOF ≤ 3 and speed < 1.0 3 < DOF < 6 or speed ≥ 1.0 DOF ≥ 6 or speed ≥ 5.0 Distractors similarity < 0.5 similarity ≥ 0.5 similarity ≥ 0.7 Occlusions degree < 0.5, duration < 0.5 and similarity < 0.5 degree ≥ 0.5, duration ≥ 0.5 and similarity ≥ 0.5 degree ≥ 0.7, duration ≥ 0.7 and similarity ≥ 0.7 where size is the length of the major axis of target v and p(v, t) is the position of target v in frame t (consequently the size of a target point is 1). As this thesis is about tracking related multiple targets, the speed of the fastest target is chosen as representative for all other targets. The similarity of a distractor or an occluder towards a target is determined based on their ap- pearance (i.e. a distance measure is employed to calculated the similarity). If several distractors or occluders appear, the value of the most similar one is selected. Furthermore, I only consider distractors which have an influence on the tracker (which are close to the target(s)). The degree of occlusion is measured by how many targets are occluded. For example, if ten target patches are tracked and five of them are occluded, the degree of occlusion is 0.5. The duration of occlu- sion is the percentage of frames of a video, where not all targets are visible (independent of the degree of occlusion). 2.4 Components of a Tracker A tracker consists of several interdependent components. Maggio et al. identify five compo- nents [70]: (i) feature extraction, (ii) target representation, (iii) finding correspondences, (iv) track management and (v) meta-data extraction. Besides these five components, tracking frame- works often include pre- and post-processing steps. Possible pre-processing steps are denois- ing [3] and segmentation [33, 61]. During post-processing a common step is to smooth trajecto- 16 2.4. COMPONENTS OF A TRACKER ries [88]. This thesis contributes to two of the five components: target representations (see PaperA and B) and finding correspondences (see Paper C and D). Hence, these two components are introduced in more detail in Sections 2.4.1 and 2.4.2. The aim of feature extraction is to exploit discriminative information from the input. Fea- tures are used to describe the target and to distinguish the target from the background of the scene. Depending on the input data, certain features are suitable for this purpose. There are three levels of feature extraction, where higher levels generally come with higher computational extraction costs [70]: Low-level (color, gradient, motion), Mid-level (edges, corners, regions), and High-level (background, objects). Track management deals with the organization of trajectories. This component is especially important for applications with multiple targets, where there is no user interaction (e.g. manual initialization in the first frame). In such cases, targets need to be detected automatically and a new trajectory is initialized through track management – this is called the birth of a target. The death of a target occurs, if it leaves the scene or if it is occluded. It is particularly difficult to re-identify targets, which have been occluded and become visible again. Mostly, these targets are identified as new targets and a new trajectory is created instead of continuing the existing trajectory. The meta-data extraction is a component of the post-processing stage of a tracker. After acquiring the temporal correspondences, the task of the meta-data extraction is to exploit ad- ditional data from these correspondences. Depending on the application, this meta-data could be 3D information used for 3D reconstruction, navigation commands for a robot or recognized gestures, which are used to interact with a game. 2.4.1 Target Representations This section gives an structured overview about well-known, basic target representations. In practice, a target representation consists of a combination of these representations. A target representation aims to describe the target in a discriminative way as to distinguish it from the background and other entities in the scene. In general, a suitable representation is chosen based on the application, the target and the expected output. The actual description of a target is called target model in this thesis. Representations can be distinguished based on the encoded information. The following three categories are distinguished in this thesis: (i) appearance representations, (ii) geometric repre- sentations, and (iii) graph-based representations. This section does not cover temporal repre- sentations (i.e. motion models), because they are not relevant for this thesis. 17 CHAPTER 2. ABOUT TRACKING Every representation described in the following sections has parameters, which can change over time. For example, if the target is described by a 2D point, its parameters are the x- and y- coordinates in the image. These coordinates change, if the target or the camera moves. Updating the parameters of a representation can be a challenging problem, which is why it is an open problem [73, 31]. 2.4.1.1 Appearance Representations The appearance of a target is how it looks from a particular view-point and under certain illu- mination conditions [35]. Templates are a common appearance representation and date back to 1981 [67]. They de- scribe the target by its pixel information consisting of color or gray values and the corresponding coordinates. A template can be initialized by the user through a manual selection of a region of interest including the target or by automatic detection. Furthermore, a template can be built from one or several images in a training set. Tracking algorithms employing a template representa- tions often assume that the appearance of the target remains more or less the same throughout the video. However, this assumption is only valid for a limited field of applications. Histograms are another frequently used appearance representation. In comparison to tem- plates, histograms generally do not encode position information and the appearance information is usually quantized into a certain number of bins. Histograms represent a target by the statistical distribution of certain appearance information within a region of interest and are invariant to 2D transformations up to a certain degree. Color histograms are built from the color values of the pixels representing the target and are for example employed by the popular Mean Shift tracking algorithm [26, 63]. Maggio et al. [69] propose a target representation based on multiple color histograms, which are computed from a region of interest in a semi-overlapping manner to ad- ditionally integrate spatial information. Dalal et al. introduce histograms of oriented gradients in [28], where the idea is to describe a target by local histograms of image gradient orientations. This is realized by dividing the region of interest containing the target into “cells” (small spatial patches) and by extracting a local 1D histogram of gradient directions or edge orientations from each cell. The combined histogram entries form the representation. Figure2.3 shows examples for a template and a histogram representation. 2.4.1.2 Geometric Representations The simplest representation for a target is through a single point, which describes a certain 2D or 3D position. This kind of representation has been thoroughly researched by the radar 18 2.4. COMPONENTS OF A TRACKER 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 5000 Red 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 5000 Green 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 5000 Blue (a) (b) (c) Figure 2.3: (a) Image with selected target patch (region inside blue rectangle: face); (b) Template of target patch; (c) RGB-histograms of target patch with 16 bins for each color channel. community [70]. A set of points is commonly used to represent and track corner points [89, 57] for applications like structure from motion [5] and video object segmentation [36, 11]. Primitive shapes describe a target with a generalizing, primitive, 2D shape (e.g. bounding rectangle or ellipse) or a 3D volume (e.g. bounding cylinder or sphere). In comparison to the point representation, primitive shapes provide additional information about the size of a target. Both representations are not able to give detailed information about the spatial decomposition of a target. Hence, they are suitable for compact, rigid targets (e.g. man-made objects like cars). A popular application of this representation is tracking with Mean shift [26, 63]. Bradski [22] additionally extracts the orientation of the target. Representations giving more information about the spatial decomposition of the target are convex hull, silhouette and contour. An intuitive explanation for the convex hull is to imagine an elastic band stretched upon a given target to encompass it [102]. Convex hulls are used for human action recognition from videos [68, 24]. Silhouette and contour are related: a contour is a sequence of coherent pixels separating a target from its background – a boundary, whereas a silhouette is the region inside a contour. They are both suitable for tracking complex non- rigid shapes [99] or for action recognition [96, 43]. A strength of contours and silhouettes is that they are robust to changing illumination conditions. Nevertheless, initializing and updating the representation is challenging with regard to the problem of separating the target from the background. A representation especially popular in applications with articulated targets (e.g. humans) are 19 CHAPTER 2. ABOUT TRACKING (a) (b) (c) (d) (e) (f) Figure 2.4: (a) Point; (b) Primitive shape (rectangle); (c) Convex hull; (d) Contour; (e) Silhouette; (f) Skeleton; The lines in these drawings are exaggerated for better visibility (e.g. the point representation is not a point but a small circle). skeletons. They can be extracted by using the medial axis transformation [19] or the distance transformation [56]. An unsolved problem with skeletons lies in their extraction as it is not robust against noise. Applications for skeletons are for example motion capture [40, 103] and human activity detection [81]. Figure 2.4 shows the presented geometric representations. 2.4.1.3 Graph-based Representations Graph-based representations are used in various fields (mathematics, computer science, etc.) and have a long history of research [46, 64]. They can be employed to represent different kinds of information (e.g. appearance, shape and spatial, hierarchical and temporal relationships). The baseline representation is a graph G consisting of a set of vertices V, which are con- nected via a set of edges E ⊂ V×V. The simplest way to represent a digital 2D image is by a 4- or 8-neighborhood graph, where each pixel is represented by a vertex and the edge structure describes the local neighborhood of each vertex. Such a representation consists of the same number of vertices as there are pixels in the image. In general, representations used for tracking try to limit the stored information to rel- evant, discriminative information, which reduces processing time and increases the performance of a tracker. A graph may be weighted, where weights w are stored with the edges and/or vertices. These weights are normally real numbers. Weighted graphs are attributed graphs (AG). In this thesis, the term attributed graph is used for graphs, where the vertices store features (e.g. discriminative features extracted from the region covering a target patch) and not only real numbers. The edges of an attributed graph encode the spatial relationships of the features (attributes). Typically, attributes are stored in the vertices, but there are attributed graphs which store attributes with other entities as well (e.g. edges and faces) [14]. The edges in a graph can be directed resulting in a directed graph. In a directed graph, the weight wij = wji, where vi and vj are the vertices, and eij and eji are the two possible, oriented edges [64]. A graph is called directed graph as soon 20 2.4. COMPONENTS OF A TRACKER as one edge is directed. In contrast, an undirected graph with weights requires that wij = wji for all edges. Please note, that there are undirected graphs without weights. A region adjacency graph (RAG) [93] represents adjacency between image regions. Typi- cally, first an image segmentation algorithm is applied to group the pixels of an image into re- gions based on some homogeneity criterion (e.g. color). Then, the RAG represents each region with a vertex and connects vertices with edges, where the corresponding regions are adjacent to each other. Graph-based representations are also used to describe hierarchies. A simple hierarchy is a regular pyramid, which consists of at least two levels. Each level can be represented by a graph, but it is more common to represent them by arrays (more efficient). Regular pyramids represent parent-child-relationships in-between the successive levels with edges. In a regular pyramid, the input image is the base level. The successive levels reduce the data by a constant reduction factor λ > 1.0 and with reduction windows of constant size. Regular pyramids are an efficient representation as their vertical structure is fixed, but they are not invariant to translation, rota- tion and scaling and do not preserve connectivity [61]. Irregular pyramids try to overcome the drawbacks of regular pyramids. They are shift-invariant and adapt to the image data. Hence, the structure of irregular pyramids is not fixed and their building process depends on the data [61]. The levels of an irregular pyramid are frequently represented by graphs and they can be built using dual graph contraction [85, 60] or graph decimation [75]. 2.4.2 Finding Correspondences The objective of this component is to find temporal correspondences of a target based on the inputs (video, knowledge and interaction) and the target model. A tracking approach finds the correspondence between the status of a target in frame t and t + 1, where the correspondence is a relative, n-dimensional vector. Its dimensionality depends on the elements of the target model. The components of the correspondence vector are divided into similarity components and change components. Similarity components are used by the tracking approach to establish the correspondence. For example, the model of a target point consists of its 2D position (x, y) and its color in RGB (r, g, b). This results in a 5D correspondence vector (x, y, r, g, b), where the RGB values are the similarity components. Imagine the position of the target point at frame t is (10, 10) and its color is (1, 0, 0). At frame t+1, the employed tracking approach finds the correspondence of the target with the help of the similarity component, because at position (5, 15) the RGB values are equal to the model (i.e. similarity is 100 %). The resulting correspondence is the relative vector (−5,+5, 0, 0, 0). 21 CHAPTER 2. ABOUT TRACKING 2.4.2.1 Single-hypothesis versus Multi-hypothesis Methods One can distinguish methods for finding temporal correspondences based on the number of pos- sible candidates [70]. Single-hypothesis methods find at each frame one candidate for the tem- poral correspondences of a target. Well-known trackers employing a single-hypothesis method- ology are the Kanade-Lucas-Tomasi [89, 67] and Mean Shift [26]. In contrast to that, multi- hypothesis methods work with multiple candidates. A popular multi-hypothesis method is the particle filter [17]. Single-hypothesis methods are computationally less expensive than multi- hypothesis methods. However, single-hypothesis methods are in general more sensitive to oc- clusions and have problems in dealing with distractors. 2.4.2.2 Processing Strategies Traditional tracking approaches find temporal correspondences in a sequential frame-to-frame manner: at each frame, the correspondence of a target is found based on the status of the target in the previous frame and its similarity components. This frame-to-frame processing is not always the best choice as correspondences based on the information of two consecutive frames can be prone to noise and errors. In [59], a forward-backward-processing is proposed which allows to reliably detect tracking failure and select valid correspondences. Such a processing strategy is also helpful if the target is occluded [95]. Furthermore, if the temporal resolution is high (high- speed cameras) or the motion of the target is known, it is not necessary to process every frame to achieve reliable correspondences [52]. There are tracking approaches using space-time process- ing, where several frames (space-time volumes) are processed at once and the correspondences are estimated by fitting a motion model to the space-time data [16]. 2.4.2.3 Local versus Global Optimization The appearance and shape of a target usually changes over time due to image noise, motion or occlusions. Hence, the correspondence at a certain frame is not equal to the target description in the model, but similar. Therefore, finding temporal correspondences can be seen as an opti- mization problem. In tracking, one can distinguish between local and global optimization. Local optimization searches for correspondences in a local neighborhood (often called search window) which is mostly around the position of the target in the previous frame [26]. Global optimization looks for the best correspondence in the whole image. Thus, local optimization is in general computationally less expensive. Apart from computational issues, local optimization is mostly inferior to global optimization. Fast or unexpected motion is problematic for local optimization as the search window might be too small. If the target disappears due to occlusions, it is more difficult to search for its reappearance within a local neighborhood instead of the whole image. 22 2.4. COMPONENTS OF A TRACKER Setting the size of the search window itself is challenging, especially if the target undergoes scale changes. There is another interpretation of local and global optimization in the case of related mul- tiple targets. Lets assume the object of interest is a target structure consisting of vertices and edges. Local optimization finds correspondences by considering local neighborhoods of the tar- get structure (e.g. a vertex, its incident edges and direct neighbors) [12, 14], whereas global optimization finds correspondences by taking into account the whole target structure [27]. This is the end of this chapter about basic knowledge in tracking. The introduced terms, con- cepts and methods are relevant for the following Chapters3 and 4, and the selected publications (Paper A, B, C, and D). 23 CHAPTER 3 Initialization of Target Models This chapter is about the initialization of target models. Section3.1 is about the motivation and the concept of the proposed approach. Section3.2 is a recall on irregular dual graph pyramids, which are the basis of the presented approach. Section3.3 summarizes the selected publications Paper A and B and discusses the most important results. 3.1 Motivation and Concept The representation of the target object is an important component of a tracker (see Section2.4). Desirable properties of a target model are: discriminative: to distinguish the target from the background and similar entities in the scene (distractors); invariant: to different kinds of transformations due to the motion and pose changes of the target; efficient: to avoid storing redundant and unnecessary information and to allow an easy extrac- tion of the relevant information; The quality of a target model depends on its initialization. There are tracking approaches where the user manually initializes the model. The advantage of such an initialization is that humans are in general able to easily initialize the desired target even if the scene is difficult (e.g. distractors). Disadvantages are the effort of doing the initialization manually, the subjectivity of the user (different users may initialize the same target differently) and the required knowledge of the user (some applications require expert-knowledge). 25 CHAPTER 3. INITIALIZATION OF TARGET MODELS There are approaches which initialize the target model automatically. One possibility is to solve the initialization as a recognition task [34, 32]. The success of such approaches highly depends on the training methodology and the training set. If the target is similar to the training examples, such approaches will perform nicely and deliver a reliable initialization. In cases where the variability of the target is limited, such an initialization is a good choice. Nevertheless, there are applications, where the targets differ or unexpected targets occur. Furthermore, even though such approaches can be fully automatic, training sets are mostly at least partly labeled by users, which again brings in the previously mentioned disadvantages (effort, subjectivity and expert knowledge). The solution I chose in this thesis is based on segmentation and does not require any prior knowledge or training. In computer vision, segmentation is often applied to single images to group their pixels into similar or even meaningful regions [33, 49, 93, 90, 21, 61]. In the case of image segmentation, these pixels are grouped based on certain criteria, which are often sim- ilarity in color and spatial proximity. Papers A and B present an approach based on irregular dual graph pyramids to extract a target model for rigid or articulated targets by applying seg- mentation in the temporal domain (video). The basic idea is: “things that move together belong together”. Hence, the proposed grouping criterion is the observed motion in the input video. This idea stems from the field of cognitive psychology. In 1973, Johansson Gunnar published a well-known work on biological motion [58]. He made the observation that humans are able to recognize a human figure based on a few bright spots undergoing motion along the major joints of a human body. The biomotion lab of Prof. Dr. Nikolaus Troje at Queen’s University in Kingston, Ontario offers demos about biological motion on their website [2]. 3.2 Irregular Dual Graph Pyramids This section introduces irregular dual graph pyramids, which are the basis of the proposed ap- proach for target model initialization in Papers A and B. Its aim is to shortly explain the most important concepts. For detailed information on irregular dual graph pyramids and the underly- ing graph theory the reader is referred to [61, 60]. Before explaining the building process of an irregular dual graph pyramid, it is necessary to introduce planar graphs and dual graphs. A graph G = (V,E) is planar if it can be embedded into the plane in R 2, where all vertices V ⊂ R 2, every edge e ∈ E is an arc between two vertices and no two edges cross each other. A planar graph G divides the plane into a set of faces F. These faces and their adjacencies can be represented by the dual graph G = (V,E). Each v ∈ V represents a face f ∈ F. Every pair of vertices in G which are adjacent, i.e. the corresponding faces f ∈ F share a common edge e ∈ E, are connected by an edge e ∈ E, so 26 3.3. SUMMARY OF SELECTED PUBLICATIONS that edges e and e are crossed. There is a one-to-one correspondence between the vertices V of G and the faces F of G, and between the edges E of G and the edges E of G. Furthermore, the dual of G is again G [61]. In the following, the graph G is called primal graph and G is called dual graph. Irregular dual graph pyramids are built in a bottom-up manner, where level Gk results from dually contracting the preceding level Gk−1 (see Figure 3.1). Hence, such a pyramid is a stack of successively reduced planar graphs P = {(G0,G0), . . . , (Gn,Gn)}, where G0 is the bottom level, Gn is the top level and n is the height of the pyramid [61]. The dual graph contraction of each level (Gk,Gk), 0 < k ≤ n consists of two steps [61]: 1. Edge contraction in Gk−1 if the corresponding vertices should be merged based on a certain similarity criterion, which is equivalent to edge removal in Gk−1; 2. Edge removal in Gk−1 to simplify the structure, which is equivalent to edge contraction in Gk−1; For the dual graph contraction, it is necessary to define so-called contraction kernels. A con- traction kernel is a tree of depth one consisting of a subset of non-surviving edges and a subset of vertices, where one vertex is called surviving vertex. In Paper A and B, the non-surviving edges for the contraction kernels are selected by the Minimum Spanning Tree algorithm of Borůvka [78]. The vertices of a contraction kernel in level k − 1 form the reduction window W (v) of the respective surviving vertex v in level k − 1. The receptive field F (v) of v is the (connected) set of vertices from level 0 that have been “merged” to v over levels 0 . . . k. All sur- viving vertices of level k − 1 make up the set of vertices Vk of graph Gk after the contraction process. 3.3 Summary of Selected Publications This section summarizes the selected publications about initializing target models and discusses the most important results. 3.3.1 Paper A Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Hierarchical spatio-temporal extraction of models for moving rigid parts. Pattern Recognition Letters, 32(16):800–810, December 20111 1Published by Elsevier: http://www.sciencedirect.com/science/article/pii/S0167865511001401 27 CHAPTER 3. INITIALIZATION OF TARGET MODELS (a) (b) (c) Figure 3.1: Example of an irregular dual graph pyramid. (a) Graph G 0: a triangulation with faces having different gray values. (b) Dual graph G0 of the triangulation without background vertex for better visibility. (c) Graph pyramid: contracted edges are marked with an arrow. Reprinted from [ 11] with permission from Elsevier. 3.3.1.1 Summary This paper proposes a novel approach for the initialization of target models for target structures (related multiple targets) based on an input video. Each rigid entity in the scene of the video (e.g. static background and each body part of a human) is identified and represented by a hierarchical graph model. Besides the video, there is no other input, i.e. no prior knowledge (e.g. number of targets and type of target) and no interaction (e.g. manual initialization by user). Hence, the approach is fully automatic and the built target model is based on the information which can be extracted from the video. The first step is to extract motion information from the video. This is realized by tracking a set of target points over time. In Paper A and B, a set of target points is tracked with the help of the Kanade-Lucas-Tomasi tracker [20]. The higher the density of the target points the more motion information is collected in the form of 2D trajectories. Please note that the aim of this step is not to tackle difficult cases in tracking (e.g. occlusions), but to extract the motion information for the following processing steps. Hence, any arbitrary tracking approach can be employed. The following steps are based on the extracted trajectories only (the video is not used anymore). To apply the basic idea: “things that move together belong together”, it is necessary to identify if there is some correlation in the motion of the target points (things). This is realized by representing the initial spatial configuration of target points (i.e. 2D positions in the first frame) by a triangulated graph. The vertices of the graph represent the target points and its edges their 28 3.3. SUMMARY OF SELECTED PUBLICATIONS Figure 3.2: From left to right: A triangulated graph is built based on the positions of the target points at time t. While keeping the connectivity of the vertices in the graph it is deformed over time based on the motion encoded in the trajectories. Looking at the deformation over time, it is possible to filter out triangles, which are separating rigid entities and are not of interest for the anticipated aim of the approach (identifying and modeling of rigid entities). spatial relationships and distances. By keeping the same graph structure (connectivity between vertices), it is possible to observe a deformation of the graph due to the motion encoded in the trajectories. This deformation of the triangulation delivers the necessary information about the correlation in motion. For each triangle (face) in the graph, the proposed approach answers two interdependent questions: (i) Does it lie on a rigid entity? (ii) To which rigid entity does it belong? The first question is answered in a filtering step, where each triangle is labeled as relevant or separating. Relevant triangles probably lie on a rigid entity and are processed further. Separating triangles connect rigid entities and are filtered out. Figure3.2 visualizes the deformation of the triangula- tion over time and the filtering step. The answer to the second question (To which rigid entity does it belong?), is found with the help of an irregular dual graph pyramid (see Section3.2). First the dual graph of the triangulation is built, where each vertex represents a triangle (face) and the edges describe their adjacency. Starting from this dual graph (base level), an irregular dual graph pyramid is built. The grouping criterion used for the building process of the pyramid is the observed motion of the triangles, where their corresponding vertices are merged if their motion is similar. In the ideal case, every vertex in the top level of the pyramid represents one rigid entity of the scene and their receptive field allows to identify the corresponding triangles. Figure 3.3 visualizes the observation of motion and the grouping based on the irregular dual graph pyramid. 3.3.1.2 Discussion This section discusses the most important results of Paper A. For Paper A, six experiments were conducted. This discussion is about experiment “human 1” and “human dancing 1” (these are the names of the experiments in Paper A). For the convenience of the reader some of the 29 CHAPTER 3. INITIALIZATION OF TARGET MODELS Figure 3.3: Relevant triangles can be grouped based on their similarity in motion with the help of an irregular dual graph pyramid. Each top vertex of the resulting pyramid represents a rigid entity of the scene. Figure 3.4: Three frames of the video “human 1” with the current state of the graph. White triangles are relevant for the grouping process in the irregular dual graph pyramid. Gray triangles are separating foreground and background and are filtered out. Reprinted from [ 11] with permission from Elsevier. corresponding figures of [11] are displayed in this section. For all videos, the ground truth was gathered manually. Furthermore, only complete trajectories are used (trajectories which cover the whole length of the video). Each discussed experiment is rated based on its level of difficulty according to Table 2.2 in Section 2.3.8. Human 1: This experiment was about identifying the rigid entities of a human based on in- plane motion. The video sequence is self-produced. Information about the video: 640 × 480 pixels, 860 frames, 134 target points and level of difficulty = (3, 3, 1, 2.5, 1, 0, 0). The grade 2.5 for target results from the partial non-rigid motion, which appears due to the behavior of cloths and skin. Figure 3.4 shows three frames, where the current state of the graph is drawn. It is noticeable that with the help of the filtering step one can achieve a foreground-background separation for in-plane motion. Figure 3.5 visualizes the final outcome of the proposed approach. All six rigid entities are identified and all 180 triangles are correctly associated with no errors or outliers. This result is the best case scenario, where each rigid entity of the target is identified and the foreground is correctly separated from the background. Each node in the top level of the irregular pyramid 30 3.3. SUMMARY OF SELECTED PUBLICATIONS Figure 3.5: Outcome of experiment “human 1”. Equally colored triangles belong to the same rigid entity. Reprinted from [11] with permission from Elsevier. represents one rigid entity. This ideal result could be achieved due to several factors: (i) the video quality was adequate, (ii) the motion of the target was controlled and mostly in-plane, (iii) the non-rigid motion due to clothing was reduced to a minimum. Human dancing 1: In this sequence, the task is again to identify the rigid entities of a human, but under out-of-plane motion. The source of the video sequence is [98]. Information about the video: 360 × 240 pixels, 62 frames, 83 target points and level of difficulty = (3, 3, 1, 3, 1, 0, 0). The grade 3 for target results from the non-rigid motion, which appears due to the behavior of cloths and skin. In comparison to the experiment “human 1”, this video is remarkably shorter (less than 10% of “human 1”). Furthermore, the non-rigid motion is especially problematic on the shirt of the human subject. This non-rigidity is difficult to handle for the presented approach as it looks for rigid entities. The filtering step is skipped in this experiment as the provided trajectories only describe foreground. Figure 3.6 shows the triangulated graph and the final result. Four rigid entities were identi- fied, with 366 correctly associated triangles and 25 outliers. For this experiment, it is difficult to decide for the ground truth. The human subject does not properly rotate the upper arms. There- fore, the four rigid entities, where the upper arms are grouped with the torso, are evaluated as correct. This experiment verifies that the proposed approach also works for out-of-plane motion. I expect even better results (with less outliers) for videos with more explicit movements. 31 CHAPTER 3. INITIALIZATION OF TARGET MODELS (a) (b) Figure 3.6: (a) Triangulated graph. (b) Result of grouping by irregular pyramid. Reprinted from [ 11] with permission from Elsevier. 3.3.2 Paper B Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Spatio-temporal extraction of articu- lated models in a graph pyramid. In 8th IAPR-TC-15 International Workshop on Graph-Based Representations in Pattern Recognition, volume 6658 of Lecture Notes in Computer Science, pages 215–224, Münster, Germany, May 2011. Springer2 3.3.2.1 Summary Paper B continues the work in Paper A. The approach presented in Paper A is able to identify rigid entities based on the observed motion in a video sequence. For a target model it is also important to know how the rigid entities are connected. Hence, the approach proposed in this paper finds the points of articulation for articulated targets. In the current state this approach is able to find points of articulation based on motion in the image plane. A point of articulation connects two or more rigid entities. The presented approach is based on the knowledge that connected entities are able to move independently, but they always keep the same distance to the point of articulation. Figure3.7 visualizes the two steps of the proposed approach: (i) generation of hypotheses for points of articulation and (ii) verification of these hypotheses. In comparison to the grouping of the triangles in the irregular pyramid, a point of articulation is not identified among the target points. It is possible, that a point of articulation does not lie on a foreground entity (see Figure 3.7). For each pair of rigid entities a hypothesis for a connecting point of articulation is created. Let’s assume that A and B are two rigid entities. For the generation of the hypothesis, it is not necessary to consider all target points of the rigid entities, but only two pairs of reference points {v1, v2} for A and {v3, v4} for B. Having the 2Published by Springer: http://link.springer.com/chapter/10.1007/978-3-642-20844-7_22 32 3.3. SUMMARY OF SELECTED PUBLICATIONS Figure 3.7: Each pair of points (red, green and blue) represents a rigid entity. For each possible permu- tation of rigid entities a hypotheses for a point of articulation is generated. If a hypothesis is valid, the corresponding rigid entities should always keep the same distance. All hypotheses which do not fulfill this criterion are discarded. The valid hypothesis is marked with a circle and the invalid ones are crossed out. position of these two point pairs at two time instances, the following system of equations can be built: ⎧⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎩ p(v1, t) = (RA ∗ (p(v1, t− 1)− p(a, t− 1)) + p(a, t− 1)) + o p(v2, t) = (RA ∗ (p(v2, t− 1)− p(a, t− 1)) + p(a, t− 1)) + o p(v3, t) = (RB ∗ (p(v3, t− 1)− p(a, t− 1)) + p(a, t− 1)) + o p(v4, t) = (RB ∗ (p(v4, t− 1)− p(a), t − 1) + p(a, t− 1)) + o, where p(v, t) is the position of reference point v at time t. RA and RB are the rotation matrices of the rigid entities and o is an offset. By solving the system of equations, one can determine the position of the point of articulation p(a, t− 1), the offset o, and the elements of the rotation matrices RA and RB , i.e. sin(θA), cos(θA), sin(θB), cos(θB). Geometrically, one can imagine that a pair of reference points builds a local coordinate system, where one of the reference points becomes the origin and the second reference point defines the orientation of the two axes. Within this coordinate system, a point of articulation will always have the same position. This is equivalent to the previous statement that rigid entities will keep the same distance to their point of articulation. By using this property, a hypothesis for a point of articulation can be verified. At each frame, the position of the point of articulation is determined using the local coordinate system of each “possibly” connected rigid entity. If the hypothesis is valid, the resulting trajectories of the point of articulation will overlap or at least be similar (see Figure 3.7). Besides finding the points of articulation, this paper studies the performance of the proposed approach for articulated targets with non-rigid parts, because for such targets the grouping is especially challenging. 33 CHAPTER 3. INITIALIZATION OF TARGET MODELS (a) (b) Figure 3.8: Sequence 1: (a) Deformation of the edges of the triangulation over time. (b) Dual graph (in color) of the triangulation (in gray). The edges of the dual graph visualize the dissimilarity of the motion of the corresponding triangles. The color bar describes the used colors, where red is high and blue is low. Reprinted from [13] with kind permission from Springer Science and Business Media. 3.3.2.2 Discussion This section discusses the most important results of PaperB. Three self-produced videos are used for the experiments in this paper. For the convenience of the reader some of the corresponding figures of [13] are displayed in this section. As for Paper C, only complete trajectories are employed (trajectories which cover the whole length of the video). Each discussed experiment is rated based on its level of difficulty according to Table2.2 in Section 2.3.8. Sequence 1: This video is also used in Paper B. It shows the in-plane motion of a human. In- formation about the video: 640×480 pixels, 860 frames, 134 target points and level of difficulty = (3, 3, 1, 2.5, 1, 0, 0). The grade 2.5 for target results from the partial non-rigid motion, which appears due to the behavior of cloths and skin. Figure 3.8 gives an insight into the motion of the triangles. This figure visualizes the diffi- culty of this grouping problem. There are cases, where the motion of adjacent triangles is not similar even though they belong to the same rigid entity. For example, tracking target points at the border of a rigid entity to the background is challenging and often results in unexpected drift (e.g. the target points on the head of the human close to the background in Figure3.8). Furthermore, in local neighborhoods with non-rigid motion, the motion of the target points and the related triangles are ambiguous. Figure 3.9 shows the outcome of grouping the triangles based on their motion with a global threshold. Due to the difficult nature of the motion (see Figure3.8), it is not possible to achieve the correct result (six rigid entities) with this simple baseline approach. 34 3.3. SUMMARY OF SELECTED PUBLICATIONS (a) (b) Figure 3.9: Sequence 1: (a) Grouping with global threshold 0.25 results in too many rigid entities. (b) Grouping with global threshold 0.6 results in only one rigid entity for left upper arm and torso instead of two. The different colors visualize the identified rigid entities. Reprinted from [ 13] with kind permission from Springer Science and Business Media. Sequence 2: This video shows the in-plane motion of a finger. Information about the video: 640×480 pixels, 674 frames, 112 target points and level of difficulty = (3, 3, 2, 2.5, 2, 0, 0). The grade 2.5 for target results from the partial non-rigid motion, which appears due to the behavior of skin. Figure 3.10 visualizes the outcome of the experiment on sequence 2. Even though there is non-rigid motion due to the skin of the finger, the proposed approach is able to find the four rigid entities of the target structure in the foreground (finger). Additionally, three points of articulation are correctly identified and connect the four rigid entities. Sequence 3: In this synthetic sequence the task is again to identifying the rigid entities of the target structure and to find the point of articulation. Information about the video: 640 × 480 pixels, 151 frames, 210 target points and level of difficulty = (3, 3, 2, 3, 2, 0, 0). The grade 3 for the difficulty of the target results from the non-rigid behavior of the parts of the articulated target. In Figure 3.11, it is shown that the rigid entities and their connecting points of articulation are identified. This is possible even under the non-rigid nature of the entities, because the global motion (rotation around the point of articulation) is more significant (traveled distance between two frames is bigger) than the local non-rigid motion. All in all, the proposed approach is able to successfully identify rigid entities and points of articulation in videos which are meant for initialization purposes. In videos with a higher level of difficulty for the factors occlusion and distractors, the quality of the results may decrease. The quality of the output highly depends on the provided trajectories (observation of motion), thus on the performance of the tracking approach. 35 CHAPTER 3. INITIALIZATION OF TARGET MODELS (a) (b) (c) (d) Figure 3.10: Sequence 2: (a) and (b) show the deformation of the triangulation in two selected frames. (c) Result of grouping. (d) Detected points of articulation. Reprinted from [ 13] with kind permission from Springer Science and Business Media. (a) (b) (c) (d) Figure 3.11: Sequence 3: (a) and (b) show the deformation of the triangulation in two selected frames. (c) Result of grouping. (d) Detected point of articulation. Reprinted from [ 13] with kind permission from Springer Science and Business Media. 36 CHAPTER 4 Finding Temporal Correspondences This chapter presents the ideas for finding temporal correspondences based on a graph model. Section 4.1 motivates and introduces the concept. Section4.2 describes appearance-based track- ing on the example of Mean Shift tracking. Section 4.3 summarizes the selected publications Paper C and D and discusses the most important results. 4.1 Motivation and Concept Tracking an object of interest in a video is often solved by tracking the corresponding target patch based on its appearance. The employed target model is called appearance-based model. These appearance-based models enable simple trackers [26] to quickly and efficiently find cor- respondences over time. They describe the appearance of a target patch by its texture or with feature descriptors, which can be as simple as a color histogram of a region of interest. Find- ing the best correspondence of a target patch is solved by searching for a position, where the extracted visual information is as similar as possible to the appearance-based model. Unfortu- nately, tracking based on appearance alone often fails to overcome challenging situations like distractors and occlusions (see Section 2.3). This thesis proposes an approach for tracking target structures which are represented by graph models. Graph models offer high representational power allowing to describe both, the appearance and also the structure of the elements of a target structure. The structure within a target structure encodes spatial relationships and dependencies. It is an important invariant, i.e. it does not change due to the targets’ motion. From such a graph model, structural cues can be deduced and enable a tracking approach to deal with tracking tasks of a high level of difficulty. Unfortunately, tracking target structures by finding the most similar graph to the graph model (graph matching) is in general NP-hard. Recently, Solnon et al. [91] showed that in the case 37 CHAPTER 4. FINDING TEMPORAL CORRESPONDENCES ? ? ? 2D example 1D simplification Figure 4.1: Dealing with distractors with the help of structural cues deduced from a graph model. Left: The target structure consists of three target patches, which are the three marked black fields of this checkerboard. As the appearance of all target patches is the same, it is a challenging or even impos- sible task to find correspondences without mixing up the targets among each other or with other black fields on the board. Right: In the top, one can see a 1D function of the similarity in appearance with three local maxima (at each black field crossed by the green line). Without further information, finding the correct correspondences is error-prone. In the bottom, one can see the same 1D function of similarity in appearance, but with an additional curve representing the deviation from the spatial structure in the target structure (i.e. in the graph model on the left). Now, finding the correspondence can be solved by finding the position, where the similarity in appearance is maximized and the deviation in spatial structure is minimized. of planar graphs, the matching is P-hard. Nevertheless, degree of the polynominal may be high. Furthermore, the survey by Conte et al. [27] shows that research on graph matching in the temporal domain (video analysis) has been sparse over 30 years. Graph models and graph- based methods are typically used in areas like 2D and 3D image analysis, document processing, biometric identification and image databases [27]. In this thesis, a novel method for finding temporal correspondences for graph models is proposed. The basic concept of the approach is to track an object of interest as a target structure, which is represented by a graph model. The vertices of the graph model represent target patches, which are described by feature descriptors, and its edges encode spatial relationships. Instead of graph matching a novel tracking approach is used, where the correspondence of each vertex in the model is found by combining the hypothesis of a simple appearance-based tracker with structural cues deduced from the graph model. The hypothesis of the appearance-based tracker for each vertex can be generated by any arbitrary tracker (in this thesis: Mean Shift). This hypothesis is combined in an iterative process with the structural cues extracted from the graph model. For each vertex of the graph model a correspondence is found, where similarity in appearance is locally maximized and deviation in structure is locally minimized. The proposed approach approximates a globally optimal solution, which could be achieved by graph matching. Figure 4.1 illustrates and motivates the concept based on an example with distractors. 38 4.2. MEAN SHIFT TRACKING p(v, t− 1) p(v, t) Iterations Figure 4.2: Mean Shift tracking procedure. This simple 1D example illustrates the iterative mode seeking process of Mean Shift. Left: Starting from the position of the target v in the previous frame p(v, t − 1), Mean Shift searches within a local neighborhood (green rectangle) for the maximum. An offset vector is generated (red arc) pointing towards the position of the maximum. In appearance-based tracking, this is the position where the similarity in appearance is locally maximized. Middle: The search window is centered around the position of the maximum in the first iteration and the algorithm again looks for the maximum in the local neighborhood. Right: The final position p(v, t) is found and the algorithm converges as the local maximum of the probability distribution has been found. 4.2 Mean Shift Tracking The aim of an appearance-based tracker is to find at each frame the position with the highest similarity in appearance with regard to the description of the target. In this thesis, I decided to use a tracking method based on the Mean Shift algorithm. The Mean Shift algorithm was proposed in 1975 by Fukunaga et al. [39] and is a non-parametric, iterative procedure. Mean Shift can be used in various tasks like clustering, image segmentation, image smoothing and object tracking [25, 26]. Here, the Mean Shift algorithm is employed to associate the vertices of the graph model from frame to frame finding the locally optimal position for each vertex. Figure 4.2 illustrates the Mean Shift procedure. My implementation of Mean Shift mainly follows the ideas in [26]. Color histograms are often used in combination with Mean Shift, but as stated by Comaniciu et al. [26] other feature descriptors can be used as well. In this thesis, I employed color histograms (see PaperD) and Sigma Sets (see Paper C) to represent the appearance of the target patches. In the following, it is explained how the Mean Shift algorithm finds temporal correspondences using color histograms as proposed by [26]. In the initial frame, a 3D color histogram is extracted for each vertex v in the graph model from the corresponding target patch at position p (i.e. the center of mass of the target patch). This color histogram is the target model q̂ of the target patch and estimates its discrete distribution of color probabilities. Every dimension of the histogram corresponds to one channel of the RBG color space. q̂ is usually divided into bins u = 1 . . . m to group similar colors. The discrete 39 CHAPTER 4. FINDING TEMPORAL CORRESPONDENCES distribution of color probabilities is determined for each bin as follows [26]: q̂u (p) = C n∑ i=1 k ( ‖x∗i ‖2 ) · δ (b(xi)− u) , (4.1) where C is a normalizing factor such that m∑ u=1 q̂u = 1. k is the Epanechnikov kernel [25] and is used to weight the pixels from which the histogram is created by their distance to p. xi are pixels of the target patch and b is a function mapping a pixel in the 2D image space to the corresponding histogram bin depending on its RGB value. x∗i = [0, 1] are the normalized positions of the pixels, where the position p in the center of the patch is (0, 0). The idea behind the weighting with k is that pixels close to p have a greater influence on q̂ than pixels farther away. δ is the Kronecker delta function. At each frame and in every iteration of the Mean Shift procedure, a candidate model p̂ is cal- culated from the patch within the search window at the current position p of v by Equation4.1. With this candidate model and the target model the new position p of the target v is calculated, where this position is the local maximum within the search window [26]: p = n∑ i=1 xi · wi n∑ i=1 wi . wi weights the pixel positions xi based on the target and the candidate model: wi = m∑ u=1 √ q̂u p̂u (p) · δ (b (xi)− u), Tracking the target patches of a target structure by multiple independent Mean Shift trackers is prone to the problems occurring in difficult tracking tasks. Difficult tracking tasks may come with cluttered backgrounds, articulated or even non-rigid targets, complex and fast motion (high degree of freedom), distractors with a high similarity to the target, and occlusions hiding the visual appearance of the target for a long time. For detailed Information about the level of difficulty of tracking tasks see Section 2.3. 4.3 Summary of Selected Publications This section summarizes the selected publications about finding temporal correspondences with graph models and graph-based methods and discusses the most important results. 40 4.3. SUMMARY OF SELECTED PUBLICATIONS Figure 4.3: A hierarchical spring system represents a target structure by an attributed graph pyramid. The vertex in the top level represents a target patch. In the bottom level, there are multiple vertices, each representing a small target patch. The edges encode the spatial relationships in the bottom and parent-child links in-between the levels. Reprinted from [12] with permission from Elsevier. 4.3.1 Paper C Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Multi-scale 2d tracking of articulated objects using hierarchical spring systems. Pattern Recognition, 44(4):800–810, April 20111 4.3.1.1 Summary This paper presents a flexible framework to track rigid and articulated target structures through multiple scales and occlusions in 2D. A rigid target structure consists of related multiple target patches and it is represented by an attributed graph pyramid. The bottom level consists of mul- tiple vertices representing local, discriminative patches extracted from the region covering the target and edges encoding their spatial relationships. In the top level there is only one vertex rep- resenting the whole target as one patch, which is connected through vertical edges to all vertices in the bottom level. During tracking, the spatial relationships encoded by the edges of this graph model are enforced in a tolerant and spring-like manner. Hence, the representation proposed in this paper is called hierarchical spring system (HSS). Figure 4.3 shows the levels of a HSS for one target structure. There are several reasons for the hierarchical nature of the target model. The presented approach aims to be applicable for tracking arbitrary targets. Therefore, it is not possible to gen- erally decide, if it is better to track the target as a target patch or as a target structure consisting of multiple target patches. This decision typically depends on the difficulty of the tracking task (see Section 2.3). In general, it is computationally less expensive to describe and track a target as a single target patch. Describing and tracking a target by only one patch is suitable for targets having a uniform appearance (e.g. one color), where no discriminative local features (patches) can be reliably extracted. If the target moves with high speed (speed > 1, see Equation2.1 in Section 2.3.8), it is easier to follow a target patch covering the whole area of the target than smaller local target patches. On the other hand, a target structure additional includes structural 1Published by Elsevier: http://www.sciencedirect.com/science/article/pii/S0031320310005091 41 CHAPTER 4. FINDING TEMPORAL CORRESPONDENCES information about the composition of the target, which can be useful in cases of occlusions or distractors. For example, if the degree of occlusion is > 50% and the target is tracked as a single target patch, it is difficult to estimate the state of the target. If the visible area of a target shrinks by 50%, this can be the result of an occlusion or due to the motion of the target along the z-axis (away from the camera). By tracking a target structure, it is possible to evaluate the properties of all target patches and their spatial relationships. During an occlusion the size of the target patches as well as their spatial distances will not change. Therefore, the state of the hidden target patches can be reconstructed from the visible ones. The idea behind the HSS is to incorporate both possibilities (target patch and target structure), and to combine their strengths and overcome their weaknesses. The temporal correspondences are found by a novel, iterative algorithm based on the Mean Shift procedure. Each vertex in the HSS is assigned to a tracker finding its temporal correspon- dences by combining hypotheses based on appearance and on structure. The structural hypothe- ses are extracted from the edges in the HSS, which act like springs pushing and pulling the vertices to reduce the deformation of the structure of the graph. Tracking is done in a top-down or bottom-up manner depending on the situation. If the top vertex (one target patch for whole target) can be tracked reliably, the positions of the vertices in the bottom-level (small target patches) are derived from the correspondence (current position) of the top vertex. In ambiguous situations (e.g. during occlusions), tracking is done bottom-up. First the positions of the vertices in the bottom level are determined and then the position of the top-vertex is calculated from them. Switching between these two types of processing allows to increase efficiency as tracking one target patch is computationally less expensive than tracking multiple target patches. Articulated targets are represented by multiple HSS connected via points of articulation. For example, a human can be represented by ten HSS, one for each body part (head, torso, lower and upper arms, lower and upper legs), connected by nine points of articulation (head-torso, torso-upper arm, etc.). A point of articulation is not related to a visible feature. Its task is to transfer position information between connected rigid parts following the principal that rigid parts always keep the same distance to their points of articulation. 4.3.1.2 Discussion This section discusses the most important results in greater detail than in PaperC. For Paper C, five experiments were conducted. The following discussion is about experiment 1, 3, and 4. For the convenience of the reader some of the corresponding figures of [12] are displayed in this section. The proposed approach has been evaluated with publicly available videos [45, 4] and videos 42 4.3. SUMMARY OF SELECTED PUBLICATIONS (a) (b) (c) Figure 4.4: Evaluation of tracking with Mean Shift (a) against proposed approach with HSS with planar graph in bottom level (b) and fully connected graph in bottom level (c). Red boxes visualize the area of the small target patches of the bottom level. Reprinted from [12] with permission from Elsevier. recorded by myself. For all videos, the ground truth for a rigid target was its center position (the positions of the individual bottom vertices could not be evaluated). Tracking with Mean Shift was the baseline approach. The proposed approach was evaluated for two different HSS: (i) planar (triangulated) graph in bottom level and (ii) fully connected graph in bottom level (see Figure 4.4). For all experiments the size of the small target patches in the bottom level was 13 × 13 pixels. The size of the target patch of the top level resulted from the size of the rigid targets (e.g. bounding rectangle of head, upper arm, lower arm, etc.). Each discussed experiment is rated based on its level of difficulty according to Table2.2 in Section 2.3.8. Experiment 1: The task in this experiment was to track a face under heavy occlusions. Fig- ure 4.5 visualizes results and the degree of occlusion over time. Information about the video: 352 × 288 pixels, 899 frames, 16 target patches in bottom level and level of difficulty = (1, 3, 1, 1, 1, 2, 2.5). The outcome of experiment 1 is that, considering the total error over the whole video, the best result could be achieved by a HSS with a fully connected bottom level graph. When a target object undergoes challenging occlusions (medium to high difficulty in Table2.2), there are several vertices in the bottom level, which do not have a direct, neighboring vertex which is visible. Thus, if the bottom level is a planar graph, there is no direct influence from vertices farther away and the propagation of the necessary position information becomes problematic (takes too many iterations). Therefore, the HSS with a fully connected bottom level is superior. The proposed approach (independent of the bottom level layout) finds temporal correspondences by combining hypotheses based on appearance and on structure. This combination is controlled by a gain that is dynamically calculated from the properties of the target (e.g. similarity of current appearance to target model). There are cases (e.g. the occluder looks similar to the target), where the determined gain weights the influence of appearance and structure in a way so that tracking is hindered (e.g. transfer of false position information within the HSS). This is why 43 CHAPTER 4. FINDING TEMPORAL CORRESPONDENCES Figure 4.5: Experiment 1: Deviation from ground truth (sum over absolute differences to ground truth). (full) using HSS with a fully connected graph, (planar) using HSS with a triangulated graph, (without) using only tracking with Mean Shift. The color bar at the bottom encodes the degree of occlusion. Yellow: up to 45%, Orange: up to 50% and longer than yellow, Red: up to 62%. Reprinted from [ 12] with permission from Elsevier. there are frames, where the result with the HSS (planar and fully connected) is worse than with the baseline approach. The advantage of the baseline approach in such a case is that the targets are tracked independently and tracking errors are not distributed among them. Nevertheless, Figure 4.5 shows that the proposed approach is able to recover from these difficulties. Experiment 3: The task of this experiment was to track an articulated object through scaling (motion along the z-axis, towards or away from the camera). Figure 4.6 shows the results and the degree of scaling over time. Information about the video: 640 × 480 pixels, 621 frames, 60 target patches in bottom level and level of difficulty = (1, 3, 1, 2, 2, 2, 0). The main conclusion from this experiment is that the proposed approaches (both planar and fully connected) result in lower error rates than the baseline approach. In comparison to experiment 1, the proposed approach is not outperformed by the baseline approach in certain frames. Another interesting observation is that for this experiment, there is no big difference between a fully connected and a planar bottom level graph in the HSS. This can be explained as the scaling factor of the target structure is estimated and distributed globally, whereas under 44 4.3. SUMMARY OF SELECTED PUBLICATIONS Figure 4.6: Experiment 3: Deviation from ground truth. The position error in pixels is a sum over the absolute difference to ground truth over all rigid parts of the articulated target. The bar on the bottom visualizes the scale change of the target. From 100% in frame 1 to 130% in frame 180 and down to 63% in frame 520. Reprinted from [12] with permission from Elsevier. occlusion the position information is distributed locally. The peaks in the graphs showing the position error in Figure 4.6 result from the tolerance of the HSS. Minor changes in the spatial configuration of the target patches are compensated by the spring-like behavior of the HSS. Therefore, the change in scaling of the target needs to be remarkable before the HSS reacts and adapts itself. Experiment 4: This experiment studies the behavior of the HSS under fast motion (speed > 1, see Equation 2.1 in Section 2.3.8). Figure 4.7 shows a selection of interesting frames. Information about the video: 640 × 480 pixels, 216 frames, 18 target patches in bottom level and level of difficulty = (1, 3, 1, 2, 2, 2, 0). From Figure 4.7 one can make two relevant observations. In frame 155, tracking with Mean Shift fails to correctly associate two vertices due to a distractor (the color distribution of the face is similar to the hand). The proposed approach suffers due to motion blur, but the spatial arrangement of the target structure stays intact with the help of the structural information and the position information of the upper arm. Hence, the recovery is easier under such conditions. In frame 170, the proposed approach successfully recovers tracking the lower arm, but the baseline 45 CHAPTER 4. FINDING TEMPORAL CORRESPONDENCES Frame 1 Frame 68 Frame 155 Frame 170 Figure 4.7: Experiment 4: Tracking an articulated object through motion blur. (top) Tracking with Mean Shift and (bottom) my approach with HSS and fully connected graphs. Reprinted from [ 12] with permission from Elsevier. approach is still not able to deal with the distractor problem. All in all, the proposed approach outperforms the baseline approach in cases of occlusion, scaling and fast motion. A HSS with a fully connected bottom level graph delivers equal or superior results in comparison to a planar bottom level graph. An advantage of the proposed combined iterative tracking is that the computational complexity is not influenced by the con- nectivity of the graph. Hence, using a fully connected bottom level graph does not slow down the tracking. On the contrary, it can speed up the convergence of the combined iterative mode seek- ing algorithm [66] as the propagation of information is faster (due to the pairwise connection of all vertices). 4.3.2 Paper D Nicole M. Artner and Walter G. Kropatsch. Structural cues in 2d tracking: Edge lengths vs. barycentric coordinates. In 18th Iberoamerican Congress on Pattern Recognition, Lecture Notes in Computer Science, page in print. Springer, November 20132 4.3.2.1 Summary This paper is a continuation of the work in Paper C. The aim of this paper is to analyze the potential of structural information in tracking. In the proposed combined iterative algorithm, temporal correspondences for target structures are found based on an appearance cue and a 2Published by Springer: http://www.ciarp.org/xviii/index.php/proceedings 46 4.3. SUMMARY OF SELECTED PUBLICATIONS structural cue. The appearance cue is delivered by the Mean Shift algorithm and is based on the similarity to the appearance stored in the model. Structural cues are directly deduced from the current spatial configuration of the target patches in comparison to the initial configuration stored in the model. In Paper C, the spatial relationships of the vertices in the bottom level graph are enforced by the spring-like behavior of the edges. Thus, the structural cue used for HSS are the edge lengths. This structural cue is called edge cue in PaperD. The experimental evaluation (see Section 4.3.1.2) shows that the information propagation in triangulated, planar graphs is not efficient enough in challenging situations (e.g. during occlusions). Therefore, one contribution of this paper is a novel structural cue for triangulated bottom level graphs based on barycentric coordinates. This structural cue is called triangle cue in PaperD. Barycentric coordinates were already introduced by August Ferdinand Möbius in 1827. They are frequently used in computer graphics [76, 51], but also mathematics [101] and computer vision [29, 86]. Barycentric coordinates allow to describe the position of each vertex in a triangulated graph based on the positions of the three corners (vertices) of any triangle in the graph. The position p of vertex v can be calculated by a linear combination of the three vertices {v1, v2, v3} of a triangle as follows: p(v) = (x, y, 1)T = (β1, β2, β3) · ⎛ ⎜⎝ p(v1) T , 1 p(v2) T , 1 p(v3) T , 1, ⎞ ⎟⎠ where β1 + β2 + β3 = 1.0 are the so-called coefficients. In addition to the novel structural cue, this paper systematically analyzes the performance of the proposed combined iterative tracking approach with the help of synthetic videos. This synthetic videos allow the comparison of the results against exact ground truth data. A drawback of the experimental evaluation of Paper C is the coarse ground truth data, which does not allow to determine the error in each vertex, but only the deviation from the center of mass of a rigid target structure (e.g. head and torso). Furthermore, the ground truth data of the videos recorded by myself is manually depicted, which already incorporates a certain error. The focus of the evaluation in this paper lies on different DOF of motion, distractors, oc- clusions and noise (see Section 2.3). Furthermore, this paper is the first attempt to study the influence of different parameters of the algorithm on the results. 4.3.2.2 Discussion This section discusses the most important results of PaperD. For the convenience of the reader some of the corresponding figures of [14] are displayed in this section. 47 CHAPTER 4. FINDING TEMPORAL CORRESPONDENCES Both structural cues and the baseline approach were evaluated with 36 synthetic videos with a size of 400 × 600 pixels and a length of 10, 30 or 37 frames (2 DOF: 10 frames, 3 DOF: 37 frames, 4 DOF: 30). The target structure in all videos consists of nine target patches and their spatial relationships, which are represented by a graph model. Please note that the proposed approach is not limited to nine vertices. The size of the target patches is 11 × 11 pixels. Each video can be rated based on its level of difficulty according to Table2.2 in Section 2.3.8. Three different test sets were created for two different target structures: Test set 1: three videos with level of difficulty = (1, 2, 1, 1, 1 to 2, 2, 0), where 1 to 2 for motion results from increasing DOF from 2 to 4; Test set 2: nine videos with level of difficulty = (1, 2, 1, 1, 1 to 2, 2, 1 to 2), where 1 to 2 for occlusion results from different degrees from 11% to 66%; Test set 3: six videos with level of difficulty = (2, 2, 1, 1, 1 to 2, 2, 0), where 2 for input results from induced noise (Gaussian white noise and Salt & Pepper 10 %). This results in 36 videos, which were evaluated with different parameter sets. I used three different choices (0–2) for the weight (gain) mixing the appearance cue with structural cue and three different orderings (0–2) of the vertices in the combined, iterative algorithm. All in all, this results in nine different parameters combinations {00, 01, 02, 10, 11, 12, 20, 21, 22} and 324 (36 · 9) test cases for each cue. Figures 4.9 and 4.10 show the results of the edge cue (edge length) and the triangle cue (barycentric coordinates) in comparison to the baseline approach. The difference between Fig- ure 4.9 and 4.10 lies in the target structure. For the results in Figure 4.9 a target structure with equally distributed target patches is used, which results in a graph, where each face has the same geometry and the faces are equal-sided. Figure 4.10 is based on a target structure, where the faces differ in their geometry and no face is equal-sided. The curves in both figures visualize the mean error in a vertex at each frame. This error is calculated as the Euclidean distance from the ground truth position averaged over all vertices in a graph and all video sequences in the corresponding test set. The most important observation is that in all test cases the best result of the triangle cue outperforms the best result of the edge cue (and the baseline approach). By choosing the best parameter set for the triangle cue it is possible to achieve a total error per vertex (averaged over all 324 test cases) of only ≈ 1, 06 pixels. In contrast to that, the best parameter set for the edge cue results in a total error per vertex of ≈ 6, 15 pixels. 48 4.3. SUMMARY OF SELECTED PUBLICATIONS deg(v) = 1 deg(v) = 2 deg(v) = 3 Figure 4.8: Ambiguity of structural cue based on edge length. (a) Vertex degree 1, all positions on circle are minima. (b) Vertex degree 2, two minima. (c) Vertex degree 3, one unique minimum. Looking at all test cases in Figures 4.9 and 4.10, the best parameter set for the triangle cue (barycentric coordinates) is 20 and the worst is 00. The edge cue (edge lengths) performs best with the set 00 and is the worst with 10. The edge cue is inferior to the triangle cue, because the reliability of the edge cue highly depends on the layout of the graph. As can be seen in Figure4.8, the edge cue is ambiguous for vertices with a degree smaller than two. As edges are a one dimensional entity, they are only capable of providing distance informa- tion. Therefore, if a triangle flips it may not be noticed. This problem can be avoided with the triangle cue by checking the signs of the coefficients of the barycentric coordinates. If a triangle flipped the signs of the coefficients change. Another weakness of the edge cue was already stated in Paper C: the information propa- gation in a triangulation can become problematic (e.g. in cases of occlusion). The edge cue is determined from the direct neighbors of a vertex (path of length one). Hence, it is a local cue with no direct influence of vertices from farther away. The triangle cue of a vertex can be deter- mined from any other triangle (face) in a graph. Therefore, it is possible to propagate position information within one iteration from a visible triangle to a hidden triangle independent of their distance (path length) in the graph. 49 CHAPTER 4. FINDING TEMPORAL CORRESPONDENCES Test set 1 0 5 10 15 20 25 30 35 0 50 100 150 00 10 20 01 11 21 02 12 22 MS 0 5 10 15 20 25 30 35 0 50 100 150 00 10 20 01 11 21 02 12 22 MS Test set 2 0 5 10 15 20 25 30 35 0 50 100 150 200 250 00 10 20 01 11 21 02 12 22 MS 0 5 10 15 20 25 30 35 0 50 100 150 200 250 00 10 20 01 11 21 02 12 22 MS Test set 3 0 5 10 15 20 25 30 35 0 20 40 60 80 100 00 10 20 01 11 21 02 12 22 MS 0 5 10 15 20 25 30 35 0 20 40 60 80 100 00 10 20 01 11 21 02 12 22 MS Figure 4.9: Results with regular-sized triangulation (i.e. triangles are equal-sided). Left: edge cue; Right: triangle cue. Vertical axis: error; Horizontal axis: frame. MS = Mean Shift (baseline approach). Reprinted from [14] with kind permission from Springer Science and Business Media. 50 4.3. SUMMARY OF SELECTED PUBLICATIONS Test set 1 0 5 10 15 20 25 30 35 0 2 4 6 8 10 00 10 20 01 11 21 02 12 22 MS 0 5 10 15 20 25 30 35 0 2 4 6 8 10 00 10 20 01 11 21 02 12 22 MS Test set 2 0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 70 00 10 20 01 11 21 02 12 22 MS 0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 70 00 10 20 01 11 21 02 12 22 MS Test set 3 0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 70 80 00 10 20 01 11 21 02 12 22 MS 0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 70 80 00 10 20 01 11 21 02 12 22 MS Figure 4.10: Results with irregular-sized triangulation (i.e. triangles are unequal-sided). Left: edge cue; Right: triangle cue. Vertical axis: error; Horizontal axis: frame. MS = Mean Shift (baseline approach). Reprinted from [14] with kind permission from Springer Science and Business Media. 51 CHAPTER 5 Concluding Remarks In this concluding chapter the original contributions of this thesis are summarized and possible future work is presented. 5.1 Contributions The main contributions of this thesis are within the field of tracking, more specifically in the field of tracking related multiple targets. One core contribution is a novel approach for the initialization of target models, which is presented in Papers A and B. The approach is a hierarchical grouping framework based on irreg- ular dual graph pyramids, which solves the initialization task with an idea inspired by cognitive psychology [58]. In a nutshell, the idea is: “things that move together belong together”. Even though, there is related work which also solves the task by grouping pixels or features based on their motion [98], the approach in this thesis additionally considers spatial proximity and rela- tionships. The proposed approach is fully automatic and does not require any prior knowledge about the scene or user interaction. Furthermore, the proposed hierarchical grouping framework allows the usage of different grouping criteria depending on the motion in the video (i.e. motion in the image plane or motion out of the image plane). To the best of my knowledge, there is no work on irregular pyramids to initialize target models based on the motion of tracked target points. The local to global grouping in the irregular pyramid enables the approach to deal with non-rigid motion and tracking errors up to a certain degree (i.e. as long as the motion which re- sults from non-rigid behavior or from tracking errors is insignificant in comparison to the global motion, the grouping will be successful). Besides the identification of the rigid entities in the scene, the approach proposed in Paper B additionally finds the points of articulation connecting the rigid entities of articulated targets. Thus providing a complete model for articulated targets. 53 CHAPTER 5. CONCLUDING REMARKS Furthermore, the output provides more information than related works as it is a hierarchical description, which can be useful for coarse-to-fine tracking approaches. Another core contribution is a novel, combined iterative tracking approach for target struc- tures of arbitrary rigid or articulated targets. These target structures consist of related multiple target patches, which are represented by a hierarchical (Paper C) or planar graph model (Pa- per D). The correspondences for these target structures are established by the proposed tracking approach, which combines appearance and structural cues to iteratively search for the locally optimal solution. Any standard appearance-based tracker (in this thesis Mean Shift) can be employed to deliver the appearance cue (i.e. a correspondence for a target patch based on the similarity to the appearance information stored in the model). The structural cue is deduced from the graph model. In Paper C, the structural cue is determined from the spatial relationships and dependencies encoded in the edges of the graph model. The structural cue simulates the behavior of springs, pushing and pulling the target patches to reduce structural deformations (this is also the reason why the representation is called hierarchical spring system). PaperD proposes a novel additional structural cue based on barycentric coordinates derived from the faces (triangles) of the graph model. For articulated targets, each rigid entity (e.g. body part) is modeled by a separate graph model and points of articulation connect these graph models to transfer position information between them. The rigid entities are allowed to move independent of each other while keeping a fixed distance towards their point(s) of articulation. This novel tracking approach enables a simple appearance-based tracker like Mean Shift to solve challeng- ing tracking tasks by incorporating a graph model and structural cues in a combined iterative tracking process. All in all, the main goal of this thesis, which is to study the potential of graph-based repre- sentations and methods in tracking, could be achieved. 5.2 Future Work The proposed approach for the initialization of target models is based on analyzing motion en- coded in trajectories. At the moment, this approach is only able to process complete trajectories. A complete trajectory is the result of tracking a target point from the first until the last frame. In future, I plan to also incorporate incomplete trajectories, which frequently appear in videos with targets moving out of the image plane. For example if a target rotates around its major axis some target points will disappear and new target points will appear (i.e. incomplete trajectories). The approach presented in Paper A is able to deal with motion out of the image plane, but it is only able to process a reduced set of trajectories (only the complete ones). Thus, the resulting target model contains less information. By incorporating incomplete trajectories the output of 54 5.2. FUTURE WORK my approach will improve. Paper B presents an approach to identify the points of articulation connecting the rigid entities of articulated targets. At the moment, this approach is limited to motion in the image plane. Therefore, a logical task for future work is to extend this approach to motion out of the image plane. Combined with the previously mentioned future work, the resulting approach will be able to initialize target models consisting of rigid entities and their points of articulation based on arbitrary trajectories in 2D and 3D. The combined iterative tracking approach offers many possibilities for future work. Cur- rently (in Paper C and D), the tracking is limited to motion with four DOG (degrees of freedom): translation along the x-, y-, and z-axis (e.g. target is moving closer to the camera or farther away) and rotation around the z-axis. In future, I plan to extend the tracking to six DOF. This extension requires structural cues in 3D. The simple structural cue based on distances in PaperC will be replaced by the structural cue based on barycentric coordinates as they are defined for n dimen- sions. Furthermore, the results in Paper D showed that this novel structural cue is superior. As long as the target moves with only four DOF, it is sufficient to update the appearance informa- tion in the target model. When the target moves with six DOF, it becomes necessary to also update the target patches and their relationships. New target patches will appear and should be included in the target model, which requires an update of the structural information. Existing target patches may become invisible, if the target rotates around its major axis. In such cases it will be helpful to pay special attention to the “borders” of the target, because this is where changes will take place. 55 Bibliography [1] TWIST-CV project: http://www.prip.tuwien.ac.at/twist/. [2] Biomotion Lab: http://www.biomotionlab.ca/. [3] A. Acharya and S. Meher. Robust video denoising for better subjective evaluation. In 2011 International Conference on Image Information Processing, pages 1–5. IEEE, 2011. [4] Amit Adam, Ehud Rivlin, and Ilan Shimshoni. Robust fragments-based tracking using the integral histogram. In IEEE Conference on Computer Vision and Pattern Recognition, 2006, pages 798–805. IEEE, 2006. [5] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade. Trajectory space: A dual representation for nonrigid structure from motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(7):1442–1456, 2011. [6] Nicole M. Artner, Csaba Beleznai, and Walter G. Kropatsch. Tracking using a hierarchi- cal structural representation. In 33rd Workshop of the Austrian Association for Pattern Recognition, volume 254, pages 249–260. Austrian Computer Society, May 2009. [7] Nicole M. Artner, Adrian Ion, and Walter Kropatsch. Tracking articulated objects using structure. In Computer Vision Winter Workshop 2009, pages 51–58. Pattern Recognition and Image Processing Group, TU Wien, February 2009. [8] Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Coarse-to-fine tracking of ar- ticulated objects using a hierarchical spring system. In 13th International Conference on Computer Analysis of Images and Patterns, volume 5702 of Lecture Notes in Computer Science, pages 1011–1018, Münster, Germany, September 2009. Springer. 57 BIBLIOGRAPHY [9] Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Rigid part decomposition in a graph pyramid. In 14th International Congress on Pattern Recognition, volume 5856 of Lecture Notes in Computer Science, pages 758–765, Mexico, November 2009. Springer. [10] Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Tracking objects beyond rigid motion. In 7th IAPR-TC-15 International Workshop on Graph Based Representations in Pattern Recognition, volume 5534 of Lecture Notes in Computer Science, pages 82–91, Venice, Italy, May 2009. Springer. [11] Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Hierarchical spatio-temporal extraction of models for moving rigid parts. Pattern Recognition Letters, 32(16):800– 810, December 2011. [12] Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Multi-scale 2d tracking of ar- ticulated objects using hierarchical spring systems. Pattern Recognition, 44(4):800–810, April 2011. [13] Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Spatio-temporal extraction of articulated models in a graph pyramid. In 8th IAPR-TC-15 International Workshop on Graph-Based Representations in Pattern Recognition, volume 6658 of Lecture Notes in Computer Science, pages 215–224, Münster, Germany, May 2011. Springer. [14] Nicole M. Artner and Walter G. Kropatsch. Structural cues in 2d tracking: Edge lengths vs. barycentric coordinates. In 18th Iberoamerican Congress on Pattern Recognition, Lecture Notes in Computer Science, page in print. Springer, November 2013. [15] Nicole M. Artner, S. B. López Mármol, C. Beleznai, and W. G. Kropatsch. Kernel-based tracking using spatial structure (received best paper award). In Challenges in the Bio- sciences: Image Analysis and Pattern Recognition Aspects. Proceedings of 32nd OEAGM Workshop, volume 232, pages 103–114. Austrian Computer Society, May 2008. [16] Nicole M. Artner, Salvador B. López Mármol, Csaba Beleznai, and Walter G. Kropatsch. Tracking by hierarchical representation of target structure. In Joint IAPR International Workshop on Structural and Syntactic Pattern Recognition, volume 5342 of Lecture Notes in Computer Science, pages 441–450. Springer, December 2008. [17] M.S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Process- ing, 50(2):174–188, 2002. 58 BIBLIOGRAPHY [18] H. Belghit, N. Zenati-Henda, A. Bellabi, S. Benbelkacem, and M. Belhocine. Tracking color marker using projective transformation for augmented reality application. In In- ternational Conference on Multimedia Computing and Systems, pages 372–377. IEEE, 2012. [19] A. G. Bharatkumar, K. E. Daigle, M. G. Pandy, Q. Cai, and J. K. Aggarwal. Lower limb kinematics of human walking with the medial axis transformation. In Workshop on Motion of Non-Rigid and Articulated Objects, pages 70–76, Austin, Texas, USA, 1994. IEEE. [20] S. Birchfeld. KLT: An implementation of the kanade-lucas-tomasi feature tracker. http://www.ces.clemson.edu/ stb/klt/, Oktober 2013. [21] M. Bister, Jan Cornelis, and Azriel Rosenfeld. A critical view of pyramid segmentation algorithms. Pattern Recognition Letters, 11(9):605–617, 1990. [22] G.R. Bradski. Real time face and object tracking as a component of a perceptual user interface. In 4th IEEE Workshop on Applications of Computer Vision, pages 214–219, 1998. [23] A.T.S Chan, H.V. Leong, and Shui Hong Kong. Real-time tracking of hand gestures for interactive game design. In International Symposium on Industrial Electronics, pages 98–103. IEEE, 2009. [24] Duan-Yu Chen, Sheng-Wen Shih, and Hong-Yuan Mark Lia. Human action recognition using 2-d spatio-temporal templates. In International Conference on Multimedia and Expo, pages 667–670, Beijing, China, 2007. IEEE. [25] Dorin Comaniciu and Peter Meer. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603– 619, 2002. [26] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5):564–575, 2003. [27] D. Conte, P. Foggia, C. Sansone, and M. Vento. Thirty years of graph matching in pat- tern recognition. International Journal of Pattern Recognition and Artificial Intelligence, 18(3):265–298, 2004. 59 BIBLIOGRAPHY [28] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2005, volume 1, pages 886– 893. IEEE, 2005. [29] F. Dornaika and F. Davoine. On appearance based face and facial action tracking. IEEE Transactions on Circuits and Systems for Video Technology, 16(9):1107–1124, 2006. [30] Ahmed Elgammal. Background subtraction: Theory and practice. Augmented Vision and Reality, pages 1–21. Springer Berlin Heidelberg, 2013. [31] Baojie Fan, Yingkui Du, Yang Cong, and Yandong Tang. Active drift correction template tracking algorithm. In 19th International Conference on Image Processing (ICIP), pages 397–400, 2012. [32] Pedro F. Felzenszwalb. Pictorial structures for object recognition. International Journal of Computer Vision, 61:55–79, 2005. [33] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient graph-based image segmen- tation. International Journal of Computer Vision, 59(2):167–181, 2004. [34] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. [35] Robert Fisher, Ken Dawson–Howe, Andrew Fitzgibbon, Craig Robertson, and Emanuele Trucco. Dictionary of Computer Vision and Image Processing. Wiley, 2005. [36] K. Fragkiadaki, Geng Zhang, and Jianbo Shi. Video segmentation by tracing disconti- nuities in a trajectory embedding. In IEEE Conference on Computer Vision and Pattern Recognition, 2012, pages 1846–1853. IEEE, 2012. [37] M. Frey, P. Giovanoli, H. Gerber, M. Slameczka, and E. Stussi. Three-dimensional video analysis of facial movements: A new method to assess the quantity and quality of the smile. Plastic and Reconstructive Surgery, 104(7):2032–2039, 1999. [38] Yun Fu and T.S. Huang. hmouse: Head tracking driven virtual computer mouse. In IEEE Workshop on Applications of Computer Vision, 2007, pages 30–30. IEEE, 2007. [39] K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory, 21(1):32– 40, 1975. 60 BIBLIOGRAPHY [40] J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, and H. Seidel. Motion capture using joint skeleton tracking and surface estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2009, pages 1746–1753. IEEE, 2009. [41] J. Gemmell, K. Toyama, C.L. Zitnick, T. Kang, and S. Seitz. Gaze awareness for video- conferencing: a software approach. MultiMedia, IEEE, 7(4):26–35, 2000. [42] A.B. Godbehere, A. Matsukawa, and K. Goldberg. Visual tracking of human visitors under variable-lighting conditions for a responsive audio art installation. In American Control Conference (ACC), 2012, pages 4305–4312, 2012. [43] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12):2247– 2253, 2007. [44] V. Grabe, H.H. Bulthoff, and P.R. Giordano. Robust optical-flow based self-motion esti- mation for a quadrotor uav. In International Conference on Intelligent Robots and Sys- tems, pages 2153–2159. IEEE, 2012. [45] Ralph Gross and Jianbo Shi. The cmu motion of body (mobo) database. Technical Report CMU-RI-TR-01-18, Robotics Institute, Pittsburgh, PA, June 2001. [46] Frank Harary. Graph Theory. Addison-Wesley, 1969. [47] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, 2004. [48] Y. Haxhimusa, A. Ion, and W.G. Kropatsch. Evaluating hierarchical graph-based seg- mentation. In 18th International Conference on Pattern Recognition, volume 2, pages 195–198. IEEE, 2006. [49] Y. Haxhimusa and W. G. Kropatsch. Segmentation graph hierarchies. In International Workshops on Structural, Syntactic, and Statistical Pattern Recognition S+SSPR, volume 3138 of Lecture Notes in Computer Science, pages 343–351, 2004. [50] H. Himberg, Y. Motai, and A. Bradley. A multiple model approach to track head orienta- tion with delta quaternions. Transactions on Cybernetics, 43(1):90–101, 2013. [51] H. Hoppe, T. DeRose, T. Duchamp, J. McDonald, and W. Stuetzle. Mesh optimization. In Proceedings of the 20th annual conference on Computer graphics and interactive tech- niques, SIGGRAPH, pages 19–26. ACM, 1993. 61 BIBLIOGRAPHY [52] Yanlong Huang, De Xu, Min Tan, and Hu Su. Trajectory prediction of spinning ball for ping-pong player robot. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3434–3439, 2011. [53] D.E. Ilea, C. Duffy, L. Kavanagh, A. Stanton, and P.F. Whelan. Fully automated segmen- tation and tracking of the intima media thickness in ultrasound video sequences of the common carotid artery. IEEE Transactions on Ultrasonics, Ferroelectrics and Frequency Control, 60(1), 2013. [54] Adrian Ion, Hubert Hausegger, Walter Kropatsch, and Yll Haxhimusa. How humans describe short videos. In Proceedings of the Second International Cognitive Vision Work- shop, 2006. [55] Adrian Ion, Yll Haxhimusa, and Walter G. Kropatsch. A graph-based concept for spa- tiotemporal information in cognitive vision. In Luc Brun and Mario Vento, editors, Graph-Based Representations in Pattern Recognition, volume 3434 of Lecture Notes in Computer Science, pages 223–232. Springer Berlin Heidelberg, 2005. [56] S. Iwasawa, K. Ebihara, J. Ohya, and S. Morishima. Real-time estimation of human body posture from monocular thermal images. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, pages 15–20. IEEE, 1997. [57] Jinwei Jiang and A. Yilmaz. Good features to track: A view geometric approach. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 72–79. IEEE, 2011. [58] Gunnar Johansson. Visual perception of biological motion and a model for its analysis. Perception & Psychophysics, 14(2):201–211, 1973. [59] Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-backward error: Automatic detection of tracking failures. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 2756–2759, 2010. [60] Walter G. Kropatsch. Building irregular pyramids by dual-graph contraction. IEE Pro- ceedings - Vision, Image and Signal Processing, 142(6):366–374, December 1995. [61] Walter G. Kropatsch, Yll Haxhimusa, and Adrian Ion. Applied Graph Theory in Com- puter Vision and Pattern Recognition, volume 52 of Studies in Computational Intelli- gence, chapter Multiresolution Image Segmentations in Graph Pyramids, pages 3–42. Springer, 2007. 62 BIBLIOGRAPHY [62] Walter G. Kropatsch, Yll Haxhimusa, and Adrian Ion. Multiresolution image segmen- tations in graph pyramids. In Abraham Kandel, Horst Bunke, and Mark Last, editors, Applied Graph Theory in Computer Vision and Pattern Recognition, volume 52 of Stud- ies in Computational Intelligence, pages 3–41. Springer Berlin Heidelberg, 2007. [63] Ido Leichter, Michael Lindenbaum, and Ehud Rivlin. Mean shift tracking with multiple reference color histograms. Computer Vision and Image Understanding, 114(3):400 – 408, 2010. [64] Olivier Lézoray and Leo Grady. Image Processing and Analysis with Graphs: Theory and Practice. Digital Imaging and Computer Vision Series. CRC Press, Taylor and Francis Group, 2012. [65] Rui Li, Ming-Hsuan Yang, Stan Sclaroff, and Tai-Peng Tian. Monocular tracking of 3d human motion with a coordinated mixture of factor analyzers. In Aleš Leonardis, Horst Bischof, and Axel Pinz, editors, 9th European Conference on Computer Vision, volume 3952 of Lecture Notes in Computer Science, pages 137–150. Springer Berlin Heidelberg, 2006. [66] Xiangru Li, Zhanyi Hu, and Fuchao Wu. A note on the convergence of the mean shift. Pattern Recognition, 40(6):1756 – 1762, 2007. [67] Bruce D. Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In International Joint Conference on Artificial Intelligence, pages 674–679. IEEE, 1981. [68] Walter G. Kropatsch Mabel Iglesias-Ham, Edel Bartolo García-Reyes and Nicole Maria Artner. Convex deficiencies for human action recognition. Journal of Intelligent & Robotic Systems, 64(3):353–364, January 2011. [69] E. Maggio and A. Cavallaro. Multi-part target representation for color tracking. In IEEE International Conference on Image Processing, 2005, volume 1, pages I–729–32. IEEE, 2005. [70] Emilio Maggio and Andrea Cavallaro. Video Tracking: Theory and Practice. Wiley, 2011. [71] Salvador B. López Mármol, Nicole M. Artner, Mabel Iglesias, Walter G. Kropatsch, Markus Clabian, and Wilhelm Burger. Improving tracking using structure. In Com- puter Vision Winter Workshop 2008, pages 69–76. Slovenian Pattern Recognition Society, February 2008. 63 BIBLIOGRAPHY [72] Salvador B. López Mármol, Nicole M. Artner, Adrian Ion, Walter G. Kropatsch, and Csaba Beleznai. Video object segmentation using graphs. In 13th Iberoamerican Congress on Pattern Recognition, volume 5197 of Lecture Notes in Computer Science, pages 733–740. Springer, September 2008. [73] L. Matthews, T. Ishikawa, and S. Baker. The template update problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):810 –815, June 2004. [74] Denys J.C. Matthies, Ngo Dieu Huong Nguyen, Shaunna Janine Lucas, and Daniel Botz. Moving shapes: a multiplayer game based on color detection running on public displays. In 15th International Conference on Human-Computer Interaction with Mobile Devices and Services, MobileHCI ’13, pages 558–563, New York, NY, USA, 2013. ACM. [75] Peter Meer. Stochastic image pyramids. Computer Vision, Graphics, and Image Process- ing, 45(3):269 – 294, 1989. [76] Mark Meyer, Alan Barr, Haeyoung Lee, and Mathieu Desbrun. Generalized barycentric coordinates on irregular polygons. Journal of Graphics Tools, 7(1):13–22, 2002. [77] C. Micheloni, B. Rinner, and G.L. Foresti. Video analysis in pan-tilt-zoom camera net- works. IEEE Signal Processing Magazine, 27(5):78–90, 2010. [78] Jaroslav Nešeťril, Eva Milková, and Helena Nešeťrilová. Otakar borůvka on minimum spanning tree problem translation of both the 1926 papers, comments, history. Discrete Mathematics, 233(1–3):3 – 36, 2001. [79] K.C. Ng. Music via motion: transdomain mapping of motion and sound for interactive performances. Proceedings of the IEEE, 92(4):645–655, 2004. [80] J. Nikolic, M. Burri, J. Rehder, S. Leutenegger, C. Huerzeler, and R. Siegwart. A uav system for inspection of industrial facilities. In Aerospace Conference, pages 1–8. IEEE, 2013. [81] Wee-Hong Ong, T. Koseki, and L. Palafox. Unsupervised human activity detection with skeleton data from rgb-d sensor. In 5th International Conference on Computational In- telligence, Communication Systems and Networks, pages 30–35. IEEE, 2013. [82] C. Poff, H. Nguyen, T. Kang, and M.C. Shin. Efficient tracking of ants in long video with gpu and interaction. In IEEE Workshop on Applications of Computer Vision, pages 57–62, 2012. 64 BIBLIOGRAPHY [83] M. Popa, L. Rothkrantz, Zhenke Yang, P. Wiggers, R. Braspenning, and Caifeng Shan. Analysis of shopping behavior based on surveillance system. In International Conference on Systems Man and Cybernetics, pages 2512–2519. IEEE, 2010. [84] J.A. Rivera-Bautista, A. Marin-Hernandez, and L.F. Marin-Urias. Using color histograms and range data to track trajectories of moving people from a mobile robot platform. In 22nd International Conference on Electrical Communications and Computers (CONI- ELECOMP), pages 288–293, 2012. [85] Azriel Rosenfeld. Arc colorings, partial path groups, and parallel graph contractions. Technical Report TR-1524, University of Maryland, Computer Science Center, July 1985. [86] M. Salzmann, R. Hartley, and P. Fua. Convex optimization for deformable surface 3-d tracking. In 11th International Conference on International Conference on Computer Vision, pages 1–8, 2007. [87] O. Schreer, I. Feldmann, N. Atzpadin, P. Eisert, P. Kauff, and H. J W Belt. 3d presence - a system concept for multi-user and multi-party immersive 3d videoconferencing. In 5th European Conference on Visual Media Production, pages 1–8, 2008. [88] Mubarak Shah, Krishnan Rangarajan, and Ping-sing Tsai. Motion trajectories. Pattern Recognition, 23:1138–1149, 1992. [89] J. Shi and C. Tomasi. Good features to track. In IEEE, Conference on Computer Vision and Pattern Recognition, 1994, pages 593–600. IEEE, 1994. [90] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [91] Christine Solnon, Guillaume Damiand, Colin Higuera, and Jean-Christophe Janodet. On the complexity of submap isomorphism. In Walter G. Kropatsch, NicoleM. Artner, Yll Haxhimusa, and Xiaoyi Jiang, editors, Graph-Based Representations in Pattern Recogni- tion, volume 7877 of Lecture Notes in Computer Science, pages 21–30. Springer Berlin Heidelberg, 2013. [92] Shih-Yu Sun, M. Gilbertson, and B.W. Anthony. 6-dof probe tracking via skin mapping for freehand 3d ultrasound. In 10th International Symposium on Biomedical Imaging, pages 780–783. IEEE, 2013. [93] A. Tremeau and P. Colantoni. Regions adjacency graph applied to color image segmen- tation. Image Processing, 9(4):735–744, 2000. 65 BIBLIOGRAPHY [94] A. Veeraraghavan, R. Chellappa, and M. Srinivasan. Shape-and-behavior encoded track- ing of bee dances. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3):463–476, 2008. [95] L. Wang, H. Yan, H.-Y. Wu, and C. Pan. Forward-backward mean-shift for visual track- ing with local-background-weighted histogram. IEEE Transactions on Intelligent Trans- portation Systems, 14(3):1480–1489, 2013. [96] Daniel Weinland, Remi Ronfard, and Edmond Boyer. Free viewpoint action recogni- tion using motion history volumes. Computer Vision and Image Understanding, 104(2- 3):249–257, 2006. [97] Xianliang Wu, J. Housden, N. Varma, YingLiang Ma, D. Rueckert, and K. Rhode. Catheter tracking in 3d echocardiographic sequences based on tracking in 2d x-ray se- quences for cardiac catheterization interventions. In 10th International Symposium on Biomedical Imaging, pages 25–28. IEEE, 2013. [98] Jingyu Yan and Marc Pollefeys. A factorization-based approach for articulated nonrigid shape, motion and kinematic chain recovery from video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5):865–877, 2008. [99] A. Yilmaz, L. Xin, and M. Shah. Contour-based object tracking with occlusion handling in video acquired using mobile cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11):1531–1536, 2004. [100] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A survey. ACM Com- puting Surveys, 38(4), 2006. [101] Paul Yiu. The uses of homogeneous barycentric coordinates in plane euclidean geometry. International Journal of Mathematical Education in Science and Technology, 31(4):569– 578, 2000. [102] Dengsheng Zhang and Guojun Lu. Review of shape representation and description tech- niques. Pattern Recognition, 37:1–19, 2004. [103] Quanshi Zhang, Xuan Song, Xiaowei Shao, Ryosuke Shibasaki, and Huijing Zhao. Un- supervised skeleton extraction and motion capture from 3d deformable matching. Neuro- computing, 100(0):170 – 182, 2013. Special issue: Behaviours in video. [104] H. Zhao and R. Shibasaki. A real-time system for monitoring pedestrians. In 7th IEEE Workshops on Application of Computer Vision, volume 1, pages 378–385, 2005. 66 BIBLIOGRAPHY [105] Junda Zhu, Yuanwei Lao, and Y.F. Zheng. Object tracking in structured environments for video surveillance applications. IEEE Transactions on Circuits and Systems for Video Technology, 20(2):223–235, 2010. [106] W.W. Zou, P.C. Yuen, and R. Chellappa. Low-resolution face tracker robust to illumina- tion variations. IEEE Transactions on Image Processing, 22(5):1726–1739, 2013. 67 Part II Selected Publications 69 PAPER A Hierarchical Spatio-Temporal Extraction of Models for Moving Rigid Parts Published in [11] by Elsevier. Important notice: This is the electronic version of my thesis. It does not include the selected papers in Part II, but only their references. 71 PAPER B Spatio-Temporal Extraction of Articulated Models in a Graph Pyramid Published in [13] by Springer. Important notice: This is the electronic version of my thesis. It does not include the selected papers in Part II, but only their references. 85 PAPER C Multi-scale 2D Tracking of Articulated Objects Using Hierarchical Spring Systems Published in [12] by Elsevier. Important notice: This is the electronic version of my thesis. It does not include the selected papers in Part II, but only their references. 97 PAPER D Structural Cues in 2D Tracking: Edge Lengths vs. Barycentric Coordinates Published in [14] by Springer. Important notice: This is the electronic version of my thesis. It does not include the selected papers in Part II, but only their references. 111 Part III Appendix 121 APPENDIX A Curriculum Vitae Personal Data First name: Nicole Middle name: Maria Last name: Artner Birth date: 23.01.1983 Place of birth: Oberpullendorf, Austria Nationality: Austria Education Since 2007 PhD student, Computer Science, Vienna University of Technology, Austria 2005 — 2007 Master, Digital Media, University of applied science Hagenberg, Austria 2002 — 2005 Bachelor, Media Technology and Design, University of applied sci- ence Hagenberg, Austria 1997 — 2002 General qualification for university entrance, Business school, Oberpullendorf, Austria Career History Since June 2010 Assistant at Vienna University of Technology under Prof. Walter G. Kropatsch Feb. 2012 — Aug. 2012 Internship with Information and Media Processing Labs. of NEC, Kawasaki, Japan Jan. 2010 — Dec. 2010 Junior researcher at Austrian Institute of Technology (AIT) Nov. 2007 — Dec. 2009 Project assistant at Austrian Institute of Technology (AIT), FWF- project: TWIST-CV 123 APPENDIX A. CURRICULUM VITAE Career Related Activities 2013 Editor of Special Issue in Elsevier’s journal of Pattern Recognition on GbR2013 2013 Co-Chair of 15th Workshop on Graph-based Representations in Pattern Recognition, Proceedings: LNCS Springer 2013 Member of program committee of 15th International Conference on Computer Analysis of Images and Patterns (CAIP) 2013 Member of program committee of 18th Computer Vision Winter Workshop 2010 Member of program committee of 15th Computer Vision Winter Workshop 2009 Additional reviewer of 13th International Conference on Computer Analysis of Images and Patterns (CAIP) 2009 Local organization of 14th Computer Vision Winter Workshop Awards 2008 Best Student Paper Award at OAGM 2008 for “Kernel-Based Tracking Using Spatial Structure” [6] List of Publications In the following all publications of the author are group by their type and listed in chronological order. In total, this list consists of 25 publications which are made up of six journal papers, 14 peer-reviewed conference and workshop papers, one master thesis and four other publications. Journal papers 1. Chieh-Han John Tzou, Nicole Artner, Walter Kropatsch, and Manfred Frey. Three- dimensional surface-imaging systems. Plastic and reconstructive surgery, 131(4):668e– 670e, April 2013. 2. Chieh-Han John Tzou, Igor Pona, Eva Placheta, Alina Hold, Maria Michaelidou, Nicole M. Artner, Walter G. Kropatsch, Hans Gerber, and Manfred Frey. Evolution of the 3- dimensional video system for facial motion analysis: Ten years experiences and recent developments. Annals of Plastic Surgery, 69(2):173–185, August 2012. 3. Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Hierarchical spatio-temporal extraction of models for moving rigid parts. Pattern Recognition Letters, 32(16):800– 810, December 2011. 4. Walter G. Kropatsch Mabel Iglesias-Ham, Edel Bartolo García-Reyes and Nicole Maria Artner. Convex deficiencies for human action recognition. Journal of Intelligent & 124 Robotic Systems, 64(3):353–364, January 2011. 5. Adrian Ion, Nicole M. Artner, Gabriel Peyré, Walter G. Kropatsch, and Laurent D. Cohen. Matching 2d and 3d articulated shapes using the eccentricity transform. Computer Vision and Image Understanding, 115(6):817–834, June 2011. 6. Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Multi-scale 2d tracking of articulated objects using hierarchical spring systems. Pattern Recognition, 44(4):800– 810, April 2011. International, peer-reviewed conference and workshop papers 7. Nicole M. Artner and Walter G. Kropatsch. Structural cues in 2d tracking: Edge lengths vs. barycentric coordinates. In 18th Iberoamerican Congress on Pattern Recognition, Lecture Notes in Computer Science, page in print. Springer, November 2013. 8. Samuel de Sousa, Nicole M. Artner, and Walter G. Kropatsch. On the evaluation of graph centrality for shape matching. In Graph-Based Representations in Pattern Recognition, 9th IAPR-TC-15 International Workshop, volume 7877 of Lecture Notes in Computer Science, pages 204–213. Springer, May 2013. 9. Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Spatio-temporal extraction of articulated models in a graph pyramid. In 8th IAPR-TC-15 International Workshop on Graph-Based Representations in Pattern Recognition, volume 6658 of Lecture Notes in Computer Science, pages 215–224, Münster, Germany, May 2011. Springer. 10. Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Rigid part decomposition in a graph pyramid. In 14th International Congress on Pattern Recognition, volume 5856 of Lecture Notes in Computer Science, pages 758–765, Mexico, November 2009. Springer. 11. Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Coarse-to-fine tracking of ar- ticulated objects using a hierarchical spring system. In 13th International Conference on Computer Analysis of Images and Patterns, volume 5702 of Lecture Notes in Computer Science, pages 1011–1018, Münster, Germany, September 2009. Springer. 12. Nicole M. Artner, Adrian Ion, and Walter G. Kropatsch. Tracking objects beyond rigid motion. In 7th IAPR-TC-15 International Workshop on Graph Based Representations in Pattern Recognition, volume 5534 of Lecture Notes in Computer Science, pages 82–91, Venice, Italy, May 2009. Springer. 13. Nicole M. Artner, Csaba Beleznai, and Walter G. Kropatsch. Tracking using a hierarchi- cal structural representation. In 33rd Workshop of the Austrian Association for Pattern Recognition, volume 254, pages 249–260. Austrian Computer Society, May 2009. 125 APPENDIX A. CURRICULUM VITAE 14. Nicole M. Artner, Adrian Ion, and Walter Kropatsch. Tracking articulated objects using structure. In Computer Vision Winter Workshop 2009, pages 51–58. Pattern Recognition and Image Processing Group, TU Wien, February 2009. 15. Nicole M. Artner, Salvador B. López Mármol, Csaba Beleznai, and Walter G. Kropatsch. Tracking by hierarchical representation of target structure. In Joint IAPR International Workshop on Structural and Syntactic Pattern Recognition, volume 5342 of Lecture Notes in Computer Science, pages 441–450. Springer, December 2008. 16. Salvador B. López Mármol, Nicole M. Artner, Adrian Ion, Walter G. Kropatsch, and Csaba Beleznai. Video object segmentation using graphs. In 13th Iberoamerican Congress on Pattern Recognition, volume 5197 of Lecture Notes in Computer Science, pages 733–740. Springer, September 2008. 17. Adrian Ion, Nicole M. Artner, Gabriel Peyré, Salvador B. López Mármol, Walter G. Kropatsch, and Laurent Cohen. 3d shape matching by geodesic eccentricity. In Com- puter Vision and Pattern Recognition Workshops, pages 1–8. IEEE Computer Society, June 2008. 18. Nicole M. Artner, S. B. López Mármol, C. Beleznai, and W. G. Kropatsch. Kernel-based tracking using spatial structure (received best paper award). In Challenges in the Bio- sciences: Image Analysis and Pattern Recognition Aspects. Proceedings of 32nd OEAGM Workshop, volume 232, pages 103–114. Austrian Computer Society, May 2008. 19. Salvador B. López Mármol, Nicole M. Artner, Mabel Iglesias, Walter G. Kropatsch, Markus Clabian, and Wilhelm Burger. Improving tracking using structure. In Com- puter Vision Winter Workshop 2008, pages 69–76. Slovenian Pattern Recognition Society, February 2008. 20. Stephan A. Drab and Nicole M. Artner. Motion detection as interaction technique for games & applications on mobile devices. Pervasive Mobile Interaction Devices (PERMID 2005) Workshop at the Pervasive, pages 48–51, 2005. Masterthesis 21. Nicole M. Artner. Analyse und Reimplementierung des Mean-Shift Tracking-Verfahrens. Master’s thesis, Fachhochschul-Masterstudiengang, Digitale Medien, Hagenberg, Hagen- berg, Austria, August 2007. Other 22. Walter G. Kropatsch, Nicole M. Artner, Yll Haxhimusa, and Xiaoyi Jiang, editors. Pro- ceedings of the 9th IAPR - TC-15 Workshop on Graph-based Representations in Pattern 126 Recognition, volume 7877 of Lecture Notes in Computer Science. Springer, Berlin Hei- delberg, May 2013. 23. Fuensanta Torres, Walter G. Kropatsch, and Nicole M. Artner. Predict pose and position of rigid objects in video sequences. Proceedings of International Conference on Systems, Signals and Image Processing, IWSSIP, 2012. 24. Walter G. Kropatsch, Adrian Ion, and Nicole M. Artner. Describing when and where in vision (abstract only). In 16th Iberoamerican Congress on Pattern Recognition, volume 7042 of Lecture Notes in Computer Science, page 25. Springer, November 2011. 25. Nicole M. Artner. A comparison of mean shift tracking methods. In 12th Central European Seminar on Computer Graphics, pages 197–204, April 2008. 127