Features in Visual Media Analysis
DISSERTATION
submitted in partial fulfillment of the requirements for the degree of
Doktorin der Sozial- und Wirtschaftswissenschaften
by
Maia Zaharieva
Registration Number 9707986
to the Faculty of Informatics
at the Vienna University of Technology
Advisor: Prof. Dr. Christian Breiteneder
The dissertation has been reviewed by:
Prof. Dr. Christian Breiteneder Prof. Dr. Stéphane Marchand-Maillet
Wien, 31.10.2011
Maia Zaharieva
Technische Universität Wien
A-1040 Wien ￿ Karlsplatz 13 ￿ Tel. +43-1-58801-0 ￿ www.tuwien.ac.at
 
 
Die approbierte Originalversion dieser Dissertation ist an der Hauptbibliothek 
der Technischen Universität Wien aufgestellt (http://www.ub.tuwien.ac.at). 
 
The approved original version of this thesis is available at the main library of 
the Vienna University of Technology  (http://www.ub.tuwien.ac.at/englweb/). 
 

D E C L A R AT I O N
I declare that this thesis was composed by myself, that the work
contained herein is my own except where explicitly stated otherwise
in the text, and that this work has not been submitted for any other
degree or professional qualification except as specified.
October 31, 2011
Vienna, Austria
Maia Zaharieva

A B S T R A C T
Today, film analysis is still a tedious process performed mostly man-
ually by film experts. Existing computer vision approaches aim at
improved retrieval and summarization methods rather than at film
understanding. While current research is predominantly focused on
the question what can we learn and extract from a film as the final
product, this thesis aims at the study of the filmmaking process as a
source for high-level content information.
The central question of this thesis is what can computer vision
methods provide to support film analysis as performed by film expert?
We discuss a possible mapping between factors that influence the
production, presentation, and perception of movies, their application
by means of well-established film techniques, and existing feature
extraction methods in computer vision. This novel view on film anal-
ysis allows for the exploration and identification of three areas in
the domain of automated film analysis and understanding. The first
area comprises research tasks that have been subject to active research
in the recent past. The second area covers research topics that are
not immediately solvable for a fully automated computer vision ap-
proach without any prior knowledge. The last area identifies research
tasks that are still open in the context of automated film analysis and
understanding.
Finally, we introduce three novel research questions and possible
solutions: camera take reconstruction, film comparison, and recur-
ring element detection. Performed experiments reveal two significant
potentials. First, they can assist film experts by providing support
for tasks that are currently performed manually. Second, proposed
algorithms blaze the trail for advanced application scenarios such as
the analysis of different montage patterns, the identification of missing
shots, the reconstruction of the original film cut, or the detection of
recurring elements.
v

Z U S A M M E N FA S S U N G
Trotz großer Fortschritte in der automatisierten Bild- und Videoverar-
beitung werden viele Untersuchungen in der Filmanalyse heute immer
noch manuell durchgeführt. Existierende Ansätze und Anwendungen
der Computer Vision haben meist das Ziel relevante Informationen
zu finden oder große Datenmengen kompakt darzustellen als Filme
zu verstehen. Während die aktuelle Forschung im Bereich der au-
tomatisierten Filmanalyse sich mit der Frage beschäftigt, was wir aus
dem Film als solchem lernen und extrahieren können, untersucht
diese Arbeit den Entstehungsprozess eines Filmes als möglichen Aus-
gangspunkt für die automatisierte Filmanalyse.
Die zentrale Fragestellung dieser Arbeit ist: "Inwieweit können Meth-
oden der Computer Vision Filmwissenschafter unterstützen?". Wir disku-
tieren eine mögliche Verlinkung zwischen Faktoren, welche die Entste-
hung, Gestaltung und Wahrnehmung von Filmen beeinflussen und
existierenden Methoden der Computer Vision. Diese neue Sicht auf
die Filmanalyse ermöglicht die Identifikation und die Erforschung von
drei Gruppen von Fragestellungen im Kontext der automatisierten
Filmanalyse: Die erste Gruppe umfasst Forschungsfragen, welche
seit einigen Jahren aktiv untersucht werden. In der zweiten Gruppe
sind Forschungsfragen zu finden, welche aus dem heutigen Stand
der Wissenschaft nicht unmittelbar gelöst werden können. Die dritte
Gruppe repräsentiert Fragestellungen, welche von großem Interesse
für Filmwissenschafter sind, jedoch in der Computer Vision bisher
nicht untersucht wurden.
Im praktischen Teil dieser Arbeit stellen wir drei neue Forschungsrich-
tungen und deren mögliche Lösungsansätze ausführlich vor: die
Wiederherstellung der originalen Aufnahmesequenz, den Vergleich un-
terschiedlicher Filmversionen und die Erkennung von wiederkehren-
den Elementen in Filmen. Die erzielten Ergebnisse der durchgeführten
Experimente weisen zwei wesentliche Charakteristika auf: Erstens,
zeitaufwändige Aufgaben in der manuellen Filmanalyse können durch
automatisierte Methoden effektiv unterstützt werden. Zweitens, die
vorgeschlagen Lösungsansätze öffnen den Raum für weitere Fragestel-
lungen in der Filmanalyse wie zum Beispiel für die Analyse von
Montagemustern, die Identifizierung verlorener Bild- und Filmsequen-
zen und das Erkennen von wiederkehrenden Elemente.
vii

To my father.

A C K N O W L E D G M E N T S
I have not failed. I’ve just found
10,000 ways that won’t work.
— Thomas Edison
First and foremost I would like to thank my boss and supervisor,
Christian Breiteneder for giving me the opportunity to follow my ideas
in this thesis. His support, advice, and valuable feedback have been
pivotal for my work. To my second supervisor, Stéphane Marchand-
Maillet, for the constructive criticism and discussions that helped me
sort out ideas and details on my work.
I would also like to thank my colleagues, Dalibor Mitrović and
Matthias Zeppelzauer, for sharing research ideas, sweets, or a beer
whenever needed. To Horst Eidenberger for being the first one to
really convince me of writing a thesis a long time ago and for sharing
his workspace with me although (or maybe because of the fact that) I
did not pursue his ideas.
To all my friends for enduring my frustration, sharing the joy, and
reminding me of life outside of research.
Finally, I would like to thank my family. There are no words to
describe the love, care, and support through the years.
xi

P U B L I C AT I O N S
Some ideas and figures have appeared previously in the following
publications:
journal articles
1. M. Zeppelzauer, M. Zaharieva, D. Mitrović, and C. Breiteneder:
Retrieval of Motion Composition in Film. Digital Creativity. ac-
cepted. 2011.
2. M. Zaharieva, D. Mitrović, M. Zeppelzauer, and C. Breiteneder:
Film analysis in archive documentaries. IEEE MultiMedia. 18:38–47.
2011.
3. M. Zaharieva, M. Zeppelzauer, D. Mitrović, and C. Breiteneder:
Archive film comparison. International Journal of Multimedia Data
Engineering and Management. 1(3):41–56. 2010.
peer-reviewed conference publications
1. M. Zaharieva and C. Breiteneder: Recurring Element Detection
in Movies. In International Multimedia Modeling Conference
(MMM’12). accepted. 2012.
2. D. Mitrović, M. Zeppelzauer, M. Zaharieva, and C. Breiteneder:
Retrieval of Visual Composition in Film Analysis. In International
Workshop on Image Analysis for Multimedia Interactive Services
(WIAMIS’11). 2011.
3. D. Mitrović, S. Hartlieb, M. Zeppelzauer, and M. Zaharieva: Scene
Segmentation in Artistic Archive Documentaries. In Symposium of
the Workgroup Human-Computer Interaction and Usability En-
gineering (USAB’10). LNCS 6389/2010. pp. 400–410. 2010.
4. M. Zaharieva, M. Zeppelzauer, C. Breiteneder, and D. Mitrović:
Camera take reconstruction. In International Multimedia Modeling
Conference (MMM’10). LNCS 5916/2010. pp. 379–388. 2010.
5. M. Zeppelzauer, M. Zaharieva, D. Mitrović, and C. Breiteneder:
A novel trajectory clustering approach for motion segmentation. In In-
ternational Multimedia Modeling Conference (MMM’10). LNCS
5916/2010. pp. 433–443. 2010.
6. M. Zaharieva, M. Zeppelzauer, D. Mitrović, and C. Breiteneder:
Finding the missing piece: Content-based video comparison. In IEEE
International Symposium on Multimedia (ISM’09). pp. 330–335.
2009.
xiii

C O N T E N T S
1 introduction 1
1.1 Summary of Contributions . . . . . . . . . . . . . . . . 2
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . 3
i background 5
2 media aesthetics in film 7
2.1 Fundamental Media Aesthetic Elements . . . . . . . . . 8
2.1.1 Light and Color . . . . . . . . . . . . . . . . . . . 8
2.1.2 Space . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Time and Motion . . . . . . . . . . . . . . . . . . 12
2.1.4 Sound . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Advanced Media Concepts . . . . . . . . . . . . . . . . 14
2.2.1 Composition . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Continuity . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Motif . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Rhythm, Tempo and Pace . . . . . . . . . . . . . 17
2.3 Perception of Visual Media . . . . . . . . . . . . . . . . 19
2.3.1 Light and Color . . . . . . . . . . . . . . . . . . . 19
2.3.2 Space . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Time and Motion . . . . . . . . . . . . . . . . . . 23
2.3.4 Sound . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.4.1 Loudness . . . . . . . . . . . . . . . . . 25
2.3.4.2 Pitch . . . . . . . . . . . . . . . . . . . . 26
2.3.4.3 Timbre . . . . . . . . . . . . . . . . . . . 26
2.4 Computational Media Aesthetics . . . . . . . . . . . . . 27
3 media features 29
3.1 Global Features . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Color features . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Edge features . . . . . . . . . . . . . . . . . . . . 32
3.2 Local Features . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Interest Point Detectors . . . . . . . . . . . . . . 35
3.2.2 Local Descriptors . . . . . . . . . . . . . . . . . . 36
3.2.3 Matching Strategies . . . . . . . . . . . . . . . . 39
3.3 Motion Features . . . . . . . . . . . . . . . . . . . . . . . 40
ii film analysis 43
4 visual-based computational media aesthetics 45
4.1 Film Factor Space . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Fundamental Media Elements . . . . . . . . . . . . . . . 48
4.2.1 Light and Color . . . . . . . . . . . . . . . . . . . 48
4.2.2 Space . . . . . . . . . . . . . . . . . . . . . . . . . 52
xv
xvi contents
4.2.3 Time and Motion . . . . . . . . . . . . . . . . . . 56
4.3 Advanced Media Concepts . . . . . . . . . . . . . . . . 62
4.3.1 Composition . . . . . . . . . . . . . . . . . . . . . 62
4.3.2 Continuity . . . . . . . . . . . . . . . . . . . . . . 68
4.3.3 Motif . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.4 Rhythm, Tempo and Pace . . . . . . . . . . . . . 74
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 case studies 81
5.1 Archive Video Data . . . . . . . . . . . . . . . . . . . . . 81
5.2 Camera Take Reconstruction . . . . . . . . . . . . . . . 83
5.2.1 Camera Take Detection . . . . . . . . . . . . . . 85
5.2.1.1 Continuity Analysis . . . . . . . . . . . 85
5.2.1.2 Motion Smoothness Analysis . . . . . 86
5.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . 89
5.2.2.1 Camera Take Detection . . . . . . . . . 89
5.2.2.2 Montage Reconstruction . . . . . . . . 90
5.2.3 Related Work . . . . . . . . . . . . . . . . . . . . 91
5.2.4 Conclusion and Discussion . . . . . . . . . . . . 93
5.3 Film Comparison . . . . . . . . . . . . . . . . . . . . . . 94
5.3.1 Underlying Methodology . . . . . . . . . . . . . 96
5.3.1.1 Frame Level . . . . . . . . . . . . . . . 96
5.3.1.2 Shot Level . . . . . . . . . . . . . . . . . 97
5.3.1.3 Video Level . . . . . . . . . . . . . . . . 97
5.3.2 Methods Compared . . . . . . . . . . . . . . . . 98
5.3.2.1 Shot Boundary Detection . . . . . . . . 99
5.3.2.2 Keyframe Selection . . . . . . . . . . . 100
5.3.2.3 Feature Extraction . . . . . . . . . . . . 101
5.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . 101
5.3.3.1 Video Data and Case Studies . . . . . . 102
5.3.3.2 Shot Boundary Detection . . . . . . . . 102
5.3.3.3 Keyframe Selection and Feature Repre-
sentation . . . . . . . . . . . . . . . . . 103
5.3.3.4 Unique Shot Detection . . . . . . . . . 108
5.3.4 Related Work . . . . . . . . . . . . . . . . . . . . 109
5.3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . 111
5.4 Recurring Element Detection . . . . . . . . . . . . . . . 111
5.4.1 Approach . . . . . . . . . . . . . . . . . . . . . . 113
5.4.1.1 Region Detection . . . . . . . . . . . . . 113
5.4.1.2 Region Analysis and Representation . 114
5.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . 116
5.4.2.1 Contemporary Movie . . . . . . . . . . 116
5.4.2.2 Archived Documentaries . . . . . . . . 119
5.4.3 Related Work . . . . . . . . . . . . . . . . . . . . 122
5.4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . 124
contents xvii
iii summary 125
6 summary and conclusion 127
6.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Future Development . . . . . . . . . . . . . . . . . . . . 128
bibliography 131

L I S T O F F I G U R E S
Figure 2.1 Lighting in film. . . . . . . . . . . . . . . . . . . . 8
Figure 2.2 Color in film. . . . . . . . . . . . . . . . . . . . . 9
Figure 2.3 Space in film: aspect ratio. . . . . . . . . . . . . . 11
Figure 2.4 Space in film: camera techniques. . . . . . . . . 11
Figure 2.5 Space in film: from deep to shallow space. . . . 12
Figure 2.6 Time manipulation in film. . . . . . . . . . . . . 13
Figure 2.7 Film compositions in Pulp Fiction (1994). . . . . 14
Figure 2.8 Balancing the scene using character and camera
motion. . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 2.9 Continuity error in The Dark Knight (2008). . . . 16
Figure 2.10 Clock-motif in Run Lola Run (1998). . . . . . . . 17
Figure 2.11 Shot length distribution in the Odessa Steps. . . 18
Figure 2.12 Rhythm analysis of the sequence Odessa Steps
from The Battleship Potemkin (1926). . . . . . . . 20
Figure 2.13 Color perception: contrast. . . . . . . . . . . . . 21
Figure 2.14 Space perception in The Dark Knight (2008). . . . 22
Figure 2.15 Space perception: size constancy. . . . . . . . . . 23
Figure 2.16 Motion perception. . . . . . . . . . . . . . . . . . 25
Figure 2.17 Equal-loudness contours. . . . . . . . . . . . . . 26
Figure 3.1 Color histograms vs. content. . . . . . . . . . . . 30
Figure 3.2 MPEG-7 edge types. . . . . . . . . . . . . . . . . 32
Figure 3.3 An example for differences in Moving Picture
Experts Group (MPEG)-7 edge histograms due to
illumination changes. . . . . . . . . . . . . . . . 33
Figure 3.4 Line detection. . . . . . . . . . . . . . . . . . . . 34
Figure 3.5 Multiple matching of local features. . . . . . . . 40
Figure 4.1 Film factors space. . . . . . . . . . . . . . . . . . 47
Figure 4.2 Mapping legend. . . . . . . . . . . . . . . . . . . 48
Figure 4.3 Media element vs. computer vision: Light and
Color. . . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 4.4 Examples for high- and low-key lighting. . . . . 49
Figure 4.5 Examples for above-/below-eye-level lighting. . 50
Figure 4.6 Examples for lighting techniques. . . . . . . . . 51
Figure 4.7 Examples for context-dependent color distribu-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 4.8 Media element vs. computer vision: Space. . . . 52
Figure 4.9 Examples for camera angles. . . . . . . . . . . . 53
Figure 4.10 Examples for different shot types. . . . . . . . . 54
Figure 4.11 Examples for blocking. . . . . . . . . . . . . . . . 55
Figure 4.12 An approach for the detection of blocking char-
acter/object at an edge of a frame. . . . . . . . . 55
xix
xx List of Figures
Figure 4.13 An example for racking focus. . . . . . . . . . . 57
Figure 4.14 Media element vs. computer vision: Time & Mo-
tion. . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 4.15 An example for a freeze frame. . . . . . . . . . . 58
Figure 4.16 An example for skip frames in Run Lola Run (1998). 60
Figure 4.17 Examples for sharply defined split-screens from
the TV series 24 (2001-2010). . . . . . . . . . . . 61
Figure 4.18 Examples for split-screens in Run Lola Run (1998). 61
Figure 4.19 Examples for superimposition. . . . . . . . . . . 62
Figure 4.20 Examples for large motion in a shot. . . . . . . . 63
Figure 4.21 Media element vs. computer vision: Composition. 63
Figure 4.22 Examples for horizontal and vertical framing
directions. . . . . . . . . . . . . . . . . . . . . . . 64
Figure 4.23 Examples for horizontal/vertical and tilted fram-
ing directions. . . . . . . . . . . . . . . . . . . . . 65
Figure 4.24 Rule of thirds. . . . . . . . . . . . . . . . . . . . . 67
Figure 4.25 Detection of symmetrical film compositions. . . 67
Figure 4.26 Motion-aided visual composition. . . . . . . . . 68
Figure 4.27 Media element vs. computer vision: Continuity. 69
Figure 4.28 Axis of action in a dialog scene. . . . . . . . . . 70
Figure 4.29 Examples for match cutting. . . . . . . . . . . . 71
Figure 4.30 Media element vs. computer vision: Motif. . . . 72
Figure 4.31 Automated motif detection. . . . . . . . . . . . . 72
Figure 4.32 Examples for detected recurring elements. . . . 73
Figure 4.33 Media elements vs. computer vision: Rhythm,
Tempo and Pace. . . . . . . . . . . . . . . . . . . 74
Figure 4.34 Keyframes of the first two scenes of Quantum of
Solace (2008). . . . . . . . . . . . . . . . . . . . . . 75
Figure 4.35 Motion content. . . . . . . . . . . . . . . . . . . . 76
Figure 4.36 Pace function analysis. . . . . . . . . . . . . . . . 77
Figure 4.37 Recent computer vision applications. . . . . . . 78
Figure 5.1 Vertov’s demonstration of cinema artificiality. . 82
Figure 5.2 Examples for highly similar repeating shots. . . 82
Figure 5.3 Examples for artifacts in archive film material. . 83
Figure 5.4 Camera takes vs. film scenes. . . . . . . . . . . . 84
Figure 5.5 Camera take detection workflow. . . . . . . . . . 86
Figure 5.6 Motion smoothness for frames of the same cam-
era take. . . . . . . . . . . . . . . . . . . . . . . . 87
Figure 5.7 Motion smoothness for frames of similar but not
consecutive shots. . . . . . . . . . . . . . . . . . . 88
Figure 5.8 False positive camera take. . . . . . . . . . . . . 89
Figure 5.9 Ambiguous shots. . . . . . . . . . . . . . . . . . 90
Figure 5.10 Detected camera takes in The Eleventh Year (1928). 91
Figure 5.11 Detected camera takes in Man with a Movie Cam-
era (1929). . . . . . . . . . . . . . . . . . . . . . . 91
Figure 5.12 Montage schemas for Kino Eye (1924). . . . . . . 92
Figure 5.13 Identical vs. similar shots. . . . . . . . . . . . . . 95
Figure 5.14 Workflow and information propagation. . . . . 96
Figure 5.15 An example for minima/maxima suppression at
video-level. . . . . . . . . . . . . . . . . . . . . . 98
Figure 5.16 Video sequence with matched and unknown shots. 98
Figure 5.17 Examples for differences in corresponding shots
in different film versions. . . . . . . . . . . . . . 105
Figure 5.18 Examples for false positives in shot matching. . 106
Figure 5.19 Experimental results from the automated film
comparison. . . . . . . . . . . . . . . . . . . . . . 107
Figure 5.20 Examples for differences in corresponding shots
as result from a preprocessing step, coding, and
compression technology. . . . . . . . . . . . . . . 107
Figure 5.21 Examples for motifs in movies. . . . . . . . . . . 112
Figure 5.22 Algorithm workflow. . . . . . . . . . . . . . . . . 113
Figure 5.23 Initial region detection. . . . . . . . . . . . . . . 115
Figure 5.24 Region dropping and merging. . . . . . . . . . . 116
Figure 5.25 Contemporary movie: Recurring elements vs.
linked shots. . . . . . . . . . . . . . . . . . . . . . 117
Figure 5.26 Contemporary movie: Distribution of the size of
detected recurring elements. . . . . . . . . . . . 118
Figure 5.27 Contemporary movie: Examples for detected re-
curring elements. . . . . . . . . . . . . . . . . . . 119
Figure 5.28 Contemporary movie: Detected recurring region
and corresponding linked elements. . . . . . . . 120
Figure 5.29 Archived documentaries: Recurring elements vs.
linked shots. . . . . . . . . . . . . . . . . . . . . . 121
Figure 5.30 Archived documentaries: An example for de-
tected recurring element. . . . . . . . . . . . . . 121
Figure 5.31 Archived documentaries: Distribution of the size
of detected recurring elements. . . . . . . . . . . 122
Figure 5.32 Cross-movie analysis. . . . . . . . . . . . . . . . 123
L I S T O F TA B L E S
Table 4.1 Tasks in visual-based computational media aes-
thetics. . . . . . . . . . . . . . . . . . . . . . . . . 80
Table 5.1 Performance results on camera take detection. . 89
Table 5.2 Recall (R) / Precision (P) results for shot bound-
ary detection. . . . . . . . . . . . . . . . . . . . . 103
Table 5.3 Recall (R) / Precision (P) results for CS1. . . . . 105
Table 5.4 Recall (R) / Precision (P) results for CS2. . . . . 108
Table 5.5 Unique shot detection. . . . . . . . . . . . . . . . 108
xxi
A C R O N Y M S
ASA American Standards Association
ANMRR Average Normalized Retrieval Rate
CCV Color Coherence Vector
DCT Discrete Cosine Transform
DOG Difference-of-Gaussian
GBR Geometry-based region
GLOH Gradient Location and Orientation Histogram
GoF Group of Frames
GoP Group of Pictures
HSV Hue-Saturation-Value
IBR Intensity-based region
JPEG Joint Photographic Experts Group
KLT Kanade-Lucas-Tomasi
LSD Line Segment Detector
MPEG Moving Picture Experts Group
MSER Maximally Stable Extremal Regions
MSF Markov Stationary Features
PCA Principal Components Analysis
RANSAC RANdom SAmple Consensus
RGB Red-Green-Blue
SIFT Scale Invariant Feature Transform
SURF Speeded Up Robust Features
xxii
1
I N T R O D U C T I O N
The camera is no innocent eye.
— Jesse J. Prinz
Visual media is a broadly used term usually referring to TV, movies,
photography, etc. In general, it can be distinguished between still and
moving pictures. Both types involve (predominantly) visual senses,
share the same visual features, and often require the same computer
vision methods for their automated analysis, understanding, and
retrieval. Moving pictures, such as films, videos or TV broadcasts,
additionally face new possibilities and challenges introduced by the
dimension of time. Following, this work is focused primarily on the
study of visual features for film analysis and understanding. It neglects
the auditory aspect, although, crucial characteristics and details are
explained where necessary.
Film studies is a research discipline that explores the application of
various film techniques and conventions, the production and distri-
bution of films, how they are responded by the audience, and what
economical, technological, social and aesthetic practices influence their
creation and perception [41, 135, 179]. Despite the recent progress
in technology and in computer vision methods, film analysis is still
a tedious process performed mostly manually by film scholars and
film experts. This situation is substantially caused by missing mutual
understanding. The two main questions are:
• What are film experts interested in? and
• What can computer vision methods provide for their support?
Today, the focus of computer vision methods for film and video
analysis moves from low-level tasks such as shot boundary detection to
high-level analysis such as genre, scene or event recognition. However,
such approaches are driven by the requirements of the audience and
consumers and aim at improved retrieval and summarization methods
rather than film understanding.
This work presents an attempt to build a bridge between the two
disciplines: film studies and computer vision. It is based on the theory
that films have a manipulative aspect. Next to narrative and involved
actors, manifold factors in the process of filmmaking influence the way
a film is perceived by the audience. Some film factors have attention
models as origin. For example, color, contrast, positioning of an object
or its motion can be applied to attract the attention of the audience to
1
2 introduction
a certain area in the scene. Other factors are the result of the artistic
nature and creativity of the people involved in the creation process.
The study of such film factors and the exploration of the feasibility
of corresponding computer vision approaches for their detection and
analysis is the subject of this thesis. In this work we investigate three
aspects of features in visual media:
1. What features of visual media do influence their production, presen-
tation, and perception most? In this context, we explore features
(mainly) from a filmmaking point of view and provide a basic
understanding for common film techniques and their intent.
2. What features can be applied to represent given characteristics of vi-
sual media? This topic deals with features as a basic module
in computer vision. It provides a brief overview over well-
established features in automated film and video analysis.
3. What features can be linked together in praxis? In the core of this
work, we explore the feasibility and the potential of a mapping
between the previously discussed views on features in visual
media.
1.1 summary of contributions
The contributions of this thesis can be summarized as follows:
1. A novel view on visual media analysis. The understanding of cru-
cial elements in the process of filmmaking and the technical
knowledge about methods of computer vision allows for the def-
inition of a thorough mapping between the two disciplines. Such
a mapping does not solely improve the mutual understanding of
the two research areas. Moreover, it allows for the identification
of feasible research tasks for an automated film analysis and
outlines the limitations of existing computer vision methods.
2. The identification and exploration of novel applications in the con-
text of film analysis and understanding. Following the study and
the analysis discussed above, this work presents novel appli-
cation scenarios in the context of automated film analysis and
understanding. Performed experiments and evaluations show
the potential of the proposed methods to assist a number of
tedious tasks currently performed manually by film experts such
as the comparison of different film versions or the identification
of unique shots. Furthermore, such methods can be applied as
an intermediate step towards a high-level film analysis such
as the analysis of montage patterns of different filmmakers or
editors.
1.2 thesis outline 3
1.2 thesis outline
Chapter 2 explores the features of visual media that most influence
their production, presentation, and perception. In the process of this
study we discuss the process of filmmaking and the intentions behind
different film techniques in detail. This chapter provides a basic un-
derstanding of the elements of media aesthetics in filmmaking and
allows for the identification of possible tasks for automated visual
media analysis.
Chapter 3 briefly outlines a set of features for visual media rep-
resentation from a computer vision point of view. In the following,
discussed features are applied in experiments and evaluations in the
remaining chapters.
Chapter 4 creates the link between visual media features in film-
making and in computer vision. Various applications of the elements
of media aesthetics in filmmaking by means of well-established film
techniques are mapped to existing approaches for their automated
detection and analysis in computer vision. This study allows for the
identification of three areas in the domain of automated film analysis
and understanding:
1. tasks that have been subject to active research in the last years;
2. tasks that are not directly solvable for a fully automated com-
puter vision approach; and
3. research tasks that are still open in the context of automated film
analysis and understanding.
Chapter 5 presents three novel application scenarios in this con-
text. All experiments are performed in an unconventional data set of
archived documentaries. The explored data set bears challenges from
both artistic and technological point of view and outlines possible
limitations of existing approaches in computer vision. All three case
studies show the high potential to improve the process of film analysis,
understanding, and retrieval.
Finally, Chapter 6 provides a summary of the work presented in this
thesis and discusses some ideas and directions for future research.

Part I
B A C K G R O U N D

2
M E D I A A E S T H E T I C S I N F I L M
A system of aesthetics can never confine,
within one interpretation,
notions which must include them all.
— Jean Mitry [105]
In this chapter we focus on the first research question of this thesis: Aim of the chapter
What features of visual media do influence their production, presentation,
and perception? This question is important for several aspects: to pro-
vide a basic understanding of the elements of media aesthetics in
filmmaking, to identify possible tasks for automated visual media
analysis, and to stress the significance of the mapping proposed in
this thesis. Next to the narrative story of a film and featuring actors,
its overall presentation strongly influences the perception (aesthetic
experience) of a movie. Since the whole process of filmmaking is a
chain of aesthetic decisions, the understanding of such background
factors provides high-level structural and semantic information that
can be missed by conventional automated retrieval methods. After
defining the term of media aesthetics, we address the basic media
elements as proposed by Zettl: light and color, space, time and motion,
and sound [178, 179]. Furthermore, we present several advanced me-
dia concepts that are the result of the combination of and interaction
among various fundamental media elements. Following, we provide a
brief background on the basic media elements from a human percep-
tion point of view. The idea is to understand what human perception is
most sensitive to in order to identify those computational features that
can best describe visual media. Finally, we discuss existing research
on automated processing of media elements as defined by the concept
of computational media aesthetics.
(Applied) media aesthetics is the study of elements that influence the What is media
aesthetics?production, perception, and presentation of media [179]. Manifold fac-
tors influence the creation of visual media and our perceptual reaction
to them. Firstly, the creation of any subject of art is accompanied by a
series of decisions by its creator: which format to use (e.g. sculpture,
painting, image, film, etc.), what do depict, and how to arrange it
within the chosen format. Secondly, media elements are influenced
by the time and the context in which they were created. For example,
the introduction of color to the film production process essentially
influenced the perception of a movie and encouraged the artistry of
the filmmakers. Finally, our own experience affects the way we per-
7
8 media aesthetics in film
ceive media elements [105]. However, in this chapter we focus on the
intended purpose of the presented media elements.
2.1 fundamental media aesthetic elements
As fundamental media aesthetic elements we refer to the aesthetic ele-
ments as proposed by Herbert Zettl: light and color, space, time and
motion, and sound [179]. Although, first published in the early 1970s,
today, Zettls’s book is still considered as one of the best and most
comprehensive books on film aesthetics.
2.1.1 Light and Color
Both lighting and color imply certain aesthetic messages and are
commonly used to guide the observer’s attention, to set up a specific
mood, or to create motifs. Lighting is mostly manipulated as the
interaction of light and shadows. For example, in Casablanca (1942)
lighting draws the attention of the audience to the tears in the eyes of
Ilsa which can be easily missed otherwise (see Figure 2.1a). Another
example can be found in Delicatessen (1991) where lighting and color
settings are used to increase the tension of the scene (see Figure 2.1b).
(a) In Casablanca (1942) sidelight
draws the attention to the tears
in the eyes of Ilsa.
(b) In Delicatessen (1991) lighting increases the
tension of the scene.
Figure 2.1: Lighting in film.
Common techniques for lighting manipulation include e.g. high-Light level
and low-key lighting referring to the overall light level in a scene.
High-key lighting usually conveys ambience ranging from normalcy to
energy and enthusiasm (see Figures 2.2a-2.2b) while low-key lighting
is mostly associated with crime and mystery (see Figures 2.1a-2.1b).
Depending on the light direction a difference can be made betweenLight direction
above-eye-level (associated with normalcy) and below-eye-level lighting
(well-known from horror movies). The combination thereof creates
the three main types of lighting techniques: Chiaroscuro, flat, and
silhouette. Chiaroscuro lighting is characterized by selective illumina-
tion, overall low light level, and distinct and dense shadows. These
2.1 fundamental media aesthetic elements 9
elements contribute essentially to the direction of the audience’s at-
tention and to the creation of expressive visual compositions. On the
contrary, the light direction in flat lighting is not exactly definable
which leads to light or transparent shadows. Flat lighting suggests
normalcy, efficiency, and upbeat mood. Finally, in silhouette lighting,
as the name implies, background is brightly lighted and the actors or
objects in the foreground are unlighted. Thus, silhouette lighting is
characterized by its extreme on light and dark contrast and its accent
on contour rather than texture and volume.
Similar to lighting, color can cause a specific feeling, guide the Color functions
attention of the audience, and stress the quality of an event or object.
For example, the color settings of the kitchen scene in Tampopo (1985)
make the widows’ red dress to stand out and draw the attention of the
audience to her (see Figures 2.2a-2.2b). A change in color settings can
also support the narrative development of the film, such as change
in scenes or leap in time. In Traffic (2000) different colors are used to
influence the mood of the audience and stress the difference in the
location settings (see Figures 2.2c-2.2d).
(a) In Tampopo (1985) red color is used
to shift the attention of the audience
from the food counter ...
(b) ... to the widow who takes the chal-
lenge to match the customers’ or-
ders [18].
(c) In Traffic (2000) color post processing
is used to influence mood. All scenes
that are shot in Washington DC have
a blue tone ...
(d) ... while all the scenes shot in Mexico
have a hazy yellow look.
Figure 2.2: Color in film.
Despite the fact, that from a perception point of view, color is not as What color are your
feelings?strongly perceived as form, color is one of the main resources of art
expressivity [105]. The use of color by means of harmony, dynamics,
transformation, and contrast strongly influences the way a film or
a picture is experienced by the audience. However, the relativity of
10 media aesthetics in film
color should not be disregarded. Color depends on the context: it
depends on the surface, on the surrounding colors, on the lighting
conditions, and on the subjectivity of the color perception of different
people. Despite the fact that certain colors set off the expression of
a particular emotion better than others (for example a violent scene
is usually associated with cool colors), the interpretation of colors
varies according to the person perceiving it and according to the
subjectivity of the creator, i.e. a filmmaker can use color according
to color theory (harmonically), contrapuntally (to provoke shock or
to create tension), or with little or no intentional meaning. As Zettl
states: "In general, the ’psychological’ interpretation of colors is a very
slippery business" [179].
2.1.2 Space
Various factors contributes to the presentation and arrangement of
space within a frame. First, the dimension and shape of a frame controlAspect ratio
the available space and its overall composition. Early films, such as
Casablanca (1942), were shot in an Academy ratio (1.33:1 or 1.37:1) and
had an almost quadratic shape (see for an example Figure 2.1a). In the
early 1950’s, Cinerama employed three cameras and three projectors
allowing the effect of peripheral vision for the audience and had
an aspect ratio of approximately 1:2.85 (see Figure 2.3a) [14]. Today,
wide screens emphasize horizontal compositions and allow for the
arrangement of multiple areas of interest. Figures 2.3b-2.3c show the
difference in the visible areas between different frame dimensions.
The specific dimensions of a frame together with the camera positionField of view
(i.e. angle, level, hight, and distance of shooting) define the field of
view of a film space. The decision for a specific camera position causes
a drastic difference in the framing of the image and in the perception
of the filmed event [18]. Apart from the narrative significance, camera
techniques can guide the attention of the audience and increase visual
interest. For example, a close up shot can bring details, that can be
easily missed otherwise (see for an example Figure 2.4a). Furthermore,
similar to the interpretation of color meanings (see Section 2.1.1),
certain camera techniques are often assigned absolute meanings. An
example for such association is the low angle shot with a powerful
character. However, in fact, Bordwell claims that there is no such
absolute or general meaning [18]. In some cases, the camera techniques
carry such meanings. However, there are no hard and fast rules which
preserves the uniqueness and richness of many individual films (see
for an example Figure 2.4b).
2.1 fundamental media aesthetic elements 11
(a) Cinerama in The wonderful world of the brothers Grimm (1962).
(b) A scene from Blood Diamond (2006) with the
original aspect ratio from 2:35.1...
(c) ... and cropped at aspect
ratio of 4:3 (TV stan-
dard).
Figure 2.3: Space in film: aspect ratio.
(a) An extreme close up from Let’s Make
Money (2008) revealing details that
can be easily missed in any other
shot type.
(b) In Citizen Kane (1941) low-angle shot
are often used to convey Kane’s
power. However, the lowest angle oc-
cur at the point of his most humili-
ating defeat – the lost gubernatorial
campaign [18].
Figure 2.4: Space in film: camera techniques.
12 media aesthetics in film
Next to field of view, another essential characteristic of space is theDepth of field
perception of depth of field1. Various elements, such as lighting, set-
ting, objects and camera positioning, contributes to the impression of
space depth. Such depth cues originate mostly in the visual perception
and can additionally stress or even damp the perception of depth2 (see
Figure 2.5). Finally, characters and camera movements additionally
influence the perception of space. Changes in the camera angle, level,
height or distance during the shot support the information about the
space of the image (visible and not visible) and make its objects and
characters become sharper and more vivid.
(a) The establishing shot shows a view
on all participants of the conference
and their environment.
(b) With increasing tension in the discus-
sion the perception of depth is con-
stantly reducing ...
(c) ... by means of camera techniques (mo-
tion and field of view) and lighting
settings.
Figure 2.5: Space in film: from deep to shallow space. The sequence shows
the G8 Conference on Diamonds in Blood Diamond (2006).
2.1.3 Time and Motion
Motion distinguishes film from other visual media – it has duration
and is, thus, dependent on time. Film time is not the time of the action
or the production; it is the perceived time. Hence, the control of time is
essential in making the audience perceive the desired pace and rhythm
of the story [39]. Time manipulation usually involves shortened plot
time in comparison with the story time. Such time jumps may omit
time spans from seconds to centuries. The audience must recognize
that time has passed. Often, such time manipulation relies on human
experience in omitting scenes that are of no importance to the narrative
1 Please, note that by depth of field we do not mean the film term referring to "the
distance within which objects appear in sharp focus" [14] but rather the term depth
from visual perception point of view.
2 For more details on space perception see Section 2.3.2
2.1 fundamental media aesthetic elements 13
development. For example, the time from waking up to breakfast
in the morning is not necessarily shown to the audience since the
filmmaker can rely on the experience and imagination of the audience.
In other cases, time manipulation can be achieved by means of various
cinematic techniques, such as accelerated montage (associated with
fast time passing), diverse shot transitions, color manipulation, or slow
motion (see for an example Figure 2.6).
(a) As Manni explains
Lola what’s happened
earlier ...
(b) ... both a wipe shot
transition and color
manipulation ...
(c) ... are used to visualize
the time jump in the
story line.
Figure 2.6: Time manipulation in film. Run Lola Run (1998).
2.1.4 Sound
In the early days of filmmaking, salient films were projected to the ac-
companiment of music. The absence of sound prevented the audience
to experience a real feeling of duration or time passing [105]. Today,
the presence, or even the absence, of sound can create powerful effects.
For example, to hear someone’s sobbing may provoke much stronger
feelings than to see him/her crying; a quiet sequence in a film can
increase the tension, etc. Sound can actively shape the perception and
interpretation of an image, guide the attention of the audience and
even form expectations. Sound can "clarify image events, contradict
them, or render them ambiguous" [18].
Sound strongly influences the perception of rhythm in a movie. Rhythm
Rhythm involves (patterns of) beat and tempo and is most recogniz-
able in film music. However, sound effects can also exhibit rhythmic
characteristics (e.g. church bells, horse gallop, etc.). Finally, speech
also has rhythm since people exhibit characteristic frequencies and
amplitudes and distinct patters of pacing and syllabic stress [18]. Usu-
ally, the rhythm of sound is in accordance with the rhythm of editing
and motion within a scene. Sound can also support or impeach the Credibility
credibility of a scene by supporting or contradicting the visual impres-
sions. Usually, sound that is unfaithful to its visual source is used for a
dramatic transition or comic effect. Finally, sound possesses temporal
and spatial dimensions. The spatial dimensions relate to the source of Space / Time
the sound. The temporal dimensions originate in the relation between
14 media aesthetics in film
the sound and the event that take place in the same time (simultaneous
/ nonsimultaneous sounds).
2.2 advanced media concepts
The fundamental media elements, as presented in the previous sec-
tion, are rarely applied individually. Rather, the combination of and
interaction among various basic media elements create powerful and
expressive tools that shape the perception of visual media. We call
such interplay of fundamental media elements an advanced media con-
cept and focus on those concepts that are extensively presented in film
theory and film study: composition, continuity, motif, and rhythm,
tempo and pace.
2.2.1 Composition
Composition refers to the use of light and color and the arrangement
of objects, characters, and the camera position for photographic and
dramatic expression [14]. Filmmakers usually try to balance the im-
age. The extreme type of image balance is called bilateral symmetry
(see Figure 2.7a) [18]. More common, however, is a more loose image
balance that can also create strong effects, for example, to stress the
significance of a character or object, to imply the power of a charac-
ter (see Figure 2.7b). Finally, unbalanced shots are often applied to
increase the tension or dynamics of a scene (see Figures 2.7c-2.7d).
(a) Near-perfect balance between the left
and the right halves of the frame (bi-
lateral symmetry).
(b) Asymetrically balanced composition
implying the power of Jules.
(c) Unbalanced dynamic composition in
the restaurant that makes the audi-
ence look back ...
(d) ... and forth between Vincent and
Jules.
Figure 2.7: Film compositions in Pulp Fiction (1994).
2.2 advanced media concepts 15
While compositions in still images, such as photographs and paint- Motion
ings, can be thorough planned, the creation of a composition in a
film scene is additionally complicated by the dimension of motion.
Through character and camera motion, compositions in film scenes
become more dynamic and ever changing [96]. For example, camera
motion is often required to follow or to lead characters and to make
adjustments for motion in the frame (see for an example Figure 2.8).
(a) The scene is balanced between the
three main characters: Rick and Ilse
are separated by Ilse’s husband, Vic-
tor.
(b) As Ilse tooks a step towards the two
men ...
(c) ... the camera moves around to keep
the balance of the scene and to antici-
pate the character movement ...
(d) ... and to face Ilse and her husband
leaving Casablanca together.
Figure 2.8: Balancing the scene using character and camera motion. The final
scene at the airport in Casablanca (1942).
Contrast and color are further compositional elements. Human eyes Contrast, color
are sensitive to (even small) differences and changes in contrast and
color. On a dark background brightly lit objects will stand out while
the darker regions tend to fade and vice versa. The same principle
applies for color: a bright object on a subdued background draws the
attention of the audience (see for an example Figures 2.2a-2.2b).
16 media aesthetics in film
2.2.2 Continuity
The overall structure of a film is defined by its physical (e.g. shots) and
logical (e.g. scenes) units. Even if the audience is aware of it, the film
structure should not be presented too obviously. The art of filmmaking
consists of creating a unity and a feeling of continuity [105]. In general,
continuity refers to the coherence and clearness of the story [14]. It is
achieved by maintaining the unity and coherence of manifold factors
such as narration, dramatic, space, time, movement, action, and point
of view. Since sometimes scenes set in the same time might be shot
several days apart, discontinuities can appear in the final version, such
as misplaced items or actors, new details can appear / disappear, etc.
(see Figure 2.9 for an example).
(a) In the opening scene there is a deep
shadow over the roof of the bank
building ...
(b) as the camera switches to the robbers
getting ready and sliding across to
the roof ...
(c) the shadow is completely gone.
Figure 2.9: Continuity error in The Dark Knight (2008): scene shot at different
day times.
2.2.3 Motif
A motif is defined as any "recurring element in a motion picture that
gains in dramatic importance through its repetition" [14]. It can be
an object, a color, a place, a person, a sound, a movement, pattern of
lighting, camera position, or even a story line. A motif can be easily
recognizable such as the ringing sound in Children of Men (2006) or
the use of color red in Run Lola Run (1998). However, in most cases,
motifs require semantic understanding and rely on the experience
and attentiveness of the audience. For instance, note the variety in
the visual appearance of the clock-motif in Run Lola Run (1998) in
Figure 2.10.
2.2 advanced media concepts 17
(a) From the first
clock in the open-
ing ...
(b) to the three
clocks in the
animated credit
sequence, ...
(c) the clock in
Lola’s room, ...
(d) in the bank ...
(e) and the casino, ... (f) to the clock
Manni watches
as he waits for
Lola, ...
(g) the watch of the
old lady in the
front of the bank,
...
(h) and an areal shot
of a fountain
looking like a
clock.
Figure 2.10: Clock-motif in Run Lola Run (1998).
2.2.4 Rhythm, Tempo and Pace
There are no generally accepted definitions for the terms rhythm,
tempo, and pace. Moreover, they are often confused and used inter-
changeably in the literature. Many sources refer to the terms with the
assumption that the reader knows what is meant.
Zettl defines pace as the perceived speed of an event and tempo Tempo/Pace
as the perceived duration [179]. Adams defines tempo as the filmic
counterpart to the musical term, namely as "rate of delivery" [5]. Both
definitions relate tempo/pace to the perceived time and speed in a film.
Zettl argues that perceived time is dependent on event density which
can be manipulated by camera and object motion and by "motion"
induced by editing [179]. Encyclopedia Britannica adds two more
factors that can influence the tempo/pace that an audience senses in a
film: "the actual speed and rhythm of movement and cuts within the
film, the accompanying music, and the content of the story" [20].
The definition of rhythm in film analysis is very blurry. Beaver Rhythm
confuses rhythm and pace and defines pace as the rhythm of the
film [14]. Bordwell and Thompson state that "the issue of rhythm in
cinema is enormously complex and still not well understood" [18].
Mitry gives one of the most exhaustive discussions on rhythm in a
film [105]. According to Mitry, the term rhythm is not to be confused
with speed or to be related to any metric pattern. Moreover, rhythm
is not made up of simple relationships of durations but rather of
"relationships of intensity contained within relationships of duration".
The intensity of a shot is influenced by the amount of movement in
18 media aesthetics in film
the shot and on the length of time it lasts. Thus, two shots of the
same length may be perceived as longer or shorter depending on the
dynamics of their content and the presentation of the content itself.
The essential factor in rhythm is for Mitry "not actual duration itself
but the impression of duration, it is this quality and it alone, not a
predetermined metric length, which serves as a referent". The more
static the content and the narrower the presentation, the less degree of
attention and the shorter perception time it demands, the longer the
shot appears to be.
The most cited example for a rhythmic montage is the sequenceThe Odessa Steps
Odessa Steps from the salient film The Battleship Potemkin (1925) by
the russian filmmaker Sergei Eisenstein (e.g. [14, 105]). The sequence
shows the slaughter between the town people from Odessa and the
White Guards. A look into the shot length montage shows no visible
pattern for the strongly perceived rhythm (see Figure 2.11). Eisenstein
builds the sequence using a parallel montage – shots of the White
Guards are alternating with shots of the folk and victims. The analysis
of the parallel sequences on their own – despite consideration of
the shot content intensity – does not reveal any explicit rhythm (see
Figures 2.12a-2.12b). The dramatic art in the sequence is additionally
intensified by the employment of drastic long close-up and detail shots
of victims in contrast to the short close-up shots from boots or rifles
of the White Guards (see Figure 2.12c). The shots of the marching
soldiers stand out against the rhythmic background of the town people.
This example illustrates the challenge in the automated detection,
measuring and visualization of rhythm. The perception of rhythm
is the product of content intensity and the narration itself. In this
example, the rhythmic marching of the soldiers enforces the rhythm
feeling to the crowd. Although both Figure 2.12b and Figure 2.12c may
lead one to assume rhythm in the sequence due to the alternating ups
and downs in the depicted distributions, their actual irregularity does
not allow for the definition of a reliable detection algorithm.
Figure 2.11: Shot length distribution in the Odessa Steps from The Battleship
Potemkin (1926). There is no visible pattern originating in the
shot length alone.
2.3 perception of visual media 19
2.3 perception of visual media
The term perception refers to "the ability to see, hear, or become aware What is perception?
of something through the senses" [99]. It is the joint product of sensory
stimuli and the recognition and interpretation of those stimuli. While
the physical aspect of the process is well-defined and understood
(e.g. how the human eye sees an object), the psychological aspect (e.g.
how an object is interpreted by the human brain) still poses manifold
challenges for computational media understanding. While we do not
present an exhaustive introduction into the topics of visual and sound
perception, we briefly address only the perception of the fundamental
media aesthetic elements: light and color, space, time and motion, and
sound.
2.3.1 Light and Color
The primary component of visual perception is light. Physical objects
absorb and reflect light and thereby become visible for the human eye.
The brightness of an object depends on (according to [10]):
1. the distribution of light,
2. the various optical and physiological processes in the human
eye and nervous system, and
3. the capacity of an object to emit and reflect light.
The last factor is called luminance and is often mistaken with the Illumination vs.
luminanceterm of illumination. Illumination is a varying property of light. It
refers to the amount of light (luminous flux per unit area) falling
on a surface and is measured in lux or footcandle. On the contrary,
luminance is a constant property of any surface and refers to the light
coming off the surface in the direction of the observers. It is measured
in candela per square meter (cd/m2). In general, the percentage of
light an object throws back, remains the same. However, different
objects can have different illumination – from nearly perfectly specular
surfaces such as some metal objects to perfectly absorbing surfaces
such as any black objects. Noteworthy is the fact, that even in the
presence of varying lighting, the human eye sees objects in relation to
the lighting sources of the whole setting and can quite well judge its
brightness. Brightness is a subjective attribute of light as perceived by
the human eye, i.e. it is perceived and not measured. Humans, usually,
assign labels ranging from very bright to very dark. Since luminance
is the measurable term that most closely corresponds to brightness,
brightness is often referred to as perceived luminance.
Pure light contains all colors of the visible spectrum although it is Color perception
perceived as colorless. When it hits an objects, some colors are reflected
and some are blocked. The reflected colors contribute to the viewer’s
20 media aesthetics in film
(a) Sequence of the marching White Guards (a circle marks a detail shot and
a square a long shot).
(b) Sequence of the townsfolk (a rhomb identifies a detail shot and a triangle
a long shot).
(c) Rhythmic parallel montage (solid line depicts the shots of the White
Guards, dashed line those of the town people; blue circles on solid lines
and blue rhombs on dashed lines mark detail shots of the White Guards
and town people respectively; red squares and red triangles the corre-
sponding long shots).
Figure 2.12: Rhythm analysis of the sequence Odessa Steps from The Battleship
Potemkin (1926).
2.3 perception of visual media 21
perception of color. Color appearance is strongly affected by the con-
text in which it is perceived: general lighting settings, surrounding
objects and their colors, size of the object itself, etc. The mechanisms of
the human eye allow for the distinction of millions of different colors.
Amazingly, however, the perceptual categories by which humans grasp
the world develop from the simple to the complex [10]. There is no
guarantee that two persons perceive or name a particular color exactly
the same way. More reliable distinction (and the most simplest one)
is between brightness and darkness. The number of colors that can
be distinguished reliably is usually limited to very few colors – the
primaries plus, sometimes, the secondaries connecting them. Human
perception can easily identify subtly different shapes. However, it is
very limited in the identification of a particular color by memory or at
some spatial distance. Finally, human perception is more sensitive to
color relations and contrast than to absolute luminance (see Figure 2.13
for an example for color contrast applied to draw the attention of the
audience).
(a) The original scene from Tampopo
(1985) applies color contrast to draw
the attention of the audience to the
widow at the food counter.
(b) A change in the color settings of the
widow’s dress, i.e. less contrast to
the surrounding objects, may lead to
loss of the audience’ attention in this
crowded scene.
Figure 2.13: Color perception: contrast.
2.3.2 Space
Any object can be described within a three-dimensional space by
means of its characteristics such as size, shape, position, motion, and
direction. The space itself has attributes as well, for example, depth,
distance, location direction, etc. In the course of time some of these
characteristics may change. Hence, space perception refers to the study
of the four-dimensional perceptual world (three-dimensional space
plus time) experienced by an observer [10, 62].
Figure 2.14 visualizes some of the most essential factors for the
perception of spatial depth in a two-dimensional frame. Blocking ob-
jects or persons are probably the strongest depth cue by dividing the
frame space in multiple depth levels (from blocking object in the fore-
ground to occluded objects in the farther background). Furthermore,
22 media aesthetics in film
the farther away the objects are in the distance, the smaller they appear
and the denser their textures become (note the size and texture gradi-
ents of the persons sitting at the table in Figure 2.14). Additionally,
shadows and mirroring provide information about the positioning of
the object in the frame. Also the linear perspective, that is based on
the observation that parallel lines converge to a common vanishing
point, supports the perception of depth (note the ceiling lighting and
its mirroring on the table). Finally, motion is another indicator for
space and distance. Noteworthy in this context are the kinetic and the
stereokinetic effects. The kinetic effect refers to the ability of the human
visual system to reconstruct complex three-dimensional shapes or
rigid objects from motion information rather than the perception of
multiple single elements (e.g. lines, dots, etc.) in space. In contrast, the
stereokinetic effect is a visual illusion: rotating two-dimensional shapes
that are assembled in a specific way can create the illusion of a solid,
three-dimensional object.
Figure 2.14: Strong space perception in The Dark Knight (2008).
The human ability to perceive people and objects as having normalPerceptual constancy
size regardless of changes in the view point or distance is called
perceptual constancy. Known objects are judged by experience while
unknown objects are judged by putting them into context of a known
reference or by judging the area the object occupies (see for an example
Figure 2.15).
Noteworthy in the context of space perception are also some pecu-
liar characteristics of human perception. In general, pictorial elements
are read from left to right (regardless from the way of writing). As
consequence, any objects at the right side of the frame looks heavier
or more important. A diagonal from the bottom left to the top right
suggests an uphill while the opposite diagonal suggests a downhill.
Finally, movement to the right is perceived as being easier than move-
ment to the left. This phenomenon is known as frame asymmetry and
has been subject to considerable academic controversy [10, 179].
2.3 perception of visual media 23
(a) At the beginning of
the scene the windows
in the background ap-
pear to be of ordinary
size (judged by experi-
ence).
(b) As Kane appears in
the background, the
windows become gi-
ant (judged by refer-
ence).
(c) Finally, as Kane moves
in the foreground, the
windows appear to be
of ordinary size again.
Figure 2.15: Space perception: size constancy. Citizen Kane (1941).
2.3.3 Time and Motion
The term time perception refers to the awareness of time passing [21]. Time perception
In contrast to e.g. color or motion perception, time perception is not a
tangible attribute and does not have a direct trigger. Time is actually
not perceived as such but rather changes or events in time. Ricard A.
Block identifies four interacting factors that influence the perception
of time [17]:
1. personality characteristics such as sex, interests, psychological
and physiological state, and experience,
2. characteristics of the time period itself, i.e. number and com-
plexity of the events happening, their modality, duration, and
constellation,
3. activities during the time period such as attentional demands or
performance of competing tasks, and
4. changes in time-related behaviors that occur when temporal
judgment or estimation is required (simultaneity, rhythm, order,
etc.).
Closely related to the concept of time is motion perception since Motion perception
motion is only possible within a given time span. Motion appeals to a
basic human instinct and, thus, quickly attracts attention. In general,
motion implies change in the surrounding environment and, thus,
may require for an action. Such changes can have a positive conno-
tation such as the appearance of a friend or a negative one such as
an approaching danger. The lower limit for the detection of (isolated)
motion is set to about 35ms temporally and about 1 min of arc spa-
tially [148]. However, these values vary depending on the illumination,
on the motion duration and velocity, and on the region of the retina
stimulated. Noteworthy is the fact that humans are more reliable at
24 media aesthetics in film
the detection of relative motion, i.e. the detection of a moving object
(or group of objects) relative to another object(s) [130]. Two essential
characteristics of motion are direction and speed. The perceived direc-
tion of a motion is depending on the context in which the movement
occurs and is subject to the law of simplicity (i.e. grouping similar
things together) [10]. Following, the motion of e.g. a flock of birds is
perceived as a global motion despite the individual moves of the wings.
As stated above, the perception of motion is possible only within a
limited range of speed (e.g. the sun seems to stand still although the
earth is in permanent motion, or the quick move with a flash light
appears to create a still lighted line). However, also the size of the
moving object influences the perceived speed, i.e., in general, large
objects seem to move slower than small ones. Finally, motion pictures
allow for the experience of motion that is otherwise not perceivable
by either accelerating or reducing the speed of recording [10]. This
makes it possible to see a plant growing and dying within a minute
or the cracking of a glass in thousands of bits for several seconds or
even minutes.
The perception of motion in a film is actually the result of an opti-Motion perception in
motion pictures cal illusion. Moving pictures are in fact sequences from still images
but perceived as continuous motion (see for an example Figure 2.16).
Various theories exist about how this illusion comes about. An early
theory that tries to explain the phenomenon of apparent motion is the
persistence of vision. It basically refers to the after-images, i.e. images
that still appear on the retina of the eye for a fraction of a second after
the original image has ceased. This optical illusion goes back to Peter
Mark Roget in 1824 and was later adopted by Terry Ramsaye in 1926
for the explanation of perceived motion in film. More recent works
explore the question if different mechanisms react to apparent and to
real motion. Apparent motion can be differentiated in short-range (i.e.
motion over short distances and of brief duration, producing motion
aftereffects) and long-range apparent motion (i.e. motion over long
distances and of long duration, no motion aftereffects) [8]. Recent
research suggests common perception of short-range apparent motion
and those of real motion while there is a notable difference to the
perception of long-range motion [8]. Motion pictures fall within the
limits of the short-range category since the changes between consecu-
tive frames are small enough. Hence, according to this theory, motion
in film is transformed by the same physical mechanisms as the real
continuous movement does. Cavanagh and Mather argue that the dif-
ferences between short- and long-range motion processing are a direct
result of different stimuli and are not the evidence for two different
motion processes [26, 27]. Instead, the authors propose a classification
based on first- and second-order motion stimuli. A first-order motion
process is the result of displacements of first-order differences in lumi-
nance and color (first-order statistics). Similarly, a second-order motion
2.3 perception of visual media 25
process responds to differences in second-order characteristics such as
texture, contrast, motion or binocular properties.
Figure 2.16: Motion perception in motion pictures: frame sequence from A
man with a movie camera (1929).
2.3.4 Sound
We are usually unaware of the amount of sound information that
we are processing in our everyday experience: street noise, people
talking, church bells, etc. Several aspects of sound, as we perceive it,
are relevant to the film’s use of sound: loudness, pitch, and timbre [18].
They interact with each other to create the audio picture of a film,
enable us to recognize characters, and shape our experience of a film
as a whole.
2.3.4.1 Loudness
Abrupt shifts in volume are often used to startle the audience of a
movie which makes the estimation of a sound loudness an useful tool
for the film analysis. The loudness of the sound describes its perceived
intensity (from quiet to loud), i.e. loudness is a subjective quantity
and cannot be measured directly [109]. In general, loud sounds have
higher amplitude and softer sounds have lower amplitude. Hence,
magnitude estimation can be applied to determine the relationship be-
tween physical intensity and perceived loudness. However, sounds
of the same intensity can appear to be differently loud due to the
sensitivity of the human ear to different sound frequencies. This is
indicated in the equal-loudness contours (see Figure 2.17). The figure
illustrates two essential characteristics of the dependance of loudness
upon its frequency in conjunction with the sound intensity. Firstly, the
sensitivity to loudness drops significantly below approx. 350Hz and
above approx. 15kHz. Secondly, the dependance of loudness upon
frequency is different at different intensity levels (note the change
of the curves shape as the intensity increases). Another two factors
that can influence the loudness of a sound are its bandwidth and
its duration. Two sounds of the same intensity but different spectra
can appear differently loud to the audience. Above a critical bandwidth
of 160Hz at a centre frequency of 1kHz, loudness increases with in-
creasing bandwidth [45]. Below the critical bandwidth, changes in the
bandwidth (at fixed overall intensity level) does not result in a change
in loudness. Finally, the loudness of a sound is time dependent: the
26 media aesthetics in film
loudness decreases for durations smaller than 100ms while for longer
durations the loudness is almost independent of duration [45].
Figure 2.17: Equal-loudness contours (from [109]).
2.3.4.2 Pitch
Similar to loudness, pitch is a perceptual sound attribute that allows
for the ordering of sounds on a scale from low to high. In a film, pitch
supports the differentiation among different sounds and objects. It
can also be used to create a joke (for example, a young boy trying
to speak in a deep voice). Changes in the pitch are often related to
changes in the shots (e.g. from a medium shot to a close-up) [18]. In
general, pitch is closely related to frequency – large frequencies result
in a higher pitch than low frequencies. Pitch also depends on sound
pressure level. In general, increasing the intensity of a sound decreases
its pitch for low frequencies (below approx. 2kHz) and increases its
pitch for high frequencies (above approx. 4kHz) [109]. The presence of
additional sounds, partial masking, can also influence the perception
of pitch shifts. In average, additional sounds with a lower frequency
than the original (test) sound results in a positive pitch shift, while
partial masking produced by sounds at a higher frequency than that
of the test sound yields a negative pitch shift [45].
2.3.4.3 Timbre
In the psychoacoustics timbre is also known as sound color or sound
quality. It depends mainly on the harmonic components of the sound
2.4 computational media aesthetics 27
or the relative number of overtones. The American Standards Asso-
ciation (ASA) defines timbre as "that attribute of sensation in terms
of which a listener can judge that two sounds having the same loud-
ness and pitch are dissimilar" [1]. The timbre of a sound allows for
the recognition of familiar sounds and, moreover, for the differentia-
tion of musical instruments (for example, if a specific note is played
on a piano or a guitar). Timbre is a multidimensional perceptual at-
tribute of sound that is closely related to the excitation pattern of that
sound [109].
2.4 computational media aesthetics
The term computational media aesthetics was first defined by Chitra Definition
Dorai and Svetha Venkatesh in 2001 [111]. It comprises the algo-
rithmic study of media elements and the analysis of their applica-
tion for "clarifying, intensifying, and interpreting an event for the
audience". Research in the field of computational media aesthet-
ics is performed mostly by the group of the original authors, see
e.g. [3, 4, 5, 6, 39, 106, 107, 108, 152, 154]. However, earlier works
(e.g. [7, 122]) as well as further recent works (e.g. [125]) also explore
film editing techniques as a basis for film annotation and retrieval
although they do no explicitly refer to the term computational media
aesthetics. For example, Radev et al. propose a general film model that Film grammar
compromises structural, semantic, and syntactic elements of film [122].
The model represents a conceptual schema graph whose nodes are the
basic film structural elements and features derived from an analysis of
the film theory. A more practical approach is proposed by Aigrain et al.
for the detection of macro video segments or scenes [7]. The authors
employ a set of "medium knowledge-based" rules that account for
shot transitions, shot repetition, contiguous shot settings, shot length,
soundtrack (music detector), and camera motion. Although the au-
thors present a thorough exploitation of explicit film grammar, they
report no evaluation results. Recently, Rasheed et al. exploited a set of
low-level features including average shot length, color variance, mo-
tion content and lighting key for the genre recognition (comedy, action,
drama, horror) of film previews [125]. Presented experiments with
over hundred film previews prove that the combination of visual cues
with cinematic principles is a powerful tool for genre classification.
Depending on the retrieval task, existing research exploring film
editing techniques usually combines the analysis of several media ele-
ments. Examples for research work focused on a single media element
are the exploitation of color and sound. Based on the assumption that Color
color is closely related to mood, Truong et al. cluster scenes together
that share the same color information [154]. Furthermore, they detect
"color events", i.e. shots in which the filmmaker intentionally uses
color to draw the attention of the audience. The authors hypothesize
28 media aesthetics in film
that the rarer a color composition is in film, the stronger it attracts the
attention of the audience. Moncrieff et al. propose a method for theSound
automated detection of affective sound events in a film based on the
dynamics of the sound energy of audio [106, 107, 108]. Despite the
subjective nature of the affects associated with sound energy events,
experiments on four horror movies prove their correlation. Pfeiffer et
al. explore audio editing practices for scene determination [120]. The
authors argue that it is possible to identify scenes which are created
through sound overlaps and explore the change patters of audio fea-
tures to automatically determine relations between consecutive shots
and group them into scenes.
Various feature combinations that exploit editing techniques haveFeature
combinations been proposed recently. The majority of the explored applications are
focused on affective film analysis and automated genre recognition.
Dominant features are average shot length, color and motion informa-
tion, see e.g. [34, 125, 141, 149, 152, 161, 162, 166]. For the determina-
tion of scene types (e.g. dialog, action) several authors additionally
explore the montage patterns of the corresponding scenes [31, 32, 169].
Finally, film tempo has been explored by means of shot length and
motion information for the localization of dramatic events and section
boundaries, e.g. [4, 5, 6].
3
M E D I A F E AT U R E S
If I could have expressed the same in a song,
I would have written a song instead.
— Bob Dylan
In computer vision, media representation involves the selection of Aim of the chapter
those features that best describe relevant aspects of media content.
In general, there are two broad approaches for media representation:
global or local features. Global descriptors represent media by a sin-
gle feature computed from the whole image. On the opposite, local
features represent media by a set of descriptors computed at some
points of interest in the corresponding media. In this chapter, we in-
troduce global, Section 3.1, and local features, Section 3.2, that have
been successfully applied in recent research for the representation
and retrieval of visual media. Since motion detection and description
can be based on either global or local features, motion features are
discussed separately in Section 3.3. In this chapter, we do not aim at a
comprehensive coverage of existing features in computer vision but
focus on those set of features that are addressed through performed
experiments and evaluations discussed in the remaining chapters.
3.1 global features
Global feature based approaches characterize media by their entire
image. Despite the low computational costs and compact representa-
tion, such approaches (usually) possess limited application due their
sensitivity to local changes, such as occlusion, illumination changes,
etc. In general, such approaches are only applicable for the recognition
of rigid objects and often require preliminary segmentation.
3.1.1 Color features
Color analysis can be performed for any kind of color spaces, such as
Red-Green-Blue (RGB) and Hue-Saturation-Value (HSV). For monochro-
matic images the intensity (grey-level) information is usually explored.
Color histograms represent the distribution of colors in an image. Histograms
In general, histograms are robust to rotation but not to illumination
changes and occlusion. Furthermore, global histograms contain no
spatial information and, thus, different images can share the same
color distribution even if they are semantically not related (see for an
29
30 media features
example Figure 3.11). Applications of color histograms comprise e.g.
image retrieval [139], object tracking [68] and image fingerprinting [52].
To introduce spatial information images are often divided into M×N
parts. Following, each part can be described separately.
(a) A shot from Pulp Fiction (1994) ... (b) ... and one from Run Lola
Run (1998) ...
(c) ... and the corresponding color (RGB)
histograms for the shot from Pulp Fic-
tion ...
(d) ... and those from Run Lola Run.
Figure 3.1: An example for similar color histograms despite different shot
content.
A different way of incorporating spatial information into the color
histogram is the Color Coherence Vector (CCV) proposed by PassCCV
et al. [119]. CCVs split a color histogram into two parts: a coherent
vector and a non-coherent vector. A pixel is regarded as coherent if
it belongs to a large region of the same color. A CCV of an image is
the combination of the histograms over all coherent and all incoherent
pixels of the image.
Further color descriptors that consider spatial information are color
correlograms and color patches. The correlogram expresses the spatialCorrelograms
correlation of pairs of colors [65]. The autocorrelogram is a variation
of the color correlogram which considers the correlation between
identical colors only [65]. Color patches measure the coarse color dis-Color patches
tribution of an image. The image is divided into M×N regions and
for each region the mean intensity is computed [59]. Recently, Li et
al. proposed a semi-global feature, Markov Stationary Features (MSF),MSF
that has been shown to be effective for the task of TRECVID2 concept
1 The similarity between the presented color histograms is measured using histogram
intersection (see Section 3.3, equation 3.3).
2 http://trecvid.nist.gov/.
3.1 global features 31
detection [86]. MSF extends the histogram-based features by character-
izing the spatial co-occurrence of histogram patterns using Markov
chain models. It therefore encodes spatial structure information both
within and between histogram bins.
Color moments are based on the assumption that the color distri- Moments
bution of an image can be interpreted as a probability distribution
and, thus, can be characterized by its moments [145]. Since the low
order moments capture most of the color distribution, usually, only
the first three moments are used (mean, standard deviation, and
skewness). Similar to the color histograms, color moments are sensi-
tive to illumination changes and partial occlusions. Color moments
have been successfully applied e.g. for image and video indexing and
retrieval [133, 145], and cut detection [51].
Li et al. detect and track dominant color objects in the HSV color Dominant color
regionsspace [89]. The authors represent each shot by a dominant color
histogram based on detected objects depending on their temporal
duration. A more robust approach to illumination changes are meth-
ods based on color ratio gradients [55, 64]. Color ratio gradients are Color ratio gradients
derived from the color constant ratios of corresponding (key-) frames.
Due to its insensitiveness to object position and geometry, shadows,
and illumination changes, the approach has been successfully applied
for shot and object representation.
Finally, MPEG defines five color feature descriptors: Dominant Color, MPEG-7
Scalable Color, Color Structure, Color Layout, Group of Frames (GoF)/
Group of Pictures (GoP) Color [66]. The first three represent the color
distribution in an image. The later two descriptors describe the color
relation between sequences (or groups) of images. The Dominant Color
Descriptor describes the representative colors in an image or image
region. It allows for the specification of a small number of dominant
color values and their statistical properties such as distribution and
variance. The Scalable Color Descriptor is a color histogram in the HSV
color space encoded by a Haar transform. Similarly, the Color Structure
Descriptor is also based on color histograms. However, it uses a small
structuring window to identify localized color distribution. The exten-
sion of the descriptor to a group of frames or a group of images is the
GoF/GoP Color Descriptor. Finally, the Color Layout captures the spatial
layout of the representative colors in an image. The representation
is based on the coefficients of the Discrete Cosine Transform (DCT).
MPEG-7 color descriptors have been successfully applied (mostly in
combination with other features) for video clip matching [16].
In an evaluation of the performance of the MPEG-7 color descriptors Evaluations
in a visual surveillance scenario the Color Structure outperformed
the remaining descriptors in terms of Average Normalized Retrieval
Rate (ANMRR) [9]. Similarly, an empirical evaluation of the MPEG-7
color descriptors and color correlograms [65] shows the superior per-
formance of the Color Structure Descriptor in the retrieval of semantic
32 media features
image categories in terms of recall and precision [116]. Recently, Van de
Sande et al. investigated the invariance (to intensity, color changes and
shifts) and the distinctiveness of different color descriptors [158]. The
authors compared the performance of histogram-based descriptors,
color moments and color moment invariants, and color descriptors
based on Scale Invariant Feature Transform (SIFT). The performed
experiments reaffirm that global color descriptors alone are often not
sufficient for content-based retrieval. Despite their partial invariance
to illumination changes histogram- and moment-based descriptors are
clearly outperformed by the color-enhanced local descriptors on both
image and video category retrieval.
3.1.2 Edge features
Edge features can represent both shape and texture characteristics
of visual content and are commonly used for the detection of region
boundaries [43, 144], text areas [134, 171], shot boundaries [60, 138], etc.
In general, an edge describes intensity variation in close surroundings
of a pixel. The computation of edges is based on gradients: edge
magnitude is the magnitude of the gradient, and the edge direction is
perpendicular to the gradient direction.
The MPEG-7 edge histogram describes the local distribution of ori-MPEG-7
entations of edges and distinguishes between five types of edges:
horizontal, vertical, 45-degree diagonal, 135-degree diagonal, and non-
directional edges (see Figure 3.2) [66]. Each image/frame is divided
into 4 × 4 non-overlapping sub-images and for each sub-image a
local edge histogram with 5 bins for the corresponding edge types
is computed. MPEG-7 edge histogram has been proved to be effective
for image similarity retrieval [97] and – in combination with further
visual descriptors – for video shot boundary [138] and copy detec-
tion [16]. A general drawback of edge features is their sensitivity to
illumination changes. Figure 3.3 illustrates the influence of changes in
the illumination on the corresponding MPEG-7 edge histograms.
(a) horizontal (b) vertical (c) 45-degree (d) 135-degree (e) non-
directional
Figure 3.2: MPEG-7 edge types [66].
Line detection is an allied area to edge detection. It is usually appliedLine detection
for the recognition of specific shapes (e.g. sport fields, scratches, wires,
etc.), camera orientation, vanishing points, etc. Conventional methods
3.1 global features 33
(a) (b)
(c) (d)
Figure 3.3: An example for differences in MPEG-7 edge histograms due to
illumination changes: two frames from two different film compi-
lations and the corresponding MPEG-7 edge histograms.
for line detection are usually based on a Hough transform [11]. Such
methods extract lines exceeding predefined length and gap thresholds
(see for an example Figure 3.4a). Depending on the selected thresh-
olds and present textures in the image, such methods can lead to
a significant number of falsely detected or missed lines. In general,
line detection is not necessarily based on preceding edge detection.
Burns et al., for example, present an algorithm based only on the
gradient orientations at each pixel [24]. Although the algorithm is able
to extract low contrast lines, it still depends on fixed thresholds. The
method is improved by von Gioi et al. [160] who propose a linear-time
Line Segment Detector (LSD) which is a combination of Burn’s method
and the meaningful alignments by Desolneux et al. [38]. This method
is very fast and accurate with a minimal amount of falsely detected
or missed lines (see Figures 3.4b-3.4c). Line detection approaches aim
at the detection of "real" lines. However, in some situations humans
perceive lines where there are none. In the discussed example in
Figure 3.4, lines are defined by the lightings on the ceilings and the
corresponding mirroring on the table. Such lines are essential for the
perception of depth in the scene space (or a vanishing point in the
back of the scene). A possible solution includes the consideration of
34 media features
present symmetry, e.g. by measuring the phase symmetry of points in
an image [76] (see Figures 3.4d-3.4e).
(a) Simple line detection method based on the Hough transform [11]. Top left: edge
detection using the Canny operator [25]. Right: Hough transform of the edge
image. Bottom left: detected lines by finding peaks in the Hough transform matrix.
(b) Line detection and grouping using
LSD [160] and J-Linkage [147]. Each
line class is represented by a different
color [46].
(c) The dominant line class (in red) indi-
cates a vanishing point sideways in
the scene.
(d) Preceding phase symmetry detection
[77] ...
(e) ... allows for the recognition of the
vanishing point in the depth of the
scene.
Figure 3.4: Line detection in a scene from The Dark Knight (2008). The Original
frame is depicted in Figure 2.14.
3.2 local features
Opposite to global features, local features do not describe entire media
but only a local neighborhood around a set of salient (interest) points.
Following, local features prove higher reliability in the recognition of
media parts (e.g. objects) despite significant changes such as clutter,
occlusion, variations in the illumination, etc. In general, the common
process workflow for a local feature-based application scenario passes
three well-defined stages. First, salient points in both model and test
image are identified. Second, their local characteristics are captured by
3.2 local features 35
(invariant) feature descriptors. Finally, a matching strategy ascertains
if two feature vectors belong to the same keypoint in both images. In
the following, we briefly present related work for all three stages.
3.2.1 Interest Point Detectors
A wide variety of interest point detectors exist in the literature. In
the following we briefly present the most recent ones. For a thorough
survey and evaluation of the performance of interest point detectors
in the context of repeatability please refer to [102, 129].
The Harris detector is based on the use of an auto-correlation ma- Harris-based
detectorstrix [61]. Interest points are detected if the matrix has two significant
eigenvalues. Harris points are geometrically stable and reliable in the
presence of image rotation, illumination and viewpoint changes [129].
However, the performance of the detector fails when the image reso-
lution changes significantly. To adapt the Harris detector to the scale
factor, Mikolajczyk et al. propose the Harris-Laplace detector [100, 103].
It selects corners at location where a Laplacian attains extrema in
scale space. The Harris-Affine detector additionally extends Harris-
Laplace by using a second moment matrix to achieve affine invari-
ance [101, 103].
The Hessian-Laplace detector searches for local maxima of the Hes- Hessian-based
detectorssian determinant and selects characteristic scale via a Laplacian [103].
Hessian-Laplace obtains a higher localization accuracy in scale space
as Harris-Laplace. Laplacian scale selection acts as a matched filter and
works better on blob-like structures than on corners since the shape
of a Laplacian kernel fits to the blobs. Similar to Harris-Affine, the
Hessian-Affine detector extends Hessian-Laplace to achieve invariance
to affine transformations [103]. The affine neighborhood is determined
by an affine adaptation process based on a second moment matrix.
Bay et al. introduced recently a further detector based on the Hessian
matrix – the Fast-Hessian detector [13]. It approximates a Gaussian
second order derivative with box filter. Image convolution with the
box filter is computed rapidly using integral images.
Maximally Stable Extremal Regions (MSER) is a watershed-based al- MSER
gorithm based on intensity values. The algorithm detects connected
intensity regions below and above a certain threshold and selects those
which remain stable over a set of thresholds [98].
Tuytelaars et al. proposed two region detectors [156]. The first detec-
tor, the Geometry-based region (GBR) detector, is an edge-based method. GBR
It starts from points detected using the Harris approach and uses
the nearby edges. Two nearby edges – which are required for each
point, limit the number of potential features. A parallelogram region
is bound by these two edges. The parallelogram is determined by
several intensity based function. The second detector, the Intensity-
based region (IBR) detector, is an intensity extrema based algorithm. It IBR
36 media features
investigates the intensity profiles along rays going out of the local
extremum. An ellipse is fitted to the region determined by significant
changes in the intensity profiles.
The Difference-of-Gaussian (DOG) detector was introduced by LoweDOG
as keypoint localization method for the SIFT approach [91, 92]. Interest
points are identified at peaks (local maxima and minima) of Gaussian
function applied in scale space. All keypoints with low contrast or
keypoints that are localized at edges are eliminated using a Laplacian
function.
A common criticism to edge-based methods is that it is more sen-Limitations
sitive to noise and changes in neighboring texture. Interest point
detectors which are less sensitive to changes in texture perform well
in a classification scenario since they recognize and capture those
features that are common for all instances in a given class. On the
opposite, identification relies on those features that are unique for a
given object. Thus, the question arises: How can we measure the distinc-
tiveness of an interest point? Schmid et al. defines information content
as measure of the distinctiveness of an interest point [129]. It is based
on the characteristics of the local shape of the image at the interest
point. If all descriptors lie close together, they do not convey any
information, that is the information content is low. Thus, matching
would fail since any point can be matched to many others. However,
we cannot simply ignore features with low information content. For
example, in a shoeprint or coin classification scenario, there is a large
number with similar (or even equal) local appearance. In spite of the
low (local) information content, single descriptors can play an essential
role in the global context. The main limitation of local features turns
out to be their locality. Moreover, methods which detect most interest
points do not necessarily perform the best. First, we are faced with
the problem of overfitting (i.e. a single interest point can be matched
to many others). Second, since misleading features may appear (e.g.
due to background changes), the information captured per interest
point plays an essential role. Thus, in the next section we give a short
overview of state-of-the-art local feature descriptors.
3.2.2 Local Descriptors
Given a set of interest points, the next step is to choose the most
appropriate descriptor to capture the characteristics of a provided
region. Different descriptors emphasize different image properties
such as intensity, edges or texture. In the following, we focus on
four descriptors which show outstanding performance with respect
to changes in illumination, scale, rotation and blur. For a thorough
survey on the performance of local feature descriptors please refer
to [102].
Lowe introduced the SIFT descriptor which is based on gradientSIFT
3.2 local features 37
distribution in salient regions – at each feature location, an orienta-
tion is selected by determining the peak of the histogram of local
image gradient orientations [92]. Subpixel image location, scale and
orientation are associated with each SIFT feature vector (4× 4 location
grid × 8 gradient orientations). SIFT features show outstanding per-
formance in existing evaluations. However the main drawback and
critical point is their computation time. Two SIFT modifications – Fast
Approximated SIFT and Principal Components Analysis (PCA)-SIFT –
claim on approximating or even outperforming accuracy and faster
matching. The Fast Approximated SIFT descriptor is accelerated by the
use of an integral orientation histogram [57]. The authors evaluate
their approach on a data set of fifteen images and report a speed up
by a factor of eight while the matching and repeatability performance
is slightly decreased. PCA-SIFT aims at the reduction of the descriptor
dimensionality by applying a PCA to the scale-normalized gradient
patches [72]. Presented evaluation results on the INRIA Graffiti data
set show that the PCA-SIFT descriptor performs slightly worse than
SIFT. Finally, since the SIFT descriptor is not invariant to light color
changes, several color-based SIFT descriptors have been introduced
recently [2, 19, 23, 54, 157, 159]. Van de Sande et al. present a comple-
mentary evaluation on the performance of several color-based features
in respect to light intensity changes and light intensity shifts [158]. The
authors show that SIFT-based descriptors outperform histogram- and
moment-based features on both image and video recognition. In the
presented evaluations, the most promising descriptor for the task of
category recognition without any a priori knowledge is OpponentSIFT.
OpponentSIFT describes all of the channels in the opponent color
space using SIFT descriptors.
Mikolajczyk and Schmid propose another extension of the SIFT GLOH
descriptor – Gradient Location and Orientation Histogram (GLOH) – de-
signed to increase the robustness and distinctiveness of the SIFT de-
scriptor [102]. Instead of dividing the path around the interest points
into a 4× 4 grid, the authors divide it into a radial and angular grid.
A log-polar location grid with 3 bins in radial and 8 bins in angular
directions is used. The gradient orientations are quantized into 16 bins
which gives a 272 bin histogram further reduced in size using PCA to
128 feature vector dimension.
Belongie et al. introduce Shape Context as feature descriptor for Shape context
shape matching and object recognition [15]. The authors represent the
shape of an object by a discrete set of points sampled from its internal
or external boundaries. Starting points for the presented approach are
edge pixels as found by the Canny detector [25]. Following, for each
point the relative location of the remaining points is accumulated in
a coarse log-polar histogram. Mikolajczyk et al. extend the original
shape context to capture orientations [102]. Hence, the obtained fea-
38 media features
ture vector is of dimension 36 (location is quantized into 9 bins and
orientation into 4 bins).
Speeded Up Robust Features (SURF) are fast scale and rotation invari-SURF
ant features [13]. The descriptor captures distributions of Haar-wavelet
responses within the neighborhood of an interest point. Each feature
descriptor has only 64 dimensions which results in fast computa-
tion and comparison (4× 4 location grid × 4 wavelet responses in
horizontal and vertical direction).
Schmid et al. performed a complementary evaluation on the perfor-Evalluations
mance of local descriptors with respect to rotation, scale, illumination,
and viewpoint change, image blur and Joint Photographic Experts
Group (JPEG) compression [102]. In most of the tests SIFT and GLOH
clearly outperformed the remaining descriptors: shape context, steer-
able filters, PCA-SIFT, differential invariants, spin images, complex
filters, and moment invariants. Furthermore, Stark and Schiele re-
port that the combination of Hessian-Laplace detector with SIFT and
GLOH descriptor outperform local features such as Geometric Blur,
k-Adjacent Segments and shape context in an object categorization
scenario [143] . For their evaluation the authors used three different
data sets containing quite distinguishable objects such as cup, fork,
hammer, knife, etc. On the contrary, we performed an evaluation of
the classification and identification power of various combinations of
interest point detectors and local feature descriptors on a dataset of
ancient coins bearing large intra- and small interclass variations [71].
The achieved results show the overall outstanding performance of
DOG and SIFT in both identification and classification tests closely fol-
lowed by Hessian-based detectors in a combination with the Shape
Context descriptor. The main critical point of the SIFT features is their
computation time. However, since feasibility and not time is the focus
of the experiments performed within the scope of this work, we apply
SIFT as thecurrently most reliable local feature.
More recently, spatial local features have been extended to theSpatio-temporal
features temporal dimension. Spatio-temporal features find application in event
and action recognition [80, 115, 163], video summarization [79], video
signatures and video copy detection [82, 85, 164]. Laptev extends the
scale-invariant Harris-Laplace interest point detector by a 3× 3 spatio-
temporal second-moment matrix [80]. The author applies an iterative
method to find points that are both spatial maxima of the Harris corner
detector [61] and extrema of a scale-normalized Laplacian function in
space. Recently, Leon et al. proposed a method for video signatures
based on video tomography [85]. Video tomography captures spatio-
temporal changes and is, thus, a measure for local and global motion
in videos. A tomography image is composed by taking a fixed line
from each frame, e.g. at the center of the frame, and arranging them
from top to bottom to create an image (shot signature). Law-To et al.
report outstanding performance of differential-based descriptors on
3.2 local features 39
video copy detection in comparison to ordinal intensity signatures and
Laptev’s spatio-temporal features [83]. The differential descriptors are
computed in three steps [82]. First, keypoints are identified using the
Harris corner detector [61] in each frame of a shot. Following, local
features are computed at four spatial positions around the keypoints
as Gaussian differential decomposition of the grey-level signal until
the second order:
f =
(
δI
δx
,
δI
δy
,
δI
δxy
,
δ2I
δx2
,
δ2I
δy
)
. (3.1)
The resulting feature vector is a 20-dimensional descriptor. Finally,
keypoints along frames are associated to build trajectories that are rep-
resented by the average descriptors. Additionally, the authors assign a
behavior label to each trajectory and distinguish between moving and
motionless points.
3.2.3 Matching Strategies
The Euclidean distance is a widely used measure to determine whether Distance measures
two feature vectors belong to the same keypoint in different im-
ages [13, 72, 92, 102, 103]. Other techniques, such as the Mahalanobis
distance, can also be applied [12, 47, 100, 101, 102, 150]). However, the
Mahalanobis distance bears mainly two disadvantages. First, it uses
a single covariance matrix to all interest points. Second, it is based
on the assumption that the errors of the descriptors should follow a
normal distribution; an assumption that is verified neither theoreti-
cally nor experimentally. Terasawa et al. show in experimental tests
that the distribution of rotation invariant descriptors is not a normal
distribution [150].
Given a set of distances between corresponding keypoints, a very Threshold-based
strategysimple matching strategy is to accept all matches above a pre-defined
threshold: threshold-based matching (e.g. [102]). Adjusting the thresh-
old allows for the selection of appropriate trade-off between false
positives and false negatives matches. Essential characteristic of this
approach is that a descriptor can have several matches. Figure 3.5
illustrates the challenge. Given the two ancient greek coins, the owl
pictured bears similar characteristics on both its breast and back. The
provided descriptor results in a total of 9 matches, all of which are
locally correct. However, there is only one correct match in the global
context. Furthermore, multiple descriptors in the training image can
be matched against the same descriptor from the test image. The mis-
leading high number of matching descriptors penalizes recognition
algorithms which rely on the total number of matches. Thus, a further
processing step, such as cross correlation [101, 136], is required to
increase the reliability of detected matches.
Nearest neighbor matching is a strategy to limit the number of possi- Nearest
neighbor-based
strategies
ble matches. A descriptor is matched against its nearest neighbor if the
40 media features
(a) single descriptor from the test image
is matched against multiple descrip-
tors from the training image, and ...
(b) ... visa versa [90].
Figure 3.5: Multiple matching of local features.
distance between them is below a given threshold [72, 102, 103]. Near-
est neighbor ratio matching considers additionally the distance to the
second nearest neighbor [12, 13, 92, 102]. Again, there can be multiple
matches when different descriptors from the test image are matched
against the same descriptor from the training image (see for example
Figure 3.5b). To overcome this problem, one can either ignore all am-
biguous matches (e.g. [136]) or keep the one with lowest distance. Both
approaches lead to a loss of potentially stable matches. More recent
matching strategies use methods of geometric filtering based on the
local spatial arrangements of the matched descriptors [101, 128, 136],
or on multiple view geometric relations [47]. Limiting assumption
of this strategy is that model and test image follow homography or
epipolar transformation.
3.3 motion features
Motion analysis comprises a set of research tasks of diverse complexity.
While the detection of present motion in a temporal sequence of
images or a video sequence is the simplest task, the distinction between
camera and object motion is a challenging and yet not a reliably
solvable task.
Recently, a feature describing the motion content of a shot based onMotion content
feature histogram intersection has been applied for shot similarity measure-
ment [124, 175]. The motion content feature MF for a shot s is defined
as:
MFs =
1
N
N∑
f=1
(1−HistInter(f, f+ 1)), (3.2)
where N is the number of frames in the shot and the histogram
intersection HistInter for any two frames (or still images) x and y is
defined as [146]:
HistInter(x,y) =
∑
b∈bins
min(Hx(b),Hy(b)). (3.3)
3.3 motion features 41
In general, the most basic motion detection method is based on
frame subtraction (given a stationary camera position and constant
illumination). The difference frame fd for two frames at time point Differential method
s and t can be defined as binary image with non-zeros elements
representing areas with substantial motion:
fd(i, j) =
{
0 if |fs(i, j) − ft(i, j)| 6 ε
1 otherwise,
(3.4)
where ε is a predefined threshold (small positive number) [142]. To
gather information about the direction of motion a cumulative differ-
ence frame can be constructed:
fcum
d (i, j) =
n∑
k=1
ak|f1(i, j) − fk(i, j)|, (3.5)
where ak is the significance weight for the corresponding frame k.
While a difference frame carries information about motion presence,
extractable motion characteristics are not very reliable [142].
Another approach for motion analysis is the optical flow computa- Optical flow
tion [22, 63, 93]. Optical flow aims at the establishment of the motion
field3 for a given video sequence and results in the determination of
motion direction and motion velocity at (possibly) all image points.
In general, optical flow computation is based on the assumptions of
constant illumination and relatively small motion between consecutive
frames. Furthermore, an accurate optical flow estimation is compu-
tationally expensive [22]. A possible solution for these limitations is
the determination of a sparse motion field by means of tracking of
interest points instead of all image points.
3 A motion field is the (ideal) two-dimensional representation of three-dimensional
motion where each image point is assigned a velocity vector [142].

Part II
F I L M A N A LY S I S

4
V I S U A L - B A S E D C O M P U TAT I O N A L M E D I A
A E S T H E T I C S
It is frequently at the edges of things
that we learn most about the middle ...
— Walter Murch [110]
The first part of this work provided a basic understanding of the Aim of the chapter
elements of media aesthetics in filmmaking that most influence the
production and perception of a movie. Additionally, we discussed
these elements from a visual perception point of view and presented
existing approaches of computer vision for their analysis and represen-
tation. In this chapter, we present a direct linkage/mapping between
the previously addressed media aesthetic elements, their application in
filmmaking by means of well-established film techniques, and existing
approaches in computer vision. This view of media analysis allows
for the exploration and identification of three areas in the domain of
automated film analysis and understanding. The first area comprises
research tasks that have been subject to thorough research in the last
years such as scene recognition or that have even been declared as
solved by the research community such as the detection of shot cuts.
The second area covers research tasks that are not directly solvable
for a fully automated computer vision approach. Such tasks require
additional knowledge, e.g. about the shape and structure of an object
for the detection of the camera angle or about the normal movements
of an object for the distinction between fast and slow motion. Finally,
the third area identifies research tasks that are still open in the context
of automated film analysis and understanding. The exploration of the
last area and the discussion of initial approaches for the presented
research tasks is the main focus of this chapter.
While film analysis and media understanding is still in the early
stages of development, the focus of existing approaches slowly shifts
from basic research tasks such as the identification of meaningful video
clips for digital preservation to advanced film analysis such as scene
and event detection, genre and emotion recognition, etc. Currently,
semantics is explored mostly by means of incorporated metadata (e.g.
title, synopsis, script, manually generated metadata, etc.) [56, 168] or
available domain knowledge (e.g. events in sport videos or the fact
that a scream belongs to a horror or an action scene rather than to
a romantic one) [151, 165]. Such information originates in the final
product, i.e. the study of the characteristics of a particular domain (e.g.
news or sport videos) or the comparison amongst different film types
45
46 visual-based computational media aesthetics
(e.g. comedy and horror movies). Following, while existing approaches
predominantly explore the question what can we learn from the film
as finished product, we aim at the study of the origins of a film and
the filmmaking process as a source for high-level information and
semantics.
4.1 film factor space
As already discussed in Chapter 2, it is the combination of and inter-
action among manifold media aesthetic elements that exert influence
on audience and film critics. The (way of) story telling is significant
for how well a movie is perceived by the audience. In general, media
aesthetic elements can be grouped into three categories according to
their origins in the filmmaking process:
1. (Pre-)production. The selection of actors as well as locations for
filming is part of the pre-production phase. Actors influence
strongly the perception and attendance of a movie. Both presence
(e.g. favorite actor) and play (e.g. great/weak performance) are
important aspects most audience is intuitively aware of. During
the production (or shooting) stage each scene is usually shot
multiple times in slightly different versions which results in
many thousands of feet of exposed film. During shooting, many
people contribute to the shaping of the film such as the script
supervisor (continuity from shot to shot), photography director
(lighting and camera techniques), sound mixer, visual-effects
supervisor, etc. [18].
2. Post-production (Editing). Post-production describes the process
of editing of previously recorded material, including the use of
special effects and audio dubbing [35]. Usually, this stage in the
filmmaking does not happen independently but rather parallel
to the production itself. In this way, adjustments or additional
shots can be undertaken during the production phase. The goals
of the editor is to find the rhythm of the movie, to create a
narrative continuity and arrange it in a way that will create the
dramatic emphasis so that the film will be effective. Editing
decisions range from straightforward presentation of material to
the alteration of the meaning of that material.
3. Technology. Three aspects of the technology employed in filmmak-
ing can be distinguished. First, technology is present in every
aspect of filmmaking, e.g. camera, lights, editing and audio
recording equipment, etc. Second, the target device (e.g. projector
or TV) influences e.g. the visual composition by setting limits on
the frame size. Third, the medium (film vs. video) may influence
the perception of the visual quality of a movie.
4.1 film factor space 47
Since no film effect or technique is the product of a single media
aesthetic element but rather the combination of and interaction be-
tween various factors, the presented categories construct a triangular
space where the position of an element indicates the contributing fac-
tors. In Figure 4.1 we show the distribution of some film techniques1.
Continuity, for example, refers to the coherence and clearness of the
story [14, 18] (for details see Section 2.2.2). The narrative continuity
of the story line as well as the general setup (e.g. lighting, camera
position, actors’ appearance, etc.) are subjects of the production stage.
However, the editing phase ensures the smooth flow of space, time
and action over a series of shots using e.g. an establishing shot, eyeline
match, shot/reverse-shot pattern, etc. A motif describes a recurring
element that gains in dramatic importance through its repetition [14].
A motif can be some visual component (e.g. color, character or object),
action (motion pattern), or sound (for details see Section 2.2.3). Again,
it is the combination of two factors that raises a film element to a motif:
the production has to envision for the motif, and the post-production
ensures the rhythm of recurrence.
Figure 4.1: Film factors space: gray boxes indicate fundamental media ele-
ments, purple boxes advanced media concepts.
In the following sections, we will discuss a possible mapping be-
tween the presented media aesthetic elements in filmmaking, their
application by means of well-established film and editing techniques,
and feasible approaches in computer vision for each element in detail.
Figure 4.2 summarizes the color conventions used for the mapping
graphs in the following sections2.
1 Please note, that we do not aim at a comprehensive coverage of existing film tech-
niques but rather at a simple and clear visualization of the film factor space.
2 Please note, that the distinction between visual- and motion-based approaches is based
on the time component of the corresponding features, e.g. static vs. motion features.
48 visual-based computational media aesthetics
Figure 4.2: Mapping legend.
4.2 fundamental media elements
4.2.1 Light and Color
The following Figure 4.3 depicts the mapping between color and light
as basic media elements and computer vision approaches (highlighted
in yellow) that can be applied for the automated detection of the
corresponding film editing technique.
Figure 4.3: Media element vs. computer vision: Light and Color.
As already discussed in Section 2.1.1 light is mostly manipulated as
the interaction of light and shadows for the purpose of orientation and
to establish a specific mood. The overall light level of a scene or a frameLight level
can be easily measured by the corresponding intensity histogram (see
for an example Figure 4.4). Rasheed et al. apply the product of mean
and variance of an intensity histogram to distinguish between low-
and high-key lighting [125]. Similarly, mean and standard deviation
are applied in [141, 166]. The authors essentially assume that low-key
4.2 fundamental media elements 49
lighting results in lower mean and variance values in comparison to
high-key lighting. In fact, the variance in a low-key frame is lower due
to a more balanced distribution of the illumination values. However,
the mean is sensitive to outliers and may be an unreliable indicator
for the light level (note the identical mean values in Figures 4.4c-4.4d).
To overcome this limitation Wang et al. propose the use of the median
(as an indicator for the general level of brightness) and the proportion
of shadow area (dark pixels with values below a predefined threshold
of 0.18) within a frame [161].
(a) Low-key lighting. (b) High-key lighting.
(c) 1×1 intensity histogram showing pre-
dominantly dark regions (mean:855;
variance: 4.9053e+06; median: 107.5;
shadow area: 0.85).
(d) 1×1 intensity histogram showing
more balanced intensity distribution
(mean: 855; variance: 8.1768e+05; me-
dian: 454.5; shadow area: 0.47).
Figure 4.4: Examples for high- and low-key lighting. First row: frames from
Pulp Fiction (1994). Second row: corresponding intensity his-
tograms.
The detection of the position of the main light source in a scene, Light direction
i.e. above-/below-eye-level lighting (see Figure 4.5), requires for a priori
knowledge of the structure of the lighted objects in the scene. Thus,
a fully automated detection of the light direction is only possible
if predefined models of the objects exist in the scene. However, the
three main lighting techniques – flat, Chiaroscuro, and silhouette –
are characterized by their illumination, light level, and the resultant
shadow types and, thus, bear characteristics that can be generalized
and automatically detected. Flat lighting has no exactly definable light
direction which results in light shadows and an overall high light
level (see Figure 4.6a). The corresponding intensity histogram shows
a broad distribution of intensity values (see Figure 4.6d). Figure 4.6b
shows an example for Chiaroscuro lighting (selective illumination, over-
all low light level, and distinct and dense shadows). A global intensity
histogram (even if a very rough feature descriptor) indicates narrow
50 visual-based computational media aesthetics
intensity range, and thus, low-contrast (see Figure 4.6e). Furthermore,
the low intensity values point at an overall low light level. Finally,
silhouette lighting is characterized by its extreme of light and dark
contrast and its accent on contour (see Figure 4.6c). The intensity his-
togram shows two peaks in the low and high regions of the intensity
range and low distribution of the values in between (see Figure 4.6f).
Since distinct contours and less textures are typical for the silhouette
lighting, edge and texture analysis can be additionally applied to dis-
tinguish among the various lighting techniques (see Figures 4.6g-4.6i).
(a) Below-eye-level lighting. (b) Above-eye-level lighting.
Figure 4.5: Examples for above-/below-eye-level lighting from Citizen Kane
(1941). The main light source of the nightclub scene is the table
lamp. The beginning with a long shot sets the leading characters
(the waiter and the reporter Thompson) into below-eye-level
lighting. Moving into the story and close up sets the camera focus
on Kane’s ex-wife Susan whose head is below the level of the
table lamp.
Due to its strong dependency on context, color analysis is only fea-Color analysis
sible to a limited extent. Color-based retrieval (e.g. color-based mood
analysis) presumes the color application by filmmakers in compliance
with the subjective experience of the audience. This assumption fails
for two reasons. Firstly, any filmmaker expresses his own creativity
and may follow established conventions or contradict them to establish
a certain feeling with the audience. Secondly, the subjectivity of color
perception cannot be unified for the entire audience. Figure 4.7 shows
four example scenes from Run Lola Run (1998) depicting Lola and her
boyfriend Manni. All scenes from Lola and these from Manni show
similar color distributions for the character itself despite different
shooting settings and distinct color setting in comparison to the other
character. Hence, such information can be used to retrieve related
scenes. However, as already discussed in Section 3.1.1, different scenes
can share the same color distributions even if they are not semantically
related (see Figure 3.1). Thus, color can only be one of the features
for a specific retrieval task (e.g. analysis within a single movie and
possibly within the works of the same filmmaker) but not the sole
solution.
4.2 fundamental media elements 51
(a) Flat. (b) Chiaroscuro. (c) Silhouette.
0
500
1000
1500
2000
2500
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(d) Intensity histogram
of flat lighting with
broad intensity range.
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(e) Intensity histogram of
Chiaroscuro lighting
with narrow intensity
range.
0
500
1000
1500
2000
2500
3000
3500
4000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(f) Intensity histogram
of silhouette lighting
with high values on
the margins and low
values in between.
(g) Despite its texture, the
frame has less distinct
contours than a sil-
houette lighting.
(h) Due to the soft light-
ing, a Chiaroscuro
frame shows less
prominent contours.
(i) Distinct contours are
typical for silhouette
lighting.
Figure 4.6: Examples for lighting techniques. First row: frames from The Cheat
(1915). Second row: corresponding intensity histograms. Third
row: Sobel-based edge detection.
(a) Different scenes with Lola showing very similar color distributions.
(b) Similarly, scenes with her boyfriend, Manni, are (mostly) shot in a different color
cloud space.
Figure 4.7: Examples for context-dependent color distributions in Run Lola
Run (1998) [49].
52 visual-based computational media aesthetics
Recently, three further color features are often applied in the context
of content-based video retrieval, e.g. for the tasks of genre classifi-
cation and affective film analysis: dominant color [149, 162], color
variance [125, 141], and color energy [34, 161, 166]. Color variance
represents the variety of color used in a video. For an example, a
comedy is usually more colorful than a horror movie. Rasheed et al.
employ color variance as the generalized variance of the CIE Luv color
space [125]. Zettl defines color energy as "the relative aesthetic impact
a color has on us" [179]. As contributing factors, Zettl identifies the
hue, saturation and brightness attributes as well as the size of the color
area and the relative contrast between background and foreground
colors.
4.2.2 Space
Figure 4.8 shows both visual- (yellow highlighted) and motion-based
(green highlighted) approaches for the detection and analysis of edit-
ing techniques focused on the arrangement of space within a scene.
Figure 4.8: Media element vs. computer vision: Space.
As discussed in Section 2.1.2 the available space in a frame is firstAspect ratio
and foremost controlled by the dimension and shape of a frame whose
determination is quite a trivial task in computer vision. Within the
specific dimensions of a frame the camera angle and distance of
shooting define the field of view of a scene.
The detection of camera angle (see Figure 4.9) poses an unsolvableCamera angle
task for computer vision. Humans make use of their experience in ev-
eryday life to recognize known and to estimate unknown objects, their
size and position and, in result, the camera view of a particular scene.
Despite recent advances in object modeling, camera angle detection
cannot be performed reliably at this state of scientific and technical
knowledge.
4.2 fundamental media elements 53
(a) A high-angle framing. (b) A low-angle framing.
(c) A Dutch angle (canted framing). (d) A straight-angle framing.
Figure 4.9: Examples for camera angles from Quantum of Solace (2008).
The distance from a camera to an object or a person is referred to Field of distance
as field of distance and results in different types of shots: close up
shot, medium shot, long shot, etc. There are no unified and reliable
definitions for the different shot types [179]. In this analysis we follow
the definitions from Bordwell and Thompson [18] who use the human
body as standard measure to distinguish among the different shot
types (see for examples Figure 4.10). Recently, Cherif et. al proposed
the use of the human body’s proportions for the classification of
shot types according to the camera distance [33]. Thus, the use of a
reliable face detection algorithm may allow for an approximation of
the size of a head and an estimation of the size of the body and its
relation to the height of the frame. However, the application of the
approach is restricted by the reliability of the underlying face detection
algorithm. Common limitations of existing face detection methods
are the sensitivity to lighting conditions, face orientation, size range,
and/or partial occlusion.
The second key characteristic of space is the perception of depth of
field affected by the arrangement and movements of objects, characters,
and camera within a scene but also by the camera focusing mechanism.
Blocking refers to the arrangement of characters and objects within Blocking
a frame [14]. Despite the endless possibilities of characters and objects
positioning, two blocking types bear an outstanding expressiveness. A
strong sense of depth is created by the placement of a character (or an
object) at the edge of a frame in the immediate foreground and the
arrangement of further characters and objects in the mid- and back-
ground (see for an example Figure 4.11a). Figure 4.11b shows another
illustration on how to effectively convey a message through a frame
composition. The presentational blocking of a character, i.e. a character
looking directly into the camera lens, creates a dynamic relationship
with the audience. Furthermore, when a character actually addresses
54 visual-based computational media aesthetics
(a) Extreme long shot:
person is barely visi-
ble or not present.
(b) Long shot: person is
visible but less dom-
inant than the back-
ground.
(c) Medium long shot:
person is framed from
about the knees up.
(d) Medium shot: person
is framed from about
the waist up.
(e) Medium close-up: per-
son is framed from
about the chest up.
(f) Close-up shot: empha-
sis on head, hands,
feet, etc..
(g) Extreme close-up shot:
emphasis on a part of
a face or an object.
Figure 4.10: Examples for different shot types from Run Lola Run (1998).
the audience (by words or gestures), the audience is not a passive
observer but an active participant of the story. While the blocking of
a character can be detected using a combination of texture and color
clustering approach (see Figure 4.12), the detection of presentational
blocking is an unsolvable task due to the diversity in the appearance
and the complex motion possibilities of head, eyes, and camera and
the combination thereof.
As discussed earlier, camera focusing mechanism is another essen-Camera focus
tial property of a video camera that can influence the perception of
space depth within a scene. Scenes that are shot using deep focus cin-
ematography, i.e. all depth planes appear in sharp focus, may require
some effort to distinguish specific depth planes. The use of selective
focus allows for the choice of a single plane and lets the other planes
blur [18]. Recent applications for focus/blur detection include primar-
ily image restoration and quality measurements. Existing approaches
4.2 fundamental media elements 55
(a) Blocking of a character at the edge
of a frame: Jules (on the left side of
the frame, shown in back) points his
gun at Brett (sitting on a table in the
middle of the frame).
(b) Presentational blocking of a charac-
ter: Captain Kooks (facing the camera)
talking to young Butch (the camera is
the kid’s point of view).
Figure 4.11: Examples for blocking from Pulp Fiction (1994).
(a) Texture detection (black regions cor-
respond to low-textured areas).
(b) Color segmentation (different gray
shades correspond to different color
clusters).
(c) Detected blocking region by the combination of tex-
ture and color information. The frame border is in-
troduced for a better visualization and is not a part
of the original frame.
Figure 4.12: An approach for the detection of a blocking character/object
at an edge of a frame using the example of Figure 4.11a. For a
region to be a blocking region, it has to be positioned at the edge
of the frame, of similar color, low texture, and certain size.
56 visual-based computational media aesthetics
can be distinguished in edge-based algorithms and methods based on
power spectrum analysis. Edge-based approaches rely on the fact that
in-focus regions bear higher contrast and more distinct edges than out-
of-focus regions. Methods based on power spectrum show that in the
Fourier domain camera defocusing results in the eduction of power in
higher Fourier frequencies. Quality measurement, i.e. the classification
of scenes in out-of-focus and in-focus, is less relevant to a high-level
film analysis. However, the detection of selective and racking focus
gives a clue about the applied camera techniques and the intended
audience attention model. Selective focus refers to the camera focus on
a single detail of the scene while racking focus describes the focus shifts
within a scene (see Figures 4.13a-4.13b). A simple approach for the
detection of racking focus is the combination of edge detection and
local features. While firstly detected edges disappear and new edges
appear with a focus shift (see Figures 4.13c-4.13d), blur-independent
local features assure the consistency of the scene (see Figure 4.13e).
Finally, as discussed in Section 2.1.2, the perception of space is also
influenced by the movements of characters and camera within a scene.
Since the following Section 4.2.3 has a closer look into the subject of
motion detection and analysis, we refrain from a further discussion at
this point to avoid redundancy.
4.2.3 Time and Motion
The concepts of time and motion are notably correlating. Since motion
is happening in time, the perception of time is strongly influenced by
the presence of camera and/or object motion. Figure 4.14 shows both
visual- (yellow highlighted) and motion-based (green highlighted) ap-
proaches for the detection and analysis of editing techniques focused
on motion analysis and time manipulation within a scene.
Depending on the pace of time, manipulations of time can be distin-Time manipulations
guished in accelerated and decelerated time perception (with respect
to normal tempo of time passing), simultaneous time passing of two
or more events, and time stops.
Filmmakers can freeze the time for example to increase tension or toTime stops
isolate and emphasize a dramatic moment within a sequence [14]. Such
time stops can be communicated by the use of still (frame) sequences
(also associated with the feeling of timelessness) and freeze frames.
Both editing techniques imply the absence of motion. However, the use
of freeze frames is probably more expressive since time stops while the
scene is still in motion and full of life. Figure 4.15 shows an example
for the application of freeze frames in the final scene from Run Lola
Run (1999).
There are different editing and camera techniques to suggest accel-Acceleration /
deceleration erated or decelerated action or tempo of time passing. One possibility
is the manipulation of the amount of frames per second in the record-
4.2 fundamental media elements 57
(a) At the supermarket robbery the focus
of the camera shifts from Manni in
the foreground ...
(b) ... to the security guard in the back-
ground of the scene.
(c) The edge detection using the Sobel op-
erator shows scratches from Manni’s
face and few features in the back-
ground at the beginning of the scene
...
(d) ... while at the end of the scene, the
face completely disappears from the
foreground and far more details ap-
pear in the background.
(e) Matched local features (SIFT) despite the blur effect.
Figure 4.13: An example for racking focus from Run Lola Run (1998).
58 visual-based computational media aesthetics
Figure 4.14: Media element vs. computer vision: Time and Motion.
Figure 4.15: An example for the use of freeze frames in the final scene of
Run Lola Run (1999). Top: Lola and Manni walk side by side. As
Manni asks what’s in the bag, the sequence freezes on the frame
where the audience can see the smile on Lola’s face. Bottom:
x-rays of the motion trajectories of the keypoints. Both x- and y-
direction reveal the point in time where all the motion suddenly
stops.
4.2 fundamental media elements 59
ing and in the playback process. For example, the use of high speed
cameras allows for the capturing of scenes which typically cannot
be seen with naked eyes, e.g. breaking of glass or a falling drop. In
general, at least 18 frames per second are required for the perception
of smooth motion. While conventional movies are shot (and played
back) at 24 to 25 frames per second, a high speed camera can acquire
up to several thousands frames per second. When played at normal
speed the scenes create the impression of slowed time and action. The
recognition of such scenes requires high-level context understanding
and experience and lies beyond the means of automated computer
vision techniques.
Another technique to manipulate the perception of time is picture
jogging. Picture jog results in a sequence which is not perceived as
smooth motion but rather as a jerky sequence. The effect of picture
jog can be caused not only by e.g. interrupted transmission but also
by intended frame dropping leading to the perception of accelerated
action. The detection of such skip frames requires for an analysis of the
motion continuity and smoothness in consecutive frames or shots (see
for an example Figure 4.16). Another editing technique to suggest in
a brief period events occurring over a longer time span is the Holly-
wood montage [14]. It usually involves fast cutting and various optical
effects conveying the essence of an event or the passage of time (e.g.
newsreel) [95]. The detection of such montage sequences requires for
semantical knowledge and, thus, cannot be performed fully automati-
cally. In contrast, accelerated montage refers to continuously decreasing
length of the shots in a given sequence which intensifies the effect
of increasing speed of action [14]. While this editing technique can
be easily detected by e.g. an analysis of the changes in the average
shot length rather than counting the shot frames, manipulations of
the presented speed (fast/slow motion) or direction of action (reverse
motion) cannot be recognized automatically by methods of computer
vision without a priori knowledge of the "normal" motion model.
To suggest the simultaneity of two (or more) events a filmmaker Simultaneous events
can divide the video frame into several areas showing different vi-
sual information. While, nowadays, the split-screen technique is mostly Split-screen
known from the TV news, it has its origins in filmmaking. A recent
example of a movie that makes an extended use of split-screens is
the TV series 24 (2001-2010). 24 demonstrates the conventional appli-
cation of the split-screen technique: several active areas are divided
by sharp boundaries (see Figures 4.17a-4.17b). These characteristics
can be used for a visual-based detection of split screens. A possible
workflow may include the detection of areas of independent motion
and, following, an analysis of the boundaries between such areas (see
Figures 4.17c-4.17d). Such an approach allows for the detection of
split-screens despite their shape (e.g. rectangle, triangle, oval, etc.) and
positioning (e.g. horizontal, vertical, diagonal, etc.). However, the pro-
60 visual-based computational media aesthetics
Figure 4.16: An example for skip frames in Run Lola Run. Left: matched
keypoints in two consecutive frames. Right: motion trajectories
of the keypoints (motion jump visualized in green). Before as
well as after the presented time moment, the motion trajectories
show predominantly smooth motion progress.
posed approach cannot deal with unconventional use of split-screens
such as the the one applied in Run Lola Run (1998). Figure 4.18 shows
two examples for split-screens with superimposed boundaries of the
respective visual areas. Such superimposition does not allow for the
detection of distinct boundaries between the areas. Since it is not
unusual to have areas with independent motion in a conventional
video shot, an automated approach that relies solely on the detection
of such areas (i.e. without presented boundaries) leads to a high rate
of falsely detected split-screens and is, thus, not feasible. Similarly,
the detection of superimposition of two or more shots appearing inSuperimposition
the same frame often requires for semantical knowledge and, thus,
cannot be performed fully automatically. Such editing technique is a
popular device in trick and experimental films (see for some examples
Figure 4.19) [14].
Another well-established film technique suggesting temporal simul-Crosscutting
taneity of two or more actions is crosscutting [14]. A typical example
for such parallel editing is a crosscut between a captured victim in
need and a person hurrying to save him/her. Crosscutting reveals
high-level semantic information about cause and effect within a spe-
cific video sequence [18]. From a technical point of view, a crosscut
sequence results in a pattern of the type AB(A|B)∗, where A and B
are crosscut scenes. Following, crosscut detection can be performed
based on shot similarity (resulting in a simple text sequence of the
corresponding shot labels) and text-based pattern analysis. Finally, to
distinguish crosscutting from further film editing techniques such as
shot-reverse-shot as often used in dialog scenes, it is required to as-
sure that the alternating shots, A and B, are visually highly dissimilar
4.2 fundamental media elements 61
(a) Double split-screen of a phone con-
versation between the CTU director
Brian Hastings and Jack Bauer (close-
ups, few motion content).
(b) Double split-screen of a phone con-
versation between Jack Bauer and his
daughter Kim (medium shots, dis-
tinct motion).
(c) Clustering based on motion orientation and distance. White arrows show motion
direction.
(d) Clusters that are not separated by an uniform area are merged. Red color: non
uniform area. Green color: uniform area surrounding a motion cluster.
Figure 4.17: Examples for sharply defined split-screens from the TV series
24 (2001-2010).
(a) Triple split-screen. The two top
screens show Manni looking at the
watch (shown at the bottom line) and
Lola running towards Manni.
(b) Double split-screen. Left: Manni in
the foreground and Lola in the back-
ground. Right: the same scene from
180 degree rotated view.
Figure 4.18: Examples for split-screens in Run Lola Run (1998).
62 visual-based computational media aesthetics
(a) Camera man walks on
a camera.
(b) Camera man in a beer
glas.
(c) Playing harmonica
within a speaker.
(d) Double superimposition in the open-
ing scene of the Book of Utopias.
(e) Triple superimposition in the opening
scenes of the Book of Mythologies.
Figure 4.19: Examples for superimposition: 4.19a-4.19c A Man with a Movie
Camera (1929); 4.19d-4.19e Prospero’s Books (1991).
and, thus, are not different views of the same scene. Another decisive
criterion can be the presence of an establishing shot (typical for dialog
scenes).
In general, it is not possible to distinguish between camera andCamera / object
motion object motion reliably (see for an example Figure 4.20). However, from
a film aesthetics point of view, the motion content of a sequence as a
whole is more relevant than the differentiation between camera and
object motion. The motion content of a shot (or a sequence of shots)
plays a central role for several film techniques: it can create visual
compositions and motifs, facilitates the creation of rhythmic montage
and invisible editing (continuity) [14, 18]. All these aspects of motion
will be discussed in the following section.
4.3 advanced media concepts
4.3.1 Composition
The following Figure 4.21 shows the positioning of film composition
within the media aesthetic elements as discussed in the previous
section and maps the corresponding film editing techniques to basic
computer vision approaches.
One of the first characteristics the audience perceives in a particularFraming direction
scene composition is the main direction of the framing. Figures 4.22-
4.23 show examples for the four main types of framing direction:
horizontal (associated with normalcy and rest), vertical (dynamic, ex-
4.3 advanced media concepts 63
(a) Camera passing through construction.
The Eleventh Year (1928).
(b) Train passing the camera. A Man
with a Movie Camera (1929).
Figure 4.20: Examples for large motion in a shot (red arrows show motion
direction). Despite the similar characteristics, the two examples
have different motion origins: 4.20a camera motion and 4.20b
object motion.
Figure 4.21: Media element vs. computer vision: Composition.
citement), horizontal/vertical (normalcy, reflects the everyday world),
and tilted (disorientation, disturbance) [179]. The characteristics of the
framing direction suggest an edge-based detection approach. How-
ever, preliminary evaluations show strong dependency on the overall
frame composition. Both horizontal and vertical framing are usually
clearly defined and easily distinguishable from other framing types.
However, the mixed (horizontal/vertical) framing and tilted framing
can be easily falsely classified due to misleading edges in the frame
composition (see Figures 4.23c-4.23d).
As already discussed in Section 2.2.1 filmmakers usually try to Balance
balance the composition of a scene by distributing objects of interest
evenly around the frame [18]. Pleasing compositions are often associ-
ated with the rule of thirds where the scene is divided horizontally
and vertically into thirds (see for an example Figure 4.24) [179]. This
64 visual-based computational media aesthetics
(a) Horizontal framing. (b) Vertical framing.
(c) The MPEG-7 edge histogram of a
horizontal-oriented frame shows pre-
dominantly horizontal edges in all ar-
eas of the image ...
(d) ... similar to the notedly dominating
vertical edges in a vertical framing.
Figure 4.22: Examples for horizontal and vertical framing directions from
The DarkKnight (2008) with the corresponding MPEG-7 edge his-
tograms.
4.3 advanced media concepts 65
(a) Horizontal/vertical framing. (b) Tilted framing.
(c) Despite the presence of all edge types
in a horizontal/vertical framing, hor-
izontal and/or vertical edges prevail
in all areas of the image.
(d) Although a tilted camera suggests
predominantly angled edges in the
frame, the overall frame composition
strongly influence the detection of
edges (e.g. the strong vertical road
line, almost horizontal line of the win-
dow frames in the back, etc.)
Figure 4.23: Examples for horizontal/vertical and tilted framing directions
from The DarkKnight (2008) with the corresponding MPEG-7 edge
histograms.
66 visual-based computational media aesthetics
knowledge can be used for a coarse estimate of the balance in a scene
by, for example, investigating the distribution of textures and colors
in the corresponding thirds and in the intersection regions. Recently,
Obrador et al. measured layout homogeneity of relevant regions to in-
vestigate the role of the overall visual balance in image aesthetics [114].
Since balanced shots are a common aesthetic desire and actually the
rule in filmmaking, the detection of both extremes – near-perfect sym-
metry and unbalanced compositions – is a more appealing task for an
automated retrieval system.
Despite several decades on research in the field, symmetry detec-Symmetrical
compositions tion in real-world images remains a challenging task [118]. However,
symmetrical compositions in film scenes exhibit characteristics that
simplify the task of reliable detection to a certain degree. Recent ap-
proaches search for single or multiple axes of symmetry produced by
corresponding objects at any arbitrary position in an image. In con-
trast, a symmetrical composition in a film scene refers to a symmetry
at frame (and not at object) level and implies near-perfect mirroring of
the right and the left halves of a frame. Hence, the detection of sym-
metrical composition is reduced to the detection of a mirror-symmetry
(or a bilateral reflection symmetry) with the axis of symmetry po-
sitioned in the mid frame region. The challenge with the detection
of symmetrical compositions in film scenes is that symmetry often
occurs at higher level than a perfect symmetry within an object, such
as a butterfly, a wheel of a car or a face. Corresponding halves may
involve, for example, different objects at similar (but not identical)
positions. Thus, symmetrical composition detection requires for the
introduction of a looser approach which does not imply perfect shape
and object mirroring. Figure 4.24 shows an example for near-perfect
symmetrical composition in Citizen Kane (1941). Despite the rotation of
Kane’s face on the background poster, Kane himself and all characters
as well as the stage props (e.g. desk, drapes) are carefully positioned.
An approach for the detection of such symmetrical compositions is
the exploration of present correspondences between regions in the
two halves of the scene (see Figure 4.25). The simplest region detec-
tion method is the color-based clustering (in case there is no color
information: intensity-based clustering). Following, each region can be
described using (loose) shape descriptors, area, and location informa-
tion. A pair-wise comparison with the regions of the second scene half
results in the detection of possible correspondences. Eventually, if 1)
the majority of the regions has been linked and 2) the correspondences
allow for the determination of a steady symmetry line, the scene can
be classified as a symmetrical composition.
The detection of unbalanced compositions requires for the identi-Unabalanced
compositions fication and tracking of salient regions (objects or characters). The
challenge of salient object detection is that relevant objects usually
do not possess homogeneous color or texture characteristics. How-
4.3 advanced media concepts 67
Figure 4.24: Rule of thirds. Citizen Kane (1941).
(a) Intensity-based clustering. (b) Cluster correspondences.
Figure 4.25: Approach for the detection of symmetrical film compositions.
All regions, that are detected using color- or intensity-based clus-
tering, are investigated for correspondences based on shape, area
and location. White stars indicate the centroids of the detected
regions, white lines detected correspondences. The red dotted
lines border the predefined central region of the scene as the
only relevant position for the search for a symmetry line. The
red solid line shows the detected symmetry line for the scene
(slightly off-center). Citizen Kane (1941).
ever, in general, two features contribute essentially to the attraction
of human attention and are often used to build theattention model of
a scene: motion and human faces (e.g. [67, 87, 94]). Following, both
features can be used for the detection of unbalanced compositions. For
example, motion information and motion clustering methods can be
applied to build a database of potentially salient regions. The more
often the same region is identified in various shots (moving or not) the
more salient it is. This saliency map can be used for the detection of
unequally distributed salient regions and, as result, unbalanced scene
compositions.
Finally, some filmmakers are known for their characteristic compo- Composition
retrievalsitions. Today, research tasks, such as the locating of all diagonal shots
in a specific film by Dziga Vertov, are a tedious process performed
manually by film experts. The retrieval of a specific composition can
be performed using a predefined template. Mitrović et al. conducted
68 visual-based computational media aesthetics
a user study on the performance of several low-level features in the
retrieval of visual compositions [104, 173]. The authors show that the
KANSEI Shape feature [74] outperforms some well-established MPEG-7
descriptors and, thus, best represents a specific visual template within
the given evaluation settings. One drawback of the approach is that it
does not account for any motion information but performs solely pair-
wise comparisons between template and keyframes. However, motion
can be essential for the perception of a specific composition (see for
an example Figure 4.26). The reflection of motion can be realized by
either using a motion-sensitive representation of a video sequence or
by extended motion analysis. Zeppelzauer et al. propose an approach
for the representation and retrieval of motion compositions based on
the detection and tracking of homogeneous motion fields [177]. The
description and matching of motion fields is based on spatial and
directional information of the corresponding fields.
(a) First frame of the shot (keyframe). (b) Average frame for the shot.
Figure 4.26: Motion-aided visual composition. While a possible keyframe of
the shot shows rather undefined and chaotic arrangement in
4.26a, a motion-sensitive representation clearly identifies a tilted
composition of two motion fields on a crossroad (demonstration
group vs. cars) in 4.26b. A Man with a Movie Camera (1929).
4.3.2 Continuity
As discussed in Section 2.2.2 continuity takes different forms in rela-
tion to narration, space, time, and motion (see Figure 4.27). Since the
understanding of both narrative and temporal continuity requires for
high-level semantical comprehension, their analysis is not feasible for
computer vision methods. However, temporal and motion continuity
bear some characteristics that can be explored fully automatically.
Spatial continuity assures the spatial coherence among differentSpatial continuity
shots and, thus, enables the creation and maintaining of the mental
map of the audience [179]. In continuity editing, as opposite to the
Russian montage, the space of a scene is established along the axis
4.3 advanced media concepts 69
Figure 4.27: Media element vs. computer vision: Continuity.
of action line3 [18]. This virtual line defines a half circle where the
camera(s) can be positioned to shoot the action (see Figure 4.28). It
assures the consistency in action direction and character positioning.
As result, the visual differences between consecutive shots of the same
camera are relatively low. Additionally, shots from different cameras
of the same scene bear some visual overlapping. In computer vision,
these facts are often used for shot and scene detection. A special type
of a scene is the dialog scene. In filmmaking dialogs are often shot
using over-the-shoulder shots as visualized in Figure 4.28. Alternating
over-the-shoulder shots are called reverse angle shooting [14]. Recently,
various approaches have been proposed for the automated detection
of dialog scenes. Existing algorithms exhibit high differences in fea-
ture complexity and selected modality (visual, audio, or both). Early
approaches explore solely basic visual similarity between shots (e.g.
based on color information) and the resulting pattern analysis of
the whole video sequence. Furthermore, they are mostly limited to
the detection of dialog scenes with two actors. Since a dialog scene
and crosscutting may result in a similar pattern, face detection can
provide more reliability, e.g. in [36, 78, 181]. Finally, since it is the dia-
log that characterizes the scene, recently several approaches consider
audio analysis and speech detection for dialog scene detection (see
e.g. [53, 75, 181]).
Some filmmakers intentionally cross the axis of action to provoke
disorientation. In general, the axis line is not fixed for the whole scene –
as the characters or the action of the scene moves, the axis line changes
accordingly. The large variety of possible motion combinations (cam-
era, objects, characters) and the (current) unreliability in the distinction
between camera and object motion and between significant and in-
significant motion/object/character makes the automated detection of
the axis of action (or the detection of present violation of the rule) not
feasible for technical and performance reasons.
3 Also known as: vector line, principal vector, line of conversation and action, 180
degree line, scene axis, sight line, eye line, and center line [18, 179].
70 visual-based computational media aesthetics
Figure 4.28: Axis of action in a dialog scene (from [18]).
Temporal continuity refers to the creation and preservation of timeTemporal continuity
continuity between consecutive shots of the same scene despite any
present time manipulations. A standard indicator for temporal conti-
nuity is sound, e.g. a sound which is associated with a specific scene
or a sound over a shot cut [18]. A more advanced tool for the creation
and preservation of continuity is the application of matched cuts4. It
describes a cut on identical points of action and creates and supports
both spatial and temporal continuity [14, 18]. Figure 4.29 shows two
examples for matched cutting (or cutting on motion) where the action
from the first shot is smoothly continued in the second shot (despite
different camera views or distance of shooting). Zeppelzauer et al.
propose an algorithm for the detection of matched cuts based on
matching of long-term motion segments [177]. However, the approach
has mainly two limitations. First, it requires for user interaction since
the user has to sketch the desired continuity. This presumes a priori
knowledge about the specific continuity type and limits the retrieval
of existing matching cuts. Second, a core descriptor for each motion
segment is its median direction which cannot represent a complex
motion such as a changing motion direction within the same motion
segment. The core element to overcome these problems and to en-
able a thorough (automated) analysis of present matching cuts is a
more precise motion description that accurately describes the motion
progress and allows for a scale invariant matching.
Discontinuities in a video sequence are usually disturbing and dis-Discontinuities
orienting for the viewer (if noticed). Discontinuities can be intentional
(e.g. to provoke the audience) or non-intentional (i.e. movie errors in
lighting, motion, camera position, props, etc.). For example, Sergei
4 A matched cut is also referred to as "invisible cut" since it is the motion (rather than
the cut itself) that attracts the attention of the audience [14].
4.3 advanced media concepts 71
(a) The first cut of the scene shows Kane turning around in the door frame of his
bedroom (frames 1-2). The turn from the first shot is continued in the second
shot from different camera view (frames 3-4).
(b) The second cut shows Kane starting to tear his bedroom apart (frames 1-2). In
contrast to the first cut, not the camera view but the distance changes which
creates even smoother transition (frames 3-4).
Figure 4.29: Examples for match cutting in Citizen Kane (1941). Each sequence
shows two consecutive shots depicted by its first and last frame
respectively.
Eisenstein is known for his intentional disrupt of the conventions of
continuity editing. Eisenstein argued for montage of collision, i.e. for
contrast and conflict of shots in order to create a new concept5[14].
Non-intentional discontinuities are typically the result of the fact
that shots originating at different times in the production process are
merged together to appear to be continuous in time. Following, vari-
ous continuity errors can appear in the final version such as misplaced
items, different lighting and shadows, props can appear or disappear,
etc.6 Recently, Pickup and Zisserman proposed an approach for the de-
tection of continuity errors based on image registration methods [121].
Since moving objects and characters may cause an expected visual
difference, the authors apply upper body detectors and trackers to
suppress errors caused by motion.
4.3.3 Motif
Motifs take very different shapes such as the use of a specific color
or sound in a given context, the particular movement of an object or
character, camera position, composition type, story line, etc. Since a
motif is the product of filmmaker vision, the list of possible motifs is
neither exhaustive nor pre-definable. Figure 4.30 shows the mapping
between some classical motifs such as color and object and possible
computer vision methods for their detection.
5 A classical example of montage of collision is October (1927).
6 An example for a website that focusses solely on the locating and commenting on
errors in movies is http://moviemistakes.com.
72 visual-based computational media aesthetics
Figure 4.30: Media element vs. computer vision: Motif.
The core characteristic of a motif is its recurrence. However, as
already discussed in Section 2.2.3, motifs can have extremely varying
appearance and often require for semantical understanding. Examples
for such motifs are the time-element in Run Lola Run (1998), the words
Jules recites before he executes someone in Pulp Fiction (1994), or the
isolation of Kane stressed by composition and camera positioning in
Citizen Kane (1941). Such cases cannot be analyzed fully automatically
by means of computer vision methods. The lower the level on required
semantical understanding and the higher similarity in the appearance
(visual or auditory) the more feasible is the automated motif detection
(see Figure 4.31).
Figure 4.31: Automated motif detection.
From a visual analysis point of view the closest recent research
area in computer vision is nearest duplicate detection. This research
field aims at the identification of identical or nearly identical images
or video sequences. However, in general, a motif does not occupy a
whole frame. Thus, a new method is required that builds on the fact
that a motif is a recurring element. An automated detection of motifs
is only possible within strict constrictions. Similar to the approach
of dominant color detection, a dominant blob detection can be per-
4.3 advanced media concepts 73
formed to identify recurring blobs in a video sequence. Such blobs
can be a person, an object, or just a part of them. As result recurring
elements can be identified (including leading actors). However, no
distinction can be made between significant and insignificant recur-
rence. Figure 4.32 shows some example results for recurring element
detection in two different movies: a contemporary thriller, Run Lola
Run (1998), and an archive documentary, A Man with a Movie Camera
(1929). Details on the implementation will be discussed in Section 5.4.
(b) Run Lola Run (1998).
(d) A Man with a Movie Camera (1929).
Figure 4.32: Examples for detected recurring elements. Please note, that for
better visualization all elements have been resized to the same
hight, i.e. depicted sizes do not correspond to the actual ones.
74 visual-based computational media aesthetics
4.3.4 Rhythm, Tempo and Pace
Figure 4.33 shows the mapping between the film concepts of rhythm
tempo and pace and potential computer vision approaches for an
automated analysis. As discussed in Section 2.2.4, rhythm depends
on the interaction of manifold factors such as the narrative of the
story, motion and visual intensity, etc. Although many of the forcing
factors can be analyzed on their own, the analysis of their assembly is
a slippery issue and, thus, not a well-definable or manageable task for
computer vision algorithm. Therefore, in the following, we will focus
on the automated analysis of tempo and pace only.
Figure 4.33: Media elements vs. computer vision: Rhythm, Tempo and Pace.
Both tempo and pace are closely related to the perception of time
and speed in film. Visually, perceived time is mostly dependent on
motion and editing style. Motion content can be measured globally or
locally (cp. Section 3.3). Global motion measurement involves usually
histogram intersection between consecutive frames. The sum over all
shot frames yields information about the motion content of a shot7.
Another approach to measure motion content is based on local fea-
tures tracking. This method allows for estimation of the amount of
motion in comparison to static features as well as the measurement
of the magnitude of motion. Figures 4.34-4.35 show two examples for
the difference in the motion content of two consecutive scenes from
Quantum of Solace (2008)8. Independently of the considered motion
indicator there are considerable differences between the scenes. The
first scene, which is the opening chasing scene of the movie, is charac-
terized by quick shot changes (average shot length: 24 frames/shot),
average content change between consecutive frames of a shot of 11%,
7 Or more precisely: the change in visual content.
8 Please note, that for better visualization and easier comparison only the beginning of
the first scene is shown.
4.3 advanced media concepts 75
77% of all tracked features are classified as motion, and in average
a motion vector travels a length of approx. 45 pixels. In contrast, the
second scene of the movie is (predominantly) a dialog scene between
Bond and "M" and is, in general, characterized by longer shots (in
average 49 frames/shot), and less camera and character motion: solely
4% of the visual content changes between consecutive frames, 57% of
all tracked features are classified as motion, and the average motion
vector magnitude is approx. 4 pixels. B. Adams and Svetha Venkatesh
showed that this information can be used not only for scene classi-
fication but also for pace/tempo analysis and, in following, for the
detection of significant event and dramatic section boundaries [6]. The
authors define pace as:
P(n) =
α(meds − s(n)
σs
+
β(m(n) − µm)
σm
(4.1)
where s is the shot length, m the motion magnitude, n the shot
number, σs and σm the standard deviations of shot length and motion
content respectively; and µm and meds are the motion mean and
shot length median respectively. α and β are weights indicating the
contribution of shot length and motion to the perception of pace. In
the performed experiments both weights are given values of 1. The
pace function is smoothed with a Gaussian filter to ignore drastic
pace changes in a single or small number of shots and to better reflect
the human perception of pace. Finally, the authors detect significant
event boundaries by edge detection using Deriche’s recursive filtering
algorithm [37]. Figure 4.36 shows the resulting pace function for
the discussed sequence from Quantum of Solace (2008) and detected
sections using multiscale analysis.
(a) Opening chase sequence. (b) First scene after the credits.
Figure 4.34: Keyframes of the first two scenes of Quantum of Solace (2008).
76 visual-based computational media aesthetics
(a) Motion indicators for the opening chase scene. Average shot length: 24 frames.
Average content change: 0.11. Average ratio motion vectors in a shot: 0.77. Average
motion magnitude per shot: 44.97.
(b) Motion indicators for the first scene after the credits. Average shot length: 49
frames. Average content change: 0.04. Average ratio motion vectors in a shot: 0.57.
Average motion magnitude per shot: 3.80
Figure 4.35: Motion content in the first two scenes of Quantum of Solace (2008).
Gray lines indicate shot borders; red lines the motion content
of the corresponding shot by means of histogram intersection;
blue bars the ratio of motion vectors to static vectors in a shot
by means of feature tracking.
4.3 advanced media concepts 77
(a) Pace function.
(b) Detected sections 1-2: chase in the tunnel and leaving the tunnel.
(c) Detected sections 3-4: police intervention into the chase scene and scene boundary
to the dialog scene.
Figure 4.36: Pace function analysis for the two scenes from Quantum of Solace
(2008): blue line shows the pace of the first (chasing) scene, red
line the dialog scene, gray dotted line the average pace mark
for the sequence. Finally, vertical gray lines indicate the results
of edge detection on pace function and corresponding story
sections from the sequence.
78 visual-based computational media aesthetics
4.4 conclusion
Computer vision approaches for film analysis usually involve both
low- (e.g. color and edge detection) and high-level analysis (e.g. feature
tracking and representation, object/face detection and recognition).
Recent applications originate mostly from the needs and the require-
ments of a general user. Examples for such applications include the
efficient and reliable retrieval of relevant sequences from large media
collections, detection of duplicates, summarization and automated
classification, etc. Many computer vision approaches perform with-
out any knowledge about film techniques or aesthetic elements. A
simplified task, such as the classification of a limited number of se-
quences in horror and non-horror sequences, can be performed using
a basic decision algorithm. However, various applications require for
the (intentional or non-intentional) consideration of film techniques.
A very simple example is the shot detection which is a fundamental
step for any advanced film analysis. Shot detection can be performed
fully automatically and (to a certain degree) reliably based on the
visual continuity condition within a shot. Another example applica-
tion for the reflection of film techniques in computer vision is the
event detection. Significant pace/tempo changes are often an indicator
for the occurrence of a dramatic event (see for examples Figure 4.37).
Finally, while semantics is still a hype concept in many recent research,
its analysis is carried out on a very rudimentary level by e.g. the
consideration of available meta data, transcripts, speech recognition,
etc.
Figure 4.37: Recent computer vision applications. Applications in italics will
be discussed in detail in the following Chapter 5.
The work presented in this section focused on the question: How
far can we go using fully automated tools and where are the frontiers
of computer vision. In order to answer this question, we explored a
4.4 conclusion 79
possible linking between fundamental computer vision approaches
on the one side and the origins of a film on the other side. Despite
the complexity of filmmaking and a certain degree of freedom and
creativity that cannot be set into formulas, the analysis of media
aesthetic elements and their application in film production by means
of film techniques reveals new possibilities for automated film analysis.
Many editing techniques require for semantical understanding and are
not feasible for an automated computer vision based approach. Some
examples include the detection of camera angle, the analysis of editing
styles (e.g. Russian style, narrative montage, conceptual montage,
montage of attraction), and the detection of time manipulations (see
Table 4.1). However, performed analysis and experiments identified
a large set of achievable tasks that have not been subject to research
so far. Possible queries for automated retrieval of film techniques
range from the detection of basic techniques such as the application
of Chiaroscuro or silhouette lighting or the detection of selective and
racking focus to higher level tasks such as the analysis of visual and
motion composition or the detection or recurring elements. Methods
implementing such techniques facilitate a fast and systematic retrieval
of film material for advanced users, e.g. film archives, film studies,
and filmmakers looking for a specific footage.
80 visual-based computational media aesthetics
Infeasible
tasks
A
ctiv e
tasks
O
pen
issues
Fundam
ental
m
edia
elem
ents
Light
&
C
olor
key
light
level;
color;
key
lighting
types;
high-/low
-key
lighting;
Space
presentationalblocking;
zoom
;
aspect
ratio;
cam
era
angle;
field
of
distance;
focus
type;
cam
era
lens;
blocking;
Tim
e
reverse
m
otion;
shot
cuts;
freeze
fram
es;
tim
e
w
arping;
crosscutting;
skip
fram
es;
r hythm
ic
m
ontage;
split-scr een;
superim
position;
accelerated
m
ontage;
A
dvanced
m
edia
concepts
C
om
position
visualcom
position
;
fram
ing
orientation;
m
otion
com
position;
balance;
sym
m
etr y;
C
ontinuity
axis
of
action;
m
atch
cutting;
tem
poral continuity;
narrative
continuity;
dialog
scenes;
visualcontinuity;
m
ontage
of
collision;
visualcontinuity
errors;
reverse-angle
shooting;
o ver-the-shoulder
shots;
M
otif
sem
anticalm
otif;
recurring
objects;
narrative
m
otif;
R
hythm
,
tem
p
o
and
pace
rhythm
tem
po/pace
Table
4.
1:Tasks
in
visual-based
com
putationalm
edia
aesthetics.
5
C A S E S T U D I E S
You’ve always got to try everything
even if you know it’s not going to work.
— Anne V. Coates in [117]
This chapter presents three case studies we conducted in the con- Aim of the chapter
text of automated film analysis. All experiments are performed on
a novel data set of archived documentaries bearing challenges from
both artistic and technological point of view. The last experiment is
additionally conducted on a contemporary movie. The archive data
set is discussed in detail in Section 5.1. The film material at hand
allows for the definition of novel research and application topics in
the domain of film analysis and understanding. One example for
such an application task is the reconstruction of the original camera
takes as presented in Section 5.2. The knowledge about the original
ordering of the film sequences (as recorded by the camera) facilitates
an advanced film analysis such as the detection of montage patterns,
the reconstruction of the original montage schema, and, in result, the
detection of missing or altered film sequences. The second case study
is presented in Section 5.3. It focusses on the comparison of different
film versions. This research requirement originates in the fact that
many film archives and film museums often possess different versions
of the same film material. Differences between film versions can be
a result of editorial changes such as different film cuts or a result of
the preservation such as missing filmstrips and incomplete copies.
Furthermore, the state and the nature of the film material impede
the differentiation between very similar shots repeatedly appearing
in the film sequence and identical shots having different appearance
due to material-specific artifacts such as mold or film tears. In the last
experiment, Section 5.4, we investigate recurrences on a more detailed
level than a shot or a film sequence. In this case study, we explore
the feasibility of detecting dominant characters or objects in a film
sequence where dominance is defined by corresponding appearance
frequency.
5.1 archive video data
The archive video data set consists of historical artistic documentaries
by the Soviet avant-garde filmmaker Dziga Vertov from the 1920s
and 1930s. Vertov plays a major role in the history of experimental Artististry-related
challengesfilms and, at the same time, he is considered as the forerunner of
81
82 case studies
the cinema vérité movement in documentaries [35]. Vertov rejects
theatrical artificiality such as studios, actors, and staging, and aims at
capturing the raw truth with his camera [105]. His films do not contain
any narrative structure, which makes them different from material that
is usually analyzed in content-based research such as news broadcasts,
sports videos, and feature films. Vertov often manipulates his films to
demonstrate the artificiality and nonrealism of cinema [35]. He makes
use of advanced montage techniques to create complex transitions,
multiple exposures, and split screen compositions (see for examples
Figure 5.1). Furthermore, his films exhibit a distinctive structure, which
is characterized by a high number of short, repeating shots with high
visual similarity (see for examples Figure 5.2).
(a) The Eleventh Year (1928).
(b) Man with a Movie Camera (1929).
Figure 5.1: Vertov’s demonstration of cinema artificiality.
Figure 5.2: Examples for highly similar repeating shots (each shot is repre-
sented by its first frame). Man with a Movie Camera (1929).
The source film material is 35mm monochrome and mostly silentTechnology-related
challenges film which limits the set of available modalities and feasible tech-
niques. The filmstrips were digitized frame-by-frame to make them
processable. Existing filmstrips of archived films are usually multiple-
generation copies that were never intended for other purposes but
backups. Often, the original filmstrips do not exist any more and the
available backup copies are the only existing source material left. The
state of film material degrades significantly during storage, copying
5.2 camera take reconstruction 83
(a) Film tear. (b) Scratches and dirt. (c) Brightness error.
(d) Scratches. (e) Underexposure. (f) Overexposure.
Figure 5.3: Examples for artifacts in archive film material (all frames exhibit
visible framelines). Film-Truth (1922-25).
and playback over the decades (see for examples Figure 5.3). Important
artifacts in archive film material include:
• Scratches, which are usually introduced by dirt in the film pro-
jector.
• Dirt (dust, liquids, mold), which propagates and increases from
one copy to the next.
• Visible framelines, which result from copying misaligned film-
strips and the shrinking of the film material. Since the filmstrips
are made of organic material they contract over time. Contrac-
tion occurs horizontally and vertically and results in shaking
and misaligned frames.
• Low contrast, which is a result of repeated copying.
• Flicker, which results from the fact that film transports in early
cameras was performed manually (variable exposure time).
• Frame displacements, which result from shrinking of the film-
strips.
5.2 camera take reconstruction
In this section we present a new topic in the domain of video retrieval, Section outline
namely the identification of editing techniques and montage patterns.
Contemporary Hollywood-type movies and TV broadcasts usually
follow specific editing rules, such as cross-cutting and shot reverse
84 case studies
shot [18], resulting in well-established shot editing patterns within
a scene. On the contrary, some documentaries, experimental, and
art house films often challenge the conventional filmmaking by the
use of unusual (non-narrative) camera and editing techniques [35].
Currently, the study of such techniques is a tedious manual process
performed by film experts. We propose an automated approach for the
analysis of editing techniques and montage sequences that is based
on the reconstruction of the original film shooting sequence referred
to as camera takes. After defining the terminology and discussing
its advantages, we present the two stage algorithm for camera take
detection. Following, the algorithm is evaluated on a challenging data
set of experimental archive documentaries. Achieved results show the
reliability of the algorithm and outline its applicability as supporting
tool for the analysis of editing techniques and montage patterns.
A camera take is defined as a single, continuously-recorded perfor-What is a camera
take? mance with a given camera setup [14]. In the editing process camera
takes are often cut into multiple shots and joined together to form
a complete movie, i.e. a camera take is a sequence of one or more
consecutively recorded video shots. Semantically related and tempo-
rally adjacent shots build a video scene. However, shots originating
from the same camera take can also be temporally distributed over the
entire movie (see Figure 5.4). Examples for such an editing technique
are the time jumps and the parallel development of two or more lines
of action.
Figure 5.4: Camera takes vs. film scenes.
The reconstruction of camera takes yields relationships of shots thatWhy camera takes?
proceed at the same place and time. This high-level structural infor-
mation is beneficial for tasks such as scene segmentation and analysis
of montage patterns, editing style, and motion rhythm. Furthermore,
reconstructed camera takes allow for compact video representation
and nonlinear browsing. The reconstruction is based on the temporal
continuity of shots. It does not require the film content to be similar
over the entire camera take. For example, several shots cut out from a
camera take that contains a long camera pan can have highly dissimi-
lar content. Methods based on keyframes and image features may not
find similarities among the shots. The presented approach is able to
associate the shots with each other.
Various applications for film analysis and retrieval can benefit fromExample applications
5.2 camera take reconstruction 85
the camera takes reconstruction. Examples include:
• Flashback / -forward detection: A flashback is defined as a shot that
is presented out of chronological order [18]. The detection of
camera takes implies the reconstruction of the original chrono-
logical order and, thus, allows for a straightforward flashback
detection.
• Montage pattern and rhythm analysis: The rhythmic relations be-
tween two shots indicates highly semantical information. Cine-
matic rhythm derives from different film techniques such as shot
duration, visual and motion content, sound rhythm, and mon-
tage patterns. For example, the use of alternating close-ups with
shorter shots creates a more intense dialog or conflict sequence.
• Film analysis and reconstruction: The reconstruction of the mon-
tage schema allows for the identification of incomplete copies
and altered versions of the original film material.
• Video summary: The association among shots of the same cam-
era take can be further used to create a more compact video
summary for non linear browsing.
5.2.1 Camera Take Detection
The core element of the algorithm for camera take detection is the mo-
tion smoothness analysis between different shots. However, since mo-
tion tracking in a long video can become computationally expensive,
we introduce an intermediate step to limit the number of candidates
for camera takes. To determine possible camera takes we use a fast and
yet reliable similarity measure based on edge histograms. Following,
we analyze the motion smoothness based on local feature tracking.
Figure 5.5 gives an overview over the workflow of the algorithm.
5.2.1.1 Continuity Analysis
For the detection of candidate camera takes we first construct the set
of all continuity regions for a given shot Sx. The continuity region
CR between two shots Sx and Sy is defined as the union of the last n
frames of Sx and the first n frames from Sy:
CRSx,Sy =
{
fSx
a−n+1, fSx
a−n+2, ..., fSx
a , fSy
1 , fSy
2 , ..., fSy
n
}
(5.1)
where a denotes the number of frames of Sx and f the respective
frames in Sx and Sy. Sy represents any other shot from the film. Thus,
for a given shot Sx a set of continuity regions (with common Sx last
frames) is constructed. In our evaluations, n is set to three which
results in a continuity region of the length 6 between any two shots.
86 case studies
Figure 5.5: Camera take detection workflow.
For every frame from the regions an MPEG-7 edge histogram is
computed (see Section 3.1 for details on the feature). Each frame
fSx
a−n+1, fSx
a−n+2, ..., fSx
a is compared to every frame from the set of
continuity regions for Sx that represents a shot different than Sx.
Following, frames vote for the shot with the highest similarity score in
terms of Euclidean distance. A shot Sy is accepted to be a following
shot of Sx if:
1. the majority frames from Sx vote for Sy, and
2. there is at least one reverse vote, i.e. at least one frame from Sy
votes for Sx.
In case, Sy is a following shot of Sx, both are assigned to a new
candidate camera take: CTi =< Sx,Sy >. For every last shot of the
current CTi the process is repeated until there are no more following
shots detected.
5.2.1.2 Motion Smoothness Analysis
Motion vector fields estimated for consecutive video frames are slowly
varying over both space and time. Therefore, we measure the variations
of the motion vectors along the temporal direction in the continuity
region of each candidate camera take. Figure 5.6 shows an example
for consecutive shots. The difference between the respective motion
vectors is very low and, thus, indicates high motion smoothness. On
the contrary, Figure 5.7 depicts frames that are visually similar but
belong to different, temporally non consecutive shots. The slight move
of the girl’s head results in significantly larger differences in the
motion vectors.
For motion detection and tracking we apply local feature tracking
based on SIFT matching (see Section 3.2.2 for details on the feature).
5.2 camera take reconstruction 87
(a) Feature tracking in consecutive frames.
0 100 200 300 400 500 600 700
0
100
200
300
400
500
(b) Corresponding motion vectors for the
first frame pair.
0 100 200 300 400 500 600 700
0
100
200
300
400
500
(c) Corresponding motion vectors for the
second frame pair.
0 100 200 300 400 500 600 700
0
100
200
300
400
500
(d) Differences between the corresponding motion vectors.
Figure 5.6: Motion smoothness for frames of the same camera take from The
Eleventh Year (1928).
88 case studies
(a) Feature tracking in consecutive frames.
0 100 200 300 400 500 600 700
0
100
200
300
400
500
(b) Corresponding motion vectors for the
first frame pair.
0 100 200 300 400 500 600 700
0
100
200
300
400
500
(c) Corresponding motion vectors for the
second frame pair.
0 100 200 300 400 500 600 700
0
100
200
300
400
500
(d) Differences between the corresponding motion vectors.
Figure 5.7: Motion smoothness for frames of similar but not consecutive
shots from The Eleventh Year (1928).
5.2 camera take reconstruction 89
However, other motion tracking methods can be applied as well. We
limit the number of extracted SIFT features per frame to 500. The result-
ing feature descriptors are matched by identifying the first two nearest
neighbors in terms of Euclidean distances. A descriptor is accepted
if the nearest neighbor distance is below a predefined threshold. The
value of 0.8 was determined experimentally and used through the
evaluation tests described in Section 5.2.2. Finally, only camera takes
with smooth motion vectors are accepted.
5.2.2 Experiments
In this section we describe the performed experiments on camera take
detection and montage pattern reconstruction.
5.2.2.1 Camera Take Detection
The first experiment focusses on the evaluation of camera take de-
tection. We explored two movies, Man with a Movie Camera (1929),
consisting of 1,768 shots (95,678 frames) and The Eleventh Year (1928),
consisting of 660 shots (63,123 frames). Our algorithm detected 186
camera takes of two and more shots. The results were evaluated man-
ually. Approx. 93% of all detected camera shots were correct and false
positive rate was less than 5% (see Table 5.1).
MMC EYE Average
True positives 110 92.44% 63 94.02% 173 93.01%
False positives 5 4.20% 4 5.98% 9 4.84%
Ambiguous 4 3.36% 0 0.00% 4 2.15%
Detected takes 119 100.00% 67 100.00% 186 100.00%
Table 5.1: Performance results on camera take detection. MMC: Man with a
Movie Camera (1929), EYE: The Eleventh Year (1928).
The lack of motion and the same visual appearance of shots may
cause false positive detection of camera takes for identical, static shots.
Another reason for incorrect detected camera takes is the dissolve edit-
ing technique. The gradual replacement and high degree of similarity
between shots can falsely assign them to the same camera take (see
Figure 5.8 for an example).
Figure 5.8: False positive camera take due dissolve (the same scene is shot
from two different perspectives). Man with a Movie Camera (1929).
90 case studies
Additionally, four detected camera takes could not be verified due
to ambiguity. An example for such shots is presented in Figure 5.9.
The shot depicts a figure in a shooting gallery on a fair. Due to the
repetitive movement of the timbal in the right hand it is often not
possible to definitely determine if 1) multiple shots are part of the same
camera take or 2) it is always the same shot on different positions.
Figure 5.9: Ambiguous shots. Man with a Movie Camera (1929).
5.2.2.2 Montage Reconstruction
The next experiment addresses the reconstruction and the analysis of
the original montage schemas. Montage schemas describe the assem-
blage of a film through editing. They allow for the analysis of editing
techniques and montage patterns. Furthermore, the reconstruction of
montage schemas is essential for the analysis of archive film material
where the original versions (filmstrips) do often no longer exist. As
previously discussed, the remaining copies are usually backup copies
from film archives that are often incomplete due to bad storage, mold,
and film tears.
We investigate different film sequences from three archive documen-
taries. We first detect camera takes. In the next step, we assign labels
to the shots of the same camera take.
The first sequence from The Eleventh Year (1928) presents workers
building a railway. The whole sequence of 19 shots (204 frames) origi-
nates from three cross-cut camera takes. Our algorithm successfully
detected and assigned the respective shots (see Figure 5.10).
Since we investigate experimental video material, not all of the
resulting montage schemas comply with conventional editing patterns.
Figure 5.11 presents the detected montage pattern in a sequence of
66 shots (191 frames) from the Man with a Movie Camera (1929). It
exhibits an unusual editing technique that is not reconstructable with
other common scene detection algorithms. The interpretation of such
patterns is a research subject for film experts.
The evaluation of the third sequence (42 shots, 773 frames) from Kino
Eye (1924) was motivated by the discovery of the original montage
schema. It shows the experiment of the filmmaker to graphically
chart the montage of shots within a scene (see Figure 5.12b). The
5.2 camera take reconstruction 91
Figure 5.10: Detected camera takes in the sequence from The Eleventh Year
(1928) (white arrows show dominant motion).
Figure 5.11: Detected camera takes in the sequence from Man with a Movie
Camera (1929).
reconstructed montage schema indicates missing and rearranged shots
(see Figure 5.12a). Currently, it is not clear whether the original film
complied with the discovered schema and the nowadays available
copy is a full version of the original film. Notwithstanding, the results
demonstrate the reliability of the algorithm and its applicability in a
scenario where the original montage schema is not available.
5.2.3 Related Work
A camera take reconstruction algorithm reveals information about the
structure of a movie. This perspective on the film structure represents
the film production point of view. In contrast, current work on video
structure analysis focusses mainly on scene detection and classifica-
tion. Recent approaches on scene detection and classification group
shots into a scene if they are content-correlated and temporally close
to each other [30, 113, 123, 124, 182]. Content correlation is usually
determined based on color information. An essential disadvantage of
92 case studies
(a) Reproduced schema.
(b) Extract from the original schema from the mid 1920s (black box frames indicate
some of the detected missing shots, oval frames: rearranged shots)
Figure 5.12: Montage schemas for the sequence from Kino Eye (1924).
this approach is that false color matches between shots of different
scenes result in falsely combined shots. Dynamic scenes often possess
different color information which impedes the process of keyframes
selection for reliable shot representation. Motion information is often
neglected within the process of scene detection. Ngo et al. use motion
information for the selection and formation of keyframes as repre-
sentative for the shot [113]. However, motion is no further used as a
matching criterion. Rasheed et al. merge shots together that have high
motion activity and small shot lengths to enable high scene dynam-
ics [123]. However, the assumption that shots of the same scene follow
the same dynamics holds only for very limited scenarios.
Recently, Truong et al. addressed the extraction of film takes [155].
The authors applied merge-and-split clustering techniques to group
similar shots based on color histograms. A substantial assumption
of the approach is that at most one shot is presented from a single
camera take. This assumption holds for a great part of Hollywood-
5.2 camera take reconstruction 93
type movies but fails for most documentary and experimental films. A
further limitation of the approach is that it cannot be applied to shots
with extensive camera and/or object motion (e.g. action shots) due to
the restrictions of the selected shot representation. Eventually, the task
of camera take extraction is reduced to a shot similarity detection.
In contrast to existing approaches, we strongly rely on motion
information. Motion smoothness between frames of the same camera
take allows for the reliable recognition of consecutive shots. Thus,
shots are linked together without the problem of appropriate keyframe
selection or shot representation. Furthermore, since shots of the same
camera take can be temporally apart from each other in the edited
film, the reconstruction of camera takes captures information, which
is lost by a scene detection algorithm.
5.2.4 Conclusion and Discussion
We presented a novel application for the reconstruction of camera
takes. We applied the proposed algorithm on a test set of experimental
archive documentaries. Presented results demonstrate the reliability
of the algorithm and outline its applicability for manifold application
scenarios such as montage pattern analysis or the comparison of dif-
ferent film cuts. This new topic in the field of video retrieval radically
affects the analysis of archive documentaries for two reasons. Firstly,
the evolution of the production process has changed essentially over
the past decades. In the past, filmmaking was an expensive process
where a scene was shot just a single or a few times which meant that
acquired film material was used as completely as possible in the cur-
rent final cut and possibly reused in further film compilations. On the
contrary, today, a scene is shot until it fits the vision of the producer(s)
which can result in raw film material up to 100 times the length of the
final cut [110]. Secondly, documentaries often use unusual camera and
editing techniques. Currently, the study of such patterns is a tedious
manual process performed by film experts
Reliable camera take detection provides a new perspective to the
domain of film analysis. From a technical point of view, it allows for
the comparison of different film cuts and the analysis of montage pat-
terns that do not follow conventional editing rules. Moreover, further
analysis of the motion smoothness between two shots can provide
information about missing frames from the original camera take.
From a semantical point of view, reconstructed camera takes capture
information that can be missed by conventional scene detection algo-
rithms. By the analysis of motion smoothness within a given continuity
region, the proposed method does not require appropriate shot repre-
sentations or keyframes and feature selection. Moreover, two shots to
be grouped are not required to be visually similar for the whole shot
length. Highly dynamical shots (e.g. action) or large camera motion
94 case studies
often result in great dissimilarity in the visual perception. However,
motion smoothness analysis can still detect consecutive shots due to
the smooth transition present in a continuos camera take. This informa-
tion can be further used to improve the process of video representation
and retrieval.
5.3 film comparison
By definition, a video copy is a transformed video sequence [83]. TheWhat is a video
copy? transformation can be of technical nature (e.g. change in video format,
frame rate, resizing, shifting, etc.) or editorial modification (frame
insertion/deletion, background change, etc.). Video copy detection
is an active research area driven by ever-growing video collections.
The detection of video duplicates allows for the efficient search and
retrieval of video content. Existing applications for content-based
video copy detection comprise e.g. clip identification in a given video
set [85, 170], copyright protection [69, 72], identification of duplicated
news stories [180], and TV broadcast monitoring and detection of
commercials [88, 132]. Presented experiments are often limited to
high quality video clips of pre-defined fixed length and synthetically
generated transformations such as resizing, frame shifting, contrast
and gamma modification, Gaussian noise additions, etc.
In contrast, film and video comparison reaches beyond the bound-Challenges
aries of a single shot and aims at the identification of both reused
and unique film material in two video versions. The compared videos
can be two versions of the same feature film, e.g. director’s cut and
original cut, or two different movies that share a particular amount
of film material, such as documentary films and compilation films.
Archive film material additionally challenges existing approaches for
video analysis by the state and the nature of the material. Different
versions vary significantly not only by the actual content (e.g. loss of
frames/shots due to censorship or re-editing) but also due to material-
specific artifacts such as mold, film tears, flicker, and low contrast.
Furthermore, existing algorithms often provide only limited robust-
ness to illumination changes, affine transformation, cropping, and
partial occlusions, which restricts their applicability for low-quality
archive films. Archive film material is well-suited for the evaluation
of film comparison techniques since it contains a large number of
natural (not synthetically generated) transformations among different
film versions and represents a complex real world scenario for film
comparison and copy detection.
In practice, it is not always obvious whether or not two shots are
identical. Figure 5.13 illustrates the challenge of identifying shot corre-
spondences in film archives. Figure 5.13a depicts the first frames of
two identical shots that possess different appearance due to visible cue
marks and scratches in the first shot, frame shifting in the second shot,
5.3 film comparison 95
and contrast differences in both shots. On the contrary, Figure 5.13b
shows the first frames of two similar and yet different shots with high
perceptual similarity.
(a) Identical shots but different appearance.
(b) Different shots despite high perceptual similarity.
Figure 5.13: Identical vs. similar shots.
In general, a film comparison process passes well-defined steps from Approach
shot boundary detection to shot representation and matching. At each
step different algorithms can be applied. The combination of and the
interaction between the selected methods are crucial for the overall
comparison process. In this section we present an approach for auto-
mated film comparison that accounts for the temporal and hierarchical
structure of a video, i.e. the frame, shot, and video level. The approach
allows for the selection of the appropriate hierarchy level for a given
task and, thus, enables different application scenarios such as the iden-
tification of missing shots or the reconstruction of the original film
cut. Using the proposed methodology we evaluate the performance of
established shot boundary detection algorithms and investigate the
influence of keyframe selection and feature representation on the film
comparison process. The results show that the approach presented
yields high recognition rates for the investigated application scenar-
ios. Furthermore, the integration of knowledge about the hierarchical
structure of a video allows for the outstanding performance of a sim-
ple, edge-based descriptor at much lower computational cost than
state-of-the-art local feature-based approaches.
96 case studies
This section is organized as follows. We describe the underlyingSection structure
methodology of our approach for automated film comparison in Sec-
tion 5.3.1. Section 5.3.2 outlines the methods for shot boundary de-
tection, keyframe extraction, and feature representation used for the
experiments presented in Section 5.3.3. We present current related
work and discuss its limitations in Section 5.3.4. Finally, we conclude
in Section 5.3.5.
5.3.1 Underlying Methodology
From a technical point of view, a video consists of temporally aligned
shots. Each video shot is a continuous sequence of frames recorded
from a single camera. We present an approach which accounts for
this logical structure of a video and does not require any additional
information. Starting from a raw and unsegmented video, the first
step is to determine shot boundaries automatically. Following, each
shot is represented by a set of robust and distinctive features. Fi-
nally, our matching and decision process is applied to the segmented
video stream. Figure 5.14 visualizes the workflow and information
propagation within the framework.
Figure 5.14: Workflow and information propagation.
5.3.1.1 Frame Level
Video frames are the basic building blocks of a video sequence. Since
a single frame represents a still image, manifold global and local fea-
tures can be applied to describe its content. For performance reasons,
features are usually not extracted for each frame of the shot but only
for selected keyframes. Each keyframe is represented by a set of fea-
tures and compared to each frame of the second video. The similarity
between frame features is used to assign a keyframe to a shot by means
of frame voting. Dependent on the selected feature, various distance
metrics can be applied to measure the visual similarity. In our work,
5.3 film comparison 97
we use nearest neighbor ratio matching based on Euclidean distances,
i.e. two features are considered similar if the distance to the second
most similar feature is above a predefined threshold. Furthermore,
we introduce a frame confidence measure, cf, based on the distance
spreading of all matches, i.e. if all distances lie closely together, the
corresponding match is considered less reliable. In contrast, an outlier
suggests a high matching confidence:
cf = 1−
dm
d
, (5.2)
where dm is the mean matching distance of the matched features,
and d the mean matching distance of all descriptors. Finally, the
comparison for each keyframe results in a quadruple holding the
frame position in the current shot, the frame confidence, the frame
voting and the positioning of the frame within the matched shot.
5.3.1.2 Shot Level
Since a video shot consists of frames, three factors may influence
the shot matching decision: 1) frames’ votes, 2) corresponding frame
confidences and, optionally, 3) the temporal ordering of the frames.
In our evaluation, at least two out of the three keyframes have to
vote for the same shot, otherwise the shot is rejected and classified as
unknown. The shot confidence, cs, accounts for the confidence of the
voting frames and is defined as the average of their scores:
cs =
n∑
i=1
si × cfi ×wfi
n∑
i=1
si
{
s = 1 for a voting frame
s = 0 otherwise
(5.3)
where cfi is the frame confidence of the i-th frame and n the
number of keyframes in the shot, and wfi the weight factor for the
corresponding temporal position. If matched keyframes have corre-
sponding temporal positions within the respective shots, wfi = 1,
otherwise wfi = 0.8. The consideration of the temporal ordering of
the frames increased the precision score by up to 10% in experimental
tests. Additionally, we apply the majority rule, i.e. the majority of the
keyframes have to vote for the same shot otherwise the shot is rejected
and classified as unknown.
5.3.1.3 Video Level
The video level represents the highest layer in the framework. Given
the domain of video comparison, the corresponding shots in different
videos build a well-defined sequence. This additional knowledge
is used to eliminate matched shots, which do not fit in the overall
98 case studies
ordering sequence. To detect outliers we apply local minima and
maxima suppression on the sequence formed by the respective shot
ids (see Figure 5.15). Finally, the average confidence score of matched
shots is defined as video confidence cv.
(a) Shot id correspon-
dences after shot-level
analysis
(b) Maxima suppression (c) Minima suppression
and final correspon-
dences
Figure 5.15: An example for minima/maxima suppression at video-level.
A further approach to increase the performance of film comparison
is the investigation of shots that have no corresponding shots in the
second video. Such shots can be either missed shots (due to failed
comparison) or unique shots. Figure 5.16 visualizes a scenario with
both matched and unknown shots in different video versions. Since
the search field for corresponding shots is limited to a well defined
area, the matching performance can be further improved while the
additional computational costs are low.
Figure 5.16: Video sequence with matched and unknown shots.
5.3.2 Methods Compared
The choice of underlying technology is crucial at each step of the film
comparison process. The proposed approach defines the logic of the
comparison process. It does not define the specific techniques used at
each level. The user has the opportunity to select the adequate method
at each step of the process. The three most important factors are:
1. the selection of shot detection algorithm;
2. the selection of keyframes as representatives for a given shot;
and
5.3 film comparison 99
3. the feature representation of the keyframes.
In this section we give a brief description of different approaches for
shot boundary detection, keyframe selection, and feature extraction.
5.3.2.1 Shot Boundary Detection
Shot boundary detection is a basic preprocessing step for most high-
level video analysis tasks such as scene segmentation and video sum-
marization. Many different techniques have been proposed in the last
decades. The principle behind different approaches is similar. Usually,
differences between consecutive frames are computed. If the differ-
ences exceed a certain threshold a shot cut is identified. In shot cut de-
tection the single frames are usually represented by compact features,
which are based on color, intensity, edges, motion, and frequency infor-
mation. We evaluate three standard methods in cut detection, which
are based on edges and intensity information. The fourth method
(self-similarity matrix) is based on edge and frequency information
and was specifically adapted to low-quality archive material.
intensity histogram . This method is based on the bin-wise dif-
ferences of the intensity histograms between two consecutive frames [50].
To include spatial information, each frame is divided into M non-
overlapping sub-images:
DIH
F1,F2
=
1
MN
M∑
j=1
N∑
i=1
|h1,j[i] − h2,j[i]|, (5.4)
where hj is the corresponding sub-image histogram and N the
number of bins. A shot cut is detected if the difference exceeds a
predefined threshold.
adaptive threshold. Instead of using a global threshold on the
histogram differences, Truong et al. propose the use of a simple adap-
tive thresholding method to detect peaks in the histogram difference
curve [153]. An adaptive threshold usually adapts better to local prop-
erties of the difference curve such as motion and flicker. The authors
consider a sliding window along the temporal axis. They detect a shot
cut if a given histogram difference 1) has the maximum value within
the window, and 2) is α times greater than the mean of remaining
histogram differences within the window.
edge change fraction. The basic idea of this approach is that
the positions of edges change considerably at shot boundaries: ex-
isting edges disappear and new edges appear where there were no
edges before [172]. Following, the authors count the entering, ρin, and
100 case studies
vanishing, ρout, edge pixels among two frames and define the edge
change fraction as
ρ = max(ρin, ρout). (5.5)
Peaks in the edge change fraction indicate shot cuts. By analysis
of the spatial distribution and relative values of entering and exiting
edge pixels the authors classify the shot cut type as a cut, dissolve,
fade or wipe.
self-similarity matrix. This method is based on the self-si-
milarity between adjacent video frames [176]. First, each frame is
split uniformly into blocks and for each block an edge histogram and
the low-frequency DCT coefficients are extracted. The edge histogram
captures the orientations of the edges and is robust to frame dis-
placements and flicker. The DCT feature represents the coarse spatial
intensity distribution across a frame and is robust against dirt and
scratches. The features are well-suited for combination, since they
capture complementary information. Next, the similarity matrices for
both features are computed separately. Sequences of similar frames
produce bright squares along the diagonal of the matrix. Shot cuts are
detected by moving a Gaussian weighted checkerboard kernel along
the diagonal of the similarity matrix. The checkerboard kernel yields
high correlation at the shot cuts and low correlation at other positions.
Finally, the two similarity matrices result in two kernel correlation
functions that are linearly combined. Shot boundaries are located by
means of peak detection.
5.3.2.2 Keyframe Selection
There are different approaches for the selection of keyframes from
a given video shot. Simple techniques do not account for the shot
content but rather select the keyframes according to a predefined
pattern. More sophisticated methods consider the visual dynamics of
a shot and perform keyframe selection based on visual characteristics
(e.g. color histograms) or motion information. Law-To et al. select
keyframes corresponding to extrema of the global intensity of motion
in a comparative study of video copy detection algorithms [83]. Orig-
inally, this approach was proposed by Eickeler and Müller [42]. To
explore the influence of keyframe selection on the film comparison,
we evaluate the following approaches:
1. KS1: always select the first frame as a keyframe [112],
2. KS2: the first and the last frames as keyframes [126],
3. KS3: the first, middle and last frames as keyframes [174], and
5.3 film comparison 101
4. KS4: motion-based selection of keyframes [42].
Eickeler and Müller [42] define intensity of motion as:
i(t) =
∑
x,y d(x,y, t)
XY
, (5.6)
where d(x,y, t) is the difference image of the gray values of adjacent
frames. To overcome the problem of abrupt visual changes caused by
e.g. flashes, the authors propose to use the smaller value of the motion
intensities for the frames (t, t+ 1) and (t− 1, t+ 2).
5.3.2.3 Feature Extraction
Different features can be used for the representation of keyframes.
We compare three different types of features with a complementary
structure. The MPEG-7 Edge Histogram is a global statistical descriptor,
the SIFT features are local image descriptors and the differential-based
descriptors capture representative information for an entire shot (for
background on the features see Chapter 3).
The MPEG-7 Edge Histogram descriptor is an effective feature for MPEG-7 Edge
Histogramimage similarity retrieval [97]. Furthermore, it possesses promising
characteristics for the comparison of archive films. The feature captures
global information within each block and, thus, is highly robust against
frame displacements and invariant to flicker. Since it captures high-
frequency information, it is prone to local artifacts (e.g. scratches, dirt)
and reflects global artifacts such as tears across the entire frame.
The SIFT descriptors are highly discriminative local features. They SIFT
are invariant to changes in translation, scale, and rotation and partially
invariant to changes in illumination and affine distortions. Thus, frame
displacement and flicker have no influence on the features. Artifacts,
which result in loss of visual information (scratches, dirt, tears), auto-
matically lead to loss of potential keypoints. However, since there is a
large number of keypoints per frame, their fraction does not impede
the matching process significantly.
Finally, recently, several authors reported outstanding performance Differential-based
descriptorsof differential-based descriptors in the context of video copy detec-
tion [70, 82, 83]. In this evaluation we follow the approach proposed
by Law-To et al. in [82], which was reported as top-performing in a
comparative study on video copy detection algorithms [83]. Unlike
the original approach, we do not distinguish between motion and
background trajectories. Bouncy and unsteady video sequences often
exhibit high motion characteristics, which may lead to mislabeling
and, thus, misclassification.
5.3.3 Experiments
In this section we present the performed evaluations. We selected
three case studies that cover, next to the artistic challenges of the
102 case studies
film material, different issues and challenges in the film comparison
process from a technical point of view. Examples include different film
source material, resultant artifacts, as well as various technical and
editorial modifications. The case studies are described in Section 5.3.3.1.
For the presented data set we evaluate the influence of the choice of
underlying technology at each step of the film comparison process:
shot boundary detection (Section 5.3.3.2), and keyframe selection and
feature representation (Section 5.3.3.3). Eventually, our last experiment
addresses a novel application that aims at the identification of unique
shots (Section 5.3.3.4).
5.3.3.1 Video Data and Case Studies
We explore ten historical artistic documentaries (see Section 5.1)
grouped into three case studies:
CS1 The first case study investigates two films by Dziga Vertov in
two versions respectively: Man with a Movie Camera (1929) and
Enthusiasm (1930). All films originate from tape-based analog
sources. In one case, the copies were derived from the original
source with several decades in between. They differ greatly in
image quality and censored content. In the other case, the two
versions originate from the same analog copy. One copy is the
result of an effort to manually reconstruct the original film by
manually adding and removing shots.
CS2 The second case study investigates again two films by Dziga
Vertov – Cinema Eye (1924) and Three Songs About Lenin (1934) –
in two different versions whereas the second copies originate
from unknown sourced DVDs. In addition to the differences in
image quality and content, digitization artifacts further impede
the process of film comparison.
CS3 The last case study compares two different but related analog
films: an original documentary by Dziga Vertov – The Eleventh
(1928) – and a compilation film by A.V. Blum – In The Shadow of
The Machine (1928) – where a number of shots from Vertov have
been used.
The length of the films ranges from 20 to 90 mins. Similarly, the
length of the shots is strongly varying from 1 to over 1500 frames/shot.
5.3.3.2 Shot Boundary Detection
The first step of an automated film comparison process is shot bound-
ary detection. Shot boundary detection is widely seen as solved for
contemporary film material. However, experiments with shot bound-
ary detection demonstrate the task still being challenging in the context
of archive film material (see Table 5.2 for a summary of the achieved
5.3 film comparison 103
R P
CS1 Intensity Histogram (IH) 0.68 0.68
Adaptive Threshold (AT) 0.67 0.65
Edge Change Fraction (ECF) 0.67 0.67
Self-Similarity Matrix (SSM) 0.91 0.91
CS2 Intensity Histogram (IH) 0.82 0.77
Adaptive Threshold (AT) 0.88 0.92
Edge Change Fraction (ECF) 0.77 0.76
Self-Similarity Matrix (SSM) 0.95 0.95
CS3 Intensity Histogram (IH) 0.77 0.77
Adaptive Threshold (AT) 0.80 0.81
Edge Change Fraction (ECF) 0.75 0.76
Self-Similarity Matrix (SSM) 0.93 0.93
Table 5.2: Recall (R) / Precision (P) results for shot boundary detection.
results). Unintended alterations of the content such as artifacts that
generate abrupt visual changes (e.g. dirt, scratches, and film tears,
see Figure 5.3) interfere with established algorithms that are based
on pixel differences (Intensity Histogram, IH, and Adaptive Threshold,
AT) and edge information (Edge Change Fraction, ECF). Furthermore,
these methods are sensitive to global motion such as camera shaking,
large object and camera motion. In such cases, preceding motion com-
pensation can be performed. However, preliminary tests showed that
prior motion compensation of archive film material introduces new
artifacts that impede following feature detection and analysis. The
Self-Similarity Matrix (SSM) outperforms significantly the other tested
methods in terms of recall and precision and proves robustness for the
complex spatio-temporal structure and manifold artifacts of archive
film material. The use of robust image features and the larger analysis
window (checkerboard window size of up to 8 frames) significantly in-
creases the robustness of the method. Thus, all following experiments
are performed based on the shots detected by the SSM method.
5.3.3.3 Keyframe Selection and Feature Representation
In this section we present the evaluation results of different keyframe
selection methods (KS1-4) in combination with the MPEG-7 edge his-
togram descriptor (EHD) and the SIFT features. Since the differential
descriptors (DD) are based on feature trajectories and, thus, process
each frame of a shot, their performance is reported only at the shot-
and video-level.
104 case studies
EHD features are matched using simple Euclidean distance. Local
feature descriptors (SIFT and DD) additionally identify the first two
nearest neighbors in terms of Euclidean distances. A descriptor is
accepted if the nearest neighbor distance ratio is below a predefined
threshold of 0.8. Since the local descriptors represent the characteristics
of a small area around a point of interest, these approaches usually
result in a high number of matching descriptors. Given the partially
high similarity between different shots, the total number of matches is
often misleading. To increase the reliability of detected matches we
ignore all ambiguous matches, i.e. all descriptors are eliminated that
match several features in the other frame. Additionally, the RANdom
SAmple Consensus (RANSAC) algorithm is applied to remove outliers
that do not fit a homography transformation [48]. Finally, a frame votes
for the shot of the frame with most matches. However, if less than
5% of the descriptors are matches or if the frame confidence is below
50-60% (depending on the features extracted), the match is considered
unreliable and is ignored, i.e. the frame is classified as unknown. At
shot-level, the required shot confidence score is set initially to 60-
70%. All shots with lower confidence score are rejected and classified
as unknown. At video-level, we account for the temporal alignment
of matched shots and discard shots that do not fit in the ordering
sequence by applying peak (local minima and maxima) detection.
Finally, all unknown shots are re-evaluated by reducing the required
confidence score for a positive match by 10%.
case study 1 (cs1). The first case study focusses on the compari-
son of analog sourced films. The different film versions share around
90% of all shots. In general, the remaining (unique) shots are the result
of loss in the process of storage or copying during the years. However,
corresponding shots bear also partially large differences due to e.g.
film tears, contrast differences and removed frames (see for examples
Figure 5.17).
Table 5.3 summarizes the experimental results in terms of recall-
precision measures. In general, the SIFT features outperform the re-
maining descriptors independently of the keyframe selection method
and on all three levels of the comparison. Surprisingly, the perfor-
mance difference to MPEG-7 edge histogram (EHD) is very low. EHD
proves to be a very competitive descriptor and as performant as the
computationally more expensive SIFT algorithm. In terms of recall
and precision, EHD scores 0.90 and 0.98 respectively whereas SIFT
achieves 0.92 and 0.99. Although the differential descriptors (DD)
build on information from each frame of a given shot, they show very
low performance. An analysis of the extracted features shows very
low variance, which results in low distinctiveness of the computed
descriptors. Thus, such descriptors are only applicable for highly dis-
criminative data. In a scenario of multiple low quality shots with high
5.3 film comparison 105
Figure 5.17: Examples for differences in corresponding shots in different film
versions (each shot is represented by its first frame). First shot:
film tear and illumination differences. Second shot: additional
frame displacement (see the black lines on the corresponding
frame borders. Third shot: frame mark removed in the second
film version. Fourth shot: high contrast difference.
visual similarity, this approach fails to correctly assign corresponding
shots.
Frame-Level Shot-Level Video-Level
R P R P R P
MPEG-7 KS-1 0.89 0.90 0.89 0.90 0.90 0.98
EHD KS-2 0.85 0.87 0.84 0.95 0.87 0.97
KS-3 0.88 0.87 0.89 0.93 0.90 0.96
KS-4 0.83 0.85 0.89 0.90 0.90 0.96
SIFT KS-1 0.92 0.90 0.92 0.90 0.92 0.99
KS-2 0.86 0.90 0.82 0.96 0.85 0.98
KS-3 0.86 0.91 0.89 0.96 0.90 0.97
KS-4 0.87 0.83 0.89 0.91 0.90 0.94
DD – – – 0.58 0.62 0.59 0.96
Table 5.3: Recall (R) / Precision (P) results for CS1.
The comparison of the keyframe selection methods shows that – for
the given case study – KS1 (first frame is a representative for the shot)
outperforms the remaining methods closely followed by KS3 (selection
of the first, middle, and last frame for a shot) and KS4 (motion-based
selection of keyframes). KS2 (first and last frames are keyframes for a
shot) results in lower performance due to the majority decision rule
on the frame-level of the framework, i.e. if both keyframes vote for
different shots, the votes are discarded and the shot is classified as
unknown even if one of the frames is assigned correctly. In general, the
performances of all keyframe selection methods lie closely together.
106 case studies
However, the computational costs differ significantly. KS1 results in
a single frame as representative for each shot. In contrast, within the
given data set, KS4 results in the selection of 1 to 30 keyframes per shot,
which increases the number of required comparisons significantly.
An analysis of the false positives reveals two facts. First, the in-False positives
vestigated films contain a large number of static, repeating shots. In
general, such shots are assigned correctly outside of the context of a
complete film. Within the given context, they are often assigned to an
identical shot that appears on a different position in the film sequence.
Thus, a generally correct match is classified as a false positive. Sec-
ond, the large number of shots with high perceptual similarity also
increases the false positives at frame- and shot-level (see Figure 5.18
for examples). Since the video-level of the framework accounts for the
temporal ordering of the shots, such false positives are easily identified
and correctly re-assigned.
Figure 5.18: Examples for false positives. The assigned shots bear high visual
similarity. In the first three examples the shots present the same
scene settings with slightly different motives (e.g. people walking
by or different workers). Despite the different subject in the last
example, both shots have identical composition and action flow.
Figure 5.19 shows resulting correspondences from the film compari-
son for the Man with a Movie Camera (1929). The results show missing
shots in Film A (shots 366 and 367 from Film B) which were filled
with black frames (shot 386 from Film A). Especially noteworthy is
the loss on information due to the introduction of an optical track at
the left side of the film A as well as the contrast differences in the two
sequences.
case study 2 (cs2). The second case study compares different
film versions of different origin: analog and digital. The film versions
share around 70% of all shots. In addition to the already discussed arti-
facts in archive film material, artifacts that result from a preprocessing
of the material (e.g. noise reduction, stabilization, contrast enhance-
ment) as well as the coding technology lead to high differences in the
visual perception of different shots (see Figure 5.20).
5.3 film comparison 107
#365 #366 #367 #368
......
... ...F
ilm
 A
F
ilm
 B
#385 #386 #387
Optical track
Figure 5.19: Experimental results from the automated film comparison.
Figure 5.20: Examples for differences in corresponding shots resulting from
a preprocessing step (e.g. noise reduction and contrast enhance-
ment) and coding and compression technology.
The results of the experiments are presented in Table 5.4. Again, SIFT
features are the top performing approach. In contrast to the first case
study, the selection of three keyframes as representatives for the shot
proves to be the best keyframe selection method. The manifold artifacts
presented in this case study require for a robust decision rule at
shot- and video-level of the framework. In general, all recall-precision
scores are slightly lower than those achieved in the first case study
because of the intensification of the presented artifacts as well as the
introduction of new artifacts due to preprocessing and video coding.
MPEG-7 edge histogram bears higher sensitivity to motion artifacts in
shots (see for examples the last two shots pictured in Figure 5.20) and,
thus, often fails to classify shots with large motion. The differential-
based descriptors completely fail to achieve any reasonable results.The
low performance of the method on the shot-level of the framework
does not allow for further evaluation on video-level. The video-level
involves peak detection in the ordering sequence of shots and requires
a precision of at least 51% for the detection of a reliable sequence.
108 case studies
Frame-Level Shot-Level Video-Level
R P R P R P
MPEG-7 KS-1 0.65 0.56 0.65 0.56 0.81 0.82
EHD KS-2 0.57 0.57 0.58 0.62 0.81 0.78
KS-3 0.69 0.60 0.69 0.75 0.85 0.82
KS-4 0.63 0.57 0.68 0.58 0.79 0.79
SIFT KS-1 0.81 0.88 0.81 0.88 0.87 0.85
KS-2 0.76 0.75 0.76 0.77 0.76 0.88
KS-3 0.79 0.74 0.83 0.84 0.91 0.88
KS-4 0.79 0.80 0.81 0.83 0.86 0.84
DD – – – 0.03 0.02 – –
Table 5.4: Recall (R) / Precision (P) results for CS2.
case study 3 (cs3). The last case study compares an original
documentary and a compilation film that uses less than 5% of the shots.
This case study clearly outlines the limitations of the MPEG-7 edge
histogram and the differential-based descriptors: both approaches fail
to find sufficient corresponding shots. However, SIFT features also
achieve very low performance. Best recall and precision scores are 40%
and 70% respectively using the KS1 keyframe selection method.
5.3.3.4 Unique Shot Detection
Our last experiment investigates the identification of shots that are
unique. The SIFT descriptor correctly identifies 83% of all unique shots
followed by the MPEG-7 Edge Histogram (EHD) with 67% and the
differential-based descriptors with 29%. The large differences on a
percentage basis are due to the low number of unique shots in our
data set. Out of 44 unique shots only 24 are longer than three frames
which is a basic requirement in our framework. Thus, the absolute
difference of 4 shots between the performance of the MPEG-7 Edge
Histogram and the SIFT descriptors is relatively low (see Table 5.5 for
details).
Unique shots MPEG-7 EHD SIFT DD
Man with a
Movie Camera (1929) 2 2 2 1
Enthusiasm (1930) 22 14 18 6
Aver. 0.67 0.83 0.29
Table 5.5: Unique shot detection.
5.3 film comparison 109
5.3.4 Related Work
Existing approaches for video copy detection usually rely on the ex-
traction of local and/or global features that are matched against a
video reference set. In general, algorithms based on global features Global feature-based
approachesallow for efficient computation, search, and indexing. Typical features
include color, edge, and motion information. Lie et al. propose a
compact binary signature based on color histogram for the recogni-
tion of TV commercials [88]. Zhang et al. use color moments and a
stochastic attributed relational graph matching to detect duplicated
news videos [180]. Kim et al. apply color and motion information
to describe video content [73]. Video similarity is measured using
a group-based record linkage technique. Leon et al. apply video to-
mography to create spatio-temporal signatures (see Section 3.2.2) [85].
Bertini et al. propose video fingerprints based on MPEG-7 color and
edge descriptors and use edit distance, defined as the minimal cost
of insertions, deletions, and substitutions of symbols to make two
fingerprints equal, for measuring video similarity [16]. However, such
methods align two videos for the entire length and, thus, are inefficient
if only a small part of the reference or query video is a copy or if
there is a single sequence appearing multiple times in the reference
video. This limitation is improved by the approach proposed by Yeh
et al. [167]. The authors extend the MSF as proposed in [86] by imple-
menting color histograms. Following, they extend the edit distance
to find local alignments of two videos based on the Smith-Waterman
algorithm [140]. In general such global feature based methods result
in a compact feature representation and, thus, enable efficient search
and indexing. However, they are not robust to illumination changes,
cropping, and partial occlusions.
Local feature-based methods overcome these limitations and often Local feature-based
approachesachieve better performance. Sivic et al. use a combination of affine
covariant regions and SIFT for object and scene matching and re-
trieval [137]. Laptev et al. propose spatio-temporal fingerprints for
event detection [81]. Joly et al. apply the Harris corner detector and
a differential description of the local region around each interest
point [69]. This approach was shown to be superior over further meth-
ods employed in the literature such as ordinal intensity signatures
or space-time interest points [83]. Zhou et al. present a video similar-
ity model which combines a normalized chromaticity histogram and
shot duration [184]. The proposed model requires color information
and uses each frame of a shot to build its visual feature. A further
limitation of the approach is that it is only robust to low-level trans-
formations such as frame rate conversion and compression format.
Sand and Teller propose an image-registration method for aligning
two videos recorded at different times into the spatio-temporal do-
main [127]. The authors combine interest point based matching and
110 case studies
local motion estimation (based on Kanade-Lucas-Tomasi (KLT) frame
tracker) for frame alignment. The proposed method has low invari-
ance to affine transformation and high computational costs of several
minutes per second of video. Recently, Douze et al. applied a combi-
nation of Hessian-Affine detector and SIFT descriptor and integrate it
into a bag-of-features framework [40]. The authors report best results
on the TRECVID 2008 copy detection task providing manifold video
modifications such as contrast change, blur and noise introduction,
occlusions and cropping.
Our work shows some similarities to the approach by Ng et al. [112].Current limitations
and scope of the work The authors propose a tree matching algorithm based on the hierar-
chical structure of a video. Unlike our work, the authors define the
video shot as the lowest hierarchical level of a video structure whereas
each shot is represented by its first frame. Following, similarity is
computed by a combination of color histogram and shot style (cam-
era motion and the length of a shot). A significant limitation of this
work is that it is only useful for comparison of videos which exhibit
high distinctive patterns of motions and have not undergone strong
modification (e.g. illumination changes or frame deletion/insertion).
Furthermore, recent evaluations bear mainly two limitations: First, the
used data sets comprise video clips of high quality and pre-defined
fixed length between 5 and 60 sec. Second, the experiments are usually
performed using synthetic transformations such as resizing, frame
shifting, contrast and gamma modification, gaussian noise additions,
etc. Our work differs from previous research in the area of video copy
detection in several aspects:
1. We aim at the comparison of complete film versions. The addi-
tional knowledge about the video structure allows for the easy
integration of temporal constraints of matched video frames and
shots and increases the overall matching performance.
2. We evaluate the combination and influence of different state-
of-the-art algorithms for shot boundary detection, keyframe
selection and feature representation.
3. We perform the evaluation on a real-world video data set of
archive film material exposing challenging artifacts such as:
• artifacts originating from the analog filmstrips, e.g. contrast
and exposure changes, blurring, frame shift, dirt, film tears;
• digitization artifacts, e.g. coding transformations;
• technical transformations, e.g. changes in video format,
resizing, cropping; and
• editorial operations such as frame/shot insertion and frame/
shot deletion.
5.4 recurring element detection 111
5.3.5 Conclusion
We presented an approach for film comparison, which accounts for
the overall video structure. Within this framework we compared state-
of-the-art methods for shot boundary detection as well as feature
representation and investigated the influence of keyframe selection
on the performance of the film comparison process. We presented the
results of the evaluation based on a real world scenario on challenging
archive film material. The results of the performed experiments put
the competition between global and local descriptors into perspective.
SIFT features are very discriminative and reliable and thus the amount
of data to be explored can be reduced significantly. Despite the low
quality and partially large differences between corresponding shots,
just three frames per shot are sufficient to correctly assign them.
However, MPEG-7 Edge Histogram is more competitive than expected.
Where SIFT is starting to be more and more the universal weapon with
which to attack such problems, MPEG-7 edge histogram proves to be
almost as performant as the computationally much more expensive
SIFT. Despite the low video quality and partially large differences
between corresponding shots, MPEG-7 edge histogram descriptors
achieve outstanding performance in terms of recall and precision that
is only marginally lower than those of SIFT features .
5.4 recurring element detection
Near-duplicate detection is a rapid emerging research field focused
at the identification of identical or near-identical video sequences. Its
vast development is mostly driven by requirements of large media
providers, advertising agencies, and commercial companies. Near-
duplicate detection facilitates application scenarios such as improved
search and retrieval of videos by reducing the number of duplicated
videos, the monitoring of commercials broadcastings, and copyright
protection [131]. While currently near-duplicate detection explores
video sequences as a whole, it does not allow for search on more
detailed level such as the investigation of duplicated characters or
objects. The detection of such recurring elements is a new requirement
for automated film analysis stated by art and film experts.
Recurring elements are a common tool in visual arts such as paint-
ing, photography, and filmmaking. Examples for such elements can
be found in the paintings by Salvador Dali (the piano is typical for his
Surrealist compositions) or the films by Dziga Vertov (rails, spinning
wheels, etc.) or Alfred Hitchcock (birds, cameo appearances, etc.), etc.
Recurring elements are often applied to convey a certain message,
theme, or mood. They are usually called motifs and take very different
shapes such as the use of a specific color or sound in a given context,
a particular movement of an object or character, camera position, com-
112 case studies
position, or even a story line [14, 18]. Motifs are often highly symbolic.
Thus, their detection requires for semantic understanding and relies
on the experience and attentiveness of the audience. An example for
such a visually highly varying motif is the X-motif in The Departed
(2006) by Martin Scorsese (see Figure 5.21a). The X appears whenever
a character is in mortal danger and takes very different shapes by
means of lighting, color, and material. However, motifs can also be
easily recognizable such as the ring in the Lord of the Rings (2001-2003)
by Peter Jackson (see Figure 5.21b).
(a) The X-motif in The Departed (2006).
(b) The easily recognizable ring-motif in the Lord of the Rings (2001-2003).
Figure 5.21: Examples for motifs in movies.
In this section we explore the feasibility of the detection of recur-
ring elements by means of automated computer vision methods. A
clear restriction for such an approach is the requirement for certain
similarities in the visual appearance of present recurring elements. We
propose a method based on local features that are robust to changes in
illumination, rotation, and scaling. Furthermore, the system automati-
cally learns recurring regions and creates links between related views
of one and the same object. Following, detected elements may differ
significantly in their position, orientation, and scale. In our approach,
a region (or element) can be an object, a part of it, or a recurring char-
acter (usually the main actors). Finally, the proposed method allows
for the detection of recurring elements not only in a single movie
but also among different works from the same author and, thus, can
support a high-level film analysis currently performed tediously and
manually by film experts. In summary, the main contributions of this
section are:
• We define a new research task motivated by the requirements of
film experts.
• We propose an automated method to detect recurring elements
in movies independently of their position, orientation, and scale.
• The output of the proposed system allows for a variety of sum-
marizing visualizations of semantically related information.
5.4 recurring element detection 113
This section is organized as follows. Section 5.4.1 describes the
algorithm for recurring region detection. Section 5.4.2 presents exper-
iments we performed as proof-of-concept for the evaluation of the
proposed algorithm. In Section 5.4.3 we give an overview over related
research. We conclude in Section 5.4.4 and give an outlook for further
research.
5.4.1 Approach
The aim of the proposed system is to detect recurring regions within
a video sequence. A video sequence can be a shot, a scene, or a whole
movie. Detected regions have to meet two essential requirements. First,
they have to be distinctive and not homogeneous regions such as the
sky, or a wall. Second, detected regions should allow for a multiple
view representation of the captured element. Thus, the proposed
system includes two critical components, region detection and region
representation, which will be discussed in the following sections (see
Figure 5.22).
Figure 5.22: Algorithm workflow.
Given a video sequence for recurring object detection, the first step
is, as in any general video analysis approach, the detection of shot cuts
and the extraction of keyframes as shot representation. Both topics
are well-investigated research areas resulting in numerous existing
methods. For shot boundary detection we employed the method pro-
posed by Truong et al. [153]. It is a simple adaptive thresholding
technique detecting peaks in the histogram difference curve of consec-
utive frames. For each detected shot, we extract the first, middle, and
last frames as keyframes. Despite the simplicity of both methods, they
proved to work efficiently with the involved data set and achieved
satisfactory results in the performed experiments. Since the input for
the proposed system is a sequence of keyframes, they can be easily
replaced by more sophisticated methods if needed.
5.4.1.1 Region Detection
For each keyframe KSj , where Sj is the corresponding shot, we detect
distinct interest points and extract local features based on SIFT [92]. SIFT
features are invariant to changes in translation, scale, and rotation and
114 case studies
partially invariant to changes in illumination and affine distortions
and, thus, allow for matching across different viewing conditions1.
Each feature F is described by a quadruple {K
Sj
i , x,y,D}, where Ki is
the associated keyframe id, x and y the corresponding coordinates,
and D the local feature descriptor.
Following, we perform initial, coarse region detection based on fea-
ture matching. Each keyframe is compared to each following keyframe
in the input video sequence. Feature descriptors are matched by iden-
tifying the first two nearest neighbors in terms of Euclidean distances.
A descriptor is accepted if the nearest neighbor distance is below a
predefined threshold. The value of 0.8 was determined experimentally
and used through the evaluation tests described in Section 5.4.2. To
reduce the number of false matches we introduce a loose spatial con-
straint. Each match is considered within the cluster of its three nearest
neighboring feature points. A match is accepted if there is at least one
further match present in the cluster. Finally, all accepted clusters are
set as initial regions.
Figure 5.23 shows an example for an initial region detection from
the movie Run Lola Run (1998) by Tom Tykwer. Compared are two
frames from two different shots showing Lola and her boyfriend on the
run from the police. The scene is shot from two different viewpoints
(see Figures 5.23a-5.23b). Although the matching process produces a
number of false positives (see the red lines in Figure 5.23c), most of
the false matches are dropped due to the spatial constraint on the next
stage of the algorithm (see Figure 5.23d).
5.4.1.2 Region Analysis and Representation
The first stage of the algorithm, coarse region detection, results in
numerous regions. To reduce their number we first remove all regions
with dimensions and area below a given threshold. Following, we
perform region growing by detecting and merging all overlapping
regions. Finally, each region R is defined as {K
Sj
i , x,y,w,h,RM}, where
w is the width of the region, h its height, and RM is a set of links to
matched regions {R1,R2, ...,RN}.
Figure 5.24 visualizes the process of region dropping and merging
for the previous example from the movie Run Lola Run. From 28
initially detected regions, more than 50% were dropped due to the
dimensional restriction (see Figure 5.24a). In our experiments we set
the minimum for both width and height of detected region to 5 px.
In such way a region can be visually perceived and interpreted by
the viewer even if it only depicts a small part of an object. Following,
all overlapping regions are merged together building preliminary
final regions (see Figure 5.24b). Since the whole process of region
detection for the starting frame is repeated for all following keyframes,
1 For details on SIFT see Section 3.2.2
5.4 recurring element detection 115
(a) Starting frame. (b) Keyframe from a following shot.
(c) Matched features: white dots identify detected interest points in the corresponding
frame; red lines indicate false matches; green lines correct matched features.
(d) Detected initial regions in the start-
ing frame. Red dots indicate dropped
features due to the spatial constraint.
Figure 5.23: Example for initial region detection (for better visualization
some spacing is introduced within the detected regions).
116 case studies
detected regions are constantly updated in size, quantity, and the set of
linked regions. For the detection of final recurring elements all linked
regions can be recursively traversed. Eventually, some regions have
few repetitions for the whole video sequence while others indicate
recurring elements (see Figure 5.24c).
(a) Region dropping: white regions are
removed due to the dimensional con-
straint.
(b) Region merging: overlapping regions
are merged together.
(c) Final region linking: red borders indicate false positive linking; yellow borders
show templates with similar parts of the same object; green borders indicate correct
linked templates. Dotted lines shows elements with very few repetitions for the
whole video sequence. Solid lines indicate detected recurring elements for the
investigated video sequence.
Figure 5.24: Example for region dropping and merging.
5.4.2 Experiments
As proof-of-concept for the proposed algorithm we perform two exper-
iments. The first one focusses on the detection of recurring elements
in a single, contemporary movie, and the second one explores the
reuse of elements in and among several archived documentaries by
the same filmmaker.
5.4.2.1 Contemporary Movie
For the first experiment we employed the German movie Run Lola Run
(1998). The story follows Lola who has 20 minutes to raise 100,000
5.4 recurring element detection 117
German marks and save her boyfriend’s life. The film presents sequen-
tially three possible scenarios about the story development and its
outcome. All three scenarios share the same locations and characters.
Following, the film involves many recurring elements (objects as well
as characters) which makes it extremely suitable for our experimental
tests.
Figure 5.25 depicts the decreasing distribution of the amount of
linked shots per detected region. For a better visualization we only
show the top 2% regions that have been linked to 7 or more shots.
Approx. 98% of all detected regions are linked to less than 7 shots and,
thus, considered as insignificant for our application scenario.
Figure 5.25: Amount of detected recurring elements vs. corresponding
amount of linked shots for the top 2% detected regions.
The definition of ground truth for recurring objects in a movie is a
tedious process feasible probably by the filmmaker only. Therefore,
we focus on the precision performance of the conducted experiments.
In our evaluations, we define precision by the ratio of correct linked
regions vs. all linked regions. The precision for the top 2% of all
detected regions is approx. 75% which confirms the potential of the
algorithm. In summary, we investigated over 200 regions with the
corresponding associated regions. In average, for each detected region
10 shots have been linked (or 17 regions since multiple keyframes per
shot are possible). The average area per region is 38% of the frame
size (see Figure 5.26). It turns out that detected regions should not be
too small. Anything bellow 5% is not really a meaningful region but
rather a part of an object such as a skin section, a shirt detail, a wall
texture, etc. Following, the region cannot be tracked reliably since it
118 case studies
is found in a large number of frames in spite of their non semantical
relation.
Figure 5.26: Distribution of the size of detected recurring elements.
Currently, few falsely linked regions reduce the overall performance.
It is an implication of the approach that if a newly detected region is
matched to an existing one, the new region inherits all established links
of the second region. Figure 5.27 shows four examples for detected
recurring regions. The first example depicts two main characters, Lola
and her father, from various scenes in the movie. The remaining
examples show recurring objects: a huge dollar bill on the wall of the
office of Lola’s father, a phone, and a flying bag. The last example also
demonstrates a false linking with Lola’s hair since the texture of the
bag and Lola’s hair exhibit high similarities in their texture. Horizontal
lines illustrate the level of linked elements. Especially noteworthy is
the visual variance within the same level. While in the example with
the dollar bill there is a high degree of visual similarity on the level
below the top region, in the first example, Lola and her father are
matched separately and the linked regions do not have any common
visual information although they share the same semantical topic.
Figure 5.28 shows a detected recurring element with its complete set
of linked regions. The example shows Lola, running to raise money,
her boyfriend looking at the clock on the wall, and a close up of the
clock, which is an often occurring scene in the movie. All three objects
(the two characters and the clock) are repeatedly found and linked
together in various degrees of detail: from the initial close up, via
a long shot of Lola, to an extreme close up of her trousers. Based
5.4 recurring element detection 119
Figure 5.27: Examples for detected recurring elements: solid lines depict
directly linked regions; dashed lines shows indirectly associated
successors of the same region.
on the region characteristics, the next task could be the classification
of regions according to the shooting length into e.g. a close up, a
medium, and a long shot. The example illustrates the two main charac-
teristics of the approach. First, due to the applied local features in the
one-to-one keyframe comparison, directly linked regions (depicted by
directional solid lines) share some common visual information. This
does not necessarily hold true for the successors of a given region.
Note, the green highlighted regions in Figure 5.28. Although, they are
both successors of the same region, they do not share common visual
information. However, they are both involved into the same semantical
topic. Second, linked regions can be distributed over the entire movie.
Next to the tree representations we used in the discussed examples, a
variety of visualization methods can be applied to represent semanti-
cally related information based on detected recurring elements such
as MPEG-7 collections, hierarchical and sequential summaries, etc.
5.4.2.2 Archived Documentaries
The second experiment we performed in the context of recurring
element detection investigates three archived documentaries by Dziga
Vertov: Man with a Movie Camera (1929), Enthusiasm (1931), and Three
Songs about Lenin (1934). The reason to choose the three documentaries
is a suggested motif by film experts shared in all three movies: the
rails. Hence, we first explore the movies separately and compare the
results to those of the contemporary material. Following, we verify
whether or not the proposed algorithm is able to detect recurring
elements among different works of the same filmmaker.
120 case studies
Figure
5.
2
8: A
com
p
lete
exam
p
le
for
a
d
etected
region
and
the
corresp
ond
ing
linked
elem
ents.Yellow
boxes
d
ep
ict
corresp
ond
ing
shots,blu
e
lines
ind
icate
keyfram
e
p
ositions
w
ithin
the
shots.
Solid
lines
betw
een
the
regions
show
a
d
irect
linkage
betw
een
regions.
T
he
tw
o
green
highlighted
regions
are
an
exam
ple
for
siblings
of
the
sam
e
region.
5.4 recurring element detection 121
In contrast to the contemporary movie from the previous exper-
iment, the explored archived documentaries exhibit less recurring
elements with much lower amount of linked shots per detected region
(see Figure 5.29). This is mainly due to the fact that most documen-
taries care less about narration and actors, they often change locations,
and characters do not necessarily recur. As a result, detected recur-
ring elements are mostly a long camera take that was cross-cut with
a second scene. Hence, such shots exhibit a high visual similarity.
Figure 5.30 shows an example for detected recurring elements in the
movie Enthusiasm (1931)2.
Figure 5.29: Amount of detected recurring elements vs. corresponding
amount of linked shots.
Figure 5.30: An example for detected recurring element in Enthusiasm (1931).
In general, documentaries turn out to exhibit, to a greater extent,
recurring scenes or sets rather that recurring elements (objects or char-
acters). Following, detected regions occupy predominantly (nearly)
the full frame size (see Figure 5.31).
2 Some examples from Man with a Movie Camera (1929) are depicted in Figure 4.32.
122 case studies
Figure 5.31: Distribution of the size of detected recurring elements.
Finally, the precision performance is comparable to the results
achieved in the first experiment. The average precision performance
for the archived documentaries is approx. 70%.
Starting point for the cross-movie analysis are previously detected
recurring regions in each movie. Similar to region tracking within a
single movie, corresponding regions are matched using local features
and a nearest neighbor ratio matching strategy. For our evaluations
at least five matches per region are required to define a reasonable
match. Figure 5.32 shows the top three elements detected in the ex-
plored movies: rails, eye, and crowd. Despite the partially high visual
dissimilarities, all three detected elements represent typical Vertov
motifs applied across different works.
5.4.3 Related Work
To the best of our knowledge, recurring elements detection has not
been subject to research so far. Related research areas comprise near-
duplicate detection and object detection and tracking.
Near-duplicate detection aims at identifying images or video se-
quences showing slight variance due to editing or changes in lighting,
viewpoint, motion, etc. [40, 70, 183]. This research area has emerged
in recent years for a variety of applications such as the recognition of
TV commercials, detection of duplicated news videos, media linking,
and copyright infringement detection. Recently, Huang et al. proposed
a method for scene recognition based on near-duplicates object detec-
5.4 recurring element detection 123
(a) Man with a Movie Camera.
(b) Enthusiasm.
(c) Man with a Movie Camera. (d) Enthusiasm. (e) Lenin.
(f) Man with a Movie Camera. (g) Enthusiasm. (h) Lenin.
Figure 5.32: Cross-movie analysis. 5.32a-5.32b: rails-motif. 5.32c-5.32e: eye-
motif. 5.32f-5.32h: crowd-motif. All detected regions are embed-
ded into the original frame for better visualization.
tion [64]. The authors argue that shots of the same scene most probably
share a large number of similar objects or background. However, the
authors do not perform any object but simple keypoint detection and
tracking. Following, a shot is represented by an average space-time
feature, called imprecisely object key feature. In contrast to recent
research in near-duplicate detection, our work performs on more de-
tailed level. While existing approaches detect duplicated or reused
media (images or video) as a whole we aim at the identification of
recurring elements within a given medium and their reuse among
different media.
Object detection and tracking usually requires a predefined appear-
ance model of the salient object or a priori information about the scene
for reliable background subtraction and motion tracking [44, 58, 84].
The application scenarios are manifold ranging from traffic control
and surveillance to sport video analysis and the recognition of human
action. Recently, Celik et al. proposed a method for unsupervised ob-
ject detection in unlabeled surveillance video data [28, 29]. The authors
first detect salient objects based on motion information and simple
124 case studies
dimensional features (e.g. height). In the next step, similarity-based
clustering allows for the grouping of objects according to the category
they belong to. The approach is only applicable in a restricted scene
with a static camera. Salient objects have to be moving and within
a certain degree of perspective deformation due to the dimensional
features in the initial step. Our approach differs significantly from ex-
isting methods for object detection and tracking in respect of available
knowledge about both object and scene and in respect of the degree
of detection, i.e. general category (a person, a car, etc.) vs. a specific
subject.
5.4.4 Conclusion
In this section we presented a new approach for the detection of re-
curring elements in movies. Since detected regions can be an object, a
part of it, or a character, the system allows for the detection of visually
similar motifs and recurring characters. The linking between detected
regions shows possible different views of the recurring elements and
facilitates the quick retrieval of relevant sequences. Performed experi-
ments with different works by the same filmmaker demonstrate the
potential of the proposed algorithm to assist experts in film analysis
and film studies.
Part III
S U M M A RY

6
S U M M A RY A N D C O N C L U S I O N
The best way out is always through.
— Robert Frost
Existing approaches for automated film and video analysis bear, for
the most part, two essential characteristics:
1. Relevant features are identified in comparison with other movies,
e.g. in the context of genre recognition: a horror movie exhibits
darker color distribution than a comedy, and
2. Consumer-driven applications aim at an improved retrieval and
handling of media data, e.g. video summarization, genre and
event-recognition, copy detection, etc.
In contrast, this work addressed the task of automated film under-
standing from a filmmaking point of view. Instead of the final product,
we explored the process of film creation, editing, and presentation
as a source for relevant features. Every choice for a given film tech-
nique or setting has a purpose. While the automated detection of
some well-established film techniques requires for additional knowl-
edge and is not feasible at the current state of research, many other
techniques can be analyzed fully automated using computer vision
methods. Identified research tasks address mainly the requirements
of film experts and improve the process of film understanding and
film studies. However, applications for the broader audience can also
make use of the acquired knowledge. An example is the use of recur-
ring element detection. The proposed method can reveal a common
semantic topic between visually almost dissimilar regions. In a next
step, this information can be applied for enriched visualization and
summarization methods (e.g. in the context of MPEG-7).
6.1 achievements
This work investigated the possibilities for building a common ground
between the requirements of film experts and existing computer vision
methods for automated film analysis and understanding. Within the
scope of this study we presented a mapping between media aesthetic
elements and concepts that influence the production, presentation, and
perceptions of films, their application by means of well-established
film techniques, and existing methods and features in computer vi-
sion. This new view on film analysis allowed for the exploration of
127
128 summary and conclusion
the boundaries of current research in computer vision and for the
identification of open research tasks. Finally, we presented three novel
research questions and their solutions in the context of automated
film analysis: camera take reconstruction, film comparison, and recur-
ring element detection. Performed experiments bear two significant
potentials:
1. The proposed algorithms can assist film experts by providing
support for tasks that are currently manually performed (e.g.
for film archives and museums, for film studies, and for the
filmmaker looking for a specific footage).
2. The proposed algorithms provide a roadmap and pave the way
for further application scenarios such as montage pattern anal-
ysis, the comparison of different film cuts, the identification
of missing shots, the reconstruction of the original film cut, or
the detection of recurring elements in the works of the same
filmmaker.
6.2 future development
• This thesis presents an initial fundamental research in the context
of automated film understanding. It is an attempt to provide a
mapping between features that are of particular interest for the
film study community and computer vision approaches. Further
research, and in particular, intensified communications and a
common vocabulary between film experts and computer vision
experts is required to complement the proposed mapping and to
identify further research questions in the context of automated
film analysis and understanding.
• Another major issue is missing ground truth. To enable further
research and comparison between different approaches a com-
mon set of data is required. However, the definition of ground
truth in this context is a notably tedious process feasible proba-
bly by film experts only. Furthermore, the resulting ground truth
is often shaped by individual judgements and subjectivity.
• Proposed algorithms in this thesis are designed as proof-of-
concept. Further work is required to improve the overall perfor-
mance and computation time in order to provide practical tools
for the film study community.
• The performed study identified a large set of achievable research
tasks that have not been subject to research so far, e.g. the analy-
sis of film compositions and continuity techniques (see Table 4.1).
6.2 future development 129
• In this work, the differentiation between feasible and not feasible
tasks has been made based on the current state-of-the-art in
computer vision and under the assumption that there is no a
priori knowledge available (e.g. about characters, objects, shapes,
etc.). However, tasks, such as rhythm analysis or motif detection,
are of particular interest to the film study community (even
within predefined constrictions) and bear high potential for
further research.
• This thesis focussed on the study of the formal features of film
style. We explored films in relation to their filmmakers and in
terms of their construction and applied technologies. Film analy-
sis can be further approached from a very different direction, e.g.
how films are perceived and responded by the audience. Such
approaches are closely dependent on users’ experiences and
users’ preferences. Furthermore, they assume that certain film
techniques trigger well-defined, distinct emotions and that film-
makers apply such film techniques according to the predefined
association. However, as already discussed in Chapter 2, the psy-
chological interpretation of film techniques is not fully explored
yet. Recent research in affective content analysis focuses mainly
on the detection and classification of emotions in distinct movie
types, such as horror and comedy, using a limited number of
film techniques (e.g. color distribution, key lighting, shot length,
motion, and audio features). This thesis broadens the horizon
for affective content analysis by discussing a wide set of fun-
damental film techniques and their intended purpose. Further
research is required for the identification of the most effective
film techniques for affective content analysis. Notwithstanding,
such research is mostly applicable for mainstream cinema and
not for avant-garde and art movies where the artistry of the
filmmaker plays a central role in the applied film techniques and
in the shaping of the movie.
• The development and, primarily, the establishment of new tech-
nologies involves changes in the filmmaking process: new film
techniques appear and existing ones may not be applicable any
more. An example for such a technology is stereoscopic or 3D
cinema. Next to the strong perception of depth in 3D, notable
adaptations include fine-tuned compositions due to the limita-
tions of stereoscopic perception and slower pace due to increased
visual complexity. In general, research in the context of auto-
mated film analyses stand to substantially benefit from such
technology developments. Fundamental film techniques, such
as those discussed in this thesis, are still present and essential
in filmmaking. However, advanced research tasks, such as re-
curring element detection that require for object tracking and
130 summary and conclusion
modeling, can make use of additional information available in
stereoscopic images.
B I B L I O G R A P H Y
[1] American Standard Acoustical Terminology. American Standards
Association, 1960.
[2] A. E. Abdel-Hakim and A. A. Farag. Csift: A sift descriptor with
color invariant characteristics. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 1978–1983, 2006.
[3] B. Adams, C. Dorai, and S. Venkatesh. Automated film rhythm
extraction for scene analysis. In IEEE International Conference on
Multimedia and Expo, pages 1056–1059, 2001.
[4] B. Adams, C. Dorai, and S. Venkatesh. Finding the beat: An
analysis of the rhythmic elements of motion pictures. In Asian
Conference on Computer Vision, 2002.
[5] B. Adams, C. Dorai, and S. Venkatesh. Media computing: compu-
tational media aesthetics, chapter Formulating Film Tempo: The
Computational Media Aesthetics Methodology in Practice, pages
57–79. Kluwer Academic Publishers, 2002.
[6] B. Adams, C. Dorai, and S. Venkatesh. Toward automatic extrac-
tion of expressive elements from motion pictures: tempo. IEEE
Transactions on Multimedia, 4(4):472–481, 2002.
[7] P. Aigrain, P. Joly, and V. Longueville. Medium knowledge-
based macro-segmentation of video into sequences. Intelligent
Multimedia Information Retrieval, pages 159–173, 1997.
[8] J. Anderson and B. Anderson. The myth of persistence of vision
revisited. Journal of film and video, 45(1):3–12, 1993.
[9] J. Annesley, J. Orwell, and J.-P. Renno. Evaluation of mpeg7
color descriptors for visual surveillance retrieval. In Second
Joint IEEE International Workshop on Visual Surveillance and Per-
formance Evaluation of Tracking and Surveillance, pages 105–112,
2005.
[10] R. Arnheim. Art and Visual Perception: A Psychology of the Creative
Eye. University of California Press, 1974.
[11] D. Ballard. Generalizing the hough transform to detect arbitrary
shapes. Pattern Recognition, 13(2):111–122, 1981.
[12] A. Baumberg. Reliable feature matching across widely sepa-
rated views. In IEEE Conference on Computer Vision and Pattern
Recognition, volume 1, pages 774–781, 2000.
131
132 bibliography
[13] H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded up robust
features. In European Conference on Computer Vision, volume
3951/2006 of LNCS, pages 404–417. Springer, 2006.
[14] F. E. Beaver. Dictionary of film terms: the aesthetic companion to
film art. Peter Lang Publishing, 2009.
[15] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object
recognition using shape contexts. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 24(4):509–522, 2002.
[16] M. Bertini, A. D. Bimbo, and W. Nunziati. Video clip matching
using mpeg-7 descriptors and edit distance. In International
Conference on Image and Video Retrieval, volume 4071 of LNCS,
pages 133–142, 2006.
[17] R. A. Block. Cognitive models of psychological time, chapter Models
of psychological time, pages 1–35. Lawrence Erlbaum Associates,
1990.
[18] D. Bordwell and K. Thompson. Film art: an introduction.
McGraw-Hill, 8th edition, 2008.
[19] A. Bosch, A. Zisserman, and X. Mu noz. Scene classification
using a hybrid generative/discriminative approach. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 30(4):712–727,
2008.
[20] E. Britannica. motion picture. Encyclopædia Britan-
nica Online. http://www.britannica.com/EBchecked/topic/
394107/motion-picture (last checked: 2011-09-15), 2011.
[21] E. Britannica. time perception. Encyclopædia Britan-
nica Online. http://www.britannica.com/EBchecked/topic/
596177/time-perception (last checked: 2011-09-15), 2011.
[22] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy
optical flow estimation based on a theory for warping. In Eu-
ropean Conference on Computer Vision, volume 4 of LNCS, pages
25–36. Springer, 2003.
[23] G. J. Burghouts and J.-M. Geusebroek. Performance evaluation of
local colour invariants. Computer Vision and Image Understanding,
113(1):48–62, 2009.
[24] J. B. Burns, A. R. Hanson, and E. M. Riseman. Extracting straight
lines. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 8(4):425–455, july 1986.
[25] J. Canny. A computational approach to edge detection. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 8(6):679–
698, 1986.
bibliography 133
[26] P. Cavanagh. Short-range vs long-range motion: Not a valid
distinction. Spatial Vision, 5(4):303–309, 1991.
[27] P. Cavanagh and G. Mather. Motion: The long and short of it.
Spatial Vision, pages 103–129, 1989.
[28] H. Celik, A. Hanjalic, and E. A. Hendriks. Unsupervised and si-
multaneous training of multiple object detectors from unlabeled
surveillance video. Computer Vision and Image Understanding,
113(10):1076–1094, 2009.
[29] H. Celik, A. Hanjalic, E. A. Hendriks, and S. Boughorbel. Online
training of object detectors from unlabeled surveillance video.
In IEEE Conference on Computer Vision and Pattern Recognition
Workshops, pages 1–7, 2008.
[30] V. T. Chasanis, A. C. Likas, and N. P. Galatsanos. Scene detection
in videos using shot clustering and sequence alignment. IEEE
Transactions on Multimedia, 11(1):89–100, 2009.
[31] L. Chen and M. T. Özsu. Rule-based scene extraction from video.
In International Conference on Image Processing, volume 2, pages
737–740, 2001.
[32] L. Chen, S. J. Rizvi, and M. T. Özsu. Incorporating audio cues
into dialog and action scene extraction. In SPIE Storage and
Retrieval for Multimedia Databases, pages 252–264, 2003.
[33] I. Cherif, V. Solachidis, and I. Pitas. Shot type identification of
movie content. In International Symposium on Signal Processing
and Its Applications, pages 1–4, 2007.
[34] Y. Cui, J. S. Jin, S. Zhang, S. Luo, and Q. Tian. Music video
affective understanding using feature importance analysis. In
ACM International Conference on Image and Video Retrieval, pages
213–219, 2010.
[35] K. Dancynger. The technique of film and video editing: history,
theory, and practice. Focal Press, 4th edition, 2007.
[36] M. De Santo, G. Percannella, C. Sansone, and M. Vento. Dialogue
scenes detection in mpeg movies: A multi-expert approach. In
Multimedia Databases and Image Communication, pages 192–201,
2001.
[37] R. Deriche. Recursively implementing the gaussian and its
derivatives. In International Conference on Image Processing, pages
263–267, 1992.
[38] A. Desolneux, L. Moisan, and J.-M. Morel. Meaningful align-
ments. International Journal of Computer Vision, 40:7–23, 2000.
134 bibliography
[39] C. Dorai and S. Venkatesh, editors. Media computing: computa-
tional media aesthetics. Kluwer Academic Publishers, 2002.
[40] M. Douze, H. Jégou, and C. Schmid. An image-based approach
to video copy detection with spatio-temporal post-filtering. IEEE
Transactions on Multimedia, pages 257–266, 2010.
[41] R. Dyer. Film studies: critical approaches, chapter Intorduction to
film studies. Oxford University Press, 2000.
[42] S. Eickeler and S. Müller. Content-based video indexing of tv
broadcast news using hidden markov models. In IEEE Interna-
tional Conference on Acoustics, Speech, and Signal Processing, pages
2997–3000, 1999.
[43] A. Ekin, A. Tekalp, and R. Mehrotra. Automatic soccer video
analysis and summarization. IEEE Transactions on Image Process-
ing, 12(7):796–807, 2003.
[44] A. Ess, K. Schindler, B. Leibe, and L. V. Gool. Object detection
and tracking for autonomous navigation in dynamic environ-
ments. International Journal of Robotics Research, 29(14):1707–1725,
2010.
[45] H. Fastl and E. Zwicker. Psychoacoustics: Facts and Models.
Springer, 3 edition, 2007.
[46] C. Feng. Code for vanishing point detection using jlinkage and
lsd. http://code.google.com/p/vpdetection (last checked:
2011-09-15), 2011.
[47] V. Ferrari, T. Tuytelaars, and L. V. Gool. Simultaneous object
recognition and segmentation from single or multiple model
views. International Journal of Computer Vision, 67(2):159–188,
2006.
[48] M. A. Fischler and R. C. Bolles. Random sample consensus:
A paradigm for model fitting with applications to image anal-
ysis and automated cartography. Communications of the ACM,
24(6):381–395, 1981.
[49] S. C. Gaddam. Code for calculating color clouds. http://cns.
bu.edu/~gsc/ColorHistograms.html (last checked: 2011-09-15),
2011.
[50] U. Gargi, R. Kasturi, and S. H. Strayer. Performance characteriza-
tion of video-shot-change detection methods. IEEE Transactions
on Circuits and Systems for Video Technology, 10(1):1–13, 2000.
[51] J. M. Gauch and A. Shivadas. Identification of new commercials
using repeated video sequence detection. In IEEE International
bibliography 135
Conference on Image Processing, volume 3, pages II–1252–1255,
2005.
[52] M. Gavrielides, E. Sikudova, and I. Pitas. Color-based descrip-
tors for image fingerprinting. IEEE Transactions on Multimedia,
8(4):740–748, 2006.
[53] Y. Geng, D. Xu, and A. Wu. Effective video scene detection ap-
proach based on cinematic rules. In Knowledge-Based Intelligent
Information and Engineering Systems, pages 165–165, 2005.
[54] J.-M. Geusebroek, R. van den Boomgaard, A. Smeulders, and
H. Geerts. Color invariance. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 23(12):1338 –1350, 2001.
[55] T. Gevers. Image segmentation and similarity of color-texture
objects. IEEE Transactions on Multimedia, 4(4):509–516, 2002.
[56] X. Giro-i Nieto, R. Salla, and X. Vives. Digimatge, a rich in-
ternet application for video retrieval from a multimedia asset
management system. In International Conference on Multimedia
Information Retrieval, pages 425–428, 2010.
[57] M. Grabner, H. Grabner, and H. Bischof. Fast approximated
sift. In Asian Conference on Computer Vision, volume 3851/2006
of LNCS, pages 918–927. Springer, 2006.
[58] W. Guo, C. Xu, S. Ma, and M. Xu. Visual attention based motion
object detection and trajectory tracking. In PCM 2010, pages
462–470, 2011.
[59] A. Hampapur and R. Bolle. Feature based indexing for media
tracking. In IEEE International Conference on Multimedia and Expo,
pages 67–70, 2000.
[60] A. Hanjalic. Shot-boundary detection: unraveled and resolved?
IEEE Transactions on Circuits and Systems for Video Technology,
12(2):90–105, 2002.
[61] C. Harris and M. Stephens. A combined corner and edge detec-
tor. In Alvey Conference, pages 147–152, 1988.
[62] M. Hershenson. Visual space perception: a primer. MIT Press, 2000.
[63] B. K. Horn and B. G. Schunck. Determining optical flow. Artifi-
cial Intelligence, 17(1–3):185–203, 1981.
[64] C.-R. Huang and C.-S. Chen. Video scene detection by link-
constrained affinity-propagation. In IEEE International Sympo-
sium on Circuits and Systems, pages 2834–2837, 2009.
136 bibliography
[65] J. Huang, S. R. Kumar, M. Mitra, W.-J. Zhu, and R. Zabih. Image
indexing using color correlograms. IEEE Conference on Computer
Vision and Pattern Recognition, pages 762–768, 1997.
[66] ISO/IEC. Information Technology - Multimedia Content Description
Interface - part 3: Visual. Number 15938-3. ISO/IEC. Moving
Pictures Expert Group, 2002.
[67] L. Itti. Automatic foveation for video compression using a
neurobiological model of visual attention. IEEE Transactions on
Image Processing, 13(10):1304–1318, oct. 2004.
[68] A. Jacquot, P. Sturm, and O. Ruch. Adaptive tracking of non-
rigid objects based on color histograms and automatic parameter
selection. In IEEE Workshop on Motion and Video Computing, vol-
ume 2, pages 103–109, 2005.
[69] A. Joly, C. Frélicot, and O. Buisson. Robust content-based video
copy identification in a large reference database. In International
Conference on Image and Video Retrieval, volume 2728/2003, pages
414–424. LNCS, 2003.
[70] A. Joly, C. Frélicot, and O. Buisson. Content-based copy detec-
tion using distortion-based probabilistic similarity search. IEEE
Transactions on Multimedia, 9(2):293–306, 2007.
[71] M. Kampel and M. Zaharieva. Recognizing ancient coins based
on local features. In Advances in Visual Computing, volume
5358/2008 of LNCS, pages 11–22. Springer, 2008.
[72] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive representa-
tion for local image descriptors. In Computer Vision and Pattern
Recognition, volume 2, pages 506–513, 2004.
[73] H.-s. Kim, J. Lee, H. Liu, and D. Lee. Video linkage: group
based copied video detection. In ACM International Conference
on Image and Video Retrieval, pages 397–406, 2008.
[74] H. Kobayashi, Y. Okouchi, and S. Ota. Image retrieval system
using kansei features. In 5th Pacific Rim International Conference
on Artificial Intelligence: Topics in Artificial Intelligence, pages 626–
635, 1998.
[75] M. Kotti, D. Ververidis, G. Evangelopoulos, I. Panagakis,
C. Kotropoulos, P. Maragos, and I. Pitas. Audio-assisted movie
dialogue detection. IEEE Transactions on Circuits and Systems for
Video Technology, 18(11):1618–1627, 2008.
[76] P. Kovesi. Image features from phase congruency. Videre: Journal
of Computer Vision Research, 1(3):2–26, 1999.
bibliography 137
[77] P. Kovesi. Code for calculating phase congruency and
phase symmetry/asymmetry. http://www.csse.uwa.edu.au/
~pk/Research/research.html (last checked: 2011-09-15), 2011.
[78] B. Kroon, J. Nesvadba, and A. Hanjalic. Dialog detection in
narrative video by shot and face analysis. In SPIE Proceedings.
Multimedia Content Access: Algorithms and Systems, volume 6506,
2007.
[79] R. Laganière, R. Bacco, A. Hocevar, P. Lambert, G. Païs, and B. E.
Ionescu. Video summarization from spatio-temporal features. In
Proceedings of the 2nd ACM TRECVid Video Summarization Work-
shop, pages 144–148, 2008.
[80] I. Laptev. On space-time interest points. International Journal of
Computer Vision, 64(2-3):107–123, 2005.
[81] I. Laptev and T. Lindeberg. Space-time interest points. In IEEE
International Conference on Computer Vision, pages 432–439, 2003.
[82] J. Law-To, O. Buisson, V. Gouet-Brunet, and N. Boujemaa. Ro-
bust voting algorithm based on labels of behavior for video copy
detection. In ACM International Conference on Multimedia, pages
835–844, 2006.
[83] J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-
Brunet, N. Boujemaa, and F. Stentiford. Video copy detection:
a comparative study. In ACM International Conference on Image
and Video Retrieval, pages 371–378, 2007.
[84] B. Leibe, K. Schindler, N. Cornelis, and L. Van Gool. Coupled
object detection and tracking from static cameras and moving
vehicles. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 30(10):1683–1698, 2008.
[85] G. Leon, H. Kalva, and B. Furht. Video identification using video
tomography. In IEEE International Conference on Multimedia and
Expo, pages 1030–1033, 2009.
[86] J. Li, W. Wu, T. Wang, and Y. Zhang. One step beyond his-
tograms: Image representation using markov stationary features.
In IEEE International Conference on Computer Vision and Pattern
Recognition, pages 1–8, 2008.
[87] S. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and H. Shum.
Statistical learning of multi-view face detection. In Computer
Vision – ECCV 2002, volume 2353 of LNCS, pages 117–121, 2006.
[88] Y. Li, J. Jin, and X. Zhou. Video matching using binary signa-
ture. In International Symposium on Intelligent Signal Processing
and Communication Systems, pages 317–320, 2005.
138 bibliography
[89] T. Lin and H.-J. Zhang. Automatic video scene extraction by
shot grouping. In International Conference on Pattern Recognition,
volume 4, pages 39–42, 2000.
[90] D. Lowe. Demo software: Sift keypoint detector. http://www.
cs.ubc.ca/~lowe/keypoints/ (last checked: 2011-09-15), 2011.
[91] D. G. Lowe. Object recognition from local schale-invariant fea-
tures. In International Conference on Computer Vision, volume 2,
pages 1150–1157, 1999.
[92] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. International Journal of Computer Vision, 60(2):91–110,
2004.
[93] B. D. Lucas and T. Kanade. An iterative image registration
technique with an application to stereo vision. In International
Joint Conference on Artificial Intelligence, volume 2, pages 674–679,
1981.
[94] Y.-F. Ma, X.-S. Hua, L. Lu, and H.-J. Zhang. A generic framework
of user attention model and its application in video summariza-
tion. IEEE Transactions on Multimedia, 7(5):907–919, 2005.
[95] R. Maltsy. Hollywood Cinema. Wiley-Blackwell, 2 edition, 2003.
[96] B. Mamer. Film Production Technique: Creating the Accomplished
Image. Number 5. Wadsworth Publishing, 2008.
[97] B. Manjunath, J.-R. Ohm, V. Vasudevan, and A. Yamada. Color
and texture descriptors. IEEE Transactions on Circuits and Systems
for Video Technology, 11(6):703–715, 2001.
[98] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide
baseline stereo from maximally stable extremal regions. In
British Machine Vision Conference, volume 1, pages 384–393, 2002.
[99] E. McKean, editor. The New Oxford Amerdican Dictionary. Oxford
University Press, 2nd edition, 2005.
[100] K. Mikolajczyk and C. Schmid. Indexing based on scale invariant
interest points. In International Conference on Computer Vision,
pages 525–531, 2001.
[101] K. Mikolajczyk and C. Schmid. Scale & affine invariant interest
point detectors. International Journal of Computer Vision, 60(1):63–
86, 2004.
[102] K. Mikolajczyk and C. Schmid. A performance evaluation of
local descriptors. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 27(10):1615–1630, 2005.
bibliography 139
[103] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas,
F. Schaffalitzky, T. Kadir, and L. V. Gool. A comparison of
affine region detectors. International Journal of Computer Vision,
65(1–2):43–72, 2005.
[104] D. Mitrović, M. Zeppelzauer, M. Zaharieva, and C. Breiteneder.
Retrieval of visual composition in film analysis. In Interna-
tional Workshop on Image Analysis for Multimedia Interactive Ser-
vices, 2011.
[105] J. Mitry. The aesthetics and psychology of the cinema. Indiana
University Press, 2000.
[106] S. Moncrieff, C. Dorai, and S. Venkatesh. Affect computing
in film through sound energy dynamics. In ACM international
conference on Multimedia, pages 525–527, 2001.
[107] S. Moncrieff, C. Dorai, and S. Venkatesh. Detecting indexical
signs in film audio for scene interpretation. In IEEE International
Conference on Multimedia and Expo, pages 989–992, 2001.
[108] S. Moncrieff, C. Dorai, and S. Venkatesh. Media computing: com-
putational media aesthetics, chapter Determining Affective Events
through Film Audio, pages 131–155. Kluwer Academic Publish-
ers, 2002.
[109] B. C. J. Moore. An Introduction to the psychology of hearing. Aca-
demic Press, 5 edition, 2004.
[110] W. Murch. In the blink of an eye: a perspective on film editing.
Silman-James Press, 2001.
[111] F. Nack, C. Dorai, and S. Venkatesh. Computational media
aesthetics: finding meaning beautiful. IEEE Multimedia, 8(4):10–
12, 2001.
[112] C. W. Ng, I. King, and M. R. Lyu. Video comparison using tree
matching algorithm. In Proceedings of the International Conference
on Imaging Science, Systems and Technology, pages 184–190, 2001.
[113] C.-W. Ngo, T.-C. Pong, and H.-J. Zhang. Motion-based video rep-
resentation for scene change classification. International Journal
of Computer Vision, 50(2):127–142, 2002.
[114] P. Obrador, L. Schmidt-Hackenberg, and N. Oliver. The role
of image composition in image aesthetics. In IEEE International
Conference on Image Processing, pages 3185–3188, 2010.
[115] A. Oikonomopoulos, I. Patras, and M. Pantic. Spatiotemporal
salient points for visual recognition of human actions. IEEE
Transactions on Systems, Man, and Cybernetics, 36(3), 2006.
140 bibliography
[116] T. Ojala, M. Aittola, and E. Matinmikko. Empirical evaluation
of mpeg-7 xm color descriptors in content-based retrieval of
semantic image categories. In International Conference on Pattern
Recognition, volume 2, pages 1021–1024, 2002.
[117] G. Oldham. First Cut: Conversations with Film Editors. University
of California Press, 1995.
[118] M. Park, S. Leey, P.-C. Cheny, S. Kashyap, A. Butty, and Y. Liuy.
Performance evaluation of state-of-the-art discrete symmetry
detection algorithms. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 1–8, 2008.
[119] G. Pass, R. Zabih, and J. Miller. Comparing images using color
coherence vectors. In ACM International Conference on Multime-
dia, pages 65–73, 1996.
[120] S. Pfeiffer and U. Srinivasan. Media computing: computational
media aesthetics, chapter Scene Determination Using Auditive
Segmentation Models of Edited Video, pages 105–123. Kluwer
Academic Publishers, 2002.
[121] L. Pickup and A. Zisserman. Automatic retrieval of visual con-
tinuity errors in movies. In Proceeding of the ACM International
Conference on Image and Video Retrieval, pages 7:1–7:8, 2009.
[122] I. Radev, N. Pissinou, and K. Makki. Film video modeling. In
Workshop on Knowledge and Data Engineering Exchange, page 122,
1999.
[123] Z. Rasheed and M. Shah. Scene detection in holywood movies
and tv shows. In IEEE Conference on Computer Vision and Pattern
Recognition, volume 2, pages 343–348, 2003.
[124] Z. Rasheed and M. Shah. Detection and representation of scenes
in videos. IEEE Transactions on Multimedia, 7(6):1097–1105, 2005.
[125] Z. Rasheed, Y. Sheikh, and M. Shah. On the use of computable
features for film classification. IEEE Transactions on Circuits and
Systems for Video Technology, 15(1):52–64, 2005.
[126] Y. Rui, T. S. Huang, and S. Mehrotra. Constructing table-of-
content for videos. Multimedia Systems, 7(5):359–368, 1999.
[127] P. Sand and S. Teller. Video matching. ACM Transactions on
Graphics, pages 592–599, 2004.
[128] F. Schaffalitzky and A. Zisserman. Automated location matching
in movies. Computer Vision and Image Understanding, 92(2–3):236–
264, 2003.
bibliography 141
[129] C. Schmid, R. Mohr, and C. Baukhage. Evaluation of interest
point detectors. International Journal of Computer Vision, 2(37):151–
172, 2000.
[130] R. Sekuler, S. N. Watamaniuk, and R. Blake. Steven’s handbook of
experimental psychology: Sensation and Perception, volume 1, chap-
ter Perception of visual motion, pages 121–176. John Wiley &
Sons, Inc., 3 edition, 2002.
[131] H. T. Shen, J. Liu, Z. Huang, C. W. Ngo, and W. Wang. Near-
duplicate video retrieval: Current research and future trends.
IEEE Multimedia, 2011.
[132] H. T. Shen, X. Zhou, Z. Huang, J. Shao, and X. Zhou. Uqlips:
a real-time near-duplicate video clip detection system. In In-
ternational Conference on Very Large Data Bases, pages 1374–1377,
2007.
[133] A. Shivadas and J. Gauch. Real-time commercial recognition
using color moments and hashing. In Fourth Canadian Conference
on Computer and Robot Vision, pages 465–472, 2007.
[134] P. Shivakumara, W. Huang, and C. L. Tan. Efficient video text
detection using edge features. In International Conference on
Pattern Recognition, pages 1–4, 2008.
[135] E. Sikov. Film studies: an introduction. Film and culture. Columbia
University Press, 2009.
[136] J. Sivic, F. Schaffalitzky, and A. Zisserman. Object level grouping
for video shots. International Journal of Computer Vision, 67(2):189–
210, 2006.
[137] J. Sivic and A. Zisserman. Video google: a text retrieval approach
to object matching in videos. In Ninth IEEE International Confer-
ence on Computer Vision (ICCV’03), volume 2, pages 1470–1477,
2003.
[138] A. F. Smeaton, P. Over, and A. R. Doherty. Video shot boundary
detection: Seven years of trecvid activity. Computer Vision and
Image Understanding, 114(4):411–418, 2010.
[139] A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain.
Content-based image retrieval at the end of the early years.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(12):1349–1380, 2000.
[140] T. F. Smith and M. S. Waterman. Identification of common
molecular subsequences. Journal of Molecular Biology, 1(147):195–
197, 1981.
142 bibliography
[141] M. Soleymani, G. Chanel, J. J. Kierkels, and T. Pun. Affective
ranking of movie scenes using physiological signals and content
analysis. In ACM Workshop on Multimedia semantics, pages 32–39,
2008.
[142] M. Sonka, V. Hlavac, and R. Boyle. Image Processing, Analysis,
and Machine Vision. Thomson Learning, 3rd edition, 2007.
[143] M. Stark and B. Schiele. How good are local features for classes
of geometric objects. In 11th International Conference on Computer
Vision, pages 1–8, 2007.
[144] A. N. Stein and M. Hebert. Local detection of occlusion bound-
aries in video. Image and Vision Computing, 27(5):514–522, 2009.
[145] M. Stricker and M. Orengo. Similarity of color images. In SPIE
Conference on Storage and Retrieval for Image and Video Databases
III, volume 2, pages 381–392, 1995.
[146] M. J. Swain and D. H. Ballard. Indexing via color histograms. In
International Conference on Computer Vision, pages 390–393, 1990.
[147] J.-P. Tardif. Non-iterative approach for fast and accurate van-
ishing point detection. In International Conference on Computer
Vision, pages 1250–1257, 2009.
[148] T. Tayama. The minimum temporal thresholds for motion detec-
tion of grading patterns. Perception, 29(7):761–769, 2000.
[149] R. Teixeira, T. Yamasaki, and K. Aizawa. Comparative analysis
of low-level visual features for affective determination of video
clips. In International Conference on Future Information Technology
(FutureTech), pages 1–6, 2010.
[150] K. Terasawa, T. Nagasaki, and T. Kawashima. Robust match-
ing method for scale and rotation invariant local descriptors
and its application to image indexing. In Information Retrieval
Technology, volume 3689/2005 of LNCS, pages 601–615. Springer,
2005.
[151] D. W. Tjondronegoro and Y.-P. P. Chen. Knowledge-discounted
event detection in sports video. IEEE Transactions on Systems,
Man and Cybernetics, Part A: Systems and Humans, 40(5):1009–
1024, 2010.
[152] B. T. Truong and C. Dorai. Automatic genre identification for
content-based video categorization. In 15th International Confer-
ence on Pattern Recognition, volume 4, pages 230–233, 2000.
[153] B. T. Truong, C. Dorai, and S. Venkatesh. New enhancements to
cut, fade, and dissolve detection processes in video segmentation.
bibliography 143
In ACM international Conference on Multimedia, pages 219–227,
2000.
[154] B. T. Truong, S. Venkatesh, and C. Dorai. Application of com-
putational media aesthetics methodology to extracting color
semantics in film. In ACM International Conference on Multime-
dia, pages 339–342, 2002.
[155] B. T. Truong, S. Venkatesh, and C. Dorai. Extraction of film
takes for cinematic analysis. Multimedia Tools and Applications,
26(3):277–298, 2005.
[156] T. Tuytelaars and L. V. Gool. Matching widely separated views
based on affine invariant regions. International Journal of Com-
puter Vision, 59(1):61–85, 2004.
[157] T. Tuytelaars and K. Mikolajczyk. Local invariant feature detec-
tors: a survey. Foundations and Trends in Computer Graphics and
Vision, 3(3):177–280, 2008.
[158] K. E. van de Sande, T. Gevers, and C. G. Snoek. Evaluating color
descriptors for object and scene recognition. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 32(9):1582–1596, 2010.
[159] J. van de Weijer, T. Gevers, and A. Bagdanov. Boosting color
saliency in image feature detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 28(1):150 –156, 2006.
[160] R. G. von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall.
Lsd: A fast line segment detector with a false detection con-
trol. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 32(4):722–732, 2010.
[161] H. L. Wang and L.-F. Cheong. Affective understanding in film.
IEEE Transactions on Circuits and Systems for Video Technology,
16(6):689–704, 2006.
[162] C.-Y. Wei, N. Dimitrova, and S.-F. Chang. Color-mood analysis
of films based on syntactic and psychological models. In IEEE
International Conference on Multimedia and Expo, volume 2, pages
831–834, 2004.
[163] G. Willems, T. Tuytelaars, and L. V. Gool. An efficient dense
and scale-invariant spatio-temporal interest point detector. In
European Conference on Computer Vision, pages 650–663, 2008.
[164] G. Willems, T. Tuytelaars, and L. V. Gool. Spatio-temporal fea-
tures for robust content-based video copy detection. In ACM In-
ternational Conference on Multimedia Information Retrieval, pages
183–190, 2008.
144 bibliography
[165] C. Xu, J. Wang, H. Lu, and Y. Zhang. A novel framework for
semantic annotation and personalized retrieval of sports video.
IEEE Transactions on Multimedia, 10(3):421–436, 2008.
[166] M. Xu, X. He, J. Jin, Y. Peng, C. Xu, and W. Guo. Using scripts for
affective content retrieval. In Advances in Multimedia Information
Processing – PCM 2010, pages 43–51. 2011.
[167] M.-C. Yeh and K.-T. Cheng. Video copy detection by fast se-
quence matching. In ACM International Conference on Image and
Video Retrieval, pages 1–7, 2009.
[168] B.-J. Yi, J.-T. Lee, H.-W. Woo, and H.-C. Rim. Contextual video
advertising system using scene information inferred from video
scripts. In International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 771–772, 2010.
[169] A. Yoshitaka, T. Ishii, M. Hirakawa, and T. Ichikawa. Content-
based retrieval of video data by the grammar of film. In IEEE
Symposium on Visual Languages, page 310, 1997.
[170] J. Yuan, L.-Y. Duan, Q. Tian, and C. Xu. Fast and robust short
video clip search using an index structure. In ACM SIGMM
International Workshop on Multimedia Information Retrieval, pages
61–68, 2004.
[171] J. Yuan, B. Wei, W. Lu, and L. Wang. A new video text detection
method. In ACM/IEEE Joint Conference on Digital Libraries, pages
359–362, 2011.
[172] R. Zabih, J. Miller, and K. Mai. A feature-based algorithm for
detecting and classifying scene breaks. In ACM International
Conference on Multimedia, pages 189–200, 1995.
[173] M. Zaharieva, D. Mitrović, M. Zeppelzauer, and C. Breiteneder.
Film analysis in archive documentaries. IEEE Mutlimedia, 18:38–
47, 2011.
[174] M. Zaharieva, M. Zeppelzauer, D. Mitrović, and C. Breiteneder.
Finding the missing piece: Content-based video comparison.
In IEEE International Symposium on Multimedia, pages 330–335,
2009.
[175] X. Zeng, X. Zhang, W. Hu, and W. Li. Video scene segmentation
using time constraint dominant-set clustering. In International
Multimedia Modeling Conference, pages 637–643, 2010.
[176] M. Zeppelzauer, D. Mitrović, and C. Breiteneder. Analysis of
historical artistic documentaries. In International Workshop on
Image Analysis for Multimedia Interactive Services, pages 201–106,
2008.
bibliography 145
[177] M. Zeppelzauer, M. Zaharieva, D. Mitrović, and C. Breiteneder.
Retrieval of motion composition in film. Digital Creativity, 2011.
[178] H. Zettl. Media computing: computational media aesthetics, chapter
Essentials of Applied Media Aesthetics, pages 11–38. Kluwer
Academic Publishers, 2002.
[179] H. Zettl. Sight, Sound, Motion: Applied Media Aesthetics.
Wadsworth Publishing Co Inc, 6th edition, 2010.
[180] D.-Q. Zhang and S.-F. Chang. Detecting image near-duplicate
by stochastic attributed relational graph matching with learning.
In ACM International Conference on Multimedia, pages 877–884,
2004.
[181] S. Zhang, W. Hu, T. Wang, J. Liu, and Y. Zhang. Speaker cluster-
ing aided by visual dialogue analysis. In Advances in Multimedia
Information Processing – PCM 2008, pages 693–702, 2008.
[182] L. Zhao, S.-Q. Yang, and B. Feng. Video scene detection using
slide windows method based on temporal constrain shot sim-
ilarity. In IEEE International Conference on Multimedia and Expo,
pages 1171–1174, 2001.
[183] W.-L. Zhao, C.-W. Ngo, H.-K. Tan, and X. Wu. Near-duplicate
keyframe identification with interest point matching and pattern
learning. IEEE Transactions on Multimedia, 9(5):1037–1048, 2007.
[184] J. Zhou and X.-P. Zhang. Automatic identification of digital
video based on shot-level sequence matching. In ACM Interna-
tional Conference on Multimedia, pages 515–518, 2005.

S H O RT C U R R I C U L U M V I TA E
Contact Information
Name: Maia ZAHARIEVA
Address: Vienna University of Technology, IMS
Favoritenstr. 9-11/188-2, A-1040 Vienna, Austria
Phone: +43-1-58801-18857
eMail: zaharieva@ims.tuwien.ac.at
Education
2007 – 2011 Vienna University of Technology, Austria
PhD in business informatics (Dr.rer.soc.oec.)
Thesis title: Features in visual media analysis
1998 – 2003 University of Vienna, Austria
MSc. in business informatics (Mag.rer.soc.oec.)
Thesis title: Efficient description of multimedia
learning objects
1995 – 1996 University of National and World Economy, Sofia,
Bulgaria
Study of business informatics
Work experience (selection)
2008 – present Interactive Media Systems (IMS) Group, Institute
of Software Technology and Interactive Systems,
Vienna University of Technology, Austria
Teaching and research assistant
2007 – 2008 Pattern Recognition and Image Processing (PRIP)
Group, Institute of Computer Aided Automation,
Vienna University of Technology, Austria
Project assistant
2003 – 2007 Multimedia Information Systems (MIS) Group, In-
stitute of Distributed and Multimedia Systems,
University of Vienna, Austria
Teaching and research assistant
147
148 curriculum vitae
Project experience
08.2008 – 01.2010 VERTOV | Digital Formalism: The Vienna Ver-
tov Collection (WWTF project, "5 senses" call
2006), http://www.digitalformalism.org
02.2007 – 08.2008 COINS | COmbatting Illicit Numismatic Sales
(6th EU Framework Programme, STREP),
http://www.coins-project.eu
12.2005 – 01.2007 PROLIX | Process Oriented Learning
and Information Exchange (6th EU
Framework Programme, 5th call, IST),
http//www.prolixproject.org
01.2005 – 09.2006 BRICKS | Building Resources for Integrated Cul-
tural Knowledge Services (6th EU Framework
Programme, 1st call, IST),
http://www.brickscommunity.org
05.2005 – 03.2007 ebInterface, ebInvoice, ebTransfer
http://www.ebinterface.at
02.2003 – 12.2005 MobiLearn | Media Informatics Any-Time Any-
Where (NML 2 Programme),
http://www.mobilearn.at
08.2002 – 12.2003 LaMedica, http://www.lamedica.de