M A G I S T E R A R B E I T
Discrimination and Retrieval
of Animal Sounds
ausgeführt am Institut für
Softwaretechnik und Interaktive Systeme
der Technischen Universität Wien
unter Anleitung von
ao. Univ. Prof. Mag. Dr. Horst Eidenberger
Univ. Prof. Dipl. Ing. Dr. Christian Breiteneder
durch
Matthias Zeppelzauer
Matr. Nr. 9926063
Wildenhaggasse 12
A-3423 St. Andrä Wördern
Wien am 15. Oktober 2005
Datum Unterschrift
 
 
Die approbierte Originalversion dieser Diplom-/Masterarbeit ist an der 
Hauptbibliothek der Technischen Universität Wien aufgestellt  
(http://www.ub.tuwien.ac.at). 
 
The approved original version of this diploma or master thesis is available at the 
main library of the Vienna University of Technology   
(http://www.ub.tuwien.ac.at/englweb/). 
 
Abstract
Hearing is the second most important human sense after vision.
Research primarily concentrated on visual information retrieval in the
past. Only few research took place in the area of audio information
retrieval. The main focus was speech recognition for a long time. Re-
cently, information retrieval of music gained importance through the
distribution of digital music over the internet. Most sounds in our envi-
ronment are neither speech nor music. Environmental sounds contain
important information we permanently use for orientation. Environ-
mental sound recognition is an upcoming research area that enables a
variety of new applications, such as life logging and automatic surveil-
lance. So far, few research has been performed in the area of animal
sound retrieval.
In this thesis, the author identifies state-of-the-art techniques in
general purpose sound recognition by a broad survey of literature.
Based on the findings, this thesis gives a thorough investigation of au-
dio features and classifiers and their applicability in the domain of an-
imal sounds. Techniques developed especially for environmental sound
recognition rarely exist. Therefore techniques from other areas of au-
dio processing are employed. Multiple features originally developed for
speech recognition are applicable to environmental respectively animal
sounds as well. Furthermore, music information retrieval methods are
employed. Classification is performed by popular machine learning
techniques.
Due to the lack of publicly available data, a large database of animal
sounds is built. Additionally, the author introduces a set of novel audio
descriptors. The features are time-based and characterize properties
of an audio waveform that are significant for human perception. Their
quality is compared with that other popular audio features. Experi-
ments show that the new descriptors perform comparably to state-of-
the-art features. The results of animal sound retrieval are encouraging
and motivate further research in this domain.
2
Zusammenfassung
Das Hören ist der zweit wichtigste menschliche Sinn nach dem Sehen.
Forscher konzentrierten sich in der Vergangenheit hauptsächlich auf vi-
sual information retrieval. Bisher fand nur wenig Forschung im Bereich
des audio information retrievals statt. Der Schwerpunkt lag lange Zeit
auf der Erkennung von Spracher. In letzter Zeit gewann music infor-
mation retrieval an Bedeutung durch die Verbreitung digitaler Musik
über das Internet. Die meisten Geräusche in unserer Umgebung sind
weder Sprache noch Musik. Umgebungsgeräusche enthalten wichtige
Informationen, die wir ständig zur Orientierung verwenden. Die Er-
kennung von Umgebungsgeräuschen ist ein aufstrebendes Forschungs-
gebiet, das neue Anwendungen wie etwa life logging und automatisier-
te Überwachung ermöglicht. Bisher wurde wenig Forschung im Bereich
der Erkennung von Tiergeräuschen betrieben.
In dieser Magisterarbeit stellt der Author den aktuellen Stand der
Technik in der Geräuscherkennung durch eine umfassende Literatur-
studie dar. Auf Grundlage der gewonnenen Erkenntnisse präsentiert
diese Arbeit eine gründliche Untersuchung von Audio-Merkmalen und
Klassifikatoren und prüft deren Andwendbarkeit im Bereich der Tier-
geräusche. Es existieren kaum Techniken, die speziell für die Erkennung
von Umgebungsgeräuschen entwickelt wurden. Daher werden Techni-
ken aus anderen Bereichen der Audioverarbeitung herangezogen. Viele
Merkmale, die ursprünglich für Spracherkennung entwickelt wurden,
sind auf Umgebungs- bzw. Tiergeräusche anwendbar. Weiters werden
Methoden aus dem music information retrieval verwendet. Zur Klassi-
fikation kommen Techniken aus dem Bereich des machine learning zum
Einsatz.
Da öffentlich nutzbaren Daten fehlen, wurde eine Datenbank von
Tiergeräuschen aufgebaut. Weiters wird ein Satz von neuen Audio-
Deskriptoren vorgestellt. Die Merkmale sind zeitbasiert und beschrei-
ben Eigenschaften eines audio Signals, die für die menschliche Wahr-
nehmung maßgeblich sind. Die Qualität der Deskriptoren wird mit je-
ner von anderen beliebten Audio-Merkmalen verglichen. Experimente
zeigen, dass die neuen Deskriptoren vergleichbare Leistungen bringen
wie aktuelle Merkmale. Die Ergebnisse für animal sound retrieval sind
vielversprechend und motivieren weitere Forschung in diesem Bereich.
3
Acknowledgments
I extend my sincere gratitude and appreciation to many people who made
this masters thesis possible.
Especially I am grateful to my parents whose support enabled me to achieve
my goals.
Special thanks are due to Ms. Doris ”coffee queen” Divotkey and Ms. Karyn
Laudisi. Furthermore, I want to thank my supervisors ao. Univ. Prof. Mag.
Dr. Horst ”Mr. Iterative Process” Eidenberger and Univ. Prof. Dipl. Ing.
Dr. Christian Breiteneder.
4
Contents
1 Introduction 8
1.1 Motivation & Problem Statement . . . . . . . . . . . . . . . . 8
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Background 12
2.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Content-Based Retrieval Systems . . . . . . . . . . . . . . . . 16
2.4 Content-Based Audio Retrieval . . . . . . . . . . . . . . . . . 17
2.5 Digital Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Experiments 21
3.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 Spectral Flux . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.2 Fourier Transform . . . . . . . . . . . . . . . . . . . . 26
3.4.3 Discrete Cosine Transform . . . . . . . . . . . . . . . . 27
3.4.4 Wavelet Transform . . . . . . . . . . . . . . . . . . . . 27
3.4.5 Constant Q Transform . . . . . . . . . . . . . . . . . . 28
3.4.6 Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.7 Sone . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.8 Cepstral Coefficients . . . . . . . . . . . . . . . . . . . 30
3.4.9 Linear Predictive Coding . . . . . . . . . . . . . . . . 31
3.4.10 Perceptual Linear Prediction . . . . . . . . . . . . . . 32
3.4.11 RASTA-PLP . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.12 Zero Crossing Rate . . . . . . . . . . . . . . . . . . . . 34
3.4.13 Short-Time Energy . . . . . . . . . . . . . . . . . . . . 34
3.4.14 LoHAS, LoLAS & AHA . . . . . . . . . . . . . . . . . 34
3.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5
3.5.1 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . 37
3.5.2 Learning Vector Quantization . . . . . . . . . . . . . . 38
3.5.3 Support Vector Machines . . . . . . . . . . . . . . . . 40
4 Results 44
4.1 Individual Features . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 Basic Signal Processing Transforms . . . . . . . . . . . 44
4.1.2 Spectral Flux and Short-Time Energy . . . . . . . . . 45
4.1.3 Zero Crossing Rate . . . . . . . . . . . . . . . . . . . . 46
4.1.4 Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.5 Constant Q Transform . . . . . . . . . . . . . . . . . . 47
4.1.6 Sone . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.7 Perceptual Linear Prediction . . . . . . . . . . . . . . 49
4.1.8 RASTA-PLP . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.9 LPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.10 MFCC and BFCC . . . . . . . . . . . . . . . . . . . . 51
4.1.11 Amplitude Descriptor . . . . . . . . . . . . . . . . . . 53
4.2 Combined Features . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Comparison of Classifiers . . . . . . . . . . . . . . . . . . . . 56
5 Related Work 58
6 Conclusion & Future Work 62
Appendix 64
A Implementation 64
A.1 Amplitude Descriptor - LoHAS, LoLAS, AHA . . . . . . . . . 64
A.1.1 Statistical Utility Functions . . . . . . . . . . . . . . 67
A.2 Short-Time Energy . . . . . . . . . . . . . . . . . . . . . . . . 69
A.3 Zero Crossing Rate . . . . . . . . . . . . . . . . . . . . . . . . 70
6
List of Tables
1 Mean recall and mean precision obtained with the features
extracted from the signal processing transforms (DFT, DCT,
and DWT). The rows show the results for different classifiers. 45
2 Results (recall and precision) of ZCR for each class (rows),
obtained by different classifiers (columns) . . . . . . . . . . . 46
3 Results (recall and precision) of Pitch for each class (rows),
obtained by different classifiers (columns) . . . . . . . . . . . 47
4 Results (recall and precision) of CQT for each class (rows),
obtained by different classifiers (columns) . . . . . . . . . . . 48
5 Results (recall and precision) of Sone for each class (rows),
obtained by different classifiers (columns) . . . . . . . . . . . 48
6 Results (recall and precision) of PLP for each class (rows),
obtained by different classifiers (columns) . . . . . . . . . . . 49
7 Results (recall and precision) of RASTA-PLP for each class
(rows), obtained by different classifiers (columns) . . . . . . . 50
8 Results (recall and precision) of LPC for each class (rows),
obtained by different classifiers (columns) . . . . . . . . . . . 51
9 Results (recall and precision) of MFCC for each class (rows),
obtained by different classifiers (columns) . . . . . . . . . . . 52
10 Results (recall and precision) of BFCC for each class (rows),
obtained by different classifiers (columns) . . . . . . . . . . . 52
11 Results (recall and precision) of AD (LoHAS, LoLAS, and
AHA) for each class (rows), obtained by different classifiers
(columns) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
12 Results (recall and precision) of the combined feature vector
for each class (rows), obtained by different classifiers (columns) 55
7
1 Introduction
Multimedia information retrieval is a growing research field that gained im-
portance in recent years, due to the increasing number of available digital
media. Traditionally, research focused on visual information retrieval (VIR).
The rise of audio information retrieval was motivated by the development of
efficient audio compression techniques that support the distribution of dig-
ital audio. Audio retrieval is employed in multimodal information retrieval,
where visual, textual, and acoustic information is combined to take advan-
tage of synergetic effects. Audio recognition is also applied for automatic
extraction of semantic annotations in multimedia databases.
This thesis concerns with the retrieval of animal sounds. Animal sounds
are a subset of environmental sounds. The investigation surveys a broad set
of audio features and several classifiers. Additionally, the author introduces
a new set of features and evaluates their quality by a selected set of classes
of animal sounds. Due to the complexity and the nearly infinite domain
of non-speech sounds, the quality of recognition is typically lower than the
quality of speech recognition, which is already well understood. Retrieval
results presented in this thesis for the domain of animal sounds are compa-
rable to that of state-of-the-art research in the area of environmental sound
recognition.
1.1 Motivation & Problem Statement
Audio recognition and retrieval has been an important and challenging re-
search field for more than fifteen years. Although the research community
yielded great technical advances in the past, work in this area is still at a
preliminary stage. The long-term goal is to achieve results comparable to the
human sense of hearing. The human auditory sense provides optimal per-
formance, since it is able to bridge the semantic gap. Audio recognition and
retrieval techniques can at best narrow the semantic gap. Although there
is a huge research community, publishing a vast amount of scientific papers
every year, there are still a lot of unsolved problems. The representation
of audio signals by numerical features is currently at a low level of abstrac-
tion that does not consider semantic information. Measuring similarity of
8
audio signals is a very difficult task, still open to research. Audio retrieval
is currently only applicable to a limited domain of sounds. In contrast to
speech recognition, the domain of environmental sounds is nearly infinite.
The retrieval quality decreases rapidly with an increasing number of classes
that have to be distinguished. Besides, the quality of retrieval degrades
with increasing inhomogeneity of the audio samples that belong to the same
class. Furthermore, the partitioning of sounds into disjoint classes is am-
biguous and subjective, due to cultural influences. Another challenge is the
representation of queries for retrieval systems. Early approaches employed
query-by-example techniques. Later, query-by-humming gained importance
especially in the field of music retrieval [23]. A retrieval task is always a
tradeoff between universality and assumptions - about the domain, about
the media, and about the user.
The problem of content-based audio retrieval can be stated as follows:
Content-based audio retrieval concerns with searching in multimedia data-
bases for audio samples specified by a query that describes properties of the
desired audio samples. In general, retrieval is the task of deriving a paramet-
ric model from raw data. From a given set of audio signals, each annotated
with a class label, a more compact abstract numerical representation by fea-
tures must be derived that characterizes the properties of the classes well.
During the training phase a (parametric) model, the classifier, is fit to the
feature-data. The goal of training is to correctly predict the class member-
ship of all possible audio signals in scope of the defined classes. Based on
the parametric model, retrieval is performed by defining and evaluating a
query.
1.2 Contribution
Animal sounds are a domain of environmental sounds that has not been in-
vestigated in detail yet. Some investigations consider animal sounds among
other classes of sound [30], [22]. These investigations concern with classes
of environmental sounds in general. To the authors’ knowledge there is no
prior work analyzing the discrimination of animal sounds from each other.
Animal sound retrieval is a new domain of research. Hence, there are no
9
investigations the results in this thesis can be compared with directly. Fur-
thermore, there is no official and commonly accepted reference database for
animal sounds.
This thesis addresses the identification of an efficient method for au-
tomatically distinguishing between sounds of different animals. The con-
tribution to this research field is represented by a thorough investigation
of the applicability of state-of-the-art audio features in the domain of ani-
mal sound recognition. Therefore a database consisting of several hundred
animal sounds is built. Traditional features developed for speech recog-
nition and features applied in audio segmentation and music retrieval are
compared. Additionally, the author introduces a set of novel features and
compares their performance with popular audio features. The introduced
features are time-based and follow an intuitive approach to describe the
waveform of a signal. The quality of the features investigated is evaluated
by a representative set of popular classifiers. Besides, an extensive survey
of state-of-the-art features and classifiers is presented. Additionally, a com-
prehensive overview of related research in the field of content-based audio
retrieval is given.
1.3 Applications
Animal sound retrieval has a wide range of applications. It may play an im-
portant role in applications for handicapped people. Such a technique could
be part of a supporting system for the deaf, providing information about
the surrounding environment. A deaf person is equipped with a microphone
and a mobile device that is responsible for retrieval. The user is visually
informed by the application about interesting or dangerous events, indicated
by sounds.
A popular application is automatic surveillance. It usually employs mul-
tiple cameras and microphones to monitor an area of interest. Such a system
produces huge amounts of data that contain only little information. Animal
sound retrieval can be applied, for example to recognize barks of a watchdog
that often signalize crucial events.
10
A traditional field of research is the annotation of time-dependent me-
dia. Animal sound retrieval may be part of a system that automatically
generates meta-information from audio and video streams. A related ap-
plication is the annotation of movies in a multimedia database to improve
search capabilities.
Additionally, life logging could take advantage of such a technique. A
life log accompanies human users during their working life and leisure time
and automatically captures and annotates events of interest in a multimodal
electronic diary. Usually life logging applications employ multiple different
sensors, such as video cameras, microphones, GPS, accelerometers, and ther-
mometers [1]. Information is extracted from the single signals and combined
with data of other sensors. The resulting diary consists of retrieved annota-
tions associated with a time stamp. Animal sound retrieval may be useful
in a life logging application, imagine a visit to the zoo. A thorough survey
of applications related to content-based audio retrieval and animal sound
retrieval is given in Section 5.
1.4 Organization
The remainder of the thesis is organized as follows: In Section 2 the princi-
ples of pattern recognition, information retrieval and digital audio are given.
Section 3 addresses the experiments and discusses features and classifiers.
Results are depicted in Section 4. A survey of related work is performed in
Section 5. Finally, in Section 6 conclusions and future work are presented.
11
2 Background
In this Section the basic ideas of audio retrieval are discussed in general.
First, the field of information retrieval is surveyed. Then the author presents
the fundamentals of pattern recognition and finally, basics of digital audio
are discussed.
2.1 Information Retrieval
Information retrieval concerns with searching documents in a database by
a textual query. Early applications basically focused on retrieval of text
documents. Information retrieval is performed by searching in the docu-
ments themselves or by searching for documents by annotated metadata. A
popular application of information retrieval are search engines in the world
wide web. Pioneers in the area of information retrieval are Salton [44] and
Rijsbergen [50].
In the last decades the number of available media has grown. Audio
and video have become available due to the development of efficient com-
pression techniques and the distribution of multimedia over the internet.
Traditional text based information retrieval is not appropriate to retrieve
audio and video data. Manual creation of textual metadata from multime-
dia objects by humans is not applicable, because it is too time consuming
and error prone. The limitations of metadata-based retrieval techniques can
be overcome by examining the content of media objects. Content-based in-
formation retrieval is a separate branch of research of information retrieval,
where information about audio and video documents is extracted directly
from their content. There is no need for a priori knowledge concerning the
documents. Depending on the media type concerned, content-based image
retrieval (CBIR), content-based video retrieval (CBVR) and content-based
audio retrieval (CBAR) are distinguished.
2.2 Pattern Recognition
Approaches dealing directly with the content of multimedia documents are
applications of pattern recognition. Pattern recognition is concerned with
12
analyzing and classifying data objects by contained patterns. A pattern
recognition task consists of multiple parts. A sensor (e.g. a microphone
or video camera) provides the system with the raw data of a signal. The
size of the data is reduced by feature extraction. This results in a more
abstract description that represents the most meaningful information that
characterizes the signal well. Based on this representation, classification is
performed. Classification is a process that groups similar patterns, repre-
sented by features together. In a content-based retrieval application the
user addresses queries to the retrieval system. Queries can be expressed in
different ways. One approach is query-by-example, where the query is of the
same media type as the documents in the database. Alternatively a textual
description of the favored document (e.g. ”find sounds of dogs ” or ”find
pictures of cars”) can be formulated as query. In the following the task of
pattern recognition is considered in more detail.
According to Watanabe, patterns are ”the opposite of chaos” [52]. A
pattern has a structure that is characterized by features, which are numerical
representations of that pattern, such as the height of a person or the pitch
of a human voice. A feature is regarded as a mapping from pattern space
(raw data) to feature space. The value of a feature is usually represented by
a scalar. In practice, several features are combined into a feature vector.
Feature extraction denotes the process of computing features. In context
of content-based retrieval, features often represent the coefficients of basic
signal processing transforms, such as the discrete Fourier Transform (DFT)
or the discrete Cosine Transform (DCT). The advantage of such transforms
is, that a few coefficients suffice to represent most of the original signal.
Due to this property, these transforms are applied in signal compression
techniques, such as JPEG and MPEG. Section 3.4 gives a thorough dis-
cussion of a variety of audio features. The author presents mathematical
foundations and describes details concerning the application of the features
in content-based audio retrieval.
As mentioned before, features are often combined to feature vectors.
Feature selection is the process of choosing a maximal informative subset
from a given set of features. Statistical methods, such as the Principal Com-
ponent Analysis (PCA) that maximizes the variance among the features, are
13
often applied for feature selection. Besides, PCA can be used to generate
new features based on the existing features.
The objective of classification is to predict the class membership of a
pattern represented by a corresponding feature vector. A class ωi is defined
by a class label i ∈ N . Each pattern respectively each feature vector belongs
to exactly one class. A classifier can be regarded as a function c(x) of a
feature vector x with:
c(x) = i⇔ x ∈ ωi (1)
The output of a classifier are the predicted class labels of the feature
vectors. Most classifiers have to be trained before they can be applied to
arbitrary test patterns. During training the classifier determines the class
boundaries based on training vectors contained in the training set. After
training, the classifier is fit to the data and ready for classification. The
quality of the classifier is evaluated using a test set. The test set contains
feature vectors that are not contained in the training set. A classifier should
be able to correctly classify not only the test and training vectors, but all
arbitrary vectors that belong to one of the selected classes. This is the
generalization ability of a classifier [15]. In Section 3.5 three classifiers,
employed in this thesis, are presented in detail.
The quality of content-based retrieval depends on the features that rep-
resent the signal and on the classifiers that discriminate between classes of
signals. An optimal feature shows minimal variations inside a class and high
variations beyond multiple classes. A good representation of data by fea-
tures is a necessary condition for successful pattern recognition. Results of
the classifiers basically depend on the quality of the features. No feature is
a priori good or bad. The quality of a feature has to be analyzed in context
of the input data, the application domain, and the classes that are distin-
guished between. Analogously, classifiers cannot be evaluated in isolation.
They have to be considered together with the features they operate on.
Pattern recognition tasks (e.g. remote sensing, computer vision, im-
age understanding, and content-based retrieval) are inversions of well-posed
problems. For example, computer graphics is the well-posed inversion of pat-
14
tern recognition and content-based image retrieval. Similarly sound synthe-
sis is the well-posed inversion of audio recognition respectively content-based
audio retrieval. In general, an inverse problem concerns with the estimation
of model parameters through the manipulation of observed data [53]. The
inversion of a well-posed problem is often ill-posed. The term ill-posed means
that the conditions mandatory for well-posed problems are not met. Condi-
tions for well-posed problems are defined by Hadamard in [25]. According
to Hadamard, a well-posed problem has the following properties:
1. A solution exists,
2. the solution is unique, and
3. the solution depends continuously on the data in some reasonable
topology.
Content-based retrieval is an ill-posed problem. In a retrieval task, model
parameters are derived from input data (audio, image or video data). Model
parameters are terms, properties and concepts that may represent class la-
bels (e.g. terms like ”car” and ”cat”, properties like ”male” and ”female”,
and concepts like ”outdoor” and ”indoor”).
The semantic gap is related to the ill-posed nature of content-based
retrieval. The semantic gap refers the mismatch between high-level concepts
and low-level descriptions. In content-based retrieval the semantic gap is
placed between the content of media and textual information describing the
semantics of the content. The following expresses this circumstance:
It is annoying trying to retrieve Hollywood kisses in a movie
database by color, texture and shape features. On the technical
level this fact is called ”semantic gap”. [16]
The gap cannot be bridged due to the ill-posed nature of content-based re-
trieval. Today the goal of the research community is to narrow the semantic
gap as far as possible. All content-based retrieval branches, such as CBAR,
CBVR, and CBIR suffer from the semantic gap and apply similar techniques
to narrow it.
15
2.3 Content-Based Retrieval Systems
The first content-based retrieval systems came up in the nineties. One of
the first image retrieval systems was QBIC [20]. The QBIC system is able to
query a multimedia database by example images or videos. Around the same
time the first investigations on CBAR were performed. Pioneering work is
presented by Wold, Blum, and Wheaton in [54]. The authors developed an
audio retrieval system called Muscle Fish that is able to distinguish a wide
range of sounds.
Multimedia retrieval systems have a complex architecture. The core of
a retrieval system is the database that stores the (multimedia) documents.
Additionally to the documents, it stores annotated metadata and extracted
features. Features are automatically computed by a feature extraction mech-
anism. Traditionally, annotations were created manually by human users.
Modern systems support automatic extraction of annotations.
A search engine is connected to the database that receives queries from
the user. A retrieval system may support multiple types of queries. Query-
by-example techniques directly use documents as query objects. The re-
trieval system computes features from the query documents and the search
engine tries to find similar documents in the database by applying a simi-
larity model. Another method is query-by-text, where the user defines the
desired class of documents or terms that describe the documents. Query-
by-text makes use of media annotations stored in the database.
Another part of the retrieval system is responsible for visualization of
the retrieved documents. It provides the user with an interface to browse
the returned media objects.
Different evaluation methods for retrieval systems exist. The most pop-
ular measures are recall and precision. Recall is the proportion of retrieved
relevant documents of all relevant documents in the database. Let Ret be
the set of retrieved documents and Rel the set of relevant documents in the
database. Recall R is defined as:
R =
|{Ret ∩Rel}|
|{Rel}|
(2)
16
Precision is the percentage of relevant documents retrieved in relation to the
total number of documents retrieved. Precision P is defined as:
P =
|{Ret ∩Rel}|
|{Ret}|
(3)
Recall and precision are inversely related. Precision decreases with increas-
ing recall and vice versa. The tradeoff between recall and precision is usually
illustrated in a recall-precision graph. A typical example is given in Figure 1.
The recall-precision graph shows the recall on the abscissa for different pre-
cisions on the ordinate. The recall-precision pairs are obtained by varying
the number of retrieved documents.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
P
re
ci
si
o
n
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Figure 1: A typical recall-precision graph, illustrating the tradeoff between
the two measures.
2.4 Content-Based Audio Retrieval
The rising number of audio, video, and image databases states the need for
efficient retrieval. The exponential growth of computational power enables
multiple applications for content-based retrieval, such as real-time surveil-
lance, video analysis, and music information retrieval. These trends encour-
age research in this area. Today several hundred scientific publications are
published every year.
17
CBAR is a relatively young research area. The techniques applied are
tightly coupled to CBIR. CBAR additionally employs methods of speech
recognition. Speech recognition is a research field with long tradition. It was
one of the first challenges in digital audio analysis. Due to the similar nature
of the approaches in both research areas, knowledge from speech recognition
can be reused in CBAR. Today speech recognition is well understood and
well engineered. The results of CBAR can currently not compare to those
of speech recognition. The reason for this may be the significant impact of
the semantic gap.
There are different branches of research in CBAR. Segmentation con-
cerns with the distinction of different types of sound such as speech, mu-
sic, silence, and environmental sounds. Segmentation is an important pre-
processing step used to identify homogeneous parts in an audio stream.
Based on segmentation the different audio types are further analyzed by
more appropriate techniques such as speech recognition, music information
retrieval and environmental sound recognition. Speech recognition is exten-
sively surveyed by Rabiner and Juang in [41].
In the last decade analysis and retrieval of music became a popular
research field [18]. On the one hand side research deals with retrieval of
instruments, artists and musical genres. On the other hand side researchers
concentrate on the extraction of semantic information in pieces of music.
Another field of research is environmental sound retrieval which com-
prises all types of sound that are not speech and music. Since the domain
of environmental sounds is arbitrary in size, most investigations confine to a
limited domain of sounds. A thorough investigation of related work is given
in Section 5.
2.5 Digital Audio
Prior to working with sound it is advantageous to become acquainted with
its fundamentals. Sound in context of this work is defined as vibrations
transmitted through an elastic media (be it solid, aeriform or liquid) that
are detectable by the human auditory sense. These vibrations generally have
frequencies ranging from 20 Hz to 20 000 Hz.
18
Since physical sound is analog it has to be digitized to be processed with
digital hardware. Usually digitalization of sound means recording a num-
ber of samples of that sound at certain time intervals. In order to enable
a perfect reconstruction of the digital signal, the analog signal has to be
sampled uniformly and at a frequency that is equivalent to at least twice
its bandwidth. This theorem is known as the Nyquist-Shannon sampling
theorem, illustrated in Figure 2.
Pulse Code Modulation (PCM) is a standard technique for digitally encod-
ing analog audio. It dates back to 1937 when a French engineer named Alec
Reeves introduced PCM for the purpose of telephone transmission. The
analog signal is sampled at uniform intervals and quantized into a digital
code. The sampling rate defines the bandwidth of the encoded signal, ac-
cording to Nyquist-Shannon sampling theorem. Besides, the quantization
depth is a critical quality measure since it determines the resolution of the
amplitude information. Quantization always introduces some noise, known
as quantization noise, that is not necessarily audible. A widely known exam-
ple for digitally encoded analog audio is the CD-Audio standard. It defines a
sampling rate of 44 100 Hz and a quantization depth of 16 Bits. Such an en-
coding preserves all perceivable frequencies and does not introduce audible
quantization noise.
19
(a) oversampled signal (b) reconstructed oversampled signal
(c) signal sampled with twice its
bandwidth (Nyquist-Shannon limit)
(d) reconstructed signal
(e) undersampled signal (f) reconstructed undersampled sig-
nal
Figure 2: 2(a) to 2(d) illustrate, that no gain is achieved through over-
sampling. The reconstructed signal in 2(b) is identical to the signal that
was sampled at twice its bandwidth. 2(e) and 2(f) illustrate the devastating
effect of undersampling. The signal cannot be reconstructed properly.
20
3 Experiments
This thesis examines ways to distinguish between animal sounds. To the au-
thors’ knowledge, distinction of animal sounds has not been investigated yet.
Animal sound retrieval is a new branch of research in the field of environ-
mental sound recognition. This section presents the experiments performed.
Different features and classifiers are applied and compared. First, the scope
and the objectives of the experiments are discussed. Then the test setup
and the framework that supports the experiments are described. Finally
features and classifiers employed in the tests are presented.
3.1 Scope
Four animals, namely birds, cats, cows, and dogs are chosen for the in-
vestigation. Sounds by birds and cats respectively by cows and dogs show
significant similarity on the spectral domain. By establishing the ground
truth of the data set, it becomes apparent that it is even difficult for human
observers to correctly classify animals only by their sound. It shows that
the perception of certain sounds of cats are similar to sounds of cows. The
similarities on the technical and perceptual level qualify the selected classes
to measure the performance of features and classifiers without bias.
There is no publicly available reference database of animal sounds. The
author built a custom database of sound samples from an internet search.
The database contains 383 samples (99 birds, 110 cats, 90 cows, and 84 dogs).
The data have a sample rate of 11 025 Hz, are quantized to 16 bit and are
single channel. A sound sample contains one or more repeated sounds of an
animal (e.g. repeated barks of a dog). Additionally some samples contain
background noise of other animals. File lengths and loudness levels vary
over the samples. Classification is performed on entire sound samples. Each
sound sample is assigned to one of the four classes.
The goal of this thesis is to compare techniques in context of the domain
of animal sounds. A system should be developed that is able to correctly
classify about 80% of the animal sounds contained in the test set. The
system learns the differences characterizing the classes of animal sounds
from a training set that is much smaller than the test set. Techniques
21
applied for retrieval should be easy to compute, to meet the demands of
mobile applications.
3.2 Setup
Numerous experiments are preformed to test each feature with each clas-
sifier. All experiments have the same structure. An experiment consists
of a number of inputs and outputs with corresponding parameters. The
following inputs exist:
• data defines the directories where test and training set are located.
Optionally, a file can be specified, where the test and training set are
stored as a binary file.
• feature specifies the features to compute. One or more features can be
declared. Each feature may return a feature vector containing several
components. For each feature the corresponding parameters are given.
• feature selection defines the components of the feature vectors that
are used in the experiment.
• classifier denotes the technique that is used for classification together
with its specific parameters.
Currently, the following outputs are defined:
• data file is a binary file where the whole data from the test and
training set are stored. That includes the entire samples of all files
annotated with metadata such as sample rate, file size, file path, class
name and class label.
• feature file(s) are binary files that store the feature vectors, extracted
from the sound samples in test and training set.
• retrieval evaluation defines a technique to identify the quality of
classification. The current implementation supports recall and preci-
sion.
22
The inputs and outputs together with their parameters are stored in an ex-
periment file. The uniform structure of experiments enables efficient and
consistent tests of arbitrary combinations of features and classifiers. All
experiments are conducted in MATLAB using an extensible framework in-
troduced in Section 3.3 that supports experiment files defined as above.
Common to all experiments is the ground-truth. The sample database is
split into a test set and a training set. The training set comprises 12 samples
per class. The training samples are chosen randomly to gain an unbiased
training set. The remaining samples form the test set: 87 bird samples, 98
cat samples, 78 cow samples, and 72 dog samples. The training set is chosen
very small to prove the ability to generalization of the classifiers.
The experiments split in two series. In the first run, all features are tested
individually with the three selected classifiers. That are k-nearest neighbor
(K-NN), learning vector quantization (LVQ), and a support vector machnine
(SVM). The results of these experiments are discussed in Section 4.1. In the
second run, the features are combined to improve the quality of retrieval.
The corresponding results are illustrated in Section 4.2. The large number
of experiments enables an objective comparison of the employed classifiers
in Section 4.3.
3.3 Test Environment
The author implemented an extensible framework that supports the defini-
tion of experiment setups by configuration files. Configuration files specify
ground-truth, test data, features, classifiers, and result output options as
mentioned in Section 3.2. The author decided for the MATLAB environ-
ment because it provides a comfortable interface for audio processing and
a large number of basic audio algorithms. Furthermore multiple toolboxes
exist, such as [40], [5], [17], and [35] that deal with audio analysis, speech
recognition and classification.
The goal of the framework is to provide common interfaces for basic
pattern recognition tasks such as feature extraction and classification. Due
to this it is feasible to represent an experiment, that comprises an entire
retrieval process, by a short description stored in a configuration file.
23
The MATLAB framework integrates the implementations of all features
employed in the experiments. It encapsulates the feature implementations
and provides standardized interfaces for them. The same functionality is
provided for the classifiers. The framework operates on a few data struc-
tures that contain the feature data and the raw sample data. Interfaces to
features and classifiers operate on these common data structures. Integra-
tion of new features and classifiers is done by implementing an interface that
encapsulates its specific implementation.
The framework provides a mechanism to store and import sample and
feature data. This speeds up repeated experiments enormously and allows
further analysis of feature data. The structure of the framework is depicted
in Figure 3. Experiments are performed on a PC with an Athlon 64 3000+
and 512 MB of RAM. MATLAB version 6.5 is used as programming envi-
ronment.
DB
Data
Import
Feature
Extraction
Feature
Selection
Classi-
fication
Evaluation
0110 0110 R&P
Ex-
per-
iment
parameters parameters
parametersparameters
parameters
Figure 3: The MATLAB framework employed for the experiments. Each
experiment is represented by a configuration file that defines the parameters
of the individual retrieval processes, such as feature extraction, features
selection, classification and evaluation.
24
3.4 Feature Extraction
Content-based retrieval usually does not operate on the original data, in-
stead features that represent the content more efficiently, are computed. For
illustration consider one second of an audio file in CD-quality: The original
data contain 44 100 samples. The first several hundred Mel-Frequency Cep-
stral Coefficients (MFCCs) of the same signal may suffice for retrieval. This
is a crucial reduction of the amount of data that has to be processed.
There is no distinct widely accepted taxonomy of audio features. A ba-
sic approach is to consider the domain of the feature: time-based features
are extracted from the signal in time domain. Spectral features are derived
after the signal has been transformed using one of the basic signal process-
ing transforms, such as Fourier Transform, Cosine Transform, and Wavelet
Transform. Another way to classify audio features is to analyze whether
they aim to imitate properties the human sense of hearing. Such features
are called perceptual features. The author considers features as either time-
based or spectral. The ability of a feature to imitate the human ear is
regarded as a superordinate property. Time-based features in the investiga-
tion comprise Zero Crossing Rate and Short-Time Energy. Additionally, the
author introduces a set of new time-based features that describe the shape
of the waveform of the signal. They are Length of High Amplitude Sequence
(LoHAS), Length of Low Amplitude Sequence (LoLAS) and Area of High
Amplitude (AHA). Spectral features concerned are Spectral Flux, Fourier
Transform, Cosine Transform, Wavelet Transform, Constant Q Transform,
Pitch, Sone, Cepstral Coefficients, Linear Predictive Coding, Perceptual Lin-
ear Prediction (PLP) and RASTA-PLP. Perceptual features are Sone, Pitch,
Bark- and Mel-scaled Cepstral Coefficients, PLP, and RASTA-PLP.
In the remainder of this section popular audio features applied in speech
recognition, music information retrieval and environmental sound recogni-
tion are discussed. The goal is to identify suitable features for the domain
of animal sounds.
25
3.4.1 Spectral Flux
The Spectral Flux (SF) is the summation of differences between adjacent
samples of the signal spectrum in a single frame. It is computed as follows:
SF =
∑
n
‖ |S [n]| − |S [n+ 1]| ‖ (4)
In the experiments the statistical moments of first and second order of the
SF for each file are employed.
3.4.2 Fourier Transform
The continuous Fourier Transform (FT) named after Joseph Fourier, is an
integral transform that re-expresses a function in terms of sinusoidal basis
functions, i.e. as a sum or integral of sinusoidal functions multiplied by some
coefficients (amplitudes). It offers a frequency domain representation of the
signal. The coefficients of the FT may directly be used as a feature. They
are also the basis for computations of more complex features (for example
MFCCs, see Section 3.4.8). The FT of a signal is given by Equation 5 and
sometimes called the forward FT.
F (k) =
∫
∞
−∞
s (n) e−2πikndn (5)
Equation 6 is called the inverse FT and is used to obtain a reconstruction
of the signal in time domain.
s (n) =
∫
∞
−∞
F (k) e−2πikndk (6)
For digital audio the discrete Fourier Transform (DFT) is needed. It is de-
fined over discrete, finite or infinite domains. In 1965, Cooley and Tukey [10]
first discussed the fast Fourier Transform (FFT) a DFT algorithm that
reduces the complexity of computations for N samples from O
(
N2
)
to
O (N · logN). Today the FFT is a standard technique to compute the FT
of a digitized signal.
The first 60 DFT coefficients are used to form a feature vector. Option-
ally zero-padding is applied to equalize the length of the samples.
26
3.4.3 Discrete Cosine Transform
The discrete Cosine Transform (DCT) is closely related to the DFT. In
contrast to the DFT which uses complex numbers the DCT is real-valued.
The DCT approximates a signal by a weighted sum of cosine functions with
different frequencies. There are several variants of the DCT with slightly
modified definitions. The variant DCT-II in Equation 7 is commonly referred
to as the DCT.
fj =
N−1
∑
n=0
s(n) · cos
(
jπ
N
(
n+
1
2
))
(7)
Equation 8 presents the variant DCT-III which is commonly referred to as
the inverse DCT (IDCT).
sj =
1
2
f (0) +
N−1
∑
k=1
f(k) · cos
(
nπ
N
(
j +
1
2
))
(8)
Similarly to DFT the computation for the DCT is in O (N · logN). In
practice DCT is often used for lossy data compression (e.g. JPEG). A
modified transform, the modified DCT is used in MP3 and Vorbis audio
compression. This area of application is motivated by the property of the
DCT that most of the signal information tends to be concentrated in the
low frequency components of the DCT. Because of the lower computational
complexity of the DCT, it is employed as an approximation of the Principal
Components Analysis (PCA), a linear transform that optimally keeps the
subspace that has largest variance.
Selected DCT coefficients, especially the low frequency components, are
often used as a feature for classification. Analogously to the DFT the first 60
DCT coefficients are used for retrieval in the experiments.
3.4.4 Wavelet Transform
The Wavelet Transform (WT) is a time-frequency transform. It dates back
to the early 20th century, when Alfred Haar, a Hungarian mathematician
introduced the first discrete Wavelet Transform. Generally the WT aims at
representing a signal by a finite length or fast decaying oscillating waveform
that is scaled and translated to reproduce the signal. This waveform is called
the mother wavelet. There is a large number of different mother wavelets,
27
the most common ones are Haar and Daubechies named after Alfred Haar
and Ingrid Daubechies[14]. Selection of the optimal mother wavelet depends
on the application. Recently, the WT started to replace the FT in several
research and application areas, such as signal processing, speech recognition,
and astrophysics.
Two types of WT exist: discrete Wavelet Transform (DWT) and continu-
ous Wavelet Transform (CWT). The CWT applies all scales and translations
of the mother wavelet. The CWT is given in Equation 9.
c (a, b) =
∫
∞
−∞
s (n)ψ (an+ b) dn (9)
CWT is commonly used for signal analysis in scientific research. It is infi-
nitely redundant but sometimes useful to comprehend certain signal prop-
erties. The DWT uses a specific subset of scale and translation values which
fulfill the conditions in Equation 10.
ψ
(
2kn+ l
)
with k, l ∈ Z (10)
DWT is employed in computer science and engineering as a means of signal
coding and compression. The computational complexity of the DWT is in
O (N). In practice the DWT is computed by the use of FIR filters. Similar
to DCT and DFT coefficients, DWT coefficients are directly employed as
features. In the experiments feature vectors that contain the first 50 DWT
coefficients are used. The mother wavelet employed is the Haar wavelet.
3.4.5 Constant Q Transform
In order to overcome the shortcomings of the Fourier Transform for analy-
sis of Western music, Brown introduced the constant Q Transform (CQT)
in [6]. The DFT yields frequency components that are separated by a con-
stant frequency difference and with a constant resolution. These frequency
components do not map efficiently to musical frequencies. The constant Q
Transform is similar to the FT but has a constant ratio of center frequency
f to resolution δf . Equation 11 illustrates the computation of the CQT:
X (k) =
1
M (k)
M(k)−1
∑
n=0
W (k, n) s (n) exp
(
−i2πQn
M (k)
)
(11)
28
with:
• window: W (k, n) = α+ (1 − α) cos
(
2πn
M(k)
)
• variable window width: M (k) = SamplingRate·Q
2k/24
• and Q = ⌊f/δf⌋.
Constant Q Transform aims to convert the problem of instrument identifica-
tion or fundamental frequency identification into a straightforward pattern
recognition task. The CQT data are transformed against log frequency. Un-
der this view sounds with harmonic frequency components show constant
patterns in low frequency space. Figures 4(b) and 4(a) illustrate the presence
of this effect with the CQT and its absence with the DFT.
(a) signals transformed with FFT (b) signals transformed with CQT
Figure 4: Fourier Transform and constant Q Transform of three complex
sounds having 20 harmonics with equal amplitude [6]. Sounds with harmonic
frequency components show constant patterns in low frequency space of the
CQT. The FFT lacks this property
The author utilizes the implementation provided by Brown in [6] using
default values to compute CQT coefficients. Mean and variance of the CQT
coefficients over each transform window are applied as features.
3.4.6 Pitch
Pitch is the perceptual counterpart of the physical frequency. It is the per-
ceived frequency of a sound. Pitch cannot be measured physically, since it is
an auditory sensation. Two sounds with measurably different frequencies do
29
not need to have two different pitches but difference in the perceived pitch
implies different frequencies. The author employs a pitch detection algo-
rithm devised by Sun in [46]. For the experiments the maximum bandwidth
the algorithm supports is used. Mean and variance of the time dependent
pitch are used as features.
3.4.7 Sone
Sone is a unit on a perceptually motivated loudness scale. Loudness is a
subjective measure of sound pressure. One phon is defined as the loudness
of a 1 kHz tone at 40 dB SPL (sound pressure level). One sone equals 40
phons. The ratio of sone to phons (1:40) was chosen to represent a doubling
of loudness with a doubling in sone. A sound with a loudness of two sone
is perceived twice as loud as a sound with loudness one sone. The loudness
values of selected frequency bands mapped to sone may be used as feature.
For the experiments the MATLAB toolbox of Pampalk is employed [40].
The author computes sone values for 40 frequency bands with a window size
of 256 samples. Mean and variance serve as features.
3.4.8 Cepstral Coefficients
Cepstral Coefficients (CCs) are a popular feature in audio retrieval [33], [55].
The authors of [49] define the cepstrum as the Fourier Transform (FT) of
the logarithm (log) of the spectrum of the original signal.
signal → FT → log → FT → cepstrum
In practice, CCs are derived from FFT or DCT coefficients or linear predic-
tive analysis [5]. CCs offer a compact and accurate high order representa-
tion of signals. Peaks in the cepstrum correspond to harmonics in the power
spectrum.
MFCCs (Mel-Frequency Cepstral Coefficients) are an instance of CCs.
Computation of MFCCs includes a conversion of the logarithmized Fourier
coefficients to Mel-scale. After conversion, the obtained vectors have to be
decorrelated to remove redundant information. A DCT is applied to receive
30
a decorrelated, more compact representation. In the following sequence the
computation of MFCCs is illustrated:
signal → FT → log →Mel → DCT →MFCCs
MFCCs are computed using VOICEBOX, a speech processing toolbox for
MATLAB [5]. In the experiments the first 20 MFCCs are combined into
a feature vector. MFCCs are computed for small signal windows. Hence
mean and variance of each coefficient are calculated. Optionally, the author
tries to enhance retrieval quality through the use of delta and double delta
features.
BFCCs (Bark-Frequency Cepstral Coefficients) are similarly computed as
MFCCs. They differ in the applied scale (Bark-scale):
signal → FT → log → Bark → DCT → BFCCs
Bark-scale and Mel-scale are perceptually motivated acoustical scales that
nonlinearly map the signal frequency. Both nonlinear scales offer higher
resolution for low frequencies than for high frequencies.
Again, VOICEBOX is utilized to compute BFCCs. Analogously to above
the first 20 BFCCs are selected and their mean and variance is calculated.
Additionally, the influence of delta and double delta features is examined.
3.4.9 Linear Predictive Coding
Linear Predictive Coding (LPC) is one of the most powerful speech analy-
sis techniques [42], [48]. The goal of LPC is to estimate the basic para-
meters of a speech signal, e.g. pitch, formants, spectra, vocal tract area
functions. Formants describe the vocal tract (mouth, throat) of a speaker
by its resonances. The formants are extracted by a linear predictor. The
linear predictor tries to express the value of a sample by a linear combina-
tion of values of previous samples. LPC estimates coefficients using linear
prediction, that minimize the mean square error (MSE) between the origi-
nal signal and the predicted signal. The coefficients of the linear predictor
represent the formants of a speech signal. LPC coefficients are employed
31
in speech recognition to distinguish between phonemes. It is beyond the
authors’ knowledge that LPC coefficients have been introduced to environ-
mental sound recognition. In [38] the author successfully applies LPC coef-
ficients to environmental sound recognition in the scope of animal sounds.
The VOICEBOX implementation is used to obtain LPC coefficients. The
first 20 coefficients computed by covariance LPC analysis are employed in
the experiments.
3.4.10 Perceptual Linear Prediction
Perceptual linear prediction (PLP) was introduced by Hermansky in 1990 for
speaker independent speech recognition [26]. PLP is based on the concepts
of linear predictive (LP) analysis and additionally emphasizes on perceptual
issues. LP analysis approximates the original signal in each frequency band
equally well. This is not consistent with human hearing where the resolu-
tion decreases with increasing frequency. PLP overcomes these problems by
implementing several properties of human hearing.
In the first processing step of PLP the windowed audio signal is Fourier
transformed. The resulting power spectrum is warped to the Bark-scale.
The warped spectrum is convolved with an asymmetric critical-band mask-
ing curve. The critical-band masking curve approximates the shape of au-
ditory filters. It specifies the spectral resolution of human hearing for each
frequency. The resulting spectrum is sampled at approximately one Bark
intervals. This results in 18 spectral samples for an analysis bandwidth of 0
to 5 kHz (0-16.9 Bark).
The sampled values are weighted by an equal-loudness curve, that sim-
ulates the sensitivity of human hearing at different frequencies. Cubic-root
amplitude compression approximates the power law of hearing, that de-
scribes the nonlinear relation between the intensity of sound and its per-
ceived loudness.
Finally, the spectral samples are approximated by an all-pole model,
usually applied in LP analysis. The coefficients of the all-pole model can be
used as features directly. Alternatively, they can be further transformed to
32
cepstral coefficients. The computational costs of PLP are similar to those
of LP analysis.
The author employs the MATLAB toolbox by Ellis that supports PLP
and RASTA-PLP [17]. All 18 coefficients are used in the experiments. PLP
coefficients are computed for entire files.
3.4.11 RASTA-PLP
Relative spectral - perceptual linear prediction (RASTA-PLP) is an ex-
tension of PLP introduced in [27]. The objective of RASTA-PLP is to
make PLP more robust to spectral distortions of the communication chan-
nel. RASTA-PLP considers the fact that human perception is sensitive to
relative values (changes) and not to the absolute values of a signal. Human
hearing is insensitive of slow variations in the input signal and constant noise
introduced by the communication channel. The RASTA technique simulates
this by band-pass filtering each frequency channel.
From the Fourier Transform of the windowed speech signal, the critical-
band spectrum is computed as with PLP. The spectral amplitudes are log-
arithmized. The log critical-band spectrum is filtered by a band-pass filter.
The effect of the band-pass filter is, that any constant or slowly-varying
components in the spectrum are suppressed. Spectral changes below the
low cut-off frequency of the filter are ignored in the output. This removes
any constant or slowly-varying components from the spectrum. The high
cut-off frequency is the upper limit of spectral changes which are preserved.
Spectral changes above the high cut-off frequency of the band-pass filter are
suppressed to smooth out artifacts (fast frame-to-frame spectral changes)
caused by short-time analysis.
After band-pass filtering the equal loudness curve and cubic-root ampli-
tude compression is applied to the relative log spectrum, equivalent to PLP.
Prior to the approximation of the spectrum by an all-pole model, the inverse
logarithm of the spectrum is computed.
Analogously to the PLP technique, the coefficients or their cepstral co-
efficients may be employed as audio features. According to Hermansky [27],
the RASTA-PLP technique outperforms the PLP technique in applications
33
where the communication channel introduces noise and spectral coloration
to the signal (e.g. telephone line). The RASTA technique yields more robust
results and decreases error rates in recognition.
In the experiments RASTA-PLP coefficients are computed analogously
to PLP coefficients. Again all 18 coefficients are selected for retrieval.
3.4.12 Zero Crossing Rate
The Zero Crossing Rate (ZCR) is the number of zero-crossings in time do-
main within one second. According to Kedem [29] the ZCR is a measure for
the dominant frequency in a signal. The mean ZCR for entire sample files
is used as feature.
3.4.13 Short-Time Energy
The Short-Time Energy (STE) of an audio signal reflects the amplitude vari-
ations over time. The main area of application of STE is the discrimination
between silence and non-silence. Equation 12 illustrates the computation.
STE = ∆t
N
∑
n=1
|s [n]|2 (12)
Mean and variance of the STE are computed for entire files.
3.4.14 LoHAS, LoLAS & AHA
The author introduces a set of new time-based low-level features for au-
dio [38]. The features follow a simple perceptually driven approach. A
human observer distinguishes sounds among other things by the distribu-
tion of loud portions and silent portions. Sounds often consist of similar
recurrent fragments. Animal sounds match this concept very well, for ex-
ample the tweets of a bird and the barks of a dog are repeating sounds. The
human hearing uses this information to distinguish and recognize sounds.
For example, the barks of a dog differ from a low of a cow because the
repeat rate and the length of the single sounds are different. On the techni-
cal level that means that the high energy segments are different in length.
Analogously, the length of pauses between high energy segments contains
34
valuable information. The introduced features are motivated by this obser-
vation. They describe characteristics of the waveform such as peaks and
silence. The features are computed based on an adaptive threshold. This
threshold separates segments with high amplitudes from segments with low
amplitudes in the waveform. The threshold for a particular sound sample
is the sum of mean and standard deviation of the absolute sample values.
Based on this threshold the length of high amplitude sequences (LoHAS) is
computed. The length of a high amplitude sequence represents the number
of consecutive samples that have a value greater or equal to the threshold.
All LoHAS together, represent the distribution of the lengths of high energy
segments in the signal. Figure 5(a) illustrates this feature. Analogously,
the length of a low amplitude sequence (LoLAS) is defined as the number
of consecutive samples that have a lower value than the threshold. The set
of LoLAS describes the distribution of lengths of silent segments in the sig-
nal. Details are depicted in Figure 5(b). The length of a high amplitude
sequence contains temporal information but no information about the loud-
ness of the signal at this section. Sequences with high amplitude can be
further characterized by their area below the waveform. The area of high
amplitudes (AHA) is the area between the threshold and the signal in a
high area sequence. In other words the AHA feature represents the extent
of high energy segments in the signal. Figure 5(c) illustrates this concept.
Statistical properties of LoHAS, LoLAS, and AHA are considered to
build features that describe entire sample files. The final features comprise
mean, standard deviation, and median of LoHAS and LoLAS over the entire
signal. Additionally, the mean of AHA is extracted. This results in a seven
dimensional feature vector which is used for classification.
3.5 Classification
In this section the classifiers employed in the experiments are described.
There is a large number of classification techniques following different ap-
proaches. Statistical methods such as Bayes classification and Gaussian
mixture models try to estimate the probability density function of the un-
derlying data. Another group of classifiers are learning algorithms that
35
abs(s(n))
s(n)
t(s(n))
(a) LoHAS
abs(s(n))
s(n)
t(s(n))
(b) LoLAS
abs(s(n))
s(n)
t(s(n))
(c) AHA
Figure 5: LoHAS, LoLAS, and AHA for signal s (n) with threshold t (s (n)):
(a) Length of High Amplitude Sequence (LoHAS); (b) Length of Low Am-
plitude Sequence (LoLAS); (c) Area of High Amplitude (AHA).
employ artificial intelligence techniques. Most algorithms fit a parametric
model to the underlying data. There are supervised learning methods such
as support vector machines (SVM) and neural networks and non-supervised
techniques such as self organizing maps. A classification technique similar
to self organizing maps is learning vector quantization (LVQ). Beside para-
metric techniques (e.g. support vector machines) there are non-parametric
techniques such as nearest neighbor.
Three supervised classifiers are selected for the experiments. The sim-
plest way to classify feature vectors is the nearest neighbor rule. We employ
a K-NN classifier, which is a generalization of the nearest neighbor classifier.
36
In the experiments the implementation of Roger Jang is applied [28]. Addi-
tionally, the author implements learning vector quantization using standard
MATLAB routines. Finally, a SVM is applied for classification with different
kernels. For this purpose the OSU SVM MATLAB toolbox is used [35].
3.5.1 K-Nearest Neighbor
K-Nearest Neighbor (K-NN) is a popular non-parametric classifier. Details
are given in [12]. In contrast to parametric techniques that fit a model to the
data or that describe the probability distribution of the data, non-parametric
techniques operate on the data directly. The data are a combination of a
training set X = (x1, ...,xN ) ∈ Rd×N containing N training vectors of
dimension d and a vector y = (y1, ..., yN ) ∈ R1×N of corresponding class
labels.
The 1-NN (NN) algorithm assigns a new vector x the class label ys of
the nearest training vector xs. Where
s = arg min
i
‖x − xi‖, 1 ≤ i ≤ N. (13)
Similarity in nearest neighbor classification can be measured by any similar-
ity (distance) measure. Usually Euclidean distance is used. This assignment
scheme partitions the feature space according to a Voronoi tessellation. Each
cell belongs to a class. Figure 6 illustrates a Voronoi tessellation in two di-
mensional space. The union of all cells that are assigned to the same class,
is the decision region for this class.
The K-NN algorithm with K > 1 considers more than only the nearest
neighbor for classification. K denotes the number of nearest neighbors of
a new feature vector x that are considered for classification. From these K
vectors, kj vectors belong to class ωj , with
∑c
j=1 kj = K, where c is the
number of classes. Vector x is assigned to class ωi with the greatest number
of representatives in the set of K neighbors:
i = arg max
j
kj , 1 ≤ i ≤ c (14)
During training the K-NN classifier learns the training set by rote. Hence,
memory and computation costs grow linearly with the size of the training
set (O(N)).
37
x1
x2
Figure 6: Voronoi tessellation in R2 of a binary classification problem. Dots
are feature vectors of class A, crosses are feature vectors of class B. The gray
area is the decision region of class A.
In the experiments the K-NN classifier is applied with different values
for K. The initial value for K = 1. K is incremented as long as classifica-
tion results improve. NN is considered to test the quality of the features.
Features that discriminate classes well, provide disjoint partitions of the
feature space. Satisfactory results with the NN algorithm indicate such a
partitioning in the feature space.
3.5.2 Learning Vector Quantization
Learning Vector Quantization (LVQ) is a classification technique belonging
to the basic competitive neural networks. It was introduced by Kohonen [31]
and is related to Self-organizing maps, also by Kohonen [31].
The LVQ algorithms approximate class distributions of pattern vectors.
According to their creator, LVQ algorithms define very good approximations
for the optimal decision borders.
Let x be a sample vector and Sk be the k-th class of an N class classifi-
cation problem. We first assign a subset of codebook vectors to each class
38
Sk and then search the codebook vector mi with smallest Euclidean dis-
tance from x. It is possible to perform this assignment without intermixing
codebook vectors that belong to different classes, even if the class distri-
butions overlap. The sample x is thought to appertain to the same class
as the closest mi. The decision border is defined by the codebook vectors
closest to the class border. The mi have to be placed into the signal space
in such a way that the nearest-neighbor rule minimizes the average expected
misclassification probability.
Let
c = arg min
i
{‖x − mi‖} (15)
define the index of the nearest mi to x. Let x = x (t) be a time-series
sample of input, and let the mi (t) represent sequential values of the mi
in the discrete-time domain. LVQ1, the basic learning vector quantization
process is given in Equations 16 to 18:
mc (t+ 1) = mc (t) + α (t) [x (t) − mc (t)] , x,mc ∈ Sk (16)
mc (t+ 1) = mc (t) − α (t) [x (t) − mc (t)] , x ∈ Si,mc ∈ Sj , i 6= j (17)
mi (t+ 1) = mi (t) , i 6= c (18)
The asymptotic values of mi obtained in the above process define a vector
quantization for which the rate of misclassification is approximately mini-
mized. The learning rate α (t) is usually made to decrease monotonically
with time. Kohonen recommends an α < 0.1. The exact law α = α (t)
is not crucial. If only a restricted set of training samples is available, they
may be applied cyclically, and α (t) may even be made to decrease linearly
to zero. The basic LVQ algorithm is illustrated in Figure 7. The screenshots
were made with the LVQ visualization tool developed by Borgelt [3].
The optimized-learning-rate LVQ1 (OLVQ1) is an improved version of
the LVQ1 presented above. OLVQ1 differs from LVQ1 in the fact that it
uses an individual learning rate αi (t) that is assigned to each mi. Let c
be defined in Equation 15, and let f (x) = +1, f (x) = −1 denote correct
39
respectively incorrect classification of x. Equations 19 to 21 define the new
process:
mc (t+ 1) = mc (t) + αc (t) [x (t) − mc (t)] , f (x) = +1 (19)
mc (t+ 1) = mc (t) − αc (t) [x (t) − mc (t)] , f (x) = −1 (20)
mi (t+ 1) = mi (t) , i 6= c (21)
If all samples are used with equal weight, the statistical accuracy of the
learned codebook vectors is approximately optimal. OLVQ1 is not the only
derivative of LVQ algorithm, several others exist (LVQ2, LVQ3, etc.).
Kohonen suggests the use of the same number of codebook vectors for
each class. The upper limit of the total number of codebook vectors is
determined by time and computational constraints.
In the experiments an LVQ with 8 hidden neurons, a learning rate of 0.01
and 200 epochs is used. The classifier is supplied with the distribution of
classes in the training set.
3.5.3 Support Vector Machines
A popular classifier is the support vector machine (SVM) [4][51]. SVMs
are supervised, statistical learning methods applicable for classification and
regression. They are also known as maximum-margin classifiers.
Given two separable clouds of points (x1, y1), · · · , (xk, yk) where xi ∈ Rn
and yi ∈ {−1,+1}, an SVM constructs an optimal separating hyperplane
wx + b = 0, that maximizes the distance between the hyperplane and
the nearest data point of each cloud (these points are the support vectors).
The distance between the support vectors and the hyperplane is called mar-
gin. Figure 8 depicts the difference between a suboptimal and an optimal
separating hyperplane. The hyperplane is not constructed in feature space,
instead the saddle point of the following Lagrange functional is calculated:
L(w, b, α) =
1
2
‖w‖2 −
l
∑
i=1
αi {yi [(w · x) + b] − 1} , (22)
40
(a) data set (b) initial codebook vectors
(c) after 25 learning steps (d) after 60 learning steps
Figure 7: The learning process of the LVQ classifier: (a) the original data set
with color coded class labels; (b) the circles are the initial codebook vectors
for each of the 3 classes; (c) and (d) display the path of the codebook vectors
while they move towards the group of training patterns that have the same
class.
41
g
h i
(a)
m1
m2k
(b)
Figure 8: Optimal separating hyperplanes (OSH): (a) g, h, i are valid but
not optimal. (b) k is the OSH, the distance between k and m1 respectively
m2 is equal and maximal.
where αi are the Lagrange multipliers. Equation (22) may be transformed
into problem (23) which is easier to solve.
w̄ =
l
∑
i=1
ᾱiyixi, b̄ = −
1
2
w̄ · [xr + xs] (23)
where xr and xs are two arbitrary support vectors with ᾱr, ᾱs > 0, yr =
1, ys = −1. Slack variables ζi and a penalty function F (ζ) =
∑l
i=1 ζi are
the means by which SVMs become applicable for the non-separable case [11].
The separating hyperplane is constructed in such a manner, that the number
of falsely classified xi is minimal. This consequently minimizes F (ζ). The
slack variables only influence the Lagrange multipliers αi, hence the solution
for the optimization problem stays the same as for the separable case.
In practice most problems are not linearly separable. Instead of identify-
ing a non linear separating function, the data points are transformed into a
higher order space in which they become linearly separable. This is achieved
by the use of kernels. Figure 9 illustrates the effect of a polynomial kernel
that maps the input space into a feature space of higher order.
Equation (24) describes the SVM classification, where K (xi,xj) is the
kernel used.
f (x) = sign
(
∑
support vectors
ᾱiyiK (xi,xj) + b̄
)
(24)
There are three typical kernel functions:
42
(a) not linearly separable (b) linearly separable
Figure 9: The kernel maps the one dimensional input space (a) into a feature
space of higher dimensionality, where the inputs become linearly separable
(b).
1. polynomial: K (xi,xj) = [(xi · xj) + 1]d,
2. Radial Basis Function (RBF):
K (xi,xj) = exp
(
− (xi − xj)
2 /2γ2
)
, and
3. sigmoid: K (xi,xj) = tanh (scl · (xi · xj) − off),
where scl (scale) and off (offset) are parameters that have to be chosen with
care. The kernel becomes invalid for certain parameter values.
Kernel functions are not limited to the ones mentioned above. Any
symmetric function that satisfies the conditions in Mercer’s Theorem is a
valid kernel function [2].
Beside K-NN and LVQ, an SVM classifier is applied in the experiments.
Since there is no method to determine the optimal kernel function, different
kernels are tested. Beside a linear kernel, polynomial kernels of second and
third order and an RBF kernel are employed.
43
4 Results
In this section the results of the experiments are presented. First, the author
discusses the performance of the individual features. Features that subopti-
mally discriminate the classes contained in the test set are considered first.
The results of the individual features are depicted in more detail in Sec-
tion 4.1 The results obtained by the combination of multiple features are
illustrated in Section 4.2. Finally, the quality of the classifiers employed in
the experiments is compared in Section 4.3.
4.1 Individual Features
The first test series concernes with individual features. Each feature is
computed from the test and training set described in Section 3.2. The
three selected classifiers (K-NN, LVQ, and SVM) are employed to test each
feature. The classifiers are trained by the training set to construct a model
of the data. The test data serve to validate the model. After classification
recall and precision of the classifiers results are determined. Data from the
training set are not incorporated in the evaluation.
Some of the features in the experiments do not perform sufficiently well.
They are discussed in Section 4.1.1 and 4.1.2. The well performing features
are considered afterwards.
4.1.1 Basic Signal Processing Transforms
In the experiments the author employs basic signal processing transforms
such as DFT, DCT, and DWT to compute features. The experiments show
that the first few transform coefficients of DFT, DCT, DWT, and CQT
insufficiently discriminate the animal sounds. The reason for this is that
significant information in higher frequency bands is not contained in the
first transform coefficients. In the case of animal sounds, high frequencies
contain significant information (e.g. for cats and birds).
Table 1 shows mean recall and mean precision over all classes obtained
by the classifiers in the experiments. The classifiers represent the rows of
the table and the different feature sets are arranged in the columns. A recall
44
of ≤ 0.25 denotes that the corresponding feature does not discriminate the
classes. Since the test data contains four classes, a recall 0.25 is equally well
to guessing the class labels. In Table 1 such results are marked italic.
∀ DFT DCT DWT
classes mean R. mean P. mean R. mean P. mean R. mean P.
K-NN 0.51 0.51 0.49 0.50 0.25 0.06
LVQ 0.25 0.05 0.25 0.06 0.25 0.06
SVM 0.48 0.49 0.29 0.37 0.25 0.05
Table 1: Mean recall and mean precision obtained with the features ex-
tracted from the signal processing transforms (DFT, DCT, and DWT). The
rows show the results for different classifiers.
The best results for the DFT feature are obtained by the K-NN classifier,
with K = 1. The SVM employed for the DFT feature uses a polynomial
kernel of second order. The DCT feature is classified best by the K-NN
classifier (again with K = 1). The feature data of the DWT cannot be
explained by any of the three classifiers.
To rule out the possibility that the number of features respectively the
number of coefficients is to small, the dimension of the feature vectors is
increased. For example, the first 200 Fourier coefficients yield a mean recall
of 0.44 for all classes and a mean precision of 0.5. Results degrade by further
increasing the dimension of the feature vector. The same behavior can be
observed for DCT coefficients.
Contrary to expectations, DWT performed worst, followed by the DCT.
The DFT feature yields the best results in context of the basic signal process-
ing transforms.
4.1.2 Spectral Flux and Short-Time Energy
In contrast to the high dimensional features discussed in Section 4.1.1,
Spectral Flux (SF) and Short-Time Energy (STE) have a dimension of
two. Performance of low-dimensional features usually is below that of high-
dimensional features, because low-dimensional features are not able to suf-
ficiently represent the samples.
45
Mean recall and mean precision for all classes of SF is about 0.47. This
result is poor at first sight. Compared to other features such as the DCT
feature described in 4.1.1, SF yields competitive results, despite its much
lower dimension. SF is more meaningful than the basic signal processing
transforms but SF alone cannot discriminate the sounds. In combination
with other features SF may improve results as illustrated in Section 4.2.
STE is only useful in classification based on frames. When STE is com-
puted for entire files, it represents the average energy of the sound sample,
which does not provide meaningful information.
4.1.3 Zero Crossing Rate
The Zero Crossing Rate (ZCR) is a measure for the dominant frequency in
a signal. Despite its dimension of one, ZCR provides good discrimination of
animal sounds, illustrated in Table 2.
ZCR K-NN LVQ SVM
Recall Precision Recall Precision Recall Precision
bird 0.79 0.77 0 0 0.76 0.80
cat 0.53 0.59 0 0 0.65 0.67
cow 0.37 0.37 1 0.23 0.41 0.48
dog 0.40 0.36 0 0 0.57 0.45
Table 2: Results (recall and precision) of ZCR for each class (rows), obtained
by different classifiers (columns)
Cats and birds are discriminated best by the ZCR. The distinction be-
tween dogs and cows causes problems for the ZCR. Analysis of the retrieval
results of SVM show that 37% of the cows are recognized as dogs and 33%
of the dogs are classified as cows. The reason for the bad separation of dog
sounds and cow sounds may be the similar frequency characteristic of these
two classes.
The SVM with linear kernel provides the best overall results. LVQ is not
able to distinguish between classes at all. It recognizes all animal sounds as
sounds of cows. The low dimension and the low computational complexity
46
of the ZCR, qualify this feature for the distinction of animal sounds in
combination with other features.
4.1.4 Pitch
Pitch is a perceptual feature that represents the perceived frequency of a
sound. The statistical moments of first and second order are used as features.
The results are depicted in Table 3.
Pitch K-NN LVQ SVM
Recall Precision Recall Precision Recall Precision
bird 0.34 0.43 1 0.26 0.06 0.6
cat 0.63 0.71 0 0 0.71 0.78
cow 0.71 0.64 0 0 0.92 0.48
dog 0.68 0.53 0 0 0.68 0.58
Table 3: Results (recall and precision) of Pitch for each class (rows), obtained
by different classifiers (columns)
The results are promising for cats, cows, and dogs. The class of birds
is poorly discriminated. Against one’s expectations, a majority of 63% of
birds are classified as cows by the SVM. This results in a low recall for the
class of birds. The K-NN classifier yields more balanced results but with a
similar distributeion. Again, LVQ is not applicable with this feature.
4.1.5 Constant Q Transform
The constant Q Transform (CQT) is an extension of the FT developed for
music information retrieval. Its coefficients are used as features. Retrieval
results of the CQT are given in Table 4
The CQT separates birds and dogs well but provides a low precision
for these classes. This indicates that multiple samples of other classes are
assigned to these classes. Analysis of the retrieval data approves this as-
sumption. A majority of cats and cows are recognized as dogs and birds.
Although, the mean recall of 0.65 over all classes and a mean precision
of 0.75 obtained by the K-NN classifier pretend to be fair average, more spe-
47
CQT K-NN LVQ SVM
Recall Precision Recall Precision Recall Precision
bird 0.93 0.55 0 0 1 0.30
cat 0.38 0.95 0.55 0.40 0.04 1
cow 0.41 1 0.60 0.60 0.04 1
dog 0.86 0.53 0.90 0.53 0.24 0.50
Table 4: Results (recall and precision) of CQT for each class (rows), obtained
by different classifiers (columns)
cific analysis of the data shows that the retrieval quality among the classes
is not balanced well. The CQT does not qualify for animal sound retrieval.
4.1.6 Sone
The Sone feature contains perceived loudness information of 40 frequency
bands. The results obtained by this feature are shown in Table 5.
Sone K-NN LVQ SVM
Recall Precision Recall Precision Recall Precision
bird 0.71 0.74 0.91 0.68 0.80 0.67
cat 0.39 0.60 0.45 0.83 0.37 0.61
cow 0.71 0.59 0 0 0.64 0.83
dog 0.74 0.56 0.97 0.42 0.90 0.58
Table 5: Results (recall and precision) of Sone for each class (rows), obtained
by different classifiers (columns)
Sone performs well on the data set. All classes except cats yield relatively
high recall and precision. The author remarks that a hit rate between 60%
and 70% is a satisfactory retrieval result for environmental sounds respec-
tively animal sounds. The SVM with a linear kernel performs best on the
Sone feature. Although it well separates birds, cows, and, dogs, it is weak
in recognizing cat sounds. 24% of the samples that contain cat sounds are
predicted as birds and 30% are considered as dogs. The results of the K-NN
classifier are slightly lower than that of SVM and are more balanced. K-NN
48
shows similar weaknesses as SVM for this feature. LVQ yields results below
that of SVM and K-NN.
4.1.7 Perceptual Linear Prediction
Perceptual Linear Prediction (PLP) is a feature introduced and successfully
applied in speech recognition. As Table 6 shows, PLP may also be applied
for discrimination of animal sounds.
PLP K-NN LVQ SVM
Recall Precision Recall Precision Recall Precision
bird 0.89 0.81 0.94 0.64 0.69 0.65
cat 0.35 0.87 0.45 0.88 0.36 0.51
cow 0.64 0.58 0 0 0.59 0.65
dog 0.78 0.49 0.99 0.45 0.72 0.51
Table 6: Results (recall and precision) of PLP for each class (rows), obtained
by different classifiers (columns)
The K-NN classifier with K = 5 performs best on the PLP feature data.
It separates the class of birds very well (recall = 0.89 and precision = 0.81).
The results for the classes cow and dog are moderate. Sounds of cats pose
a problem for all three classifiers in the test. A majority of cat sounds are
recognized as dogs. Dogs in turn are confused with cows and vice versa.
In context of the test database of animal sounds, PLP is mainly ap-
plicable to the distinction of birdsong from other animal sounds. Besides, it
must be pointed out that PLP is a speech recognition technique, that was
not designed to distinguish between animal sounds.
4.1.8 RASTA-PLP
Rasta-PLP is an extension of PLP that considers additional properties of
human hearing. The RASTA technique improves the results of retrieval, as
illustrated in Table 7.
The results of RASTA-PLP are similar to the results of PLP. Again, the
class of cat sounds is not discriminated well while most birds are correctly
49
RASTA- K-NN LVQ SVM
PLP Recall Precision Recall Precision Recall Precision
bird 0.82 0.73 0 0 0.78 0.8
cat 0.48 0.58 0 0 0.47 0.67
cow 0.61 0.72 1 0.24 0.74 0.60
dog 0.81 0.64 0.08 1 0.78 0.65
Table 7: Results (recall and precision) of RASTA-PLP for each class (rows),
obtained by different classifiers (columns)
classified. In contrast to PLP, the RASTA technique improves the recogni-
tion of cows and dogs. SVM and K-NN perform equally well. LVQ does not
produce feasible data. A mean recall of 0.69 and a mean precision of 0.68
are satisfactory values for a single feature in this domain.
Whether the additional computational costs of the RASTA technique
justify the improved retrieval quality or not, is dependent of the application
domain. With respect to the data set used, RASTA-PLP is more reasonable
than PLP.
4.1.9 LPC
Similarly to the PLP technique, LPC is a popular feature in speech recog-
nition. Furthermore, it is utilized for signal compression. LPC coefficients
may be represented in many different ways such as autoregressive coeffi-
cients, cepstral coefficients, and reflection coefficients [5]. For the data set
used, the representation as impulse response is the best choice. 20 LPC co-
efficients are extracted from each sound sample. The results are illustrated
in Table 8.
LPC is the first feature in this experiments that yields average recall and
precision above 70%. LPC coefficients discriminate all classes well. Best re-
sults are gained by the SVM with an RBF kernel. The average recall and
precision over all classes are about 80%. SVM only experiences problems
with the dog class, where 21% of the sounds are classified as cat sounds. All
other classes are well separated. The NN classifier suboptimally explains
the data in comparison with SVM. Especially the cat class cannot be dis-
50
LPC K-NN LVQ SVM
Recall Precision Recall Precision Recall Precision
bird 0.74 0.81 0.94 0.92 0.91 0.92
cat 0.52 0.63 0.86 0.74 0.85 0.73
cow 0.76 0.76 0.73 0.95 0.76 0.88
dog 0.88 0.64 0.74 0.74 0.69 0.74
Table 8: Results (recall and precision) of LPC for each class (rows), obtained
by different classifiers (columns)
tinguished well from the other classes with NN. Contrary to the experiences
gained in the previous experiments, LVQ demonstrates high performance,
comparable to the other classifiers. Especially the results in Table 8 are
even better than that of SVM. Since LVQ chooses its initial codebook vec-
tors randomly, the results in Table 8 are not as significant as the results
of deterministic classifiers such as K-NN and SVM. In another run, LVQ
obtained a mean recall and precision about 70%.
The results of LPC for the distinction of animal sounds are surprising,
because LPC is usually employed in speech recognition. The experiments
show that LPC may be successfully applied in other domains as well and
must not be debared from environmental respectively animal sound recog-
nition.
4.1.10 MFCC and BFCC
MFCC and BFCC are among the most popular features for audio analysis
and speech recognition. The first 20 MFCCs and BFCCs are considered as
features [5]. Delta and double delta cepstral features perform poorly and
are not used. MFCCs and BFCCs perform nearly identically. This results
from the fact that both are cepstral domain features that only differ in the
psycho-acoustical scaling. Tables 9 and 10 depict the retrieval results of
MFCC and BFCC.
MFCCs deliver the best results using the K-NN classifier with K = 1
(mean recall of 0.81 and mean precision of 0.83). That indicates that MFCCs
cluster the feature space according to the classes. The SVM with a linear
51
MFCC K-NN LVQ SVM
Recall Precision Recall Precision Recall Precision
bird 0.82 0.91 0.79 0.83 0.78 0.84
cat 0.70 0.83 0.79 0.69 0.80 0.68
cow 0.81 0.97 0.77 0.91 0.77 0.98
dog 0.90 0.60 0.72 0.70 0.82 0.75
Table 9: Results (recall and precision) of MFCC for each class (rows), ob-
tained by different classifiers (columns)
kernel yields similar results for MFCCs. LVQ provides slightly lower perfor-
mance for these features.
MFCCs are well suited to discriminate the classes of animal sounds.
Analysis of the wrong classified test samples show that a majority of cat
and cow sounds are assigned to the dog class by the NN classifier. The
SVM detects 20% of the bird sounds as cats and only 12% of cats as dogs.
Most wrong classified cow sounds are assigned to the class of cat sounds by
the SVM.
BFCC K-NN LVQ SVM
Recall Precision Recall Precision Recall Precision
bird 0.84 0.87 0.83 0.77 0.81 0.77
cat 0.77 0.86 0.79 0.73 0.74 0.74
cow 0.73 0.89 0.74 0.85 0.73 0.98
dog 0.93 0.67 0.74 0.78 0.90 0.75
Table 10: Results (recall and precision) of BFCC for each class (rows),
obtained by different classifiers (columns)
As expected, BFCCs perform equally well in comparison with MFCCs.
Again the K-NN classifier (K = 1) performs best, followed by SVM and
LVQ. Best classification is obtained for the class of dogs. The distribution
of the misclassified samples is similar to that of MFCC. Bird sounds are
confused with cat sounds. Furthermore 13% of cats are assigned to the dog
class. 12% of cows are recognized as cats.
52
The experiments show that MFCC and BFCC are equally well applicable
for the distinction of animal sounds. They can be efficiently computed and a
relatively low number of coefficients suffice to yield satisfactory results. An
average recall and precision above 80% is obtained with SVM, using only
the first seven MFCCs. That indicates that the succeeding coefficients do
not contain information useful for classification by the SVM. For the first
seven BFCCs results slightly degrade to a mean recall of 0.77 and a mean
precision of 0.79. Most information is contained in the first two MFCCs
respectively BFCCs. Classification based on the first two coefficients yields
results that explain about 70% of the data (mean recall and mean precision
about 0.7). These results are very good considering the low dimension of
the feature.
The experiments do not show crucial advantages of one of the features.
The combination of both features does not increase retrieval quality because
they characterize the same properties of the signal. Their computational
complexity is equal. The choice between MFCC and BFCC depends on the
application domain, they are employed in.
4.1.11 Amplitude Descriptor
The last feature tested is the Amplitude Descriptor (AD) introduced in [38].
The AD consists of LoHAS (mean, standard deviation, median), LoLAS
(mean, standard deviation, median), and AHA (mean). Results obtained
by the AD are given in Table 11.
AD K-NN LVQ SVM
Recall Precision Recall Precision Recall Precision
bird 0.80 0.79 0.84 0.65 0.90 0.76
cat 0.54 0.78 0.72 0.70 0.68 0.81
cow 0.74 0.78 0.49 0.93 0.72 0.78
dog 0.93 0.64 0.78 0.70 0.86 0.81
Table 11: Results (recall and precision) of AD (LoHAS, LoLAS, and AHA)
for each class (rows), obtained by different classifiers (columns)
53
The simple but intuitive features contained in the AD, capture the prop-
erties of the specific classes well. The combination of LoHAS, LoLAS and
AHA is able to explain most of the test data. Classification with the SVM
and a linear kernel provides an average recall and precision of 0.79. All
classes are separated well from each other. The minimum recall is 0.68 for
cats. Dogs are separated best from the other classes. Only 8% of dogs
are assigned to cows and 5% are classified as cats. Most misclassified cows
(11%) are recognized as cats. Cats are suboptimally separated. 19% of all
cat sounds are predicted to be bird sounds by the SVM. 90% of all birds are
correctly classified. The remaining 10% of bird sounds are uniformly distrib-
uted over the other classes. The precision is high for all classes (above 0.76).
Satisfactory results are also achieved with LVQ and K-NN. The results of
AD are comparable to other well performing features such as MFCC, BFCC
and LPC. For the NN classifier recall and precision of AD lie between those
of LPC and MFCC. The results of LVQ for the AD are slightly below that
of the other well performing features.
4.2 Combined Features
Up to now the author concentrated on individual features. In order to
improve retrieval quality, several features are combined to a feature vector.
This makes sense because the combination aggregates information present
in separate features. In the second test series, the author experiments with
all features employed in the first series again. The possibility cannot be
eliminated that a weak feature attains synergy together with another feature.
That is why even poorly performing features such as Zero Crossing Rate and
Spectral Flux are considered.
An optimal solution to the retrieval problem is empirically determined.
Therefore the author first combines the best performing features and ob-
serves the synergy effects between them. Features that do not improve
retrieval quality are not selected. Then weaker features are added to the
selection. Again, only features that improve results remain in the combi-
nation. An SVM is employed to evaluate the discriminative power of the
combinations. The SVM is chosen because it is well applicable to high di-
54
mensional feature vectors. Such feature vectors may emerge in this series of
the experiments when multiple features are combined to a high dimensional
vector.
This strategy finally yields a feature vector that comprises 26 compo-
nents. The first six components are mean, standard deviation, and median
of LoHAS respectively LoLAS. The first LPC coefficient is left out because it
always contains a constant value. The succeeding four LPC coefficients are
added to the selection. Additionally the first 13 MFCCs are selected. Fur-
ther components are the mean SF, the mean Pitch, the first RASTA-PLP
coefficient and the mean of Sone. The results of the feature combination are
illustrated in Table 12.
Com- K-NN LVQ SVM
bination Recall Precision Recall Precision Recall Precision
bird 0.86 0.94 0.98 0.91 0.97 0.94
cat 0.83 0.81 0.83 0.79 0.88 0.91
cow 0.82 0.96 0.72 0.88 0.81 0.97
dog 0.90 0.74 0.78 0.75 0.96 0.80
Table 12: Results (recall and precision) of the combined feature vector for
each class (rows), obtained by different classifiers (columns)
Classification based on this feature vector yields an average precision
and recall above 0.9 using the SVM with a linear kernel. This is a sig-
nificant improvement over results with individual features. There are only
a few misclassifications worth mentioning. 8% of cats are assigned to the
class of dogs. 9% of cow sounds are predicted as dog sounds and 8% as cat
sounds. All other classes are well separated and misclassifications are be-
low 3%. LVQ and K-NN profit from the combined feature vector as well.
LVQ correctly predicts more than 80% of the test samples’ classes. K-NN
obtains an average recall of 0.85 and an average precision of 0.86.
The combination of features increases retrieval quality. At the same time
computational complexity increases because more features have to be com-
puted. Especially in mobile applications there is always a tradeoff between
retrieval quality and computational costs.
55
4.3 Comparison of Classifiers
All classifiers in the experiments achieve satisfactory recall and precision
(between 0.7 and 0.9). The classifiers do not perform equally well on different
features. For very poor features such as DFT and DCT, only SVM and NN
yield consistent results. LVQ is not able to explain any of the data obtained
by poor features. Useful results with LVQ are mainly achieved for well
performing features such as MFCC and LPC.
SVM is different from the other classifiers used. K-NN and LVQ depend
directly on the clustering of samples in feature space. They deliver satis-
factory results when the classes form disjoint clusters. In contrast, SVM
constructs a more abstract parametric model, such as a linear or polyno-
mial model, depending on the kernel used. As a consequence SVM depends
less on the distribution of samples in feature space. A model of low order
tends towards more generalization ability, while with a model of high order,
classification runs the risk of overfitting.
SVM and K-NN perform comparably on the test data. We cannot iden-
tify a winner among the two classifiers. For high dimensional feature vectors
SVM usually outperforms K-NN. The feature vectors in the experiments
have relatively low dimension. This may be the reason for the similar per-
formance of SVM and K-NN in the investigations.
In most experiments K-NN is used with K = 1. K-NN is tested with
different K > 1 to find an optimal classification. In the majority of cases K
of one yields the best results.
The author employs different SVM kernels for the discrimination of ani-
mal sounds. The linear kernel is well suited for most features. That indicates
a clear structuring of the feature data. The RBF kernel demonstrates good
performance on the LPC feature. Polynomial kernels are outperformed by
linear and RBF kernels.
Computational complexity of the classifiers investigated are different.
Measurements refer to the time the classifier takes for training and classify-
ing the test samples. Classification of the combined feature using the SVM
with a linear kernel takes 63 milliseconds. The K-NN classifier takes 78 mil-
liseconds for the same task. LVQ is much slower at classifying. Useful results
56
are obtained from a minimum of 15 epochs. The LVQ takes 1.38 seconds
for 15 epochs, where most of the time is spent on training. All three classi-
fiers are well suited for frame-based classification. Furthermore, they may be
employed in mobile respectively realtime applications. In contrast to LVQ,
SVM and K-NN are able to be (re)trained very quickly.
57
5 Related Work
There are different groups of audio retrieval techniques. Numerical repre-
sentation of signals by features, is common to all methods. Approaches can
be grouped by the way similarity among different signals is detected. A
straight forward technique is to apply a distance measure directly to the
features. Pioneering work in this area concerning audio is performed in [54].
The authors develop a content-based audio retrieval system (Muscle Fish)
that distinguishes classes such as animals, machines, musical instruments,
telephone, etc. They extract features such as loudness, pitch, brightness
and bandwidth. Similarity is measured using a weighted Euclidean distance
(Mahalanobis). Classification is accomplished by the nearest neighbor rule.
An alternative to the direct measure of similarity is the use of artificial
intelligence techniques such as Support Vector Machines (SVM) [11], Hid-
den Markov Models (HMM) or Artificial Neural Networks (ANN). An early
example in the domain of audio processing is presented in [19]. The authors
apply a self-organizing neural network to cluster similar sounds.
Another way of classification is based on template matching [21]. The
author extracts MFCC features from the audio signal and clusters the feature
space into distinct cells with a quantization tree (Q tree). Histograms are
considered as templates. They represent the distribution of feature vectors
over the partitions of the tree. Templates are compared by distance measures
(e.g. Euclidean distance or cosine distance).
Segmentation is an important preprocessing step of audio analysis. It is
employed to discriminate different types of sound such as speech, music, en-
vironmental sounds and combinations of these. The authors of [45] separate
music and speech with low level features. They apply Spectral Centroid,
Spectral Flux, Zero Crossing Rate, Spectral Roll-off, and Percentage of Low
Energy Frames to represent the audio signal. Different classification tech-
niques such as Gaussian mixture model (GMM) and nearest neighbor are
used to separate speech from music based on the features. The same task is
accomplished in [7] using a different set of features (e.g. Amplitude, Cepstra,
and Pitch). A more comprehensive study on audio segmentation is neces-
sary to separate environmental sounds from speech and music. In [58] the
58
authors successfully separate speech, music, song, environmental sounds and
some selected combinations of these sound types. Features for this purpose
include Energy, ZCR, Fundamental Frequency, and Spectral Peak.
Based on successful segmentation of an audio stream, different audio
types can be further analyzed. The most intensive research took place in
the area of speech recognition. Beside classical recognition of speech [41]
researchers focus on recognition of the spoken language [39]. Another field
of research is classification of the speaker (e.g. for customization issues or
authentication) [43]. In the area of multimodal dialog systems, recognition
of human emotions from audio gained focus [8]. The different areas of speech
processing are a source to survey state-of-the-art audio features.
Beside speech recognition, music information retrieval (MIR) gained im-
portance through the availability of huge amounts of digital music. MIR
consists of classification and structural analysis. Classification concerns
recognition of instruments, artists and genres. Multiple speech recognition
features are applicable to the classification of music. In [33] the authors
distinguish between instruments (e.g. Brass, Keyboard, and String) by ex-
tracting features such as ZCR, STE, Bandwidth, Pitch, Formant Frequencies
and MFCCs. These features are computed from short frames of the audio
signal. The mean and standard deviations of the features over all frames add
up to the final feature vector that represents the signal. Classification is per-
formed by GMM and NN. Instrument recognition is proposed in [37]. The
authors extract Pitch, Onset Asynchrony, and information about Tremolo
and Vibrato of the audio sample. The Fisher projection method is used to
build a hierarchical Fisher classifier. Music genre classification is addressed
in [24]. In this paper the authors propose the discrete wavelet packet de-
composition transform to distinguish music genres.
Structural music analysis tries to extract similarities and recurrences in a
piece of music. A comprehensive structural analysis is performed in [36]. Au-
tocorrelation is computed to extract Rhythm from the wavelet-decomposed
signal. Pitch Class Profiles in combination with HMM separate chords. Vo-
cal and instrumental sections are characterized in terms of Octave-Scaled
Cepstral Coefficients (OSCCs). An SVM trained with OSCC features sepa-
rates vocal from instrumental sections.
59
Environmental sound recognition concerns with the analysis of sounds
that do not originate from speech or music. The range of environmental
sounds is extremely wide. Hence, most investigations concentrate on a re-
stricted domain. A popular research field is audio recognition in broadcasted
video. In [34] the authors recognize the scene content of TV programs (e.g.
weather reports, advertisement, basketball and football games) by analyz-
ing the audio track of the video. They extract Pitch, Volume Distribution,
Frequency Centroid and Bandwidth to characterize TV programs. Classi-
fication is performed by an appropriate neural network for each class. A
well investigated problem is highlight detection in sport videos. The au-
thors of [47] retrieve crucial scenes in soccer games by analyzing play-breaks.
Whistles, that often refer to play-breaks in sports, are detected using Spec-
tral Energy within an appropriate frequency band. Another indicator for
highlights is the audience. Excitement is quantified by Loudness, Silence,
and Pitch. A similar approach is followed by [55]. The authors analyze key-
words in commentator speech and audience which are relevant to important
actions of the game. They apply an HMM trained with low level features
(Energy and MFCCs including delta and double delta features) to recog-
nize the keywords. Investigations presented in paper [56] address extraction
of highlights in baseball games. Beside visual features the authors extract
audio features (e.g. MFCC, Pitch, Entropy). An SVM detects excitement
of the audience. Template matching is applied for baseball hit detection.
These two audio cues are combined to improve quality of highlight detec-
tion. Another area of interest is surveillance and intruder detection. The
authors of [9] detect intruders in a room by monitoring variations in a room-
specific transfer function. A broad survey of audio features and classification
techniques, in context of automatic surveillance is given in [13].
In [57] multilevel classification is proposed. First the authors apply
a coarse level segmentation to separate speech, music and environmental
sound. In a second step an HMM is considered to analyze environmen-
tal sounds (e.g. footstep, laughter, rain, windstorm). The authors of [30]
present an audio indexing system using MPEG-7 features. They apply Audio
Spectrum Basis (ASB) and Audio Spectrum Projection (ASP) descriptors
to distinguish classes such as ”Dog”, ”Bell”, ”Water”, and ”Baby” with
60
HMMs. They show that MPEG-7 descriptors perform similar to MFCC.
SVMs are successfully applied to environmental sound recognition in [22].
The authors compare and combine cepstral features (MFCCs) with percep-
tual features (Total Spectrum Power, Subband Powers, Brightness, Band-
width, and Pitch). In [22] perceptual features outperform cepstral features.
Best results are obtained by a combination of both. In [22] SVM performs
better than NN and K-NN.
A challenging area of environmental sound recognition is life logging.
This research field is concerned with continuously analyzing the environ-
mental sounds around a human user. From this information a diary is built
where major events and the user’s activities are stored. Fundamental re-
search in the domain of life logging is performed in the ”Forget-me-not”
system [32]. ”Forget-me-not” is a mobile application that analyzes the ac-
tivities of a user in his office. This includes monitoring the workstation,
telephone, printer and the location of the user. In [1], Aizawa presents a life
logging system that captures video and audio. Audio information is used to
detect human voice, to recognize conversation scenes. The system supports
GPS and provides inertial trackers to measure motion. Additionally it has
access to documents, web pages, and emails. Applications discussed in this
section prove the importance of environmental sound recognition for future
information systems.
61
6 Conclusion & Future Work
Discrimination of animal sounds is a rarely considered area of environmen-
tal sound recognition. While some investigations on environmental sound
recognition involve animal sounds among other sounds, there is few work on
the discrimination of animal sounds from each other. This thesis concerned
with content-based retrieval of animal sounds. Due to the lack of a publicly
available database of animal sounds, the author collected numerous audio
samples from the internet and built a database that contains about 380 ani-
mal sounds arranged in four classes. A survey of widely used audio features
and classifiers was presented. The research focus was the investigation of
their applicability for animal sound recognition. The experiments show that
popular features employed in speech recognition such as LPC coefficients
and MFCCs separate the classes of animal sounds well. Furthermore, the
author observes that low complex features such as Fourier coefficients and
Wavelet coefficients poorly perform on animal sounds.
The author introduced a set of novel time-based audio features. They
follow an intuitive way to describe the characteristic shape of a waveform.
Despite their simplicity, they perform comparably to much more complex
features, such as MFCC and LPC. The investigation shows that a combi-
nation of state-of-the-art features with the feature set introduced by the
author, enables to successfully classify more than 90% of the animal sounds
contained in the database.
The author employed three popular classifiers in the experiments. The
SVM and the K-NN classifier perform equally well. Both achieve high pre-
cision and recall and even consistent results for poorly performing features.
LVQ is more sensitive to the feature data than the other classifiers in the
test. LVQ yields satisfactory results for well discriminating features, while
its results on weak features are poor.
The results of the investigation are promising for future research in this
area. Frame-based classification may further improve results of file-based
classification. Therefore, context sensitive classifiers such as Hidden Markov
Models and Artificial Neural Networks will be employed. Further work will
62
include the comparison of features discussed in this thesis with MPEG-7
features in the domain of environmental and animal sounds.
Another future goal is the distinction of different sounds from the same
species (”understanding animals”). Such a technique may be useful in ana-
lyzing animal behavior. It may also improve the understanding of humans
for their animals. Besides, a focus will be the design of new audio features for
environmental sounds. An interesting area are mobile information systems
such as life logging and supportive systems for handicapped people.
63
Appendix
In the appendix the author provides the source code of features used for
animal sound retrieval.
A Implementation
This section contains Java implementations of features employed in the in-
vestigations. In Section A.1 the Java source code of the Amplitude De-
scriptor introduced in [38] is presented. In Sections A.2 and A.3 the author
provides source code for Short-Time Energy and Zero Crossing Rate.
A.1 Amplitude Descriptor - LoHAS, LoLAS, AHA
package org.vizir.audio.feature;
/**
*
* Implementations of features LoHAS (Length of High Amplitude Sequence),
* LoLAS (Length of Low Amplitude Sequence) and AHA (Area of High Amplitude).
* The features were introduced in:
* Discrimination and Retrieval of Animal Sounds,
* Vienna University of Technology
* TR-188-2-2005-05
* Mitrovic, D. and Zeppelzauer, M.
* 2005.
*
* After construction of the class, the get-functions may be used to retrieve
* statistical properties such as mean, variance, and median of LoHAS, LoLAS and AHA
*
* (c) by Dalibor Mitrovic and Matthias Zeppelzauer
*/
import org.vizir.util.*;
import java.util.ArrayList;
public class AmplitudeDescriptor {
private float[] mSignal = null;
private float[] mLoHAS = null;
private float[] mLoLAS = null;
private float mAHA = 0.0f;
64
/**
* Constructs a new Amplitude Descriptor and computes LoHAS, LoLAS and AHA
*
* @param signal the input signal
*/
public AmplitudeDescriptor(float[] signal)
{
this.mSignal = signal;
mLoHAS = new float[3];
mLoLAS = new float[3];
//calculate absolute values of signal
for (int i=0; i<this.mSignal.length; i++) {
this.mSignal[i] = Math.abs(this.mSignal[i]);
}
//calculate adaptive treshold
float treshold =
Statistics.mean(mSignal)+(float)Math.sqrt(Statistics.variance(mSignal));
//compute LoHAS, LoLAS, and AHA
boolean new_LoHAS = false;
boolean new_LoLAS = false;
int counter_LAS = 0;
int counter_HAS = 0;
float accumulator_AHA = 0.0f;
ArrayList list_LoHAS = new ArrayList();
ArrayList list_AHA = new ArrayList();
ArrayList list_LoLAS = new ArrayList();
for (int i=0; i<this.mSignal.length; i++) {
if (this.mSignal[i] >= treshold && new_LoHAS) {
counter_HAS = counter_HAS + 1; //continue HAS
accumulator_AHA =
accumulator_AHA + (this.mSignal[i]-treshold); //increase AHA
}
else if (this.mSignal[i] >= treshold && !new_LoHAS) {
// new HAS
new_LoHAS = true;
counter_HAS = 1;
// end LAS
new_LoLAS = false;
list_LoLAS.add(new Integer(counter_LAS));
// init AHA
accumulator_AHA = this.mSignal[i]-treshold;
65
}
else if (this.mSignal[i] < treshold && new_LoLAS) {
// continue with LAS
counter_LAS = counter_LAS+1;
}
else if (this.mSignal[i] < treshold && !new_LoLAS) {
if (new_LoHAS) {
// end HAS
list_LoHAS.add(new Integer(counter_HAS));
new_LoHAS = false;
// end AHA
list_AHA.add(new Float(accumulator_AHA));
}
// new LAS
new_LoLAS = true;
counter_LAS = 1;
}
}
//copy ArrayLists to float arrays:
float[] array_LoHAS = convertIntegerListToFloatArray(list_LoHAS);
float[] array_LoLAS = convertIntegerListToFloatArray(list_LoLAS);
float[] array_AHA = convertFloatListToFloatArray(list_AHA);
//calculate statistical properties from the float arrays:
this.mLoHAS[0] = Statistics.mean(array_LoHAS);
this.mLoHAS[1] = Statistics.variance(array_LoHAS);
this.mLoHAS[2] = Statistics.median(array_LoHAS);
this.mLoLAS[0] = Statistics.mean(array_LoLAS);
this.mLoLAS[1] = Statistics.variance(array_LoLAS);
this.mLoLAS[2] = Statistics.median(array_LoLAS);
this.mAHA = Statistics.mean(array_AHA);
}
private float[] convertFloatListToFloatArray(ArrayList list) {
float[] array = new float[list.size()];
for (int j=0; j < list.size(); j++) {
array[j] = ((Float)list.get(j)).floatValue();
}
return array;
}
private float[] convertIntegerListToFloatArray(ArrayList list) {
66
float[] array = new float[list.size()];
for (int j=0; j < list.size(); j++) {
array[j] = ((Integer)list.get(j)).floatValue();
}
return array;
}
/**
* getLoHAS returns the statistical properties of LoHAS.
*
* @return an array with the mean (position [0]),
* variance (position [1]) and median (position [2]) of LoHAS
*/
public float[] getLoHAS() {
return mLoHAS;
}
/**
* getLoLAS returns the statistical properties of LoLAS.
*
* @return @return an array with the mean (position [0]),
* variance (position [1]) and median (position [2]) of LoLAS
*/
public float[] getLoLAS() {
return mLoLAS;
}
/**
* getLoHAS returns the mean of AHA.
*
* @return the mean of AHA
*/
public float getAHA() {
return mAHA;
}
}
A.1.1 Statistical Utility Functions
package org.vizir.util;
/**
*
* Utility class to calculate mean, variance and median of an array of float values
*/
public class Statistics {
67
/**
* Compute the mean of the input array
* @param values an array of float values
* @return the mean of the input values
*/
public static float mean(float[] values) {
float sum = 0.0f;
for (int i=0; i<values.length; i++) {
sum += values[i];
}
if (values.length > 0)
return sum/values.length;
else
return 0;
}
/**
* Compute the variance of the input array
* @param values an array of float values
* @return the variance of the input values
*/
public static float variance(float[] values) {
float meanValue = mean(values);
float[] helper = new float[values.length];
for (int i=0; i<values.length; i++) {
// Y = (X-mu)^2
helper[i] = (values[i] - meanValue)*(values[i] - meanValue);
}
float variance = mean(helper);
return variance;
}
/**
* Determine the median of the input array
* @param values an array of float values (unsorted)
* @return the median of the input values
*/
public static float median(float[] values) {
float[] sortedValues = sort(values);
float med = 0.0f;
if (sortedValues.length > 0) {
int halfLen = (int)(sortedValues.length/2);
if (sortedValues.length % 2 == 0) { // even length
med = (float)(0.5 * (sortedValues[halfLen-1] +
68
sortedValues[halfLen]));
}
else { //odd length
med = sortedValues[(int)((sortedValues.length+1)/2-1)];
}
}
return med;
}
/**
* Simple sort algorithm in O(N^2)
* @param an array of float values (unsorted)
* @return the sorted input array array
*/
public static float[] sort(float[] values) {
float helper = 0.0f;
for (int i=0; i<values.length; i++) {
for (int j=0; j<values.length-1-i; j++) {
if (values[j] > values[j+1]) {
helper = values[j];
values[j] = values[j+1];
values[j+1] = helper;
}
}
}
return values;
}
}
A.2 Short-Time Energy
package org.vizir.audio.feature;
/**
*
* Calculates the short-time energy of a framed audio signal
*/
public class ShortTimeEnergy {
/**
* getShortTimeEnergy returns the a float array containing
* the shorttime-energy for each frame
*
* @param signal the input signal
* @param samplingRate the samplingrate of the input signal
* @param frameSize the desired framesize in ms (milliseconds)
* @return the short time energy per frame
69
*/
public static float[] getShortTimeEnergy(float[] signal, float samplingRate,
float frameSize) {
int samplesPerFrame = (int) Math.floor(samplingRate / 1000.0 * frameSize);
int numOfFrames = signal.length / samplesPerFrame;
float[] ste = new float[numOfFrames];
for(int j = 0; j < numOfFrames; j++) {
for(int i = 0; i < samplesPerFrame; i++) {
try {
ste[j] += (Math.pow((signal[j * samplesPerFrame + i]), 2)
/ samplesPerFrame);
}
catch (ArrayIndexOutOfBoundsException ex) {
ste[j] = -1;
}
}
}
return ste;
}
}
A.3 Zero Crossing Rate
/**
*
* Calculates the number of zero crossings in an audio signal
*/
public class ZeroCrossings {
/**
* getZeroCrossings calculates the zero crossings per second
* of the input <code>signal</code>. This is a measure for the
* fundamental frequency
* @param signal the input signal
* @param samplingFrequ the sampling frequency of the input signal
* @return the number of zero crossings per second
*/
public static float getZeroCrossings(float[] signal, float samplingFrequ) {
int numOfZeroCrossings = 0;
int len = 0, idx = 0;
float a = 0, b = 0;
float factor = 0;
factor = samplingFrequ / (float) signal.length;
for(int i = 0; i < (signal.length - 1); i++) {
70
idx = i + 1;
a = Math.signum(signal[i]);
b = Math.signum(signal[idx]);
if ( a != b) numOfZeroCrossings += 1;
}
return numOfZeroCrossings * factor;
}
}
71
References
[1] K. Aizawa. Digitizing personal experiences: Capture and retrieval of
life log. In Proceedings of the 11th International Multimedia Modelling
Conference, pages 10–15, January 2005.
[2] M. Aizerman, E. Braverman, and Rozonoer L. Theoretical founda-
tions of the potential function method in pattern recognition learning.
Automation and Remote Control, 25:821–837, 1964.
[3] C. Borgelt. Learning vector quantization visualization. http://fuzzy.cs.
uni-magdeburg.de/∼borgelt/doc/lvqd/#Download, 2000.
[4] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal
margin classifiers. In Computational Learing Theory, pages 144–152,
1992.
[5] M. Brookes. Voicebox is a matlab toolbox for speech processing.
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html, 2005.
[6] J. Brown. Calculation of a constant q spectral transform. The Journal
of the Acoustical Society of America, pages 425–434, 1991.
[7] M. Carey, E. Parris, and H. Lloyd-Thomas. A comparison of features for
speech, music discrimination. In Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP ’99),
1:149–152, March 1999.
[8] C. Chiu, Y. Chang, and Y. Lai. Analysis and recognition of human vocal
emotions. In Proceedings of the International Computer Symposium,
December 1994.
[9] Y. Choi, K. Kim, J. Jung, S. Chun, and K. Park. Acoustic intruder
detection system for home security. In IEEE Transactions on Consumer
Electronics, pages 130–138, 2005.
[10] J. Cooley and J. Tukey. An algorithm for machine calculation of com-
plex fourier series. Math. Comp., pages 297–301, 1965.
72
[11] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,
20(3):273–297, 1995.
[12] T. Cover and P. Hart. Nearest neighbor pattern classifications. IEEE
transaction on information theory, 13:21–27, 1967.
[13] M. Cowling. Non-speech environmental sound classification system for
autonomous surveillance. PhD Thesis, Griffith University, Queensland,
Australia, 2004.
[14] I. Daubechies. The wavelet transform, time-frequency localization and
signal analysis. IEEE Transactions on Information Theory, 36(5):961–
1005, 1990.
[15] R. Duda, P. Hart, and D. Stork. Pattern Classification 2nd edition.
Wiley, 2001.
[16] H. Eidenberger. A new perspective on visual information retrieval.
SPIE Electronic Imaging Symposium, 5304, 2004.
[17] D. Ellis. Matlab audio processing examples. http://www.ee.columbia.
edu/∼dpwe/resources/matlab/, 2005.
[18] S. Esmaili, S. Krishnan, and K. Raahemifar. Content based audio classi-
fication and retrieval using joint time-frequency analysis. In Proceedings
of the IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’04), 5(17-21):665–668, May 2004.
[19] B. Feiten and S. Gunzel. Automatic indexing of a sound database
using self-organizing neural nets. Computer Music Journal, 18(3):53–
65, Summer 1994.
[20] M. Filickner, H. Sawhney, W. Niblack, J. Ashley, W. Huang, B. Dom,
M. Gorkani, J. Hafine, D. Lee, D. Petkovic, D. Steel, and P. Yanker.
Query by image and video content: The qbic system. IEEE Computer
Society Press, 28:23–32, 1995.
[21] J. Foote. Content-based retrieval of music and audio. In Proceedings
of the SPIE conference on Multimedia Storage and Archiving Systems
II, 3229:138–147, August 1997.
73
[22] Guo G. and Z. Li. Content-based classification and retrieval by support
vector machines. In IEEE Transactions on Neural Networks, 14:209–
215, January 2003.
[23] A. Ghias, J. Logan, D. Chamberlin, and B. Smith. Query by hum-
ming: musical information retrieval in an audio database. Proceedings
of the third ACM international conference on Multimedia, pages 231–
236, 1995.
[24] M. Grimaldi, Cunningham P., and A. Kokaram. A wavelet packet
representation of audio signals for music genre classification using dif-
ferent ensemble and feature selection techniques. In Proceedings of the
5th ACM SIGMM international workshop on Multimedia information
retrieval, pages 102–108, 2003.
[25] J. Hadamard. Sur les problèmes aux dérivées partielles et leur signifi-
cation physique. Princeton University Bulletin, 1902.
[26] H. Hermansky. Perceptual linear predictive (plp) analysis of speech.
The Journal of the Acoustical Society of America, 87:1738–1752, 1990.
[27] H. Hermansky and N. Morgan. Rasta processing of speech. IEEE
Transactions on Speech and Audio Processing, 2(4):578–589, October
1994.
[28] R. Jang. Data clustering and pattern recognition toolbox. http://
neural.cs.nthu.edu.tw/jang/matlab/toolbox/DCPR/, 2005.
[29] B. Kedem. Spectral analysis and discrimination by zero-crossings. IEEE
Proceedings, 74:1477–1493, November 1986.
[30] H. Kim, N. Moreau, and T. Sikora. Audio classification based on mpeg-
7 spectral basis representations. In IEEE Transactions on Circuits and
Systems for Video Technology, pages 716–725, 2004.
[31] T. Kohonen. Self-organizing maps. Springer-Verlag New York, Inc.,
Secaucus, NJ, USA, 1997.
74
[32] M. Lamming and M. Flynn. ’forget-me-not’ intimate computing in
support of human memory. In Proceedings of FRIEND21 International
Symposium on Next Generation Human Interface, February 1994.
[33] M. Liu and C. Wan. Feature selection for automatic classification of
musical instrument sounds. In Proceedings of the 1st ACM/IEEE-CS
joint conference on Digital libraries, pages 247–248, 2001.
[34] Z. Liu, J. Huang, Y. Wang, and T. Chuan. Audio feature extraction
and analysis for scene classification. In IEEE Workshop on Multimedia
Signal Processing, pages 343–348, June 1997.
[35] J. Ma, Zhao Y., and Ahalt S. Osu svm classifier matlab toolbox.
http://www.ece.osu.edu/∼maj/osu svm/, 2005.
[36] N. Maddage, C. Xu, M. Kankanhalli, and X. Shao. Content-based music
structure analysis with applications to music semantics understanding.
In Proceedings of the 12th annual ACM international conference on
Multimedia, pages 112–119, 2004.
[37] K. Martin and Y. Kin. Musical instrument identification: A pattern-
recognition approach. In Proceedings of the 136th meeting of the
Acoustical Society of America (ASA), October 1998.
[38] D. Mitrovic and M. Zeppelzauer. Discrimination and retrieval of an-
imal sounds. In Proceedings of the IEEE Conference on Multimedia
Modelling 2006 (accepted), 2006.
[39] Y. Muthusamy, E. Barnard, and R. Cole. Reviewing automatic lan-
guage recognition. In IEEE Signal Processing Magazine, pages 33–41,
October 1994.
[40] E. Pampalk. A matlab toolbox to compute similarity from audio. In
Proceedings of the 5th International Conference on Music Information
Retrieval (ISMIR’04), 2004.
[41] L. Rabiner and B. Juang. Fundamentals of speech recognition. Prentice-
Hall, Inc., Upper Saddle River, NJ, USA, 1993.
75
[42] L. Rabiner and R. Schafer. Digital Processing of Speech Signals. Pren-
tice Hall Inc., Englewood Cliffs, New Jersey, 1978.
[43] D. Reynolds and R. Rose. Robust text-independent speaker identifica-
tion using gaussianmixture speaker models. In IEEE Transactions in
Speech and Audio Processing, 3:72–83, January 1995.
[44] G. Salton and M. McGill. Introduction to modern information retrieval.
New York [etc.] : McGraw-Hill, cop. 1983, 1983.
[45] E. Schreirer and M. Slaney. Construction and evaluation of a robust
multifeature speech/music discriminator. In Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Processing
(ICASSP ’97), pages 1331–1334, 1997.
[46] X. Sun. Pitch determination and voice quality analysis using sub-
harmonic-to-harmonic ratio. In Proceedings of the International Con-
ference on Acoustics, Speech, and Signal Processing (ICASSP ’02), May
2002.
[47] D. Tjondronegoro, Y. Chen, and B. Pham. Applications ii: The power
of play-break for automatic detection and browsing of self-consumable
sport video highlights. In Proceedings of the 6th ACM SIGMM inter-
national workshop on Multimedia information retrieval, pages 267–274,
2004.
[48] T. Tremain. The government standard linear predictive coding algo-
rithm: Lpc-10. In Speech Technology Magazine, pages 40–49, April
1982.
[49] J. Tukey, B. Bogert, and M. Healy. The quefrency alanysis of time
series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum,
and saphe-cracking. In Proceedings of the Symposium on Time Series
Analysis (M. Rosenblatt, Ed), pages 209–243, 1963.
[50] C. van Rijsbergen. Information Retrieval. http://www.dcs.gla.ac.uk/
Keith/Preface.html, 1979.
76
[51] V. Vapnik. The Nature of Statistical Learning Theory (Information
Science and Statistics). Springer, November 1999.
[52] W. Watanabe. Pattern Recognition: Human and mechanical. Wiley,
1985.
[53] Wikipedia. Inverse problem. http://en.wikipedia.org/wiki/Inverse
problem, 2005.
[54] T. Wold, D. Blum, and J. Wheaton. Content-based classification,
search, and retrieval of audio. IEEE Multimedia, 3(3):27–36, 1996.
[55] M. Xu, L. Duan, L. Chia, and C. Xu. Audio keyword generation for
sports video analysis. In Proceedings of the 12th annual ACM interna-
tional conference on Multimedia, pages 758–759, 2004.
[56] Rui Y., Gupta A., and A. Acero. Automatically extracting highlights
for tv baseball programs. in Proceedings of the ACM International
Conference on Multimedia, pages 105–115, 2000.
[57] T. Zhang and C. Kuo. Hierarchical classification of audio data for
archiving and retrieving. In Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP ’99),
6:3001–3004, March 1999.
[58] T. Zhang and C. Kuo. Audio content analysis for online audiovisual
data segmentation and classification. In IEEE Transactions on Speech
and Audio Processing, 9:441–457, May 2001.
77