Evaluation of a Hybrid Sound
Model for 3D Audio Games with
Real Walking
DIPLOMARBEIT
zur Erlangung des akademischen Grades
Diplom-Ingenieur
im Rahmen des Studiums
Medieninformatik
eingereicht von
Michael Urbanek
Matrikelnummer 1328186
an der
Fakultät für Informatik der Technischen Universität Wien
Betreuung: Privatdoz. Mag.rer.nat. Dr.techn. Hannes Kaufmann
Mitwirkung: Projektass. Iana Podkosova, BSc MSc
Wien, 25.08.2015
(Unterschrift Verfasser) (Unterschrift Betreuung)
Technische Universität Wien
A-1040 Wien  Karlsplatz 13  Tel. +43-1-58801-0  www.tuwien.ac.at
Die approbierte Originalversion dieser Diplom-/ 
Masterarbeit ist in der Hauptbibliothek der Tech-
nischen Universität Wien aufgestellt und zugänglich. 
 
http://www.ub.tuwien.ac.at 
 
 
 
 
The approved original version of this diploma or 
master thesis is available at the main library of the 
Vienna University of Technology. 
 
http://www.ub.tuwien.ac.at/eng 
 

Evaluation of a Hybrid Sound
Model for 3D Audio Games with
Real Walking
MASTER’S THESIS
submitted in partial fulfillment of the requirements for the degree of
Diplom-Ingenieur
in
Media Informatics
by
Michael Urbanek
Registration Number 1328186
to the Faculty of Informatics
at the Vienna University of Technology
Advisor: Privatdoz. Mag.rer.nat. Dr.techn. Hannes Kaufmann
Assistance: Projektass. Iana Podkosova, BSc MSc
Vienna, 25.08.2015
(Signature of Author) (Signature of Advisor)
Technische Universität Wien
A-1040 Wien  Karlsplatz 13  Tel. +43-1-58801-0  www.tuwien.ac.at

Erklärung zur Verfassung der Arbeit
Michael Urbanek
Neilreichgasse 85-89/2/2, 1100 Wien
Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwende-
ten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit -
einschließlich Tabellen, Karten und Abbildungen -, die anderen Werken oder dem Internet im
Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Ent-
lehnung kenntlich gemacht habe.
(Ort, Datum) (Unterschrift Verfasser)
i

Acknowledgements
I would like to thank everyone who participated in the user study voluntarily and my advisors
for supporting me throughout the whole thesis (and even before).
iii

Abstract
Audio games are a subcategory of computer games that have spatialized audio as the only output
that their players receive. In order to provide a compelling impression of the environment, sound
in audio games has to be of superior quality in terms of immersion and realism. For improving
the immersion and realism in audio games, complex sound models can be used to generate
realistic sound effects, including reflections and reverberation. An implementation of a hybrid
sound model similar to the ODEON approach is introduced and adapted for real-time sound
calculations. This model is evaluated and compared to a baseline model usually used in audio
games in a user study in a virtual reality environment. The results show that the implemented
hybrid model allows players to adjust to the game faster and provides them more support in
avoiding virtual obstacles in simple room geometries than the baseline model. Complex sound
models are beneficial in audio games, provided that there is enough computational resources
available to perform calculations in real time.
v

Kurzfassung
Audiogames sind eine Unterkategorie von Computerspielen, welche ausschließlich auditiven
Output erzeugen und somit gänzlich auf visuellen verzichten. Da Sound der einzige Output ist,
den der Spieler oder die Spielerin erhält, muss er von überragender Qualität hinsichtlich Im-
mersion und Realismus sein. Um diese Qualität zu erreichen, können komplexe Soundmodelle
herangezogen werden, die Reflexionen und Echo erzeugen. In dieser Thesis wird ein hybrides
Soundmodell ähnlich dem Modell von ODEON vorgestellt, das für Echtzeitberechnungen in
der verwendeten Softwareumgebung adaptiert wurde. Das implementierte Modell wird mit ei-
nem üblichen, in Audiogames verwendeten Modell (baseline model) innerhalb einer erzeugten
virtuellen Realität mit freiwilligen Teilnehmern und Teilnehmerinnen getestet. Die Ergebnisse
zeigen, dass das implementierte hybride Soundmodell, im Vergleich zum existierenden Ansatz,
Spieler und Spielerinnen unterstützt und es ihnen erlaubt, in simplen, virtuellen Räumen Hin-
dernissen auszuweichen. Audiogames sollten, sofern genügend Rechenleistung vorhanden ist,
komplexe Soundmodelle wie beispielsweise das in dieser Thesis vorgestellte, implementieren.
vii

Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Aim of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Methodological approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Structure of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 State of the art 5
2.1 Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Room Acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Obstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.5 Sound Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.5.1 Head-Related Transfer Functions . . . . . . . . . . . . . . . 9
2.1.5.2 Sound Model Categories . . . . . . . . . . . . . . . . . . . 10
2.1.6 Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.6.1 Image Source Model . . . . . . . . . . . . . . . . . . . . . . 13
2.1.6.2 Secondary Sources . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.7 Other Sound Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Audio Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2.1 Sonification . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2.2 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2.3 Imagination . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2.4 Game Genres & Sound Types . . . . . . . . . . . . . . . . . 23
2.2.2.5 Environments . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 Games in Scientific Context . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Software Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Unity3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Wwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
ix
3 Design 33
3.1 Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Real-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Sound Spatialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Sound Plug-ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Plug-in Test Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Game Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Sound Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.1 Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.2 Adapted Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2.1 General Adaptions . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2.2 Adaptions of the Secondary Sources Algorithm . . . . . . . 41
3.4.2.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.2.4 Testing Prototype . . . . . . . . . . . . . . . . . . . . . . . 45
4 Implementation 47
4.1 Plug-in Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.1 Integration of Wwise into Unity 3D . . . . . . . . . . . . . . . . . . . 47
4.1.2 Integration of the AstoundSound plug-in into Wwise . . . . . . . . . . 47
4.1.3 Calling Wwise Events in Unity3D . . . . . . . . . . . . . . . . . . . . 48
4.2 Sound Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Code Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 Image Source Method . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.3 Modified Secondary Sources Approach . . . . . . . . . . . . . . . . . 53
4.2.4 Spatialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.5 Obstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.1 Frames Per Second (FPS) . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.2 Sound Objects and CPU load . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.4 Parameters for User Study . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Evaluation 63
5.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Game Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.1 Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.2 Participants and Procedure . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.3 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.3.1 Subjective Measurements: Questionnaires . . . . . . . . . . 68
x
5.3.3.2 Objective Measurements: Logging . . . . . . . . . . . . . . 69
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.1 Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.2 Realism and Immersion . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.3 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6 Conclusion 77
Bibliography 79
A Questionnaires 85
A.1 Pre-Test-Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
A.2 Each-Test-Questionnaire Simple Room . . . . . . . . . . . . . . . . . . . . . 88
A.3 Each-Test-Questionnaire Complex Room . . . . . . . . . . . . . . . . . . . . 92
A.4 Post-Test-Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
xi

CHAPTER 1
Introduction
Computer games may be used for entertaining or educational purposes, although the industry
has its main focus on the first one. The increasing amount of power and performance of modern
computers assist the development of more realistic computer games. Graphics processing units
offer the required performance which is needed for realistic rendering and visual output. Most
computer games use visual output with an additional support of audio to give the player an
immersive and realistic feeling. However, there are people who will not or cannot rely on visuals,
e.g. visually impaired people. Audio games can replace common types of computer games for
these groups of users.
Audio games are a subcategory of computer games which focus on another sense of a human
being: hearing. Since sound is the only output and feedback that players get in audio games, it
has to be of superior quality to provide a convincing experience. The sound component in those
games serves as replacement for visuals when compared to computer games with visual output.
To have a realistic sound in audio games, this sound has to be spatialized in 3D. In this context,
spatialization means that the sound that a listener hears is calculated according to the position
of the sound source in virtual (3D) space. Without spatialization, it is not possible to locate a
sound source.
Sound propagation models that are used for room acoustics are able to produce geometry
dependent reverberation effects. Audio games using such models could provide more realistic
auditive environments for players. Existing audio games are usually played in a desktop envi-
ronment. Navigation, e.g. the change of the player’s viewport position, is done by using input
devices, e.g. mouse, keyboard or other controllers [28]. However, natural sound localization
is based on binaural hearing e.g. rotating the head to determine sound direction, which is not
possible when using the above mentioned devices. A user’s head position and rotation can be
tracked in real time with the use of dedicated tracking equipment and used for sound spatializa-
tion. If head tracking is available in a large enough physical space, navigation by real walking
becomes possible. The possibility to localize the sound source while rotating one’s head while
walking in a virtual environment just like in the real world provides players with much higher
degree of immersion compared to a desktop environment [50].
1
1.1 Motivation
The motivation of this thesis is to enhance realism and immersion in audio games by using
compelling and realistic sound provided by a complex sound simulation model in combination
with natural navigation by real walking. There is no research published that investigates such
combination. The real-time sound simulation model developed as an outcome of this thesis
can be used by audio game designers and developers to enhance the quality of sound in their
products.
1.2 Problem statement
The sound models used in existing audio games [28, 29, 45, 55, 58, 62] do not take the struc-
ture of the Virtual Environment (VE) into account. They provide spatialized sound, but do not
simulate any kind of more complex sound phenomena like reflections or reverberations that can
be generated by more sophisticated sound models. The simulation of these effects are included
into an audio game in this thesis. In a game environment, sound calculations need to be done in
real time depending on a user’s position. The spatialization of sound relative to a user’s position
can be done by using existing technologies, e.g. an audio middleware solution. However, the
real-time calculation of a sound model is a difficult task and a proper sound model needs to be
chosen to work in an interactive environment in real time.
To summarize, in this thesis it is investigated how complex effects like sound reflections and
reverberations can be included into a sound model used in an audio game and calculated in real
time together with spatialized sound.
1.3 Aim of the work
The aim of this thesis is to implement a sound propagation model with real-time sound spa-
tialization, reflections and reverberation and to integrate it into an audio game prototype with
navigation by real walking.
1.4 Methodological approach
In this thesis, an already existing Virtual Reality (VR) game environment is used. In the used
setup, a game is implemented in the Unity3D game engine and runs on a laptop worn by a
user. An Oculus Rift DK2 is used for rendering (or blindfolding users in case of an audio game
without visuals) and an optical tracking system for the calculation of the user’s position.
An audio middleware solution Wwise is chosen as the basis for all sound calculations. The
thesis contains a comparison of sound spatialization plug-ins, with the AstoundSound plug-in
being chosen for the final game prototype.
An investigation of existing sound models in the domain of room acoustics shows that a
hybrid sound model can be adapted and used for real-time application in an audio game. With
this sound model, it is possible to simulate sound reflections and reverberation. A baseline
2
model, which only spatializes sound without any further effects, is also implemented as means
of comparison.
The evaluation of the implemented sound model is done in an extensive user study with 37
participants. The implemented hybrid model is compared to the baseline model in a between-
subject user test. Subjective measurements (Questionnaires) and objective measurements (Log-
ging) are used for the evaluation.
1.5 Structure of the work
The state of the art in the areas of audio games and sound models is described in Section 2.
Section 3 introduces main design principles of the presented work, where the test environment,
the requirements, plug-ins and sound models are discussed. In Section 4, the integration of a
hybrid sound model into the Unity3D game engine is described. In addition, the integration of
the used audio middleware solution Wwise and the used spatialization plug-in AstoundSound
is summarized. The evaluation of the implemented hybrid model is presented and discussed in
Section 5. There, the proposed hypotheses, a game prototype, the study design and results are
presented. Section 6 concludes the thesis. The questionnaires used in the user study can be
found in Appendix A.
3

CHAPTER 2
State of the art
2.1 Sound
This chapter introduces fundamentals of sound formation, sound related phenomena and gives
an insight of how humans perceive sound. Sound simulation methods and their appropriateness
for audio games are discussed.
2.1.1 Fundamentals
Sound is a disturbance in form of waves in a medium caused by the vibration of a sound source
[1]. A medium is necessary for transportation of the sound, like air (usually), water (underwater
sound) or earth (e.g. bass vibrations felt on the ground). Vibration in this context means that
the size of the sound source oscillates between two sizes, one a bit larger than normal and one
a bit smaller [42]. While changing its size, the sound source stimulates the medium (e.g. air)
around it which causes compression (when the sound source grows bigger) and rarefy (when
the sound source gets smaller) of the medium, thus producing sound waves. In the domain of
audio, sound waves are longitudinal, which means that the displacement of the medium is the
same as the direction of the energy transport. The rate at which particles of a medium oscillate
around a given point in space is defined as frequency, which is measured in Hertz (Hz) or cycles
per second [42]. The amount of compression and rarefaction of the transport medium defines
the amplitude of the wave, which is perceived as loudness when it arrives the receiver (e.g. an
ear) [42].
The sound pressure is the energy of a sound emitting source on its surrounding [42]. The
Sound Pressure Level (SPL) is usually measured in Newtons per square meter ( N
m2 ) [42]. How-
ever, in practical acoustics a logarithmic scale for SPL is used: decibel. 0 dB is near to the
hearing threshold and 130 dB near to the threshold of pain [56]. This decibel scale is based
on sound energy [56]. Sound intensity is the product of sound pressure and the sound particle
velocity [47]. Sound waves can be described with the wave equation. A sound wave is a space-
and time-dependent function of the sound pressure p, which is defined for one dimension as fol-
5
Figure 2.1: The reflection of a sound wave on a smooth plane.
lows [30]. Be p the sound pressure, c the speed of sound (in air), t the time and x the respective
position, then the wave equation is defined by
∂2p
∂x2
− 1
c2
∂2p
∂t2
= 0. (2.1)
Without proof, the solution p equals
p = p0sin(ωt∓ kx) (2.2)
where p0 is the sound pressure amplitude, ω the angular frequency of waves and k = 2π
λ [30]. λ
is the wavelength.
Sound pressure can be influenced by obstacles e.g. objects or walls on the path of a wave
through reflection, scattering and diffraction [56]. If the surface is completely smooth, the plane
sound wave is reflected specularly, according to Snell’s law [56]. Figure 2.1 shows this reflec-
tion, both angles are equal. Scattering is the reflection of sound waves on rough surfaces [56].
A surface is rough if the irregularities on it are much larger than the wavelength of the sound
wave. Diffraction, which is the bending of sound appears at edges or corners of an obstacle [56].
Diffraction waves are produced on the edges of the obstacle. The intensity of this diffraction
wave depends on the size of the object and the wavelength.
2.1.2 Perception
Humans use their two ears for sound perception. This is called binaural hearing, which is the
basis of sound localization [25].
The human brain is able to locate an audio source depending on intensity as well as time
differences between the arrival on both ears [25]. With this information, it is possible to express
two variables in the context of hearing: the direction and the distance of an occurring sound [25].
However, if there are more subsequent sound waves (e.g. in a room where sound reflections are
6
present), the human perceives the location of the sound event where the first sound wave hits the
human ear [25].
The human ear consists of several parts that can be separated into the outer ear, the middle
ear and the inner ear [42]. The pinna of the outer ear consists of the skin and bone structure that
is visible. This area is responsible for focussing the sound waves that go through the ear canal
towards the ear drum. This outer area is different for every human being. Three bones that are
connected as lever are in the middle ear. They connect the ear drum with another membrane that
is located between the middle ear and the inner ear. This membrane then stimulates a fluid in the
inner ear. The stimulus of this fluid is finally transmitted by an auditory nerve to the brain that
is perceived as hearing. Therefore, sound has to travel through the whole hearing system to get
perceived. In addition, the shape of the ear helps to distinguish directions.
There is a difference between static and dynamic sound localization. The authors of [25]
describe an example with an approximation of the head through a sphere. A human would
not notice the difference between audio sources placed in front or behind, as well as above
or beneath when not moving the head because the differences in intensity and time are equal.
When the head is not moved, this is called the static localization process. If a person is not sure
about which direction a sound comes from, he or she moves the head slightly to determine the
location due to changes in the intensity and time domain. This is called dynamic localization.
Dynamic localization is the movement of the head for localization purposes and is essential
and recommended for an immersive virtual environment of audio games, which emulate spatial
sounds [11]. This is possible with continuous tracking of the head’s position and movement for
adaption in sounds in the virtual environment [14].
The hearing system of humans has another ability which enables listeners to focus on one
audio source while suppressing other sources like noise or voices. This is called the cocktail
party effect [37]. Therefore, it is possible that several audio sources in audio games can emit
sound waves simultaneously, while the player has the ability to focus on the sources he or she
finds most interesting.
2.1.3 Room Acoustics
The scientific domain describing the rules of sound propagation in rooms is room acoustics.
One major element of sound in rooms is the existence of reverberation. Reverberation is the
remainder of sound in a space when the original source has already stopped emitting sound [59].
This effect occurs in rooms. Reverberation in rooms is caused by multiple reflections of the
sound at the room’s walls that has its origin at the original sound source in space [56]. It can be
split into three parts: direct sound, early reflections and late reverberation [56].
Direct sound is the first sound wave that reaches the listener without having been reflected.
Therefore it travels the shortest distance between the sound source and the listener compared to
early reflections and late reverberation. In an infinitely open space, direct sound is the only part
which is perceivable. Direct sound indicates the direction of the sound source. The precedence
effect (which is related to direct sound) is a binaural psychoacoustic effect which states that
the listener perceives the sound direction according to the first wave recognized by the human
hearing system [57]. For example, two sound sources are placed in an open space, both emitting
the same sound. The first sound source emits the sound without any restrictions continuously.
7
Figure 2.2: An example of the energetic room response. The strongest impulse has direct sound
(red), followed by early reflections (blue) and late reverberation (green).
The second sound source emits the same sound delayed but louder. The direction of the first
sound source is perceived as more dominant. In a room with several reflections, the precedence
effect helps to identify the original sound source position even though reflected sound waves are
coming from all directions.
Early reflections are the first subsequent reflections at obstacles (or room boundaries). Like
direct sound, early reflections provide information about the position and distance of the sound
source. In addition to that, these reflections provide clues about the room geometry. Early
reflections are important for the direct sound impression, which Vorländer describes as "They
enhance the loudness, support the intelligibility of speech, the clarity of music and the impression
of the auditory source width." [56].
Late reverberation is contributing to the reverberation mostly and is responsible for the en-
velopment of the respective listener. They provide information about the room size and reflec-
tivity, however, they do not provide information regarding the direction of the sound anymore.
The combination of direct sound, early reflections and late reverberation forms the overall
perceived reverberation in a room. Figure 2.2 shows the energetic room response.
2.1.4 Obstruction
When an obstacle is between a listener and a sound source, the listener perceives the sound as
attenuated. The reason for this is that the direct sound which is the shortest distance between the
listener and the sound source travels through the obstacle and therefore is perceived as muffled.
High-frequency sounds are reflected or absorbed by obstacles, low-frequencies get transmitted
[16]. In sound simulation, this effect can be achieved by applying a low-pass filter on the sound
source. Figure 2.3 shows the direct sound traveling through two walls. S is the sound source
position, R the position of the receiver. When the direct sound wave first hits the wall it gets
attenuated. It then travels with less power towards the receiver position and goes through the
second wall where it is again attenuated.
8
Figure 2.3: An example of how the obstruction effects the sound coming from the sound source.
S is the sound source position, R the position of the receiver. The more dots the line has, the
stronger the attenuation is.
2.1.5 Sound Simulation
Computer sounds can be pre-recorded or artificially generated and can be modified and played
through software. Pre-recorded sounds are stored in sound files. These files can be accessed
with sound software at any playback position. In addition to that, effects can be applied that e.g.
modify or filter sound frequencies. However, several parameters can be accessed through sound
software, e.g. playback position, volume (loudness) or pitch (playback speed).
Sound simulation models can be used to produce reverberation artificially in real time, taking
the room geometry into account. For this computations, the listener’s position and rotation
must be known, otherwise the spatialization and reverberation cannot be calculated. Sound
spatialization is the ability to play a sound as if it is positioned at a specific point in three-
dimensional space [34].
2.1.5.1 Head-Related Transfer Functions
As described in Section 2.1.2, humans perceive sound through their ears. Incoming sound waves
are focused at the pinna and transmitted through the ear canal to the ear drum (outer ear). Ab-
sorptions at the pinna, the head and the torso are different for every individual and affect the per-
ceived sound. To simulate these absorptions effects, a Head-Related Transfer Function (HRTF)
can be used. HRTFs are the response of how the human ear receives a sound in space, which is
basically a model of the human hearing system and its auditive interpretation [48]. It describes
the effect and absorption of the head, outer ear and torso. Vorländer offers a comprehensive
definition of HRTF:
It [HRTF] is defined by the sound pressure measured at the eardrum or at the ear
canal entrance divided by the sound pressure measured with a microphone at the
centre of the head but with the head absent. [56]
9
Figure 2.4: The three parameters when measuring the HRTFs. Two angles azimuth and eleva-
tion and one distance parameter.
HRTFs can be measured by placing tiny probe microphones inside both ear channels of a
person and then playing sound through a loudspeaker at different azimuth, elevation but same
fixed distance from the person’s head [7]. This way, when the sound source is moved in space
and therefore emits sound at different places different HRTFs are measured and stored for later
usage e.g. sound spatialization. The three parameters are shown in Figure 2.4. The stored HRTFs
can be applied on a sound source signal through convolution or (inverse) Fast Fourier Transform
(FFT) to simulate the absorption effects [24]. HRTFs are only required when a listener is wearing
headphones.
In sound simulation standard HRTFs are used, e.g. the HRTFs measurements of a KEMAR
dummy-head microphone [13].
2.1.5.2 Sound Model Categories
This section covers different types of sound models that may be used to calculate and simu-
late room acoustics. However, not every method is appropriate in every situation. Existing
approaches can be grouped into the following categories:
• wave-based approaches,
• ray-based approaches,
• and statistical approaches.
10
Waved-based approaches
Solving the wave equation for a room is one method that can be applied in room acoustics
to generate realistic but artificial reverberation. Approaches that simulate room acoustics by
solving the wave equation are called wave-based approaches. In wave-based approaches, the
geometrical structure of the virtual environment must be discretized [56]. This discretization
process divides the structure into small elements to allow the creation of a linear system of
equations. The spatial discretization is done by creating a discretized grid (mesh) of the surface
or points in space [56]. Depending on the size of the wavelengths, the chosen discretization must
be sufficient to allow interpolation [56]. At high frequencies (and therefore small wavelengths),
the discretization must be large enough.
There are typically two methods that can be used in wave-based approaches, the Boundary
Element Method (BEM) and the Finite Element Method (FEM) [56]. In the BEM, only bound-
aries of the room geometry needs discretization, e.g. the walls, the floor and the ceiling. On the
other hand, in FEM the whole room is divided into pieces for which the linear system must be
solved.
Wave-based approaches are more accurate compared to ray-based approaches [49]. In wave-
based approaches, diffraction can be modelled without applying additional methods [49]. This
sound phenomenon is more present at low frequencies only. This makes the sound model more
accurate compared to ray-based approaches.
Wave-based approaches are computationally intensive [49]. The computational workload
depends on the size of the structure and the amount of frequencies to calculate. Therefore, it is
more recommended for a small range of frequencies (e.g. calculation of only low frequencies)
[49]. Vorländer gives an calculation duration example for BEM in [56]. With a computer that
can solve 8000 nodes in 60 seconds with a desired frequency resolution of 1 kHz, the calculation
time equals about f
60 hours with f being the frequencies. In FEM with mesh sizes in a range of
100.000 nodes, calculation times are in order of magnitude of 5 minutes per frequency [56].
Due to its high computational costs, wave-based approaches can hardly be used for audio
games and games in general. Sound models must do their calculations in real time when applied
in interactive software. Wave-based approaches might get wider use with further improvements
of hardware processor power. However, recently published literature shows that it is possible to
use a wave-based approach in games [27].
Ray-based approaches
Ray-based approaches treat the sound waves as rays similar to the rays of light. Rays can
be reflected, refracted or diffracted [56]. There are two possible ways to construct rays in such
approaches [56]: One way is to use forward geometric construction which follows the ray from
the sound source to the respective listener. The other way is to backtrace the ray from the listener
to the source. Both are equivalent to each other due to the law of reciprocity [56]. Rays in these
approaches carry sound energy that can get reduced by reflections or absorption effects, however,
when a sound ray hits a volume or the listener in a scene, the energy at this point is known.
11
Two methods are introduced briefly in this section. The image source method and stochastic
ray tracing. The image source method is a ray-based approach that is discussed extensively
later in this thesis. In this method, the sound source is mirrored at the boundaries of the room,
generating image sources. These image sources are mirrored again to create image sources of
higher order. Further explanations can be found in Section 2.1.6.1. In stochastic ray tracing,
sound is simulated as a bunch of particles that follow a ray’s path [56]. These particles get
reflected in the same way rays do and lose energy due to these reflections or absorptions. At
the receiver’s position, the incoming particles are counted over time to provide the listener with
reverberation.
Ray-based approaches are not as accurate as wave-based approaches, however, the calcu-
lation time for more complex structures is smaller in ray-based approaches compared to wave-
based ones [49]. The image source method e.g. can find all specular reflection paths. However,
at higher order reflections this computations can be very costly [49].
The computation time of the image source method depends on the amount of boundaries at
which the sound source is mirrored and the maximum image source order. Be n the amount
of room boundaries (walls + floor + ceiling) and x the maximum image source order, then the
amount of generated image sources i can be calculated by
i = nx. (2.3)
The amount of generated image sources at higher order must be handled in an interactive appli-
cation in real time which is not possible. For this purpose, hybrid models (see Section 2.1.6) are
introduced that approximate the results at lower computational costs. This is only valid for the
image source method, however. The computation costs are smaller for stochastic ray tracing.
Ray-based approaches are appropriate for audio games due to the better computational per-
formance compared to wave-based approaches.
Statistical approaches
The Statistical Energy Analysis (SEA) is a framework for the analysis of systems, where the
energy is of interest [23]. It can be used for the prediction of vibration and therefore sound of
acoustic systems, e.g. sound transmission through a wall. Statistical approaches therefore can be
used in the domain of coupled systems for the prediction of noise levels, where the transmission
of structures plays a major role [30]. However, these applications are not of relevance in audio
games and therefore such approaches are not used.
2.1.6 Hybrid Model
A hybrid sound propagation model combines two methods to negate the disadvantage of one
method with the advantage of another. In this thesis, the hybrid model combines the image
source method (ray-based approach) with secondary sources. The combination of these methods
allows a realistic result [38].
In this thesis, an approach similar to the model that ODEON uses is used and discussed in
the next sections. ODEON is a professional room acoustics software [33]. It uses secondary
sources in combination with stochastic scattering to calculate late reverberation.
12
Figure 2.5: Calculation of image sources. Sx: image sources of first order; Sx,y: image sources
of second order.
2.1.6.1 Image Source Model
The image source model calculates sound reflections as long as it is assumed that the walls of
the room are smooth and therefore generate specular-reflected sound waves. The energetic room
response can be constructed with the use of this model. Sound calculation in this model works
as follows.
For starting the calculations, the position of the sound source and the structure of the room
must be known. The position of the sound source is a vector in 3D space (x, y, z). First,
the sound source is mirrored at every plane of the walls in the room. This procedure is done
recursively, creating image sources of higher order, excluding mirroring at the same wall again
(generating all permutations without having the same wall consecutively). Figure 2.5 shows this
mirroring process in a very simple room. The room consists of 4 walls (numbered from 0 to 3),
while S represents the sound source and R the receiver. In this mirroring process, the mirrored
sound source positions are of relevance. Starting clockwise, the sound source is first mirrored at
wall 0, then at wall 1, 2 and finally 3. The generated mirrored sound sources are called image
sources. These are first order image sources, illustrated in the rooms with green background.
After the first order calculation, the mirroring process starts again, mirroring image source S0
at wall 0 (top, not shown), then S0 at wall 1 which generates image source S1,0, S0 at wall 2
(not allowed) and S0 at wall 3 that generates S3,0. This process may be done until a satisfying
image source order is reached, e.g. two [38]. The higher the order, the more reflections are
generated that produce more reverberation. However, calculation time increases with every wall
and must be limited by a maximum order (see discussion about computational costs in image
source method above). The second order reflections are highlighted in the purple areas.
The positions of sound sources of higher order can be calculated using the following formula.
Be ~S the original sound source position, ~Sn the position of the (mirrored) image source, ~n the
normal vector of the respective wall (surface), ~r the vector between the foot point of the wall
13
normal and the sound source ~S, then
~Sn = ~S − 2d~n (2.4)
with
d = ~r · ~n = |~r| cos(α) (2.5)
calculates the positions of image sources at higher orders [56]. To calculate e.g. order two, the
positions of the first order reflections are taken as ~S.
Not all of the image sources calculated by this step are used further. An audibility test is
performed to find the sources which are valid. For this calculation, the position of the receiver
R is needed. The audibility test is a backward check. The starting point for performing this
check is at the receivers position R. The unique mirroring order of an image source indicated
by its indices (e.g. S0,1) defines the order at which walls the image source was mirrored. These
indices are used in the backward check.
An audibility test is performed for every image source. A ray is shot from the receiver posi-
tion towards an image source. If the image source has been produced by mirroring at the same
wall that the ray intersects first, this image source is considered valid, and invalid otherwise.
Valid image sources will emit sound and contribute to early reflections.
The audibility test is illustrated in detail in Figure 2.6. Here, R marks the receiver position.
S is the original sound source. Two walls of the virtual environment are marked with 0 and 1. S0
and S1 are image sources of the first order. S0,1 and S1,0 are image sources of the second order.
For the image sources of the second order, the last index indicates the wall at which the image
source was produced. In the given example, the image source S1,0 is examined for validity. A
ray is shot from the receiver position R towards S1,0. This ray first intersect wall 1. S1,0 was
produced by mirroring against the wall 0, as indicated by its last index. This last index does not
coincide with the index of the wall that the ray intersects first. Therefore, the image source S1,0
is not valid. In comparison, S0,1 is a valid image source. Here, a new ray is started at the point
of intersection with wall 1 to examine if S0 which is the predecessor of S0,1 is also valid. The
principle is the same as described above and therefore S0 is also a valid image source. If there
is no other wall between the last intersection point and the original sound source, the full path
(R → S0,1 → S0 → S) is reconstructed and valid (blue line). This check has to be done with
every image source. In the example of Figure 2.6, the path R → S0 → S is also possible and
valid. The path R→ S corresponds to direct sound.
A mother image source is the predecessor or one of the predecessors of an image source.
If such a source is invalid, then all the successors are also invalid and a valid path cannot be
reconstructed.
Mechel published a paper in 2002 called Improved Mirror Source Method in Room Acoustics
which lists 8 interrupt criteria for calculation of image sources [26] that reduces the computation
time in this method. Mechel defines the field angle as one criteria that has to be valid for each
calculated image source. The field angle spans up an area with the image source position as
origin in which the receiver must be in. This area is limited by the borders of the wall at which
the image source was mirrored at. If the receiver is not in this area the image source is considered
as invalid. Therefore neither it nor its successors (children) need to be calculated. Figure 2.7
shows a room, where the sound source is mirrored at wall 5 and then at wall 0. A cone, which
14
Figure 2.6: An illustration of the audibility test. S is the position of the sound source, R the
position of the receiver. The walls are numbered with 0 and 1. First order image sources are
S0 and S1, second order image sources are S0,1 and S1,0. S1,0 is the only image source that is
invalid in this example.
goes through the corners of the last mirrored wall (in case of S5,0 wall 0) spans up an area in
which the receiver has to be. If this criteria is met, the image source is considered as valid. In
this figure, three receiver positions are shown, but only R1 is valid. R2 is outside of the cone
(gray area) and R3 is not in line of sight (audibility test) to S5,0 and therefore invalid. If this
algorithm is applied on earlier order reflections, computational time can be saved.
After the audibility check and interrupt criteria are applied, the valid image sources are
identified. At the image source positions, sound sources can be placed that emit sound. Without
an appropriate delay, all the sound sources would play the same sound at the same playback
position. A sound delay must be added to generate the reverberation effect. The sound delay
depends on the distance between the image source and the receivers position multiplied by the
speed of sound [56]. If this distance gets bigger, the sound delay also increases.
The above mentioned process of image source creation, audibility check and interrupt crite-
ria are valid for the fixed positions and rotations of the receiver, sound source and the underlying
15
Figure 2.7: The field angle for faster computation. The sound source is mirrored first at wall 5
and then at wall 0. The cone describes the area in which the receiver has to be to be recognized
as valid image source. If the receiver is in the green area, the image source passes this test.
virtual structure. However, if the receiver, the sound source or the geometrical structure changes,
several actions must be performed [46]. When the room geometry changes or the sound source
moves, the whole model needs to be recalculated. This means that the locations of the image
sources must be recalculated and the audibility check with the interrupt criteria must be applied
again. If the sound source or the listener rotates, only the orientations for sound spatialization
of the listener need to be updated. However, if the listener moves, another audibility check must
be performed.
2.1.6.2 Secondary Sources
In this hybrid approach, the image sources are calculated like described above. At a predefined
point in image source creation order, the secondary source algorithm takes over and starts cal-
culation. This order is called transition order. When the transition order is reached, no further
image sources are created. Instead, at the point of intersection when a ray that is shot from
last created image source position to the position of the image source next order hits a wall, a
secondary source is created. Then, rays are emitted omnidirectional from this secondary source
for generating other secondary sources at every new collision (reflection) point. These rays are
then reflected at the room boundaries and generate other rays. Every ray is reflected according to
Snell’s Law with direction ~v. In addition to that, a random vector ~r is calculated for simulating
surface’s roughness and the direction of the reflected ray is modified by ~r [38]. The scattering
coefficient defines how much of ~r is added to ~v.
16
IfN rays are generated, every ray starts with aN ’th part of the energy. The energy at a point
can be described by the following formula. Be j the j’th order of the secondary source, Ej the
energy at the j’th secondary source, Es the energy at the original sound source, N the amount
of rays and αi the absorption coefficient of the respective wall, then is
Ej =
Es
N
j∏
i=1
(1− αi) (2.6)
the amount of energy at the j’th secondary source.
Ideally, these secondary sources should emit a very large number of rays that create new
secondary sources at the room’s boundaries that again shoot a large number of rays into the
room and so on. However, Naylor postulated that the amount of rays will get very large quickly
resulting in too much computational load [32]. He states that using one ray that creates secondary
sources at reflection points is enough [32].
Figure 2.8 shows the procedure of the secondary source algorithm. S is the position of the
sound source, R the position of the receiver. S0 is the first order image source, S0,2 the second
order image source. The blue line is the reflection path generated through the image sources.
The direction of the end of the blue line is given by the next image source position. However,
the transition order is reached and the secondary sources algorithm starts creating secondary
sources. Every secondary source is marked as green dot. The directions of the secondary rays
depend on the incident angles.
The sound delay at the receiver depends on the total sum of the path length of the ray and the
distance between the secondary source and the listener [32]. A secondary source is only valid if
the position of this source is visible and not occluded from the position of the listener. Otherwise
it does not emit any sound. The intensity at the receivers (listeners) position can be computed,
giving the following formula. Be P the power of the original sound source, N the amount of
rays, αi the absorption coefficient of the respective wall, θ the angle between the surface normal
and the receiver, r the distance between the secondary source and the receiver, so is
I =
P
N
n∏
i=1
(1− αi)
2cos(θ)
2πr2
(2.7)
the intensity at the listener (I) [32].
A hybrid model approach negates the high computational costs at higher order reflections
by introducing secondary sources. The created image sources are used for early reflections, the
secondary sources are placed on the walls for late reflections. The amount of generated sound
sources in the hybrid model is smaller compared to the amount of generated image sources
higher order in a plain image source model. Therefore, the hybrid model is faster to evaluate.
This advantage makes the hybrid approach appropriate for audio games. The accuracy of hybrid
model is less than in a wave-based approach (see discussion above), however, the advantages of
faster computation prevail the lack of accuracy.
According to Rindel, 500 to 1000 rays produce a reliable result in an auditorium with set
transition order to two or three [38].
17
Figure 2.8: An example of how secondary sources are computed. The blue lines are the paths
calculated by the image source method. Instead of the creation of the third order image source,
a secondary source is created that creates additional secondary sources.
18
2.1.7 Other Sound Models
In addition to the approach ODEON is using that is described in the last section, other sound
models which take the geometry of the environment into account exist. RAMSETE [10] is one of
them. It uses a pyramid tracer that has the advantage of covering the surface of a spherical source
completely, avoiding multiple detection of the same image source and avoiding overlapping
cones. The RAMSETE approach is a ray-based approach. RESound [52] also is a ray-based
approach that uses a combination of discrete ray-tracing and frustum tracing for the calculation
of the sound propagation paths. For late reflections, a statistical acoustic model is used. A GPU-
based approach named iSound [31] uses a modified image source method and a multi-view ray
casting algorithm that allows the parallel computation of image sources on GPU. In addition to
that, a scheme for reducing audio artifacts was developed. The iSound sound model is a ray-
based approach. Wave-based Acoustics for Virtual Environments (WAVE) [27] is a wave-based
sound propagation model that uses a combination of precomputation and runtime calculations.
In the preprocessing stage, the dynamic transfer operator of the scene is calculated by using the
per-object and inter-object transfer functions. The duration of this stage depends on the amount
and details of the objects. During runtime, the outputs of the player devices (controller and
Head-Mounted Display (HMD)) are used as inputs for the WAVE sound system that auralizes
the scene in real time. The calculations are done on CPU and GPU.
All the above mentioned sound models could also be used in games and therefore in audio
games. However, in this thesis a hybrid model approach similar to ODEON was chosen for
testing sound models in audio games.
2.2 Audio Games
Audio games are described and discussed in this section. First, a definition and then fundamen-
tals of audio games are discussed. Related work in form of a history of audio games closes this
section.
2.2.1 Definition
There is no existing scientifically reviewed definition of the term audio games. Wikipedia, as
the only source that provides a definition, defines audio games as electronic games played on
a device such as a personal computer. It is similar to a video game save that there is audible
and tactile feedback but not visual. [60]. This can be used as basis, since it states that the visual
aspects are missing from audio games completely. There are several papers, e.g. [8, 9, 51] on
this topic that do not define audio games but describe and use this term in their descriptions as
a definition. Literature state that in audio games information is acquired through spatialized and
iconic cues [8, 9] and it can be distinguished between two types of audio games; those which
use spoken descriptions of situations and those which use non verbal audio cues [51]. To give a
definition in a scientific context including the given references, this thesis defines audio games
as:
19
Audio games are a subcategory of computer games played on electronic devices without
having visual but auditive or tactile output as feedback for the user. Auditive feedback is given
in form of spoken descriptions, non verbal audio cues, spatialized cues or iconic cues.
The described audio games in this thesis do not have their main focus on text-to-speech
games, although they are also games with auditive feedback.
2.2.2 Fundamentals
This section describes fundamentals of audio games for further understanding of the creation
process and the elements of an audio game. The description of sonification and interaction is
followed by a discussion about the imagination part in audio games.
2.2.2.1 Sonification
Due to the lack of visual feedback, interactions or objects in a virtual world of an audio game
have to give auditory feedback to the user to signalize their existence. Sonification is a technique
that is used to transform non verbal sound [18]. Examples of sonification in daily life are the
beep of the stereo rangefinder in a car, the ticking of the Geiger counter or the beep of a fire
detector. In the context of audio games, Röber and Masuch identified several information groups
and summarized them in three questions [39]:
• Where is something?
• What is this?
• What can I do with it?
These questions do not only relate to things, they also relate to and can be triggered by
interactions (see Section 2.2.2.2 for interaction types). The first question determines the place
of an object in the auditive VE. Those objects help users with orientation and support them in
building an image of the virtual world in their head. They even let them reproduce it in real
life [8, 45]. In audio games, the sound component is the only source for orientation purposes.
Without any audio source, the player is unable to locate, orientate and navigate himself or herself
in a virtual world [39].
The second question (What is this?) asks for a description or an intuitive purpose of an
object in the virtual environment. Without any response, the player is unable to identify objects
that makes them auditive invisible. Röber and Masuch describe this phenomenon as inaudible
and not perceivable [39]. As a result of that statement, they declare that every object in a
virtual environment is interactable. They classify interaction groups as: obstacles, portals, and
interactables [39].
The third question What can I do with it? asks for the relationship between an object and the
possible interactions a player can do with it.
Obstacles are insuperable objects in the virtual environment. The authors describe objects as
elements which give shape to the environment, e.g. walls. Generally, interaction is not possible
with those elements. The only possible interaction is collision and obstruction [39]. In an
20
auditive environment, such elements also have to be detectable by the player. Therefore they
must have sonification, like an increase of volume if the player comes closer [39] or even a
sound if the player collides with the obstacle [9]. Those sounds could also be used to describe
and represent the actual context (e.g. a wall out of stones could have another sound compared
to a wall of wood) [41]. As an alternative to the sonification approach, haptic feedback is also a
possibility to provide information about the virtual environment [40].
Portals are objects in an audio game that bring the player from point A to point B. This
transportation process can or cannot happen instantly. Examples for portals in games are doors,
stairs, and teleporters [39]. Feedback while passing through or walking on portals provides the
player with information of using them.
Interactables are the third category described by Röber and Masuch. These are objects in
the virtual world, someone can interact with, like a button, a phone or a machine. Users know
about the functionalities of these examples but the player must receive feedback when they are
not active yet. When a game e.g. simulates a living room and the player’s next mission goal is
to phone Ms. X, the player must know where the phone is located in the room to do his or her
call and to proceed in the game. Even when the player knows the position of the phone, he or
she needs feedback about whether the interaction with the object was successful or not. If the
sound of an interactable is unclear, the authors [39] suggest to describe it at the first interaction
verbally. This should help the player recognize the object later. It is also possible to highlight
interactables when the player focuses on them (e.g. additional sounds as feedback or lower
environmental sounds), to provide him or her with the necessary information of the possibility
for interaction.
Figure 2.9 shows the discussed object types of an audio game. Green elements are obstacles
that are insuperable and define mostly the border of the scene. The blue elements are portals,
e.g. the floor, the door an the stairs. The phone and the paper (red) on the table are interactables
that the player can use.
The next section describes different interaction types and their challenges.
2.2.2.2 Interactions
Interaction is the relationship between the input of a player and the subsequent actions in the vir-
tual environment triggered by them [2]. This section describes the different types of interactions
in an audio game itself, with some information containing different input hardware devices and
their usage. The interaction types in an audio game can be separated into navigation, interaction
with objects and communication [39].
The navigation in audio games happens through an input device. This is handled by the
player to move an avatar or a cursor to change its position. As stated in Section 2.2.2.1, the
player must gain feedback according to the selected interaction type. With that information, the
player knows if a set interaction was intended or not. If the audio game is played in front of a
computer, Röber and Masuch suggest using navigation with a keyboard or joystick rather than
with a mouse [39]. The reason for that is the challenge to place the virtual cursor (when the
movement is done with point and click) to the right position, so that the character can move to
21
Figure 2.9: Examples of obstacles (green), portals (blue) or interactables (red) in an audio only
environment. Illustration taken from [39].
it.1 Another reason for this sub-optimality is the loss in concentration during the movement of
this virtual cursor. In audio games, the player should put his or her whole concentration to the
game’s audio sources [39]. The authors also recommend using a joystick with haptic feedback.
This is easier and better suitable compared to a restrictive keyboard control [39].
The interaction with game objects is the second category. This interaction type depends
on the interaction possibilities in the particular game. In AudioDoom f.i., the player is able to
interact with a door (which is a portal, see Section 2.2.2.1 for explanation) to open it; therefore
the game implements the interaction open a door [45]. One possible interaction in a classical
audio only Mastermind game, is the movement and placement of the appropriate stone in the
right place that implements the interactions move and place stones [51]. As mentioned in Section
2.1.2, a realistic simulation of the virtual environment with the support of head tracking is a solid
basis for a 3D environment. Two interaction techniques are described by the authors Röber and
Masuch [39]. One is the auditory cursor that is an object in the 3D environment that describes
— with speech, artificial or natural sounds — objects that intersect with it. With this metaphor,
interaction is eased. The other technique is called radar that can be described as a cone of a
flashlight. Every object that is hit by the light of the cone emits information about their position
and functionality.
The third interaction category is communication. This communication is done through
speech. An interaction can be initialized by artificial characters or the player itself. In mul-
tiplayer games, this interaction type can be used for chatting between players [39]. Bidirectional
communication between the player and an artificial character may work on basis of predefined
routines.
1An example of a point and click adventure is the game Day of the Tentacle. It was released in 1993.
22
2.2.2.3 Imagination
The last element of audio games that every game has in common, is the visual imagination of
the player. Liljedahl et al. introduced a term for this named Scary Shadow Syndrome [20].
They derive this concept from older horror films, where the budget as well as the advances in
technology were limited. As example, they list the production of the film Jaws. Instead of
filming a shark attack on a female swimmer, the director only filmed the reaction of the girl —
due to technical reasons — which triggered the personal imagination [20]. Therefore they advice
that a game designer should not tell or show everything to the player but let them construct their
own interpretation and imagination of visual and auditive feedback. Another example which
supports the advice of Liljedahl et al. is shown in the paper by Velleman et al. [55]. During
their research in the development process of Drive, they asked themselves the question, which
of the following examples make more fun: Steering but traveling with 5 mph or no steering but
traveling with 700 mph? [55]. They concluded that it is the latter one because faster speed is
more thrilling in the imagination of a player.
2.2.2.4 Game Genres & Sound Types
The website audiogames.net shows a high variety of genres and types of audio games. The
listed genres include action games, adult games, adventure games, arcade games, card games,
compilations, educational games, first person shooters, gamebooks, interactive fiction, multi user
dungeons / mmorpgs, puzzle games, racing games, role play games, side scroller, simulations,
sports, strategy games, traditional games, trivia games and word games.2 This range of genres
shows that the missing visual output which is typical in audio games gives no restrictions for
developers to choose a genre. Another classification of audio games could be done by analyzing
the used types of sounds, which are e.g. spoken texts or ambient sounds.
Röber and Masuch did a classification on their so called auditory phenomena concerning au-
dio games, where they decided to group different types of sounds into three categories: speech,
music and natural or artificial sounds [39]. Speech is mainly used for uni- or bidirectional com-
munication between elements in audio games in form of spoken words. The audio game Sleuth
makes a heavy usage of speech for its interactive narrative [9]. There are different applications
for using music as additional sound source in an audio game. Depending on the used type of mu-
sic, it may influence the user’s mood and even the feeling during play [39]. A slow paced music
can be used to communicate a chill-out ambiance to the player. It is also possible to use music
for sonification of light intensity changes or to highlight someones attitude (e.g. usage of dark
sounds for a virtual person in a game to state that this is one of the bad guys) [39]. The third cat-
egory, natural or artificial sounds, has the biggest variety, postulated by Röber and Masuch [39].
The sound of flying birds or the splashing of water are examples for natural sounds. Artificial
sounds therefore are e.g. signals that the player recognizes as a defined sound of something, like
an item or an object (see Section 2.2.2.1 for examples of artificial agreed sounds). Friberg and
Gärdenfors of Stockholm International Toy Research Center also did a categorization of sounds.
This categorization was derived from their Tactile Interactive Multimedia project. They divided
2These genre types were taken from the audiogames.net website. This website has the author’s recommendation
for people who are interested in audio games.
23
their findings into five categories: avatar sounds, object sounds, character sounds, ornamental
sounds and instructions [12].
2.2.2.5 Environments
There are several environments in which audio games can be played. Those are listed as follows:
• without sitting in front of a desktop computer e.g. lying on a couch [28],
• in front of a desktop computer without headphones e.g. [45],
• in front of a desktop computer with headphones e.g. [9],
• and real walking in an open space e.g. [55].
The authors of [28] show that audio games may be played in a relaxed atmosphere, e.g. on a
couch and provide an immersive feeling. Playing without headphones is not recommended [9],
however, the authors of [45] showed that it is working with speakers though. Real walking in an
open space supports the presence [50], therefore, the authors of [55] proofed that audio games
are working in an open space environment.
2.2.3 Games in Scientific Context
This section covers selected audio games developed in a scientific context published by members
of the research community. However, this section does not cover every game or paper. Only
audio games that show a new component compared to other papers are covered in this section.
Most of the following games are designed for visual impaired users but are also playable and
suitable for users without impairments [51,58]. At the end of the section, papers are listed which
show the development of audio games in the scientific context.
Lumbrears and Sánchez published a paper in 1999 called Virtual Environment Interaction
through 3D Audio by Blind Children which was the first playable, in a scientific field built audio
game in a 3D virtual environment [45]. In this paper, they describe and test their game prototype
AudioDoom, which is an audio adaption of the classical Doom game with reduced functionality.
In this context, the authors use hyperstories. Therefore, this virtual world has several environ-
ments that are represented as node in the hyperstory context. These environments could involve
dynamic objects or characters that react to the behavior of the player. However, these environ-
ments are connected with each other through predefined links that are usually represented in the
virtual world as doors or portals. In addition to that, they use narratives to generate a deeper
immersion of the game. The players are able to steer their characters with a joystick while they
get auditive feedback through speakers. The authors try to test their hypothesis that those vir-
tual environments (the audio game itself with all associated hyperstory components) can create
mental images [45]. To verify that, they let blind people play the game several times. After they
have created their mental image, they are told to rebuild the played level with Lego blocks. Each
block has a different meaning, f.i. a door in the virtual world is represented by a Lego window
in the real world. As conclusion, the participants are able to rebuild the level with Lego blocks
successfully. This proofed their proposed hypothesis.
24
Figure 2.10: GRAB haptic interface for in- and output. Picture taken from [62].
Drewes et al. published a paper in 2000 called Sleuth: An Audio Experience about their
identical named game Sleuth [9]. Sleuth is an audio version of the classical game Clue in which
the player takes the role of a detective who has to investigate a murderer case to determine
the killer, the weapon and the room through asking people. The player can move his detective
avatar through the house in different rooms. Every room has its own sound sources. The player
receives a detective’s notebook, which has names of guests, weapons and rooms in it. Therefore
the game provides the player with additional information through a paper but does not provide
any visual output. After gathering enough information, the player can make a guess and solve
the murderer case. If he or she is right, the game is won, otherwise he or she has to investigate
further. The goal of the evaluation part was to examine the effectiveness of our [their] design
decisions in a qualitative manner [9]. After the participants played the game, the authors asked
them several questions about their clues and guesses to evaluate, if their design decisions were
effective or not.
Another paper called The Design and Evaluation of a Computer Game for the Blind in the
GRAB Haptic Audio Virtual Environment was published in 2003 by Wood et al. [62]. They
used — compared to the other presented papers so far — a combination of haptic in- and output
as well as auditive output, which they call Haptic Audio Virtual Environment (HAVE). In this
environment, the haptic device is called GRAB HAVE which was developed by PERCRO. This
device is completely steerable with two forefingers and consists of two arms with three Degrees
Of Freedom (DOF) and three DOF in mobility [62]. Figure 2.10 shows the GRAB interface
in action, operated by a user. GRAB detects collisions between the player and virtual objects
and return them as haptic feedback. The authors decide to use a search and adventure game as
test genre for their study. Therefore, they built an virtual world with two rooms, one (locked)
door, a key (to unlock it), one attractive trap (which captures and immobilizes one finger), an
attractive trap deactivator (which is near the trap itself), two bombs, one bomb deactivator, lives
and points. If the player collects all two points, the game is won. If the bomb explodes, the game
is lost. To collect items, the player has to tap three times in a row to collect it. By pressing a
button or opening the door, the player receives corresponding sounds like a clicking or knocking.
The participants in the playtests stated that the game was too simple in term of challenges, but
the GRAB environment is suitable for gaming purposes.
Velleman et al. published a compilation and results of three tested audio games in 2004 [55].
These games are called Drive, Curb Game and Demor, while two of them have special properties
25
which are highlighted in this paragraph. Drive is a racing game without the possibility to steer
the car. The authors tried to focus completely on the sense of speed due to their assumption
that they should sonify the essence of a game instead of visual feedback. The objective of this
game is to collect boosters and use them to gain additional speed, but it gets harder to collect
boosters due to higher speeds. After three minutes, the game is over and the score (reached
distance and collected boosters) is calculated. Players then have the possibility to upload their
scores and compare them to others. Until 2004, over 50.000 people have downloaded the game
and as a conclusion of this study, the authors stated that blind users had reached higher scores
than non handicapped people. The Curb Game is the second game in this paper which is very
similar to Frogger. The third game Demor is a location based 3D audio first person shooter. The
player wears a backpack with a laptop (including GPS functionality) and headphones with a head
tracking module. In addition to that, the player also has a joystick. This game is supposed to be
played outdoors. The GPS and the headphones provide the laptop with the required information.
If the player hears an enemy, he or she has to look in the direction of it and pull the trigger of
the joystick. Depending on the accuracy, the player gets points. This is the first recorded 3D
outdoor audio game.
A case study about Terraformers, an often cited computer game (f.i. [12, 40, 41]), was pub-
lished in 2004 by Westin [58]. Terraformers itself was released in 2003 and was the first com-
mercial hybrid 3D audio game for sighted as well as for blind people [58]. They use a mixture
of 3D sound and voice feedback to substitute missing visual sense. In the game, there is a com-
pass which is represented by spoken feedback, telling the player the rough direction. Another
element which supports visually impaired people is a sonar. With this device, the player is able
to estimate the distance to objects in the field of view. By pressing a key, the game describes the
object. Every object in the game has voice feedback, 3D graphics as well as 3D sound icons.
This game offers three different graphic modes. The first one is the standard one, where every
object is rendered as intended. Then there is a second mode, called High Contrast Mode, where
unimportant objects (like walls) are rendered in black or white while important objects are ren-
dered as in the first game mode. The advantage of this rendering method is that the important
objects get additional contrast and are therefore better to recognize. This mode offers low vi-
sion gamers the possibility to play the game. The third mode is the No 3D Graphics Mode that
completely disables graphic rendering. Even in this mode, the game offers possibilities which
allows sighted as well as blind people to play this 3D game.
The paper published by Heller et al. does not demonstrate the possibilities of 3D audio in
form of games but rather in form of a historical installation in the Coronation Hall in Aachen
called CORONA [17]. The audio installation places several virtual characters into a hall, which
are calculated on mobile devices that visitors get. Visitors also get a headphone with a compass
sensor for auditory output as well as tracking of the head. People then had the possibility to
move through this virtual environment and to interact with the virtual audio characters. The
authors did some preliminary tests to identify the best technical solution within the restrictions
of the hall. They also tested an optical tracking system as possible technology but found out that
this method would need too much fine tuning and it would not fit in this special environment.
The research team Merabet et al. published a paper named Teaching the blind to find their
way by playing video games in 2012 [8,29]. In this paper, the authors transferred the knowledge
26
of a VE to the real world. The Audio-based Environment Simulator (AbES) allows a simulation
of an existing building for navigation and exploration purposes in form of a game metaphor [29].
Therefore, a building was recreated in virtual space and blind people were able to explore this
space on a computer to get the environment to know. After some iterations of gaming (there are
monsters, jewels and an exit), the blind participants were able to orientate themselves with their
generated cognitive map of this building in this real world structure. One application for this
technology could be that blind people may get an foreign environment to know, before they visit
it in reality for the first time [8].
Since the start of audio games with the introduction of AudioDoom, technical as well as
scientific progress advances the possibilities of games without visual feedback. The work of the
authors Velleman et al. [55], Heller et al. [17] and Merabet et al. [29] indicate a trend of leaving
the computer screen at home to enjoy audio games in the wild. Table 2.1 shows a summarization
of audio games related scientific papers, illustrating the authors, years, games and the main
characteristics.
2.3 Software Tools
In this thesis, two software tools are used for implementing and testing a hybrid sound model.
Both are introduced in this section. Unity3D is a 3D game engine that supports 3D audio. Wwise
is an audio middleware solution that can replace an existing audio engine of e.g. a game engine
like Unity3D completely.
2.3.1 Unity3D
Unity3D [54] is a runtime- and development environment for interactive real-time applications
or computer games. The basic version is free of cost with reduced functionality. However,
the professional version requires an annual fee. Unity3D offers state of the art in graphics,
animation, sound and programming.
Unity3D uses a scene graph for storing game objects. In this graph, objects, attributes and
relationships can be modeled. A game object exists in virtual 3D space and has a position,
rotation and scale. Game objects can be selected and modified in the scene view. Primitive game
objects like cubes or spheres can be added through a menu. In Unity3D, components can get
attached to game objects that are e.g. information about the material of the game object, audio
files or scripts. Unity3D supports C#, C++, Boo and UnityScript. The game view of Unity3D
shows the application from the perspective of the application camera and allows simulation of
the application in runtime. Figure 2.11 illustrates an example of the Unity3D layout.
Unity3D uses the FMOD bibliography that allows spatialized 3D sound. FMOD is a sound
playback and mixing engine. Several effects, like Doppler- or hall-effects can be applied on
sound sources using this integrated bibliography. However, it is not possible to calculate geom-
etry dependent echoes [53]. For sound, Unity3D provides an audio source and audio listener
component. The audio source component allows changes including volume, pitch, maximum
distance and effects regarding the Doppler effect and is responsible for the playback of the
27
Authors Year Game Special
Lumbreras & Rossi [21] 1995 Hypertext story First 3D audio selectable hy-
pertext story.
Lumbreras et al. [22] 1996 Hypertext story, grab-and-
drop technique
Virtual reality glove to grab
and drop objects.
Lumbreras & Sánchez [45] 1999 AudioDoom First 3D audio game, proof
with Lego blocks.
Drewes et al. [9] 2000 Sleuth Combination of audio game
with physical paper.
Winberg & Hellström [61] 2000 Towers of Hanoi Only use earcons as auditory
feedback.
Targett & Fernström [51] 2003 Os & Xs, Mastermind Use of earcons and auditory
icons.
Wood et al. [62] 2003 GRAB HAVE Haptic in- and output and au-
ditive output.
Sánchez et al. [43] 2003 AudioBattleShip First collaborative audio
game.
Velleman et al. [55] 2004 Drive, Curb Game, Demor Online highscore compari-
son. GPS, joystick, outdoor
game.
Westin [58] 2004 Terraformers Commercialized and
awarded hybrid audio
game.
Mendels & Frens [28] 2008 Audio Adventurer Relaxed context and use of
own physical input device.
Heller et al. [17] 2009 CORONA Interactive audio installation
in a coronation hall.
Merabet et al. [29] 2012 AbES Knowledge of virtual world
transferred to real world.
Sánchez et al. [44] 2014 Audiopolis Simulation of whole city
with tactile feedback.
Table 2.1: Summarization of published papers related to audio games in scientific context. 2.2.3.
28
Figure 2.11: The Unity3D layout: (1) scene view, (2) game view, (3) scene graph, (4) project
explorer, (5) inspector.
sound source. On the other hand, the audio listener component is attached to the game object
where the sound should be received, e.g. the player’s avatar.
Another principle that is used in this thesis in Unity3D is ray casting. Ray casting can be
used to detect obstacles between two points in 3D virtual space. A ray is shot from a 3D position
into a 3D direction. Depending on the cast method, one or more obstacles (game objects) that
intersect with this ray are detected and returned. This principle is used in the image source
method for sound source mirroring and for the audibility test. In the secondary source method,
ray casting is used to calculate the reflection angle as well as the position of the secondary
sources.
2.3.2 Wwise
Wwise is an audio middleware solution that provides an authoring application, a sound engine,
a game simulator, a plug-in architecture and an interface that allows communication to and
from external world builders [4]. Audio middleware solutions can be used as replacement of an
already existing audio engine. Wwise is used in modern computer games e.g. Assassin’s Creed
Unity (Ubisoft Montreal), Borderlands (Gearbox Software), Destiny (Bungie) or Metal Gear
Solid V (Konami) [3].
Wwise provides access to general sound settings e.g. volume or pitch, allows 2D and 3D
sound positioning and Real Time Parameter Control (RTPC). RTPC enables access to sound
settings from outside the application during runtime, e.g. the change of the pitch of an already
playing sound file. These options can be used for the spatialization of sound and the integration
of sound models. In addition, Wwise provides several sound effects e.g. delay, flanger or reverb
effects. The provided reverb effects do not take any geometrical information about the envi-
ronment into account. Wwise uses a bus system to which the sound events are routed. Several
sound events can be directed to one bus. In these buses, settings including bus volume or bus
29
Figure 2.12: The Wwise layout: (1) project explorer, (2) sound properties, (3) event viewer.
effects can be added and are valid for every sound that is routed over the respective bus. When a
sound plug-in is used, e.g. GenAudio’s AstoundSound, it is attached to an audio bus. Therefore,
every sound that is routed through this bus is spatialized in 3D, as the AstoundSound plug-in
is responsible for sound spatialization. The elements of Wwise are discussed in the following
paragraphs.
The authoring application of Wwise is a standalone software in which sound files can be
managed, modified, mixed and saved into SoundBanks. SoundBanks are a collection of event
and sound data that can be loaded at specific points during a game to increase performance.
Events in Wwise can be used to e.g. start or stop a sound file when fired. These events need to
be created in the authoring application first, otherwise they are not available. Figure 2.12 shows
the layout of the authoring application.
Wwise’s sound engine manages audio and performs functions. The sound engine is opti-
mized and available for different platforms: Windows, Max, iOS, Android, Linux and Windows
Phone.
The game simulator consists of a Lua (programming language) script interpreter that allows
the reproduction of sound and motion behaviour in a game to monitor specific behaviours and
the performance of Wwise before the project is integrated into a game’s sound engine.
The plug-in architecture of Wwise allows an expansion of available functions. Sound plug-
ins can be used to produce artificial sounds or to create sound effects, e.g. reverb. However, it
is also possible to use third-party plug-ins, e.g. GenAudio’s AstoundSound. A discussion about
tested plug-ins for this thesis can be found in Section 3.3.
SoundFrame is an interface that allows communication between Wwise and an external
game world builder or 3D application. Through this interface it is possible to access the same
functionalities as in the authoring application.
Figure 2.13 shows the production pipeline of Wwise. In the middle of this image, the basic
30
Figure 2.13: The Wwise production pipeline. Author: building of sound and properties, Simu-
late: first testing and simulation, Integrate: integration without programming, Mix: mix proper-
ties in real time and Profile: profiling game during runtime. Image taken from [4].
tools of Wwise (which are discussed above) and their connections are shown. In the outer circle,
Author, Simulate, Integrate, Mix and Profile determine the Wwise pipeline. Author defines the
first step that includes the building of sound and its properties. Simulate is the first step of
testing and simulation, if everything is working as intended. Integrate describes the option that
the integration can be done without additional programming. Mix stand for the possibility of
mixing properties in game in real time. Profile is the possibility of profiling the game during
runtime.
A full overview of the functionalities of Wwise can be found in [4].
31

CHAPTER 3
Design
This chapter presents the environment in which the audio game prototype was implemented and
discusses the restrictions that the test environment imposed. Afer that, an audio middleware
solution and sound spatialization plug-ins available for it are tested and evaluated to find the
most appropriate technology for the implementation of the hybrid sound model. Finally, the
design of the baseline model and the hybrid model are presented.
3.1 Test Environment
As discussed above, audio games can be played in different environments (see Section 2.2.2.5),
e.g. on a couch, in front of a computer or in an open space by using real walking. In this thesis,
an audio game in which the player is able to use real walking for the movement of the virtual
avatar is used. The player is able to move with full freedom in an open space, therefore tracking
is required to locate the player’s position in the real environment. The position of the player
must be known for the calculation of the sound model and real-time spatialization.
The test environment was already available at the begin of the thesis and consists of a room,
a virtual model of it and a tracking solution. The room in which the audio game takes place is
modeled virtually in Unity3D. This model is shown in Figure 3.1. The room boundaries, pillars
and other obstacles are modeled into the virtual environment. The optical tracking system has
been developed in Vienna University of Technology and uses a camera mounted on the Oculus
Rift DK2 worn by a tracked user. On the ceiling of the room, markers are placed and detected
by the camera to determine the player’s position in the room. To calculate the rotation of the
player’s head, tracking information is combined with the information of the Oculus Rift DK2
sensors. The size of the testing area is 30m x 7m.
33
Figure 3.1: The 3D model of the test environment.
3.2 Requirements
Sound processing software and sound propagation models have to be chosen in accordance with
the following requirements.
3.2.1 Real-Time
In a game, users have to experience the immediate results of their actions. This is also valid for
audio games. Such actions could be the change of the view while moving the viewport. In this
test environment, the movement of the viewport is done by real walking. In addition to that, the
player must perceive changes in sound when moving or rotating the view otherwise it would not
be perceived as realistic and therefore immersive.
In regular computer games, real time calculations of graphics and sound are important to
provide a realistic and immersive feeling. However, in audio games there are no graphics at all.
Therefore, only the sound component needs to be calculated in real time, including changes in
sound when the player moves the body or rotates the head.
A sound model in an interactive scene has to be calculated in real time, otherwise user’s will
perceive a significant delay. This delay reduces the immersion and realism of the sound model.
If real-time calculations during the experience is not possible at all, players might not be able to
localize sounds which will make an audio game experience less convincing and less enjoyable.
As conclusion, sound software and particularly sound calculation methods should be able to
perform in real time. This also concerns the sound spatialization process that is described in the
next section.
3.2.2 Sound Spatialization
In real life, humans are able to distinguish the direction of a sound event (see Section 2.1.2).
Computer generated sound needs to be processed in a way that simulates the same cues that
allow people to distinguish the directions from which sounds come in real life. For this purpose,
HRTFs can be used (see Section 2.1.5.1).
The sound spatialization depends on four parameters. The position and rotation of the lis-
tener as well as the position and rotation of the sound source. The rotation of the sound source is
only necessary when directed sound is used. When these positions and rotations are known, the
most appropriate HRTF can be chosen. Therefore, the azimuth, elevation and distance between
34
the sound source and the listener needs to be calculated. With these three parameters the most
appropriate HRTF is determined. The HRTF is then applied on the sound (convolution or (in-
verse) FFT) to simulate the absorption of the outer ear, head and torso. This provides the player
with spatialized, realistic sound.
In audio games, sound spatialization is crucial. Without spatialization, Sonification (see
Section 2.2.2.1) cannot give an answer to the question Where is something?.
As conclusion, a software is necessary that calculates the sound spatialization regarding the
position and rotation of the listener and the receiver in real time.
3.3 Sound Plug-ins
As stated in Section 2.3.2, Wwise is used as an audio middleware solution for the implementation
of the sound effects in the proposed audio game prototype. However, there is additional software
available for sound spatialization in real time. This section investigates possible combinations
of the audio solution at the game engine (Unity3D), audio middleware solution (Wwise) and
additional audio plug-ins. The quality of sound provided by the tested combinations is compared
to find the most appropriate software solution. When Wwise is tested, sound plug-ins can extend
the software’s functionality for sound spatialization. The following software configurations are
tested:
• Unity3D audio without any additional software,
• Unity3D with Oculus Spatializer,
• Wwise without any plug-in,
• Wwise with AstoundSound plug-in,
• and Wwise with Auro-3D plug-in.
For testing the built-in sound engine, the audio middleware solution and the available plug-
ins, a simple Unity3D project was developed.
3.3.1 Plug-in Test Prototype
A simple prototype scene was created to test the listed sound software. In a test scene, a cube
is constantly moving on the horizontal orbit around the player and continuously emitting sound.
There are three rotation phases. In the first phase, the cube orbits the player horizontally. After
it completes a 360 degrees loop, the second phase starts and the cube orbits the player vertically.
After completing this phase, the third phase starts which is the same as the first one but at a
different height so that the player can perceive sound coming from above. With these three
phases, the player can experience the sound of the cube coming from different directions. By
pressing predefined keys, the sound calculation method can be switched between the different
sound solutions. A mouse can be used to rotate the character controller and therefore the virtual
listener to perform an equivalent of the dynamic sound localization.
35
Figure 3.2: The prototype which was used for testing of the different sound plug-ins and the
built-in engine. In the image at the top, there is the actual game view, at the bottom the scene
view. The cube represents the sound source moving around the player, represented as the cap-
sule.
Figure 3.2 shows this prototype. In the top image, the game view can be seen. The cube in
front is the rotating sound source. The active sound solution is displayed in the top left corner,
the available sound solutions in the top right one. The bottom image shows the scene, including
the player controller, the camera and the rotating cube.
This test was conducted in a silent environment with closed headphones and a mouse. Differ-
ent sound solutions were tested several times. We were able to make the following conclusions.
3.3.2 Results
Unity3D audio without any additional software
The Unity3D sound engine uses FMOD bibliography. The game engine and therefore the already
integrated sound bibliography is freely available. More details about Unity3D sound can be
found in Section 2.3.1. The tested game engine and therefore sound engine version was Unity3D
4.6.5f1. In general, Unity3D spatializes sound authentically, however, changes when moving the
virtual head up and down were not perceivable. The calculated spatialized sound did not change.
Unity3D with Oculus Spatializer
This plug-in is freely available for sound spatialization. It can be directly integrated into Unity3D
without using any audio middleware solution. Oculus Spatializer uses HRTFs for sound spatial-
36
ization. The tested version was Oculus Audio SDK 0.9.2 Release. The spatialization of a sound
source which is in front of the player felt unnatural when moving the head horizontally. The spa-
tialized sound when the sound source was in front of the player felt like standard stereo output
that does not seem to be spatialized.
Wwise without any plug-in
The audio middleware solution can spatialize sound without any addition sound plug-ins. The
software is available for free as long as no additional plug-ins are used. The positioning of a
sound object in Wwise can be set to 2D or 3D. The tested Wwise version was Wwise v2014.1
(64-bit) Build 5158. Without any additional spatialization plug-in, this software produced the
worst results of the tested sound solutions. Compared to other software, the sound did not feel
realistically spatialized. Horizontal sound spatialization (changes in azimuth) was admissible,
however, vertical sound spatialization (changes in elevation) were not perceivable.
Wwise with AstoundSound plug-in
Wwise in combination with AstoundSound spatializes sound around the listener in real time and
provides distance cues [15]. The plug-in can spatialize from 0 to 359 degrees of azimuth and
-90 to 90 degrees of elevation. These calculations are done with AstoundSound’s own transfer
functions called Brain-Related Transfer Functions (BRTFs) [15]. These functions model how
the brain perceives sound spatially.
The sound was perceived realistically in all the situations. Changes in azimuth as well as in
elevation felt realistically spatialized and clearly distinguishable. Compared to the other tested
software, this was the most convincing one. The tested version was Mixer plug-in v1.5.
Wwise with Auro-3D plug-in
Auro-3D was originally developed for home theaters (up to 10.1 sound systems) and cinemas
(up to 13.1 sound systems). It uses three different layers of sound [6]. The lower layer is
responsible for elevation between -15 and 15 degrees. The height layer is responsible for sound
spatialization between 15 degrees and 75 degrees of elevation. Spatialized sound at an elevation
between 75 and 90 degrees is localized at the top layer. Depending on the vertical coordinates
of a sound source, the sound gets assigned to these three layers.
The used version of this plug-in was Plug-in Version 1.0. The output was similar to the
spatialized sound produced by Wwise. Changes in the sound source elevation did not make
any noticeable difference, however, if the sound source was moved significantly, a reduction in
sound volume was perceived.
The results show that the most appropriate technology for this thesis is a combination of
Wwise with the AstoundSound plug-in that spatializes sound realistically. To test this software
in an audio game, the following two games were implemented and investigated.
3.3.3 Game Prototypes
Two prototypes were implemented to investigate the appropriateness of Wwise in combination
with AstoundSound in an audio game environment.
37
Figure 3.3: Screenshots of Catch the Dot for testing Wwise with AstoundSound in an audio
game environment. In the image on the top, there is the actual game view, on the bottom the
scene view.
The first game, Catch the Dot is a simple audio game where the player is inside a room.
There are three coins randomly placed inside this room which continuously emit sound. The
main task is to collect coins. After the player has collected the first three coins, another three
coins are spawned and rain sound is played additionally. After collecting them, additional sound
disturbances start playing and spawn another three coins. After the last coins are collected,
the game is completed. The game was played blindfolded in front of a desktop computer with
headphones. Figure 3.1 shows a screenshot (debug output) of the scene- and gameview. In the
top image, the coin is visible in front of the player. In the image at the bottom, the character
controller as well as two coins are visible.
Testing of this prototype showed that localization using this sound technology has worked
in this environment. The spatialization of the sound supported the player in finding the coins.
Disturbances made the localization process harder, however, the game could be completed.
In Catch the Dot, the sound sources are stationary. However, further investigations were
conducted to see if the localization and spatialization work correctly when the sound sources
and listener are moving at the same time. For this purpose, another game prototype called Audio
Frogger was created.
Frogger (Atari, 1981) is a computer game where the player is in the role of a frog who wants
to cross a street. On the street, there are several obstacles like moving cars that make crossing
harder. If the frog is hit by a car the game is over. The level is successfully completed when the
other side of the street is reached.
38
Figure 3.4: Screenshots of Audio Frogger for testing Wwise with AstoundSound in an audio
game environment. In the image on the top, there is the actual game view, on the bottom the
scene view.
Audio Frogger is similar to the original Frogger game, but it is played in 3D space. Again,
the game is played without visuals at all. However, the debug output can be seen in Figure 3.4.
There are several streets in regular intervals with cars driving from the right to the left side on
the screen. These cars are emitting wheel noise sounds. The pitch is shifted depending on the
distance between the object an the listener, the speed of the listener and the sound source to
simulate the Doppler effect. Three different sounds are available: one of a regular car, one of a
truck and the sound of a regular car paired with the sound of a police siren. With these sounds,
the user is able to distinguish the direction of a car and can give an estimation of the distance
between the user and the sound source. In addition to the continuous wheel sounds, the cars are
honking randomly. Other sound disturbances appear from time to time, e.g. bird sounds or rain.
With every passed street, the speeds of the cars are slightly increased. As soon as the last street
is passed, the game is completed successfully. However, if a car hits the player, the game is lost.
This prototype proofed that the spatialization and localization is working when the sound
sources and the listener are moving in an audio game environment using Wwise with Astound-
Sound plug-in.
Spatialization with Wwise and AstoundSound supported the player during the game. The
player was able to fulfill the tasks only by listening to sound. Therefore it can be concluded that
this technology is appropriate for an audio game. Thus, this technology is used for implementing
sound models, which are discussed in the next section.
39
3.4 Sound Models
The tested sound plug-ins provide a possibility to spatialize sound in real time. However, these
plug-ins do not provide any geometry-dependent reverberation. As already mentioned above,
reverb effects that are available in the audio middleware solution do not take geometry into
account. Therefore, an implementation of a sound model is needed that can simulate those
effects. The model that is implemented in this thesis is an adaption of the hybrid sound model
introduced in 2.1.6.
In this thesis, it is investigated if these additional sound effects matter in audio games (see
Section 5.1 for the proposed hypotheses). Therefore, a model that is generally used in audio
games is implemented as comparison. This model is hereinafter referred to baseline model and
is introduced in the next section.
Both sound models should meet the following requirements for being applicable in this
thesis:
• Spatialized sound calculated according to the position and rotation of the listener and
sound source.
• Sound attenuation according to the distance between the listener and the respective sound
source.
• Obstruction of sound sources through an applied low-pass filter on the respective sound
source depending on the listener’s position.
3.4.1 Baseline Model
The review at audio games in scientific context shows that all of them use similar sound models
[9, 28, 29, 45, 55, 58, 62]. In this section, the common elements of these models are highlighted,
discussed and adapted for the use in audio games.
Sound spatialization is the key component in all of the above mentioned papers. Without
sound spatialization, an object cannot be localized in virtual space. In [45], the authors created
3D sound by preprocessing the game audio files first by combining different sets of HRTFs.
In [28], they use stereo-panning for sound spatialization. However, in this thesis already existing
sound technology that is able to spatialize sound in real time is used.
Attenuation over distance is described by one of the above mentioned papers [9]. If the
distance between a sound source and the listener becomes bigger, the attenuation on the sound
source has to be stronger and therefore the emitted sound is losing power. This feature is essential
for the perception of distance and is implemented in the baseline model.
These studies [9, 28, 29, 45, 55, 58, 62] do not take the effect of obstruction into account.
Obstruction might be appropriate for audio games where sound sources could be behind walls.
This effect can be simulated by applying a low-pass filter on the obstructed sound source [36].
The baseline model mets the above mentioned requirements. However, the next section
discusses the hybrid model that introduces another requirement; geometry-dependent reverber-
ation.
40
3.4.2 Adapted Hybrid Model
As the baseline model, the hybrid model implements spatialized, attenuated sound and takes
obstruction into account.
In addition to these three basic principles, it reproduces basic principles of room acoustics.
In room acoustics, the reflections and reverberations in a room depend on the structure of the
room’s geometry, as it is discussed above (see Section 2.1.6). This information needs to be
taken into account in a hybrid sound model for audio games. Therefore, the above described
requirements are extended by:
• early reflections and late reverberation that are geometry-dependent.
The basic principles of the ODEON approach are discussed in Section 2.1.6. This approach
is ray-based and combines the image source method with secondary sources. An implementa-
tion of ODEON as it is described in [35] into the given test environment would not be possible.
ODEON would generate too many sound sources that cannot be handled by the software simul-
taneously. Therefore, the ODEON model is adapted for real time in the given environment. This
adaptions are discussed in the following paragraphs.
3.4.2.1 General Adaptions
Sound delay
Like discussed above, the sound delay for the image sources is calculated through the distance
between the sound source and the listener. This means that every time the player changes posi-
tion the audio playback position needs to be changed as well. This is not possible in the given
software setup, without producing sound artifacts. These artifacts are produced by an output
buffer underflow that can be perceived as clicking noise [46].
As replacement for this distance effect, sound is additionally attenuated depending on the
distance between the sound source and the listener. This sound attenuation is applied on image
sources but also on secondary sources since the delay is calculated in the same way.
Amount of sound sources
In [38], the author states that 500 to 1000 rays are enough to obtain reliable results producing
realistic sound effects in an auditorium. If each ray creates 10 sound sources in total, there
would be 5000 to 10000 sound sources in the audio game prototype at the same time. This
amount cannot be handled by the used software of this thesis.
To limit the amount of produced sound sources, the maximum image source order can be set
as recommended in [38] to two or three. However, the amount of secondary sources has to be
limited, otherwise real-time calculations would not be possible anymore. The adaptions of the
secondary sources are discussed in the next section.
3.4.2.2 Adaptions of the Secondary Sources Algorithm
In general, the secondary sources algorithm starts its calculations when the set transition order
is reached. When the transition order is set to two, third order image sources are not produced
41
anymore. Instead, the first secondary source is placed at the point of intersection at the wall
between the image source of second order and the point where an image source of the third
order would be. The direction of the secondary ray follows Snell’s law for reflection.
However, the secondary sources are only audible for the receiver if the corresponding image
source is valid (during the audibility test, see Section 2.1.6.1). This might not always be the
case, e.g. if the maximum image source order is too low or there are too many walls. Therefore,
an approach similar but not identical to the secondary source approach is implemented, which is
referred to as Modified Secondary Sources Approach (MSSA).
Instead of starting the secondary sources algorithm when the maximum image source order is
reached, this approach shoots rays into the scene from the original sound source at the beginning.
These rays explore the room geometry and therefore produce reverberation even if there are no
visible image sources. As long as the maximum length of the secondary ray is not reached, the
reflection process is repeated iteratively, placing a secondary source at every point of intersection
between a secondary ray and a wall. With every jump, the ray loses energy (volume) and the
delay of each secondary source is calculated through the covered distance between the original
sound source and the secondary source. The last secondary source is created at the wall where
the maximum length was below the threshold. Figure 3.5 shows this process. The green dots
represent the created secondary sources.
Formula 2.7 for calculating the energy of the secondary sources cannot be applied in the
used setup. The last term of the proposed formula would take the attenuation over distance into
account. However, AstoundSound plug-in calculates the attenuation of sound objects depending
on the position and rotation of the listener. Therefore, the last term of Forumla Formula 2.7 is
dropped.
Sound energy absorptions of obstacles are usually frequency dependent. This is not possible
in the used sound software setup. Therefore, the same parameter is used for each frequency. See
Section 4.3 for discussion about the chosen value for the sound model implementation.
3.4.2.3 Parameters
The proposed hybrid model approach introduces the following parameters:
• maximum image source order,
• maximum length of secondary rays,
• number of rays emitted from the original sound source,
• scattering coefficient,
• absorption coefficient,
• and obstruction level.
Maximum image source order
In the image source method, sound sources get mirrored at the room boundaries recursively.
In a room with four walls, the sound source gets mirrored at all walls and generates four image
42
Figure 3.5: Illustration of image sources (S0) and secondary sources (green dots on the room
borders) calculation for the implemented hybrid model.
sources. If the image source algorithm would stop now, the maximum image source order would
have been set to one. When the image source, e.g. S0 again gets reflected on a wall, e.g. wall 1
it generates an image source of second order (S0,1).
Rindel [38] recommends a maximum image source order set to two or three.
Maximum length of secondary rays
As discussed above, the amount of sound sources needs to be limited otherwise it would not be
manageable by the used software. The maximum length of secondary rays is the total distance
a ray has covered since its emitter. In ODEON, a maximum reflection order parameter can be
set that limits the amount of secondary sources. It defines how often a ray can get reflected. In
ODEON, the maximum value for this parameter can be set to 2000 [35]. However, appropriate
parameters in the test environment are discussed in Section 4.4.
43
Figure 3.6: The regular split of a sphere for calculating the amount and directions of the rays for
MSSA. Rays (lines with black arrows) are shot from the sound source origin through the points
of intersections (red circles). The more intersections, the more rays are emitted
Number of Rays emitted from the original sound source
The number of rays emitted is a parameter that is introduced for MSSA. In original secondary
sources calculations, the secondary source algorithm starts after the transition order is reached.
However, as discussed above, this is not applicable for a room with bigger obstacles in it.
The number of rays emitted are defined as the amount of rays that are emitted from the orig-
inal sound source into the scene. For the calculations of the directions, a virtual sphere is placed
at the sound source position which is split into equal parts on its surface. When the parameter
is set to 25, 25 rays are emitted from the origin of the sound source facing the directions of the
equally split parts of the sphere surface. Figure 3.6 shows an example of the regularly split of
the sphere.
Scattering coefficient
As described in Section 2.1.6.2, the scattering coefficient describes how much of a random vector
is added to the vector of the specular reflection of a secondary ray. With this, roughness of a
surface can be simulated. The value must be between 0 and 1.
When the parameter is set to 0, no random vector is added to the direction of the specular
reflection. Therefore, the surface would be infinitely smooth. When the parameter is set to 1,
44
the direction of the random vector would be the new direction of the secondary ray.
Ideally, this parameter should be set to 0.1 for large plane surfaces and 0.7 for highly irregular
surfaces [38].
Absorption coefficient
The absorption coefficient is needed for the calculation of the sound energy of secondary sources.
It defines the amount of sound energy absorbed by a wall. With every reflection a secondary ray
loses energy and therefore creates a less powerful secondary source. See Formula 2.6 for the
calculation formula.
A list of possible absorption coefficients can be found in [19].
Obstruction level
Obstruction is a sound phenomenon that appears when the direct path between a sound source
and the listener is obstructed by an obstacle. A more detailed discussion about obstruction can
be found in Section 2.1.4.
The used audio middleware solution Wwise offers an option to set the obstruction level of a
sound source. This parameter defines the strength of attenuation caused by a low-pass filter and
reduction in volume. The value must be between 0 and 1, where 0 means no obstruction and 1
full obstruction.
3.4.2.4 Testing Prototype
A prototype was implemented for development and testing of the proposed hybrid sound model.
Figure 3.7 shows the debug output of this version. The red capsule is the player, while the
semi-transparent cube is the position of the sound source. Following the rules of image source
creation (see Section 2.1.6.1), the pink lines show the reflections of sound rays on the walls. The
higher the order, the more reflection does one ray have. The green cubes are the positions of the
secondary sources that are generated like described above.
Several tests with different types of sounds were tested for finding optimal parameters. Tests
with German Hip-Hop (Deichkind - So‘ne Musik) have shown that this model is not appropriate
for sound sources producing sound continuously. This might be the case due to computational
reasons and limitations of the sound model. However, speeches turned out to be very clear and
rich in reflections. Sound sources that produce periodically discrete sounds are very suitable.
45
Fi
gu
re
3.
7:
A
sc
re
en
sh
ot
of
th
e
im
pl
em
en
te
d
ve
rs
io
n
fo
r
te
st
in
g
th
e
hy
br
id
m
od
el
.
T
he
pi
nk
lin
es
sh
ow
th
e
im
ag
e
so
ur
ce
m
od
el
re
fle
ct
io
ns
an
d
th
e
gr
ee
n
cu
be
s
ar
e
th
e
po
si
tio
ns
of
th
e
se
co
nd
ar
y
so
ur
ce
s.
46
CHAPTER 4
Implementation
This chapter discusses the implementation of the hybrid sound model in the Unity3D game
engine and describes the communication between Unity3D, Wwise and AstoundSound.
4.1 Plug-in Integration
This section describes the necessary configurations to set up the communication between the
used software.
4.1.1 Integration of Wwise into Unity 3D
Before Wwise can be integrated into Unity3D, it is advisable to create a new project in Wwise
first and select the Unity3D’s Assets folder as target (see Figure 4.1). When the new project is
created, Unity3D automatically detects the new changes in its folder structure and updates its
project explorer’s content.
When the new project was created, Wwise Unity3D integration package can be imported
into Unity3D. During the import process, the Wwise project should be detected automatically.
Wwise is successfully imported into Unity3D then.
4.1.2 Integration of the AstoundSound plug-in into Wwise
As discussed above, AstoundSound plug-in was chosen as the most appropriate technology for
this thesis. Before the spatialization functionality of AstoundSound can be used, it must be
integrated into Wwise. The plug-in is not free of cost and therefore not enabled by default in
Wwise. To active the plug-in, AkSoundEngine.dll needs recompiling.
For this recompiling process, the Unity3D Integration Source must be downloaded from Au-
diokinetic’s website. It can be opened with an appropriate Integrated Development Environment
(IDE), e.g. Microsoft Visual Studio. In SoundEngineStubs.cpp file, additional plug-ins can be
47
Figure 4.1: This image shows the new project dialog in Wwise. The location of the project
should be inside of Unity3D’s Assets folder so that Unity3D can automatically update changes
done in Wwise.
Figure 4.2: The Unity3D Integration Source for Wwise, opened in Visual Studio 2012 IDE.
AstoundSound and Auro-3D plug-in are uncommented (activated) and ready for compiling.
activated. Figure 4.2 shows the uncommented lines of code that are necessary to activate plug-
ins (in this figure, AstoundSound and Auro are activated). The old AkSoundEngine.dll needs to
be replaced with the newly compiled one in order to have the plug-ins working properly.
4.1.3 Calling Wwise Events in Unity3D
After the Wwise sound engine is integrated into Unity3D, Wwise can be used to play sound
instead of Unity3D’s built-in sound engine. Before a sound file can be played through the audio
middleware solution, the sound must be imported into Wwise’s authoring application. The sound
is then available through Wwise’s SoundBank structure. SoundBank is a collection of event and
sound data that can be loaded at specific points during a game to increase performance.
Unity3D communicates with Wwise through events. These events can be used to trigger
actions, e.g. the start or the stop of an audio file. When a virtual character in an adventure game
48
Figure 4.3: The AkEvent script provided by Wwise can be used to fire events without program-
ming. When the object to which is script is attached to enters a collision, the “Play_Bomb“ event
is fired.
e.g. arrives at a village when she was in the woods before, a sound event (e.g. “stop_twittering“)
could stop the twittering of birds and another sound event could start village related sounds. An
event is in its SoundBank unique. Therefore, an event name can only exist once.
Before an event can be fired, it must be registered in Wwise’s authoring application. By
right-clicking the imported sound, a new event can be created. Available events are play, break,
stop, pause, mute, bus volume, voice volume, voice pitch, voice low-pass filter, voice high-pass
filter, state, set switch, bypass effect, seek, trigger, game parameter and release envelope. Events
are also stored in Wwise’s SoundBank structure.
There are two possible ways to fire sound events in Unity3D for being played through Wwise
audio engine. One option is to use the already existing scripts provided by Audiokinetic that
load e.g. a Soundbank at start or trigger events when collisions happen automatically. The other
option is to fire these events directly to Wwise through code. In this thesis, the second option
was preferred over the first one. Regardless of how the event is fired, Wwise knows the sound
producing game object (and therefore its position and rotation) and the audio file that needs to
be played.
In Unity3D, the AkEvent script provided by Wwise can be used to fire events without pro-
gramming. The script needs to be attached to the game object at which the sound should be
played. Several triggers can be selected that fire the sound event, including CollisionEnter, Dis-
able or MouseEnter. The name of the event which should be fired can be selected in a drop-down
box through a Wwise explorer in Unity3D. Figure 4.3 shows the AkEvent component attached
to a game object.
Sound events can also be fired through code. For this purpose, the PostEvent method of
AkSoundEngine class in Unity3D needs to be called. The parameters are the event name and the
game object at which the event is triggered. In the following example, the call of method
AkSoundEngine.PostEvent(“Play_Bomb“, gameObject)
is equivalent to the settings in Figure 4.3 without script.
49
Wwise offers RTPC that allows the changing of parameters of a sound object, e.g. volume
or pitch at runtime. They need to be registered in Wwise before they can be used, similar to
the registration of events. RTPC parameters are set by calling the SetRTPCValue method of
AkSoundEngine class. In the following example, the call of method
AkSoundEngine.SetRTPCValue(“volume“, 90, gameObject)
sets the volume of the gameObject to 90.
4.2 Sound Model Implementation
This section describes the implementation of the hybrid sound model. A combination of the
image source method with the modified approach of secondary sources was implemented. Every
sound source was represented by a game object. The player is represented by a capsule and the
sound sources are cubes (see Figure 3.7). A sound source has a defined position in the 3D space.
4.2.1 Code Design
This section describes the structure of the used scripts. The following scripts were implemented:
• AudioMaster.cs,
• CheckObstruction.cs,
• GameElement.cs,
• GameLogic.cs,
• and ReflectionCube.cs.
Audio Master Script
The Audio Master script implements basic methods for loading a sound bank, starting and stop-
ping an audio file and reading or setting a playback position of an audio file. These methods use
existing methods of AkSoundEngine. Every other class in the implementation that needs access
to audio methods inherits from this class.
Check Obstruction Script
The Check Obstruction script is responsible for setting the right level of obstruction on the sound
source. It is attached to the original sound source object because this is the only type of sound
source where obstruction is possible. For a detailed description of the implementation, see Sec-
tion 4.2.5.
Game Element Script
In Game Element script, the hybrid sound model algorithm like it is described in Section 4.2.2
50
Figure 4.4: The five implemented classes AudioMaster, CheckObstruction, GameElement,
GameLogic and ReflectionCube. With these classes, a fully functional audio game with a hybrid
model can be implemented.
is implemented. It calculates the image source positions as well as secondary sources and per-
forms the audibility test on both. In addition, it is responsible for enabling and disabling sound
sources which are not valid during runtime. Sound model dependent parameters are set in this
script. However, it does not inherit from Audio Master since it is not responsible for playing or
stopping sound directly. This script is attached to every sound source automatically when they
get instantiated.
Game Logic Script
The Game Logic script is the implementation of the context in which this model is used. When
it is used for an audio game, this scripts implements the whole logic of the game, e.g. loading of
a new level, storing of all sound sources or logging. This script can but does not have to inherit
from Audio Master as long as no sound needs to be played. This script takes several prefabs
(sound source prefab, secondary source prefab, receiver, etc.) as parameters.
Reflection Cube Script
Reflection Cube script inherits from Audio Master and is attached to every sound source. It im-
plements methods for starting and stopping and volume changing of a sound source and provides
the possibility to start an audio file with delay. The possibility of being played with a delay is
necessary for the image sources and secondary sources.
Figure 4.4 shows a schematic class diagram as an overview of the implemented classes.
51
Figure 4.5: This image illustrates the raycast at the creation of image sources in two dimensions.
4.2.2 Image Source Method
In this section, the implementation of the image source method in Unity3D is described. For a
general description of the image source method, see Section 2.1.6.1.
Every wall has an additional object called Mirrorplane attached that extends the wall on
two axes (see the blue lines in Figure 4.5). At the beginning of the calculation, the position of
the sound source (S) and the player (R) is saved. Rays are shot from the sound source into six
directions, two for each axis. In Figure 4.5, such raycasts are shown by green arrows. When
the ray hits the mirrorplane of a wall, an image source is instantiated in form of a new game
object (S1) at the position coresponding to Formula 2.4. S1 shoots another rays in six directions,
while only 5 of them are valid. The one invalid ray is due to the fact that it is not allowed that
the image source is mirrored back at the same wall. However, S1,0 was mirrored at the virtual
extension (mirrorplane) of wall 0. Depending on the maximum image source order (see Section
2.1.6.1), the mirroring stops after this order is reached.
After the image sources have been created, an audibility test needs to be performed when
the player moves for enabling or disabling sound cubes that are valid or invalid. For more
information about the audibility test itself, see Section 2.1.6.1. This audibility test is illustrated
in Figure 4.6. Image source positions are the same as in Figure 4.5. The audibility test starts
at the receiver’s position. A ray is emitted to an image source of the first order (e.g. S0). If
the hit wall (but not mirrorplane) equals the wall the source was mirrored at, the next step of
the audibility test can be performed. A ray then starts at the position of intersection in direction
to the predecessor of the actual image source. In the case of S0 it is the original sound source.
Therefore, S0 is successfully backtraced by the path R → S0 → S and the image source
remains valid. Also image sources S1 and S0,1 are valid, by taking the paths R → S1 → S and
R → S0,1 → S0 → S. However, S1,0 is invalid because the ray first hits wall 1 instead of wall
0 (red line). If the source is invalid, it and its children are disabled because an invalid image
source cannot produce valid sources of higher order.
The audibility test is done at every frame when the user moves. As soon as an image source
52
Figure 4.6: The audibility test applied on the given example.
gets invalid, the volume parameter of the respective Wwise game object is set to zero. If the
audibility test of an already disabled sound source is valid, the volume is restored.
4.2.3 Modified Secondary Sources Approach
There are two parameters (see Section 3.4.2.3 for an overview of the parameters) affecting the
outcome of this algorithm; number of rays emitted and maximum length of secondary rays. As
described in Section 3.4.2, the origin of the emitted rays is the center of the original sound
source. The direction as well as the amount of rays depend on a parameter called steps. This
parameter splits the spheres’ surface into equal parts on x- and y-axes, like described in Section
3.4.2.3.
When the parameter is set to six, the algorithm will create 36 rays (62 = 36).
After the directions are calculated, rays are shot through the sphere parts. When a secondary
ray hits a wall, it gets reflected according to Snell’s law. However, to simulate the roughness of
the surface, a random vector is added to the direction that is calculated through Snell’s law. This
random vector is generated during runtime and added to every reflected direction. The weight of
this random vector can be controlled by the scattering coefficient parameter (see Section 3.4.2.3).
Figure 4.7 shows the vectors used in this approach. On the left, the secondary ray incidents
on the surface. Without a random vector, the ray would get reflected according to Snell’s Law.
The random vector has a different direction. However, when the scattering coefficient is set
to 0.5, the secondary direction like it is shown in the figure is the resulting direction of the
secondary ray.
The sound energy (volume) of the secondary source is calculated using Formula 2.6. With
every reflection, the energy of the next secondary source gets reduced by the factor αi which
is the absorption coefficient. When the sound energy reaches zero, the sound source is not
spatialized anymore.
Every time a ray is reflected, the total path it has covered so far is saved. When the total
covered path is larger than the maximum length of secondary rays, the algorithm stops creating
additional secondary sources.
53
Figure 4.7: The secondary ray would get reflected according to Snell’s law (specular) when the
scattering coefficient is set to 0. The secondary direction like it is in the figure can be achieved
by setting the scattering coefficient to 0.5.
4.2.4 Spatialization
Spatialization is done by Wwise using AstoundSound plug-in. The following parameters are
passed to Wwise:
• the player,
• the sound source,
• the image sources,
• and the secondary sources.
The AkGameObj script is provided by Wwise and is attached to those objects and transmits
the position and rotation to the audio middleware solution.
In Wwise, AstoundSound spatializes the sound of emitting sound sources. The sound sig-
nals are routed through an audio bus to which the AstoundSound plug-in is attached to. Every
incoming sound signal is spatialized then.
There are several options provided by AstoundSound plug-in that can be selected in Wwise’s
authoring application. The following features of AstoundSound plug-in are enabled:
• Full 3D - enables 3D panning (azimuth and elevation).
• Spatial interpolation - allows interpolation for smooth transitions between filter points.
• Apply attenuation - calculates sound attenuation depending on the distance between a
sound source and the listener.
• Get direction from game object - this setting is important in interactive applications.
• Get distance from game object - this setting is important in interactive applications.
The spatialization is done in real time.
54
4.2.5 Obstruction
When the sound source is obstructed by an obstacle, a low-pass filter and a reduction in volume
is applied on the sound source. Wwise offers the method SetObjectObstructionAndOcclusion for
this purpose. The first parameter of this method is the game object on which the low-pass filter
should be applied. The second one is the number of the listener in the scene. The third parameter
can be between zero and one and sets the obstruction level. Zero means no obstruction, one
means fully obstructed. This is also possible for occlusion, however, no occluded sound sources
are used in this thesis.
Only the original sound source can be obstructed. Image sources cannot be obstructed be-
cause obstructed image sources would already fail the audibility test. Secondary sources are only
emitting sound, if there is no obstacle in the line of sight between the player and the secondary
source.
If there is no obstacle between the source and the receiver (player), no low-pass filter is
applied. If there is one wall between them, a low-pass filter is applied. When there are two walls
between these two points, the strength of the low-pass filter is squared. The following formula is
used to calculate the obstruction parameter for Wwise. Be o the obstruction level, n the amount
of walls between the original sound source and the listener so is p with
p = 1− on (4.1)
the parameter for the SetObjectObstructionAndOcclusion method.
4.3 Parameters
The following parameters are adjustable and can be changed:
• maximum image source order,
• number of rays emitted from the original sound source,
• maximum length of secondary rays,
• scattering coefficient,
• absorption coefficient,
• obstruction level,
• maximum audible distance,
• and speed of sound.
For the calculation of the sound delay, the speed of sound is taken (344 m/s). For every meter
between the image source and the original sound source and every meter of the secondary ray’s
length, a delay of 1
344 ≈ 0.0029 seconds is applied on the respective sound source. However,
55
this parameter is not changed because for the tested audio game air is assumed as medium for
sound.
The maximum values of the other parameters depend on several factors: computational
power, structure of the VE and if the sound sources are able to move or not. Several settings
were tested to find an optimal solution for this issue.
The maximum image source order limits the mirroring process during the creation of the
image sources (see Section 2.1.6.1 for further explanations). Depending on the chosen value,
performance can be increased or decreased. Higher values find more specular sound paths,
however, the model calculation time changes significantly when this parameter is modified. In
the next section, this parameter is tested to visualize its impact on system performance during
runtime. For the audio game, this parameter is set to three.
The amount of emitted rays for the calculation of secondary sources also affects the system
performance. The more rays are emitted, the more sound sources need to be managed and
spatialized at the same time. For the tested audio game, this parameter is set to 25. However, the
next section shows the performance of the system when this parameter is changed.
Related to the amount of emitted rays, the maximum length of secondary rays also affect
system performance. The higher this value is, the more sound sources are generated that need
to be managed and spatialized at the same time. The optimal value for this parameter depends
on the VE structure, however, in the tested audio game this value is set to 60 meters. Different
values of this parameter and their effects are shown in the next section.
The absorption coefficient is the amount of energy the secondary ray loses with every reflec-
tion. This parameter is set to 0.3. Therefore, the ray loses 30% of its energy with every reflection.
This is the absorption coefficient of hardwood that was chosen for this implementation [19].
The scattering coefficient is used to simulate rough surfaces. It is set to 0.05. This means that
95% of the true reflection vector is combined with 5% of the random vector. Other parameters
would have been possible too, however, in the simulated audio games the room’s surfaces are
assumed as very smooth.
The obstruction level is set to 0.5 which sounds natural. Other values tested for this pa-
rameter were 0.9, 0.7 and 0.3. However, 0.5 was found to be appropriate for the tested audio
game.
The maximum audible distance defines the distance between the listener and a sound source
in which the sound source is still perceivable. This distance depends on the VE structure and
cannot be generalized but for the tested audio game of this thesis, a maximum audible distance
of 20 meters was found to be appropriate. This also served as replacement for the missing delay
between the receiver and the image sources as well as between the receiver and the secondary
sources.
The maximum image source order, maximum length of a secondary ray, the amount of
secondary rays and the maximum audible distance depend on the underlying VE structure and
cannot be proposed in general. However, the next section discusses several combinations of the
maximum image source order, the maximum length of the secondary rays and the amount of
secondary rays in a test scene.
56
Figure 4.8: The area for testing different parameters to evaluate maximum values.
4.4 Performance
The performance of the implemented hybrid model is discussed in this section. For reasons of
comparison, the maximum image source order, the maximum length of the secondary rays and
the amount of secondary rays are modified, tested and compared. The test setup consists of an
Intel i7-3770 CPU @ 3.4 GHz, 8 GB RAM and Nvidia Geforce GTX 670 running on 64-bit
Windows 7.
The virtual test environment consists out of ten reflection surfaces (one ceiling, one floor,
eight walls). In the user study described in Section 5.2.1, a similar virtual scene is used. Figure
4.8 shows the testing area. The area is 20m x 15m x 5m large with an inner pillow of the size 7m
x 5m x 5m. The ceiling is made transparent but does reflect sound waves and act like a normal
wall.
As mentioned above, three parameters are varied for testing system performance: the max-
imum image source order, the maximum length of the secondary rays and the amount of sec-
ondary rays. For the maximum image source order, three values were tested: two, three and
four. A fifth image source order would have produced too many sound sources. For the maxi-
mum length of secondary sources, the following eight values are tested: 30, 40, 50, 60, 70, 80,
90 and 100. The tested values for the amount of secondary rays are 4, 9, 16, 25, 36, 49, 64 and
81.
Parameter combinations which produced sound starvations are marked with *. Sound star-
vation happens when the sound engine cannot provide data to the hardware buffer in a timely
manner [5]. The reason for this could be an excessive use of the CPU where the audio thread
has 100% load.
57
max. order = 2 Amount of rays emitted
4 9 16 25 36 49 64 81
M
ax
im
um
le
ng
th
of
se
co
nd
ar
y
ra
ys
(m
)
30 169.87 146.29 139.10 94.44 85.64 63.87 56.30 44.59
40 168.23 134.51 127.72 78.04 72.79 51.68 44.21 30.55*
50 161.52 125.17 117.26 69.78 60.10 44.37 39.75 25.86*
60 163.55 117.44 107.46 62.90 50.38 40.28 32.15* 22.53*
70 151.66 110.43 101.83 57.32 46.70 31.98* 26.44* 20.18*
80 146.24 97.68 97.54 52.38 41.13 28.32* 22.23* 16.93*
90 152.11 95.98 87.65 46.08 37.67 24.97* 19.20* 15.32*
100 141.89 92.93 86.90 42.09 34.54 22.97* 17.82* 13.99*
Table 4.1: The measured performance on a testing desktop. The amount of rays in combination
with the maximum length of secondary rays is varied and the output is given in average FPS
measured over 30 seconds. The maximum image source order is set to two. Output marked with
* produced sound starvation.
max. order = 3 Amount of rays emitted
4 9 16 25 36 49 64 81
M
ax
im
um
le
ng
th
of
se
co
nd
ar
y
ra
ys
(m
)
30 73.59 67.79 67.85 55.42 50.00 41.87 38.97 33.32
40 73.03 64.95 64.64 50.35 44.39 36.07 31.90 27.04*
50 71.85 63.05 60.88 46.17 40.96 31.54 27.55 21.84*
60 70.38 59.02 61.44 43.04 36.33 26.64 23.82* 19.75*
70 69.22 57.88 55.25 39.22 33.40 25.19* 22.25* 17.06*
80 69.00 55.34 55.19 37.52 30.54 23.53* 19.44* 15.63*
90 68.17 55.44 52.05 33.55 27.46 20.48* 18.28* 13.83*
100 70.32 52.72 50.78 32.40 26.78 18.63* 16.27* 13.21*
Table 4.2: The measured performance on a testing desktop. The amount of rays in combination
with the maximum length of secondary rays is varied and the output is given in average FPS
measured over 30 seconds. The maximum image source order is set to three. Output marked
with * produced sound starvation.
4.4.1 Frames Per Second (FPS)
Table 4.1 shows the effect of varying the parameter of the amount of rays emitted and the maxi-
mum length of secondary rays with maximum image source order set to two. Table 4.2 lists the
output FPS with maximum image source order set to three, Table 4.3 with parameter set to four.
The output is measured in averaged FPS over a duration of 30 seconds in the testing area.
4.4.2 Sound Objects and CPU load
This section shows three tables in which the parameters of the three testing variables, the max-
imum image source order, the maximum length of the secondary rays and the amount of sec-
ondary rays are varied. However, in comparison to Table 4.1, Table 4.2 and Table 4.3, the tables
58
max. order = 4 Amount of rays emitted
4 9 16 25 36 49 64 81
M
ax
im
um
le
ng
th
of
se
co
nd
ar
y
ra
ys
(m
)
30 36.04 34.49 34.32 29.20 28.84 24.55 23.21 20.91
40 35.27 33.99 33.33 27.94 26.19 22.76 21.34 17.43*
50 35.46 32.15 32.61 27.21 24.19 20.54 18.16* 15.03*
60 35.48 31.79 31.35 25.51 22.43 18.49 16.84* 14.18*
70 34.86 31.28 31.49 23.34 21.27 17.45* 15.15* 12.57*
80 35.11 30.85 29.63 22.31 19.26 15.57* 14.05* 11.59*
90 35.07 29.95 29.83 20.75 18.46 14.48* 12.92* 10.68*
100 34.39 30.07 29.30 21.63 17.22 13.14* 12.18* 10.13*
Table 4.3: The measured performance on a testing desktop. The amount of rays in combination
with the maximum length of secondary rays is varied and the output is given in average FPS
measured over 30 seconds. The maximum image source order is set to four. Output marked with
* produced sound starvation.
max. order = 2 Amount of rays emitted
25 81
reg.
objects
active
objects
cpu
load
reg.
objects
active
objects
cpu
load
M
ax
.
le
ng
th
se
c-
on
da
ry
ra
ys
30 352 273 26.40% 916 837 94.90%
60 610 531 51.50% 1831 1752 100.00%
100 1033 954 78.08% 2843 2764 100.00%
Table 4.4: The registered sound objects in Wwise (in total, including emitting and non-emitting
sound sources), the active sound objects (that are actively emitting sound) and the corresponding
CPU load in percent. The maximum image source order is set to two.
in this section (Table 4.4, Table 4.5 and Table 4.6) illustrate the amount of sound objects reg-
istered in Wwise, actively playing sound objects and the corresponding CPU load of the audio
thread. The content of the tables is discussed in the next section.
For a better overview, two values for the parameter amount of secondary rays (25 and 81)
and three values for the parameter maximum length of the secondary rays (30, 60 and 100) are
chosen. 25 secondary rays with a maximum length of 60 are set for the user study.
4.4.3 Discussion
This section discusses the values of Table 4.1, Table 4.2, Table 4.3, Table 4.4, Table 4.5 and
Table 4.6.
Maximum image source order
The maximum image source order has a significant effect on the overall system performance.
The higher the maximum image source order is, the higher the computational costs are for model
calculation. By comparing the top left values (30m maximum length, 4 rays emitted) of Tables
59
max. order = 3 Amount of rays emitted
25 81
reg.
objects
active
objects
cpu
load
reg.
objects
active
objects
cpu
load
M
ax
.
le
ng
th
se
c-
on
da
ry
ra
ys
30 1164 320 30.41% 1688 844 90.83%
60 1375 531 50.34% 2566 1722 100.00%
100 1724 880 77.88% 3582 2738 100.00%
Table 4.5: The registered sound objects in Wwise (in total, including emitting and non-emitting
sound sources), the active sound objects (that are actively emitting sound) and the corresponding
CPU load in percent. The maximum image source order is set to three.
max. order = 4 Amount of rays emitted
25 81
reg.
objects
active
objects
cpu
load
reg.
objects
active
objects
cpu
load
M
ax
.
le
ng
th
se
c-
on
da
ry
ra
ys
30 7707 310 35.97% 8269 872 96.41%
60 7959 562 58.23% 9076 1679 100.00%
100 8271 874 78.02% 10203 2806 100.00%
Table 4.6: The registered sound objects in Wwise (in total, including emitting and non-emitting
sound sources), the active sound objects (that are actively emitting sound) and the corresponding
CPU load in percent. The maximum image source order is set to four.
4.1, 4.2 and 4.3, the performance loss can be observed. When the maximum image source order
is set to two, the system performance is at 169.87 FPS. When the value is increased by one, the
performance drops by more than a half to 73.59 FPS. When the maximum image source order
is set to four, the calculated average FPS are 36.04.
When the values of the other two parameters (length and amount of secondary rays) are set
to maximal, the effect of performance loss is less significant. When the maximum image source
order parameter is set to two, 13.99 FPS are reached. When the parameter is set to three, 13.21
FPS are possible. However, when the parameter is increased to four, 10.13 FPS were measured.
The reason for this can be seen in Tables 4.4, 4.5 and 4.6. When the maximum parameters are
chosen, the CPU load already is at 100%, therefore the increase of the maximum image source
order does not affect the performance anymore.
However, the higher the value of this parameter, the more sound objects are registered in
Wwise. In comparison, when both parameters (number of rays and maximum length) are set
to max value, 2843 (Table 4.4) sound objects are registered when the maximum image source
order is set to two, 3582 (Table 4.5) when the maximum order is set to three and 10203 (Table
4.6) when it is set to four.
Registered objects do not cost additional CPU performance. This can be seen by comparing
the registered objects when the maximum length of secondary rays is set to 60 meters, the
amount of rays set to 25 and the maximum image source order is set to two and three (Table
60
4.4 and Table 4.5). The amount of active sound objects is the same, however, the registered
objects with maximum image source order set to two are 610 and when the parameter is set to
three, there are 1375 registered sound objects. The difference in CPU load is marginal (51.50%
compared to 50.34%).
Amount of rays emitted
The amount of rays emitted also affects the system performance. When the parameter is in-
creased, more rays are emitted from the original sound source to be reflected on surfaces. The
difference in CPU load can be seen by comparing the CPU load when the parameter maximum
length of secondary rays is set to 30 meters and the parameter amount of rays emitted is set to
81. In Table 4.4, the difference in CPU load is 68.5%, in Table 4.5 it is 60.42% and in Table 4.6
it is 60.44%. The reason for this performance difference is the increase of the number of sound
sources that actually emit sound in the scene.
Maximum length of the secondary rays
The maximum length of secondary rays also effect overall system performance because the
larger amount of generated and especially active sound objects need additional management and
spatialization.
The amount of registered sound objects increases with the increase of maximum length of
secondary rays. This affects CPU load, as it is shown in Table 4.4. When the amount of rays
emitted is set to 25 and the maximum length of secondary rays is increased, the CPU load also
increases. With a maximum length of 30 meters, the CPU load is at 26.40%. When the value is
increased to 60 meters, 51.50% CPU power is used. However, 78.08% CPU is used when the
parameter is set to 100 meters. This shows that the maximum length affects the CPU load. This
is shown in Table 4.4.
When a secondary ray has a larger maximum length, it can get reflected more often and
therefore it produces a larger amount of secondary sources. When the maximum length is set to
30 meters, 352 registered objects are known in Wwise. If the value is set to 60 meters, the amount
of registered sound objects is 610. When the amount of rays emitted is set to 81, 916 objects are
registered when the maximum length is set to 30 meters and 1831 when the maximum length
is set to 60 meters. These two increased values can be compared to evaluate how the amount
of rays emitted effect the amount of objects when the maximum length is varied. When more
rays are emitted from the original sound source, the effect of varying the maximum length of
secondary sources is stronger due to the fact that more rays can produce more sound objects
when the maximum length is increased compared to less rays emitted.
4.4.4 Parameters for User Study
A suitable combination of the parameters maximum image source order, maximum length of
secondary rays and amount of rays emitted depend on the chosen hardware (more computational
power enables more sound source management and spatialization) and the scene geometry. For
the user study, the following values for the parameters were chosen.
The maximum image source order was set to three. This parameter is recommended by
[38].y Therefore, the first, second and third order image sources are generated.
61
The maximum length of the secondary rays was set to 60 meters and the amount of rays
emitted was set to 25. The CPU load of 50.34% in the testing scene was found to be a proper
value that still leaves calculation reserves for Oculus Rift DK2 and tracking.
62
CHAPTER 5
Evaluation
The developed hybrid sound model is evaluated by comparing it to a simpler sound model that
is usually used in audio games, the baseline model. The design of the models is described in
detail in Section 3.4. For the purpose of comparison, an audio game prototype was developed.
In this game, participants had to reach several sound source positions without visual but auditive
feedback using navigation by real walking in a large room.
The proposed hypotheses, the study procedure and the findings including discussion are
described in this chapter.
5.1 Hypotheses
The assumption was that the hybrid sound model reduces the difficulty of tasks (H1 - H4) and
contributes to a higher sense of realism (H5) and immersion (H6) compared to the baseline
model. Therefore, the following hypotheses were formulated and investigated:
H1 : The completion time in the hybrid model is lower than the completion time in the base-
line model.
H2 : The difficulty of finding sound sources is higher with the baseline model than with the
hybrid model.
H3 : Users are likely to better recognize virtual obstacles in a game using the hybrid model
than in a game using the baseline model.
H4 : Users can reconstruct virtual geometry of a game better with the hybrid model than with
the baseline model.
H5 : The hybrid model sounds more realistic than the baseline model.
H6 : Users playing a game with the hybrid model will experience a higher degree of immer-
sion than the ones playing a game with the baseline model.
63
5.2 Game Prototype
The game prototype was developed to compare the implemented sound models.
The localization of sound sources in a virtual room is the central element in this game proto-
type. Therefore, the tasks of this audio game prototype are the reaching of sound source locations
in virtual space by moving and listening to sound in real world. When a sound source location is
reached, the next sound source starts playing. The game is over when all sound source positions
are reached by the player. To avoid any cheating, players have to fulfill the tasks blindfolded.
There are three levels in this audio game prototype that a player has to complete. The first
level is an introductory virtual scene where a player can get comfortable with the virtual reality
and the sound in it. In this level, the player is able to see the scene. As soon as the player feels
comfortable, he or she starts the real experiment by stepping on a button on the floor (see next
section for detailed description).
In level two and three, the player has to fulfill the tasks blindfolded. The tasks in each of
these levels are the same. In the beginning, the player hears a ringing phone. The task is to
reach the position of the ringing phone by moving around in the tracked area and listening to the
audio. Every position and rotation change of the player is taken into account, for real-time sound
spatialization. When the position of the phone is reached, the player gets further instructions of
the situation he or she is in. The voice on the telephone says that he has blindfolded the player
and the player has to find three bombs in this room before they explode. After the phone call
has ended, the first bomb starts ticking. The task now is to find the location of the bomb through
listening to the bomb ticking. Only one bomb emits sound at the same time. When the location
of the first bomb is reached, the second bomb starts ticking. As soon as the player has reached
all bomb locations, the level is completed.
Although there are virtual walls in the game, the player is able to go through them. There is
no game feedback that forbids a player to continue moving through a virtual wall. This solution
allows the comparison if the hybrid model supports the player in avoiding obstacles or not.
In the next section, the three levels are described in detail.
5.2.1 Levels
The game prototype that served as basis for the user study consisted of three different levels.
Introductory Scene
The first level (Introductory Scene) is for experimenting and getting comfortable with the virtual
reality and sound in it. This is the first room the participants were in, before actually testing
the first audio game. In the room, there are three walls (excluding the walls at the border of the
room), a water tap and a red button. In this level, participants are able to see the geometry and
the sound source.
The game area is 20m x 5.2m in size. The water tap is placed next to the center of the room.
It is partially enclosed by two walls so that a player can move around the water tap to perceive
sound changes when the sound source is obstructed. A third wall is placed next to it. The red
button in the southeast corner starts the experiment.
64
Figure 5.1: A screenshot of the Introductory Scene. The water tap is enclosed by walls, while
the red button in the corner starts the audio game.
Figure 5.2: A screenshot of the simple room. The turquoise cube marks the position of the
phone, the pink cube the position of the first bomb, the gray cube the position of the second and
the red cube the position of the third bomb. However, the participant is not able to see them
during the test.
Figure 5.1 shows this room. The water tap is periodically emitting sound of dripping water.
It is the sound origin of the scene. The red button on the floor triggers the event for starting the
experiment.
Simple Room
The second level (shown in Figure 5.2), which is referred to as the simple room has one wall and
four sound sources in it. The turquoise cube marks the position of the phone, the pink cube the
position of the first bomb, the gray cube the position of the second and the red cube the position
of the third bomb. The starting position is marked as X. The last bomb is behind a wall and
therefore obstructed.
Complex Room
The third level (shown in Figure 5.3), which is referred to as the complex room has five walls
and four sound sources in it. The order of the sound sources is the same as in the simple room
(turquoise, pink, gray, red). The starting position is marked as X.
65
Figure 5.3: A screenshot of the complex room. The turquoise cube marks the position of the
phone, the pink cube the position of the first bomb, the gray cube the position of the second and
the red cube the position of the third bomb.
5.3 Study Design
The study design is a between-subject experiment. Half of the participants played the game with
the baseline model used to simulate sound, half of them had the hybrid model assigned. In this
section, the hardware setup, the procedure and the measurements are described. The user study
was performed during spring 2015.
5.3.1 Hardware Setup
The game was tested in a 30x7m tracked open space. The participants wore an Oculus Rift HMD
(DK2) and 7.1 surround sound headphones (Logitech G35). Players’ positions were tracked with
the use of a wide area optical tracking system. The VR system had been developed in-house and
has not been published yet. The prototype ran on a laptop with Intel Core i7 processor and
Nvidia GeForce GTX 980M graphics card. The laptop was attached to a backpack frame and
carried by participants on their backs.
5.3.2 Participants and Procedure
Participants were invited through a social media channel to attend at this study. 37 people
accepted this invitation. 20 of them were male and 17 female. The social media channel was
a public Facebook group named “Foreigners in Vienna“. Therefore, the culture, nationalities
and backgrounds were very diverse. Participants received some sweets for time compensation
at the end of the study. The following paragraphs describe the procedure every participant went
through.
After the greeting, the participants had to fill in two forms (one Non-Disclosure Agreement
and one Health-Condition Agreement). They received instructions for the test. First, they had to
fill in a Pre-Test-Questionnaire. After that, audio games as well as the hardware equipment were
described. Then, they put on the hardware and started the game in the Introductory Scene, where
they were able to see and hear a dripping water tap for getting comfortable with the sound in the
66
Figure 5.4: A photo taken during the experiment. The participant wears the backpack, Oculus
VR and headphones.
VE, depending on the model they have got assigned. As soon as they felt comfortable with the
sound and virtual reality, they were told to go to the red button for starting the first experiment.
When they hit the button, the screen turned black and a virtual phone started to ring.
In the simple room, the participants had to fulfill the tasks like described above (see Section
5.2). After all bombs got defused, the participants were able to see the first room they were in
at the beginning and were told to remove the hardware and to go back to the test coordinator.
Then they had to fill in the Each-Test-Questionnaire, where they had to draw the virtual scene
they were in (see Section 5.3.3.1 for an explanation of the drawing task).
After they had filled in this questionnaire, they put the hardware on again and started looking
for the red button. This triggered the start of the third audio game level (complex room). The
complex room model was similar to the simple room model but the room geometry and the
sound source positions were different. As soon as they defused all bombs in this level, they were
able to see everything again and to go back to the test coordinator so that the hardware could be
removed.
The participants then filled in the second Each-Test-Questionnaire, followed by the Post-
Test-Questionnaire. After the completion of the Post-Test-Questionnaire, the participants had
the option to talk about their experience and the complex room with visible sound sources was
67
shown to them.
5.3.3 Measurements
Subjective data in form of questionnaires and objective data obtained from direct measurements
were used to test the formulated hypotheses (see Section 5.1). These types of measurements are
discussed in this section.
5.3.3.1 Subjective Measurements: Questionnaires
Four different questionnaires were used for the user test. They can be divided into three types.
All the questionnaires can be found in the appendix of this thesis (see Section A). Answers to
most of the questions were given on a Likert scale in the range from 1 to 7.
Pre-Test-Questionnaire: The participants received a questionnaire before the user test.
Questions regarding age, gender and experience with computers, VR and audio games
were asked.
Each-Test-Questionnaire: The participants received a questionnaire immediately after
each audio game (two in total). The task here was to draw the room and to assign an
order to the sound sources correctly, including positions of the walls. They were asked to
rate the difficulty of the tasks and how sure they think their solution was.
Post-Test-Questionnaire: It consisted of a simulator sickness questionnaire (SSQ), ques-
tions regarding the locations of sound sources and about immersion as well as realism.
The Pre-Test-Questionnaire was intended to obtain answers about the participants’ prior
study experience. The gender and age was used to find possible differences in terms of e.g. com-
pletion time. The questions concerning the experience with computers, VE, computer games,
audio games and if they play an instrument or make music was used to get information about the
level of experience that participants had. Especially the question concerning the experience with
music was used for finding differences between the groups that answers with “I have experience“
or “I have no experience.“
The Each-Test-Questionnaire consisted of two parts. First, the participants had to draw the
virtual room they were in into a map of the real environment, including the positions of the
bombs and the phone. After that, they had to fill in how difficult it was to draw that room as
well as how confident they were that their solution was correct. In the second part, the border
of the room was already drawn into the map, including the positions of the sound sources. The
participants had to bring them into correct order and drew the walls. Afterwards, the questions
regarding difficulty and correctness were asked again, with an additional question about the
difficulty of drawing the walls. This questionnaire existed in two versions, one for the simple
room and one for the complex room. Different questionnaires were necessary because the sound
sources were placed at different positions. The drawings as well as the questions were asked for
the investigation of H4.
68
The Post-Test-Questionnaire started with the Simulator Sickness Questionnaire (SSQ). Nor-
mally, participants of a study with SSQ have to fill in one questionnaire at the beginning and
one at the end of the study. However, in this thesis, only post-study SSQ was used. SSQs are
normally used to test interactive software that generates visual output. Visual disturbances nor-
mally contribute to the simulator sickness. It was not expected to find elevated levels of sickness
symptoms due to the fact that audio games do not have any visuals at all. The SSQ in this user
study was used out of curiosity and to possibly have a comparison for other studies. After the
participants filled in the SSQ, they were asked if they found the Introductory Scene useful.
The following questions were asked twice in the Post-Test-Questionnaire, once for each
room. The first six questions were related to immersion (H6). Participants were asked how
aware they were of events occurring around them and how realistic their sense of moving in the
virtual world was. The latter question was asked to evaluate effects of real-time spatialization
on immersion. The participants were also asked if they felt confused or disorientated during or
at the end of the session. This could have happen when the participant was not used to walk
around blindfolded which could decrease the degree of immersion. The last two questions were
about how involved they were and how quickly they adjusted to the VE experience.
After the questions about immersion were answered, four questions about the difficulty (H2)
of finding sound sources in the 3D environment were asked. The difficulty of finding the ringing
phone, the first, second and third bomb were asked. These questions were used to assess the
difficulty of finding sound sources when different stated sound models were used. After that,
participants were asked how much they liked the sound in general and how realistic it seemed
to them. These questions were used to assess the correctness of hypothesis (H5) concerning
realism. The last questions concerned perceived sound reflections, their realism and helpfulness.
These two questions were asked to evaluate if the participants were able to recognize sound
reflections correctly and how helpful they found them for task fulfillment.
5.3.3.2 Objective Measurements: Logging
During the experiment, events and the positions of the player were logged in Unity3D.
An event is a situation in which the player reaches a game state or hits an object. A game
state is reached when the audio game begins (Game start) and when the audio game is completed
(Game over). A logged event consists of a timestamp and the corresponding event text. When
the player reached the position of a sound source (phone, first, second and third bomb) or hit a
wall it is also logged as an event. When collisions between the player and walls are logged it is
possible to calculate the hit-wall metric that defines how often a participant went through virtual
walls. An example of an event log is shown below.
eventLog.txt
0.000 - ============ Game start ============
0.004 - Starting game of player with ID: 7259
0.005 - Introscene - no logging. COMPLEX Sound Model.
0.000 - ============ Game start ============
0.000 - Starting game of player with ID: 7259
0.001 - SIMPLE Room Model with selected COMPLEX Sound Model.
69
15.582 - Reached status: Phone
46.568 - Reached status: Bomb1
64.876 - Reached status: Bomb2
97.746 - Reached status: Bomb3
97.746 - ============ Game over ============
0.000 - ============ Game start ============
0.000 - Starting game of player with ID: 7259
0.001 - COMPLEX Room Model with selected COMPLEX Sound Model.
15.621 - Reached status: Phone
49.099 - Reached status: Bomb1
86.317 - Hit Wall WallInRoomBottomTopPhys
87.571 - Reached status: Bomb2
101.191 - Hit Wall WallInRoomBottomTopPhys
113.117 - Hit Wall WallInRoomMiddlePhys
121.359 - Reached status: Bomb3
121.359 - ============ Game over ============
The position of the player was logged periodically each 500ms as a 3D vector. With this
information it is possible to calculate the covered path and distance traveled by a player. Below,
there is an example of the position log.
positionLog.txt
0.001 - SIMPLE Room Model with selected COMPLEX Sound Model.
-2.9, 1.5, 6.0
-3.1, 1.5, 5.9
-3.1, 1.7, 5.8
-3.0, 1.8, 5.8
-3.1, 1.8, 5.8
...
0.001 - COMPLEX Room Model with selected COMPLEX Sound Model.
-3.3, 1.8, 5.5
-2.8, 1.6, 5.7
...
It is possible to calculate the completion time and the time participants need between the
tasks with the logged event information (time stamp and event text). This information was
used to assess the correctness of H1. The hit-wall rate was used to investigate H3. The more
participants went through virtual walls the less the assigned sound model supported them in
avoiding obstacles.
5.4 Results
The results of objective and subjective measurements of 34 participants were analyzed and are
discussed in this section. Statistical software SPSS was used for the statistical analysis.
70
5.4.1 Difficulty
The participants with the assigned hybrid model completed the game faster in the simple and the
complex room. The mean values of completion times and standard deviations are listed in Table
5.1. However, the difference between mean completion times in two rooms has not been found
to be statistically significant. Therefore, there is not enough evidence to support H1.
Completion Time (s)
Mean σ
Simple Room
Baseline 202.60 88.38
Hybrid 180.00 72.02
Complex Room
Baseline 204.46 160.37
Hybrid 178.26 53.09
Table 5.1: Mean values and standard deviations of the completion time.
Participants were asked to rate how quickly they adjusted to the VE experience on the scale
from 1 to 7, where 1 corresponded to “very slowly“ and 7 to “very quickly.“ According to the
resulting scores, the participants adjusted to the hybrid sound model quicker than to the baseline
model. The results are listed in Table 5.2. The difference is statistically significant (p = 0.03 in
Student’s T-test, α = 0.05) for the simple room. The null hypothesis is valid with the probability
close to marginal for the complex room (p = 0.08 in Student’s T-test, α = 0.05). Higher sub-
jective adjustment scores are in accordance with shorter game completion times for the hybrid
model.
Adjustment to VE Experience
Mean σ
Simple Room
Baseline 4.97 1.40
Hybrid 5.94 1.10
Complex Room
Baseline 5.77 1.16
Hybrid 6.41 0.89
Table 5.2: Mean values and standard deviations of the scores obtained for questions about the
adjustment to the VE experience. 1 corresponds to “adjusted very slowly“ and 7 to “adjusted
very quickly.“
The participants were asked to rate the difficulty of locating the sound sources in the VE.
The results are illustrated in Table 5.3. There is little difference in the mean scores for each
sound source if compared between two sound models. H2 is therefore not supported.
The hit-wall rate measures how often a participant went through a virtual wall. The mean
values, standard deviations, minima and maxima are listed in Table 5.4. The mean hit-wall rate
with the baseline model was 1.06 and only 0.29 in the hybrid model in the simple room. This
difference is statistically significant (p = 0.00 in Kruskal-Wallis test, α = 0.05) and supports
H3 in the simple room. In fact, 13 of out 17 participants found all sound sources in the simple
room with the hybrid model without hitting a wall, while only one out of 17 participants with the
71
Sound Source
P σ B1 σ B2 σ B3 σ
Simple Room
Baseline 3.24 1.65 3.71 1.92 3.85 1.59 3.50 1.80
Hybrid 3.35 1.51 3.71 1.71 3.68 1.65 4.24 1.78
Complex Room
Baseline 2.77 1.72 2.59 1.63 3.44 1.85 4.12 1.79
Hybrid 2.82 1.92 2.41 1.43 3.18 1.46 4.29 1.85
Table 5.3: Mean values and standard deviations of the resulting scores for the questions about
difficulty of finding the phone (P) and the three bombs (Bn). Answers were given on a scale
from 1 to 7, where 1 meant “very easy“ and 7 “very difficult.“
assigned baseline model did not hit the wall. In the complex room, the result was not statistically
significant and therefore does not support H3 in this room. However, the mean hit-wall rate in
the hybrid model is lower. The maximum hit-wall rate of the baseline model in the complex
room is almost twice as large as of the hybrid model (hit-wall rate baseline model: 11; hybrid
model 6).
Hit-Wall Rate
Mean σ Min Max
Simple Room
Baseline 1.06 0.42 0 2
Hybrid 0.29 0.46 0 1
Complex Room
Baseline 4.59 2.47 3 11
Hybrid 3.65 1.53 2 6
Table 5.4: Mean values, standard deviations, minima and maxima of the measured hit-wall rate
(in times a participant went through a virtual wall during an experiment).
None of the participants were able to reconstruct the virtual room correctly without given
borders. An example of a participant’s drawing with an overlay of the correct solution is shown
in Figure 5.5. The start position was at the bottom right corner. The numbers indicate the
position of the sound sources, marked from 1 (phone) to 4 (third bomb). Participants were not
able to mark the position of the last bomb correctly, although they removed the head-mounted
display and the headphones at this position.
Only four participants reconstructed and ordered the sound sources of the simple room cor-
rectly. Eight could reconstruct parts of the complex room (no one was able to recreate it entirely),
while only six assigned the order of the sound sources correctly. Two of them assigned the order
of the sound sources correctly in both rooms. The corresponding numbers are shown in Table
5.5. The remaining 28 participants could not reconstruct the order in which they discovered the
sound sources. H4 is not supported.
The results show that the reconstruction of a VE which is only audible but not visible is
generally a difficult task. During the brief talks after the session, several participants reported
that they had experienced a sudden change of loudness when they went through a wall but were
not able to identify it as a wall during the test. This change was caused due to a different outcome
of the sound model in front of and behind the wall. In the baseline model, this is caused by the
72
Figure 5.5: An exemplary drawing of a participant, who had to draw the virtual room without
given borders. The red overlay shows the correct solution.
Correctly ordered Sound Sources
0 1 2 4
Simple Room
Baseline 2 7 7 1
Hybrid 4 7 3 3
Complex Room
Baseline 2 10 3 2
Hybrid 1 5 7 4
Table 5.5: Number of participants who ordered n (0, 1, 2 or 4) sound sources correctly in the
Each-Test-Questionnaires.
sound difference in obstruction calculation. However, in the hybrid model, the differences in
obstruction and additionally the different calculations of sound reverberation in front of and
behind a wall are responsible for perceiving this effect.
This shows that the participants were not familiar with sound behavior and did not think
about possible obstacles in the VE. However, the hybrid model provides more precise guidance
to the sound source in case of a simple VE with obstacles (simple room) as demonstrated by the
partial confirmation of H3.
5.4.2 Realism and Immersion
This section describes the subjective results obtained by the evaluation of the questionnaires
regarding realism and immersion for the confirmation or rejection of H5 and H6.
Subjective scores for sound realism were high for both models and are shown in Table 5.6.
This result indicates that the spatialization and obstruction in the baseline model and spatializa-
tion, obstruction and reverberation of the hybrid model provide realistic sound output but do not
differ in perceived realism when compared. This could have two reasons. On the one hand, this
could indicate that participants are not much aware of sound phenomena. On the other hand,
this could indicate that the reverberation in the proposed hybrid sound model was not convinc-
ing enough. However, the baseline model does not have any reverberation. It therefore should
be a less realistic sound model compared to a sound model with reverberation. The mean values
of the baseline model for both rooms are equal (5.41 out of 7) and high. The mean values of the
73
hybrid model are lower (5.06 in the simple room, 4.68 in the complex room) compared to the
baseline model. The marginal difference could be caused by the unfamiliarity of computer gen-
erated sound. The subjective scores do not differ significantly between the two sound models.
Therefore, H5 is not supported.
Realism of Sound
Mean σ
Simple Room
Baseline 5.41 1.49
Hybrid 5.06 1.46
Complex Room
Baseline 5.41 1.47
Hybrid 4.68 1.54
Table 5.6: Mean values and standard deviations of the resulting scores for the questions about
sound realism. Answers were given on a scale from 1 to 7, where 1 meant “not realistic at all“
and 7 “very realistic.“
The participants were asked whether they had experienced sound reflections or not. 59% of
the users who had the baseline model (without reflections) assigned stated that they had heard
reflections. 24% had not heard any reflections, 17% were unsure. Corresponding numbers for
the hybrid model (with modeled reflections) are 35%, 29% and 36%. The values are listed in
Table 5.7. The confusion about sound reflections might be attributed to the fact that users are
not trained in listening to specific sound effects. The simulated reverberation might also be not
perceived as a real one although the scores for the question regarding realism were high (see
Table 5.6). A study with within-subject design might show the difference in the perceived sound
realism between models. In such studies, participants could be asked which sound model they
find more realistic or which one they would prefer.
Reflection heard
No Not Sure Yes
Assigned Model
Baseline 24% 17% 59%
Hybrid 29% 36% 35%
Table 5.7: Sound reflection heard by the participants playing the game with the assigned base-
line and hybrid model.
The participants were asked, how aware they were of events occurring in the real world
during the test and how involved they were in the VE experience. These questions were used to
assess the users’ immersion. The answers indicate low awareness of real events (see Table 5.8)
and high involvement in the VE experience (see Table 5.9).
Real events occurring around a player during the experiment could have been foot steps or
the presence of the test coordinator who always was next to the participants. However, as the
scores indicate, participants were not aware of real events around them. All scores are smaller
than three (see Table 5.8).
The tested prototype provides a highly immersive audio game experience independent of
the used sound model. The mean values (as shown in Table 5.9) of both sound models in both
74
rooms are larger than five, indicating the high degree of immersion. The mean value of the
hybrid sound model in the complex room is larger than six.
However, the mean scores regarding the awareness and involvement are not different for
both sound models. Therefore, H6 is not supported.
Awareness of Real Events
Mean σ
Simple Room
Baseline 2.88 1.68
Hybrid 2.29 1.35
Complex Room
Baseline 2.94 1.65
Hybrid 2.18 1.20
Table 5.8: Mean values and standard deviations of the resulting scores for the questions about
awareness of occurring events during the test session. Answers were given on a scale from 1 to
7, where 1 meant “not aware at all“ and 7 “very aware.“
Involvement
Mean σ
Simple Room
Baseline 5.94 1.04
Hybrid 5.47 1.56
Complex Room
Baseline 5.94 1.10
Hybrid 6.06 1.13
Table 5.9: Mean values and standard deviations of the resulting scores for the questions about
involvement during the test session. Answers were given on a scale from 1 to 7, where 1 meant
“not involved at all“ and 7 “very much involved.“
The results show that the completion time in the hybrid sound model was shorter compared
to the baseline model, however, the result was not found to be significant. Participants adjusted
to the hybrid sound model faster than to the baseline sound model in the simple room. However,
there was no difference in difficulty of finding sound sources found when comparing the sub-
jective scores of the participants. The complex sound model supported the players in avoiding
obstacles in simple room structures. The reconstruction of a virtual room without seeing but
hearing is a difficult task. The sound in both sound models was rated to be very realistic. How-
ever, participants were not able to identify the missing reflections in the baseline model and the
reflections in the hybrid model correctly. The degree of immersion was found to be high in both
tested sound models.
5.4.3 Observations
This section describes observations done by the test coordinator during the user study.
75
Participant’s strategy
The optimal strategy for fulfilling the tasks in the proposed audio game would be that a partici-
pant first localizes the sound source direction by moving the head (dynamic sound localization,
see Section 2.1.2) and then walks towards the sound source position. The nearer the participant
get to the sound source position, the louder the volume is. Participants had two strategies in
finding the sound sources that are described in the next paragraphs.
One strategy was the localization of the sound source position by trying to walk in different
directions. Participants who followed this strategy completed the game fast, however, they had
problems in finding the exact position that was necessary to fulfill a task.
Another strategy was a more cautious one. Participants who followed this strategy moved
slowly and stopped when they were not sure about the sound source direction. This strategy
was a slower one, however, participants did not have problems in finding the exact sound source
location compared to the last strategy.
In both mentioned strategies, the participants usually moved forward. However, two partic-
ipants also moved backward. There are two possible reasons for this. On the one hand, they
could have just passed a sound source position and wanted to go back to determine if they have
passed it. On the other hand, they could have went through a wall and experienced the significant
sound differences again by moving back and forth.
Reactions
Participants reacted differently when they reached a sound source position. Some of them said
“There it is“ or “Oh, here.“ Others just moved to the next task without any further reaction.
When the participants received the Each-Test-Questionnaire the first time after they com-
pleted the first audio game, they smiled or laughed. The task in this questionnaire was to draw
the virtual room they were in. The reason for this reaction was that they did not expect this task
at all. The task of drawing a virtual but only audible room is a difficult task, as shown in Section
5.4.1.
Participants also reacted differently on the virtual instructions. The phone voice was modu-
lated so that it sound like a kidnapper requesting money. This was intended to produce a more
intensive and thrilling feeling of this audio game. When the phone call had started, participants
also smiled or were a bit afraid.
At the end of the session, the test coordinator showed the complex room model to the par-
ticipants. The majority was surprised that the room had such a complex structure.
Two participants showed indication of fear. One of them folded her arms and was apparently
relieved when she completed the game, the other talked to herself during the experiment to calm
herself down. She also repeated “Oh God, oh God, oh God.“ A reason for this could be the
scary shadow syndrome described above so that participants imagine a horrible situation they
are currently in that makes them afraid [20]. However, no participant was harmed during the
experiment.
76
CHAPTER 6
Conclusion
Audio games that do not have any visual output at all provide players with an immersive game
experience only by simulating sounds. These sounds have to be compelling, otherwise the de-
sired degree of immersion in the game will not be reachable.
In this thesis, a real-time implementation of a hybrid sound model that calculates VE geometry-
dependent reverberation in an audio game is presented. The hybrid model is implemented in
Unity3D game engine using Wwise with AstoundSound plug-in for real-time sound spatializa-
tion. AstoundSound was compared to other sound spatialization tools and was found to be the
most appropriate one. The implemented hybrid sound model is compared to the baseline sound
propagation model typically used in audio games. This comparison is conducted in a user study
where the participants had to play a 3D audio game while walking blindfolded in a large tracked
space.
The results of the study indicate that both tested models provide a highly immersive game
experience. The proposed hybrid approach provides players with more information about the
VE. Although the differences in completion time were not significant, participants who had the
hybrid sound model assigned, completed the game tasks faster. In the implemented audio game,
the participants with the hybrid sound model adjusted to the VE faster than the participants with
the assigned baseline sound model.
Participants had to reconstruct the room and the position of the perceived sound sources.
As the results show, the reconstruction of a virtual, non-visual scene is a difficult task. Aspects
influencing this difficulty will be investigated in further studies with sound models and audio
games.
It can be concluded that a complex sound model supports the player in avoiding obstacles in
simple room geometries. It is advisable for game designers and developer to integrate a complex
sound model in their games, as long as the computational power is available and the player has
to avoid obstacles.
In future work, the used sound model may be implemented by using GPU-based methods.
This model can then be compared to more accurate sound models. GPU-based calculations
offer higher performances, however, a native implementation instead of using an existing audio
77
middleware solution might be a more performant option, especially when real time calculation
of a large amount of sound sources is necessary.
Authors’ comment:
When I was starting an investigation in the domain of audio games, I was really surprised
that only few scientists are researching in this scientific field. I hope that the scientific community
will get more attention to audio games, because in my humble opinion, non visual impaired
people rely too much on their visual sense while they forget that they still have others.
78
Bibliography
[1] Fundamentals of Telephone Communication Systems. Western Electric Company, 1969.
[2] Ernest Adams. Fundamentals of Game Design. New Riders Publishing, Thousand Oaks,
CA, USA, 2nd edition, 2009.
[3] Audiokinetic, Inc. Customers. https://www.audiokinetic.com/community/
customers/, 2015. [Online; accessed 25-August-2015].
[4] Audiokinetic, Inc. Wwise fundamentals. https://www.audiokinetic.com/
download/documents/Wwise_Fundamentals.pdf, 2015. [Online; accessed
25-August-2015].
[5] Audiokinetic, Inc. Wwise help 2014.1 build 5158 - capturing data from the sound engine,
2015. [Online; accessed 25-August-2015].
[6] Auro Technologies. Technology. http://www.auro-3d.com/consumer/
technology/, 2015. [Online; accessed 25-August-2015].
[7] Corey I Cheng and Gregory H Wakefield. Introduction to head-related transfer functions
(hrtfs): Representations of hrtfs in time, frequency, and space. In Audio Engineering Soci-
ety Convention 107. Audio Engineering Society, 1999.
[8] Erin C Connors. Audio-based virtual environments: Teaching spatial navigation skills to
the blind.
[9] T Drewes, E Mynatt, Maribeth Gandy, et al. Sleuth: An audio experience. In Proceedings
of The International Conference on Auditory Display, 2000.
[10] Angelo Farina. Ramsete-a new pyramid tracer for medium and large scale acoustic prob-
lems. In Proc. of Euro-Noise, volume 95, 1995.
[11] Maria Fellner and Robert Höldrich. Physiologische und Psychoakustische Grundlagen des
räumlichen Hörens. IEM-Report 03/98, 1998.
[12] Johnny Friberg and Dan Gärdenfors. Audio games: new perspectives on game audio. In
Proceedings of the 2004 ACM SIGCHI International Conference on Advances in computer
entertainment technology, pages 148–154. ACM, 2004.
79
[13] Bill Gardner, Keith Martin, et al. Hrft measurements of a kemar dummy-head microphone.
1994.
[14] Lalya Gaye. A flexible 3d sound system for interactive applications. In CHI’02 Extended
Abstracts on Human Factors in Computing Systems, pages 840–841. ACM, 2002.
[15] GenAudio, Inc. Our products. http://www.astoundholdings.com/
products.php, 2015. [Online; accessed 25-August-2015].
[16] W.M. Hartmann. Principles of Musical Acoustics. Undergraduate Lecture Notes in
Physics. Springer, 2013.
[17] Florian Heller, Thomas Knott, Malte Weiss, and Jan Borchers. Multi-user interaction in
virtual audio spaces. In CHI’09 Extended Abstracts on Human Factors in Computing
Systems, pages 4489–4494. ACM, 2009.
[18] Thomas Hermann, Andy Hunt, and John G. Neuhoff. Introduction. In Thomas Hermann,
Andy Hunt, and John G. Neuhoff, editors, The Sonification Handbook, chapter 1, pages
1–6. Logos Publishing House, Berlin, Germany, 2011.
[19] R.C. Jaiswal. Audio-Video Engineering. Nirali Prakashan.
[20] Mats Liljedahl, Nigel Papworth, and Stefan Lindberg. Beowulf: an audio mostly game.
In Proceedings of the international conference on Advances in computer entertainment
technology, pages 200–203. ACM, 2007.
[21] Mauricio Lumbreras and Gustavo Rossi. A metaphor for the visually impaired: browsing
information in a 3d auditory environment. In Conference companion on Human factors in
computing systems, pages 216–217. ACM, 1995.
[22] Mauricio Lumbreras, J Sánchez, and M Barcia. A 3d sound hypermedial system for the
blind. In Proceedings of the First European Conference on Disability, Virtual Reality and
Associated Technologies, pages 187–191, 1996.
[23] Richard H Lyon. Theory and application of statistical energy analysis. Elsevier, 2014.
[24] Matija Marolt. A new approach to hrtf audio spatialization. In Proceedings of the interna-
tional computer music conference, pages 365–367. Citeseer, 1996.
[25] Christian Maschke and André Jakob. Psychoakustische messtechnik. In Michael Mörser,
editor, Messtechnik der Akustik, chapter 11, pages 599–642. Springer-Verlag Berlin Hei-
delberg, Berlin, Germany, 2010.
[26] FP Mechel. Improved mirror source method in roomacoustics. Journal of Sound and
vibration, 256(5):873–940, 2002.
[27] Ravish Mehra, Atul Rungta, Abhinav Golas, Ming Lin, and Dinesh Manocha. Wave: In-
teractive wave-based sound propagation for virtual environments. Visualization and Com-
puter Graphics, IEEE Transactions on, 21(4):434–442, 2015.
80
[28] Philip Mendels and Joep Frens. The audio adventurer: Design of a portable audio adven-
ture game. In Fun and Games, pages 46–58. Springer, 2008.
[29] Lotfi B Merabet, Erin C Connors, Mark A Halko, and Jaime Sánchez. Teaching the blind
to find their way by playing video games. PloS one, 7(9):e44958, 2012.
[30] Matjaz Mihelj, Domen Novak, and Samo Begus. Virtual Reality Technology and Applica-
tions. Springer Publishing Company, Incorporated, 2013.
[31] M Taylor A Chandak Q Mo and C Lauterbach C Schissler D Manocha. isound: Interactive
gpu-based sound auralization in dynamic scenes.
[32] GM Naylor. Treatment of early and late reflections in a hybrid computer model for room
acoustics. In 124th ASA Meeting, 1992.
[33] Graham M Naylor. Odeon – another hybrid room acoustical model. Applied Acoustics,
38(2):131–143, 1993.
[34] Oculus VR, LLC. 3d audio spatialization. https://developer.
oculus.com/documentation/audiosdk/latest/concepts/
audio-intro-spatialization/, 2015. [Online; accessed 25-August-2015].
[35] Odeon A/S. Odeon user manual version 13. http://www.odeon.dk/download/
Version13/ODEONManual.pdf, 2015. [Online; accessed 25-August-2015].
[36] Vojin G. Oklobdzija. The Computer Engineering Handbook: Electrical Engineering
Handbook. CRC Press, Inc., Boca Raton, FL, USA, 2001.
[37] Irwin Pollack and JM Pickett. Cocktail party effect. The Journal of the Acoustical Society
of America, 29(11):1262–1262, 1957.
[38] Jens Holger Rindel. The use of computer modeling in room acoustics. Journal of Vibro-
engineering, 3(4):41–72, 2000.
[39] Niklas Röber and Maic Masuch. Interacting with sound: An interaction paradigm for
virtual auditory worlds. In ICAD, 2004.
[40] Niklas Röber and Maic Masuch. Leaving the screen: New perspectives in audio-only
gaming. In 11th Int. Conf. on Auditory Display (ICAD). Citeseer, 2005.
[41] Joseph D Rogers and Marc E Rogers. Three-dimensional virtual environment, October 23
2013. US Patent App. 14/061,711.
[42] Francis Rumsey and Tim McCormick. Sound and recording. 2014.
[43] Jaime Sánchez, Nelson Baloian, Tiago Hassler, and Ulrich Hoppe. Audiobattleship: Blind
learners collaboration through sound. In CHI’03 Extended Abstracts on Human Factors in
Computing Systems, pages 798–799. ACM, 2003.
81
[44] Jaime Sánchez, Marcia de Borba Campos, Matías Espinoza, and Lotfi B Merabet. Audio
haptic videogaming for developing wayfinding skills in learners who are blind. In Proceed-
ings of the 19th international conference on Intelligent User Interfaces, pages 199–208.
ACM, 2014.
[45] Jaime Sánchez and Mauricio Lumbreras. Virtual environment interaction through 3d audio
by blind children. CyberPsychology & Behavior, 2(2):101–111, 1999.
[46] Lauri Savioja, Jyri Huopaniemi, Tapio Lokki, and Ritta Väänänen. Creating interactive
virtual acoustic environments. Journal of the Audio Engineering Society, 47(9):675–705,
1999.
[47] M. Schroeder, Thomas D. Rossing, F. Dunn, W. M. Hartmann, D. M. Campbell, and N. H.
Fletcher. Springer Handbook of Acoustics. Springer Publishing Company, Incorporated,
1st edition, 2007.
[48] EAG Shaw. External ear response and sound localization. Localization of sound: Theory
and applications, pages 30–41, 1982.
[49] Samuel Siltanen, Tapio Lokki, and Lauri Savioja. Rays or waves? understanding the
strengths and weaknesses of computational room acoustics modeling techniques. In Proc.
Int. Symposium on Room Acoustics, 2010.
[50] Mel Slater, Martin Usoh, and Anthony Steed. Taking steps: The influence of a walking
technique on presence in virtual reality. ACM Trans. Comput.-Hum. Interact., 2(3):201–
219, September 1995.
[51] Sue Targett and Mikael Fernström. Audio games: Fun for all? all for fun. In ICAD, 2003.
[52] Micah T Taylor, Anish Chandak, Lakulish Antani, and Dinesh Manocha. Resound: inter-
active sound rendering for dynamic virtual environments. In Proceedings of the 17th ACM
international conference on Multimedia, pages 271–280. ACM, 2009.
[53] Unity Technologies. Audio overview. http://docs.unity3d.com/Manual/
AudioOverview.html, 2015. [Online; accessed 25-August-2015].
[54] Unity Technologies. Unity. https://unity3d.com/, 2015. [Online; accessed 25-
August-2015].
[55] Eric Velleman, Richard van Tol, Sander Huiberts, and Hugo Verwey. 3d shooting games,
multimodal games, sound games and more working examples of the future of games for the
blind. Springer, 2004.
[56] Michael Vorländer. Auralization: fundamentals of acoustics, modelling, simulation, algo-
rithms and acoustic virtual reality. Springer Science & Business Media, 2007.
[57] Hans Wallach, Edwin B Newman, and Mark R Rosenzweig. A precedence effect in sound
localization. The Journal of the Acoustical Society of America, 21(4):468–468, 1949.
82
[58] Thomas Westin. Game accessibility case study: Terraformers–a real-time 3d graphic game.
In Proceedings of the 5th International Conference on Disability, Virtual Reality and As-
sociated Technologies, ICDVRAT, pages 95–100, 2004.
[59] G. White and G.J. Louie. The Audio Dictionary: Third Edition, Revised and Expanded.
University of Washington Press, 2005.
[60] Wikipedia. Audio game — Wikipedia, the free Encyclopedia. http://en.
wikipedia.org/wiki/Audio_game, 2015. [Online; accessed 25-August-2015].
[61] Fredrik Winberg and Sten Olof Hellström. Investigating auditory direct manipulation:
sonifying the towers of hanoi. In CHI’00 extended abstracts on Human factors in comput-
ing systems, pages 281–282. ACM, 2000.
[62] John Wood, Mark Magennis, Elena Francisca Cano Arias, Teresa Gutierrez, Helen Graupp,
and Massimo Bergamasco. The design and evaluation of a computer game for the blind in
the grab haptic audio virtual environment. Proceedings of Eurohpatics, 2003.
83

APPENDIX A
Questionnaires
A.1 Pre-Test-Questionnaire
85
Pre-test: 
Your age: 
Your gender: 
male  female  other 
 
Please indicate your answer to the following questions on the scale from 1 to 7, where 1 means „not at 
all“ and 7 means “very much”: 
 
How much experience of computer usage do you have? 
 
1                                                                                   7 
 |----------|----------|----------|----------|----------|----------| 
not at all                                                                                   very much 
 
How much experience of virtual reality do you have?  
 
1                                                                                   7 
 |----------|----------|----------|----------|----------|----------| 
not at all                                                                                   very much 
 
 
How much experience of playing computer games do you have? 
 
1                                                                                   7 
 |----------|----------|----------|----------|----------|----------| 
not at all                                                                                   very much 
 
How much experience of playing audio games do you have? 
 
1                                                                                   7 
 |----------|----------|----------|----------|----------|----------| 
not at all                                                                                   very much 
  
Do you play a musical instrument, sing or make music? 
yes   no 
 
If yes, how professional are you at it? 
 
1                                                                                   7 
 |----------|----------|----------|----------|----------|----------| 
not very professional                                                             very professional 
 
 
A.2 Each-Test-Questionnaire Simple Room
88
Each-test: 
Please draw the virtual room you were in, including the phone and bombs. 
 
 
 
 
 
 
 
 
How difficult was it to draw the room? Please indicate your answer on the following scale: 
 
1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all      very difficult 
 
How correct do you think your drawing is? Please indicate your answer on the following scale: 
 
1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all      absolutely correct 
In the following image, you can see the room you were in with the positions of the phone and the bombs. Assign an order to the stars, starting with 1 
for the phone and finishing with 4 for the last bomb. Put in the walls. 
 
How difficult was it to assign the order? Please indicate your answer on the following scale: 
 
1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all      very difficult 
 
How difficult was it to draw the walls? Please indicate your answer on the following scale: 
 
1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all      very difficult 
 
 
How correct do you think your solution is? Please indicate your answer on the following scale: 
 
1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all      absolutely correct 
A.3 Each-Test-Questionnaire Complex Room
92
Each-test: 
Please draw the virtual room you were in, including the phone and bombs. 
 
 
 
 
 
 
 
 
How difficult was it to draw the room? Please indicate your answer on the following scale: 
 
1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all      very difficult 
 
How correct do you think your drawing is? Please indicate your answer on the following scale: 
 
1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all      absolutely correct 
In the following image, you can see the room you were in with the positions of the phone and the bombs. Assign an order to the stars, starting with 1 
for the phone and finishing with 4 for the last bomb. Put in the walls. 
 
How difficult was it to assign the order? Please indicate your answer on the following scale: 
 
1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all      very difficult 
 
How difficult was it to draw the walls? Please indicate your answer on the following scale: 
 
1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all      very difficult 
 
 
How correct do you think your solution is? Please indicate your answer on the following scale: 
 
1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all      absolutely correct 
A.4 Post-Test-Questionnaire
96
Post-test: 
Simulator-sickness questionnaire: 
 
Circle how much each sypmtom is affecting you right now: 
 
 
General discomfort  
 
 
 
None  
 
 
Slight  
 
 
Moderate  
 
 
Severe  
 
Fatigue  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
 
Headache  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
 
Eye strain  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
 
Difficulty focusing  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
 
Salivation increasing  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
 
Sweating  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
 
Nausea  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
 
Difficulty concentrating  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
 
« Fullness of the Head »  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
 
Blurred vision  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
 
Dizziness with eyes open  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
 
Dizziness with eyes closed  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
 
Vertigo  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
Stomach awareness  None  Slight  Moderate  Severe  
 
Burping  
 
 
None  
 
Slight  
 
Moderate  
 
Severe  
To what extent was the introductory session helpful for the task fulfilment? Please answer on 
the scale: 
 
       1                    7 
       |----------|----------|----------|----------|----------|----------| 
        not at all                                                                                 very helpful 
 
 
The following questions concern the first test session: 
 
During the testing session, how aware were you of the events occurring in the real world around 
you? Please answer on the scale: 
 
       1                    7 
       |----------|----------|----------|----------|----------|----------| 
not aware at all                                                                             very aware 
 
How realistic was your sense of moving around inside the virtual environment? Please answer 
on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
not realistic at all                                                                             very realistic 
 
 
To what degree did you feel confused or disoriented during the session? Please answer on the 
scale: 
 
 1                                                                                  7 
|----------|----------|----------|----------|----------|----------| 
not at all                                                                                  very much 
 
 
To what degree did you feel confused or disoriented at the end of the session? Please answer on 
the scale: 
 
            1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all                                                                                  very much 
 
 
How involved were you in the virtual reality experience?  Please answer on the scale: 
 
            1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all                                                                                  very much 
 
 
How quickly did you adjust to the VE experience? Please answer on the scale: 
 
                1                                                                                   7 
    |----------|----------|----------|----------|----------|----------| 
very slowly                                                                                 very quickly 
 
 
How difficult was it for you to localize the ringing phone in the room? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          very easy                                                                                very difficult 
 
How difficult was it for you to localize the first bomb? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          very easy                                                                                very difficult 
How difficult was it for you to localize the second bomb? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          very easy                                                                                very difficult 
 
How difficult was it for you to localize the third bomb? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          very easy                                                                                very difficult 
 
How much did you like the sound in general? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          not at all                                                                                  very much 
 
 
How realistic did the sound in general seem to you? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          not realistic at all                                                                     very realistic 
 
Did you hear any sound reflections? 
yes  no   not sure 
 
If yes, how realistic did they seem to you? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          not realistic at all                                                                     very realistic 
 
How helpful or confusing did they seem to you? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
  very confusing                                                                                 very helpful 
 
 
The following questions concern the second test session: 
 
During the second testing session, how aware were you of the events occurring in the real world 
around you? Please answer on the scale: 
 
       1                    7 
       |----------|----------|----------|----------|----------|----------| 
not aware at all                                                                             very aware 
 
How realistic was your sense of moving around inside the virtual environment? Please answer 
on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
not realistic at all                                                                             very realistic 
 
 
To what degree did you feel confused or disoriented during the session? Please answer on the 
scale: 
 
 1                                                                                  7 
|----------|----------|----------|----------|----------|----------| 
not at all                                                                                  very much 
 
 
To what degree did you feel confused or disoriented at the end of the session? Please answer on 
the scale: 
 
            1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all                                                                                  very much 
 
 
How involved were you in the virtual reality experience?  Please answer on the scale: 
 
            1                                                                                   7 
|----------|----------|----------|----------|----------|----------| 
not at all                                                                                  very much 
 
 
How quickly did you adjust to the VE experience? Please answer on the scale: 
 
                1                                                                                   7 
    |----------|----------|----------|----------|----------|----------| 
very slowly                                                                                 very quickly 
 
 
How difficult was it for you to localize the ringing phone in the room? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          very easy                                                                                very difficult 
 
How difficult was it for you to localize the first bomb? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          very easy                                                                                very difficult 
How difficult was it for you to localize the second bomb? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          very easy                                                                                very difficult 
 
How difficult was it for you to localize the third bomb? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          very easy                                                                                very difficult 
 
How much did you like the sound in general? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          not at all                                                                                  very much 
 
 
How realistic did the sound in general seem to you? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          not realistic at all                                                                     very realistic 
 
Did you hear any sound reflections? 
yes  no   not sure 
 
If yes, how realistic did they seem to you? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
          not realistic at all                                                                     very realistic 
 
How helpful or confusing did they seem to you? Please answer on the scale: 
 
         1                      7 
         |----------|----------|----------|----------|----------|----------| 
  very confusing                                                                                 very helpful 
 
Please give us any type of feedback, your comments are very welcome: