Schindler, A. (2019). Multi-modal music information retrieval: augmenting audio-analysis with visual computing for improved music Video analysis [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2019.72065
E194 - Institut für Information Systems Engineering
Number of Pages:
Music Information Retrieval; Multi-Modal Information Retrieval; Audio-Visual Analysis; Machine Learning
This thesis focuses on harnessing the information provided by the visual layer of music videos for augmenting and improving tasks of the research domain Music Information Retrieval (MIR). The main hypothesis of this work is based on the observation that certain expressive categories, such as genre or theme, can be recognized solely based on the visual content, without the sound being heard. This leads to the hypothesis that there exists a visual language that is used to express mood or genre. In a further consequence it can be concluded that this visual information is music related and therefore should be beneficial for the corresponding MIR tasks such as music genre classification or mood recognition. The validation of these hypotheses is first based on literature search in the Musicology and Music Psychology research domain to identify production processes in music videos or visual branding in the music business. The analytical approach is based on a series of comprehensive experiments and evaluations of visual features concerning their ability to describe music related information. These evaluations range from low-level visual features to high-level concepts. Additionally, new visual features are introduced capturing rhythmic visual patterns. Experimental results showed that the developed audio-visual approaches improved over the audio-based benchmark in the conducted experiments for the three prominent MIR tasks Artist Identification, Music Genre and Cross-Genre Classification. Finally, the experimental results were compared to the findings from the literature review, which revealed correlations between identified production processes and quantitatively determined audio-visual correlations. Thus, well-known and documented visual stereotypes (e.g., cowboy hat/Country music, swimsuit/Dance, fire/Heavy Metal), the choice of particular colours as well as theme-specific symbols, could be confirmed.