On the Impact and Interplay of Input Representations and Network Architectures for Automatic Music Tagging

Damböck, Maximilian; Vogl, Richard; Knees, Peter

doi:10.5281/zenodo.7343091

DC Field

Value

Language

dc.contributor.author

Damböck, Maximilian

dc.contributor.author

Vogl, Richard

dc.contributor.author

Knees, Peter

dc.contributor.editor

Rao, Preeti

dc.contributor.editor

Murphy, Hema

dc.contributor.editor

Srinivasamurthy, Ajay

dc.contributor.editor

Bittner, Rachel

dc.contributor.editor

Caro Repetto, Rafael

dc.contributor.editor

Goto, Masataka

dc.contributor.editor

Serra, Xavier

dc.contributor.editor

Miron, Marius

dc.date.accessioned

2023-06-15T08:57:18Z

dc.date.available

2023-06-15T08:57:18Z

dc.date.issued

2022-12-08

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Damböck, M., Vogl, R., & Knees, P. (2022). On the Impact and Interplay of Input Representations and Network Architectures for Automatic Music Tagging. In P. Rao, H. Murphy, A. Srinivasamurthy, R. Bittner, R. Caro Repetto, M. Goto, X. Serra, & M. Miron (Eds.), <i>Proceedings of the 23rd International Society for Music Information Retrieval Conference. ISMIR 2022</i> (pp. 941–948). International Society for Music Information Retrieval. https://doi.org/10.5281/zenodo.7343091</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/177655

dc.description.abstract

Automatic music tagging systems have once more gained relevance over the last years, not least through their use in applications such as music recommender systems. State-of-the-art systems are based on a variant of convolutional neural networks (CNNs) and use some type of time-frequency audio representation as input, in a fitting combination to predict semantic tags available through expert or crowd-based annotation. In this work we systematically compare five widely used audio input representations (STFT, CQT, Mel spectrograms, MFCCs, and raw audio waveform) using five established convolutional neural network architectures (MusicCNN, VGG16, ResNet, a Squeeze and Excitation Network (SeNet), as well as a newly proposed MusicCNN variant using dilated convolutions) for the task of music tag prediction. Performance of all factor combinations are measured on two distinct tagging datasets, namely MagnaTagATune and MTG Jamendo. A two-way ANOVA shows that both input representation and model architecture significantly impact the classification results. Despite differently sized input representations and practical impact on model training, we find that using STFT as input representations provides the best results overall and on specific tag categories (genre, instrument, mood), while other representations show less consistent behavior in these regards. Furthermore, the proposed dilated convolutional architecture shows significant performance improvements for all input representations except raw waveform.

dc.description.sponsorship

Fonds zur Förderung der wissenschaftlichen Forschung (FWF)

dc.language.iso

dc.rights.uri

http://creativecommons.org/licenses/by/4.0/

dc.subject

Music Information Retrieval

dc.subject

Auto-tagging

dc.subject

Deep Learning

dc.subject

Audio signal representations

dc.title

On the Impact and Interplay of Input Representations and Network Architectures for Automatic Music Tagging

dc.type

Inproceedings

dc.type

Konferenzbeitrag

dc.rights.license

Creative Commons Namensnennung 4.0 International

dc.rights.license

Creative Commons Attribution 4.0 International

dc.contributor.editoraffiliation

Amazon (United States), United States of America (the)

dc.relation.isbn

978-1-7327299-2-6

dc.relation.doi

10.5281/zenodo.7676768

dc.description.startpage

941

dc.description.endpage

948

dc.relation.grantno

P 33526-N

dc.rights.holder

Maximilian Damböck, Richard Vogl, Peter Knees

dc.type.category

Full-Paper Contribution

tuw.booktitle

Proceedings of the 23rd International Society for Music Information Retrieval Conference. ISMIR 2022

tuw.peerreviewed

true

tuw.relation.publisher

International Society for Music Information Retrieval

tuw.project.title

Empfehlungssystem & Nutzer: Hin zu gegenseitigem Verständnis

tuw.researchTopic.id

I4a

tuw.researchTopic.name

Information Systems Engineering

tuw.researchTopic.value

100

tuw.publication.orgunit

E194-04 - Forschungsbereich Data Science

tuw.publisher.doi

10.5281/zenodo.7343091

dc.identifier.libraryid

AC17204552

dc.description.numberOfPages

tuw.author.orcid

0000-0003-3906-1292

dc.rights.identifier

CC BY 4.0

dc.rights.identifier

CC BY 4.0

tuw.editor.orcid

0000-0002-9032-7909

tuw.editor.orcid

0000-0003-2251-2202

tuw.event.name

23rd International Society for Music Information Retrieval Conference

tuw.event.startdate

04-12-2022

tuw.event.enddate

08-12-2022

tuw.event.online

Hybrid

tuw.event.type

Event for scientific audience

tuw.event.place

Bengaluru

tuw.event.country

tuw.event.presenter

Vogl, Richard

tuw.event.track

Single Track

wb.sciencebranch

Informatik

wb.sciencebranch.oefos

1020

wb.sciencebranch.value

100

item.grantfulltext

open

item.openairecristype

http://purl.org/coar/resource_type/c_5794

item.mimetype

application/pdf

item.openairetype

conference paper

item.openaccessfulltext

Open Access

item.languageiso639-1

item.cerifentitytype

Publications

item.fulltext

with Fulltext

crisitem.author.dept

TU Wien

crisitem.author.dept

E194-04 - Forschungsbereich Data Science

crisitem.author.dept

E194-04 - Forschungsbereich Data Science

crisitem.author.orcid

0000-0003-3906-1292

crisitem.author.parentorg

E194 - Institut für Information Systems Engineering

crisitem.author.parentorg

E194 - Institut für Information Systems Engineering

crisitem.project.funder

FWF - Österr. Wissenschaftsfonds

crisitem.project.grantno

P 33526-N

Appears in Collections:

Conference Paper

Fulltext (Version of Record (published version))

Adobe PDF

(654.63 kB)

CC BY 4.0

Show simple item record

Page view(s)

370

checked on Nov 23, 2023

Download(s)

202

checked on Nov 23, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM