A comparison of audio preprocessing methods for music autotagging using CNN-architectures

Damböck, Maximilian

doi:10.34726/hss.2022.89400

Record link:

https://doi.org/10.34726/hss.2022.89400
http://hdl.handle.net/20.500.12708/20348

Title:

A comparison of audio preprocessing methods for music autotagging using CNN-architectures

Citation:

Damböck, M. (2022). A comparison of audio preprocessing methods for music autotagging using CNN-architectures [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.89400

reposiTUm DOI:

10.34726/hss.2022.89400

CatalogPlus:

AC16542830

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Damböck, Maximilian

Advisor:

Knees, Peter

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2022

Number of Pages:

100

Keywords:

Music Information Retrieval; Deep Learning; Convolutional Neural Networks; Music Autotagging; Audio-Signal Preprocessing; Dilated Convolutions

Abstract:

CNNs have become famous for their outstanding results in various computer vision tasks in the last decade. Based on the STFT as preprocessing method for audio signals, CNNs are used in Music Information Retrieval (MIR) as well. Especially for autotagging of music, they have become state-of-the-art.This thesis investigates how the following most commonly used input representations of audio signals aect the classification performance of CNNs on music autotagging: raw waveform, STFT, Mel spectrogram, Mel frequency cepstral coecients (MFCCs), and constant-Q transformation (CQT). For this purpose, their performance is compared using five dierent CNN architectures on two datasets, MagnaTagATune (MTAT) and MTG-Jamendo (MTGJ). A Two-way ANOVA analysis shows that both model and input representation significantly impact the classification results. On average, the STFT has the best overall performance on both datasets. It also outperforms all other input representations for all specific tag categories genre, instrument, and mood. No special trends can be observed for the classification performances of the dierent input representations on the respective tag categories. MFCCs provide good results while having a four to twenty times smaller size than the other input representations and consequently an up to four times shorter epoch-time during training. The CQT transformation shows the worst results on MTAT but performs second-best on MTGJ. Therefore, more research on this preprocessing method is needed for the results to be more conclusive.Apart from that, this study investigates the applicability of dilated convolutions for music autotagging models. Their ability to capture large receptive fields while keeping the resource consumption low can be interesting for the genre- and mood classification. To prove this conjecture, the existing CNN model Musicnn (cf. [Pons and Serra, 2019]) is extended with stacked parallel dilated convolutions and then compared to the original model on the MTAT dataset. The results show a significant enhancement of the average ROC-AUC score from 90.99% to 91.49% and from 36.74% to 37.78% for the PR-AUC score, while reducing the average training epoch-time by 59%.

Im letzten Jahrzehnt wurden Convolutional Neural Networks (CNNs) bekannt durch ihre herausragenden Ergebnisse in verschiedenen Teilgebieten von Computer Vision. Mithilfe der Short Time Fourier Transformation (STFT) als Verarbeitungsmethode für Audio-Signale finden CNNs auch in Music Information Retrieval (MIR) unterschiedliche Anwendungen, insbesondere im Bereich Autotagging von Musik wo sie regelmäßig neue Top-Ergebnisse erreichen. Ziel dieser Arbeit ist es herauszufinden, wie die am weitest verbreiteten Eingabe- Repräsentationen von Audio-Signalen die Klassifikationsperformance von CNNs für das automatische Taggen von Musik beeinflussen, zu diesen gehören: die unverarbeitete Signalform, STFT, Mel Spectrogram, Mel Frequency Cepstral Coecient (MFCC) und Constant-Q Transformation (CQT). Dafür wird die Performance anhand fünf verschiedener CNN Architekturen an zwei verschiedenen Datensätzen verglichen, letztere sind MagnaTagATune (MTAT) und MTG-Jamendo (MTGJ). Die Two-way ANOVA Analyse der Ergebnisse zeigt, dass sowohl die Wahl des CNN Modells als auch die Wahl der Eingaberepräsentation einen signifikanten Einfluss auf die Klassifikationsergebnisse haben. Die STFT erzielt im Durchschnitt die besten Ergebnisse auf beiden Datensätzen. Dies ist auch bei der Auswertung pro Tag-Kategorie - Genre, Instrument und Stimmung der Fall. Weiters konnten hier auch keine allgemein gültigen Unterschiede der Klassifikationsperformances der einzelnen Eingaberepräsenetationen zwischen den Kategorien ausgemacht werden. MFCCs liefern durchwegs zufriedenstellende Ergebnisse, obwohl diese um vier bis zwanzig mal kleiner als die anderen Eingabeformate sind und daher auch eine, um bis zu vier mal kürzere Trainingszeit aufweisen. CQT hat die zweitschlechteste Performance auf dem MTAT Dataset und die zweitbeste auf MTGJ. Um genauere Aussagen über dessen Anwendbarkeit zu treffen sind daher noch weitere Untersuchungen notwendig. Darüber hinaus wird in dieser Arbeit untersucht, inwiefern Dilated Convolutions die Klassifikationsperformance von Musik Autotagging Modellen verbessern können. Speziell deren ressourcenschonende Merkmalserkennung ist interessant für die Genre- und Stimmungsklassifikation von Musik. Dazu wird das bestehende CNN Musicnn (vgl. [Pons and Serra, 2019]) mit Stacked Parallel Dilated Convolutions erweitert. Die Ergebnisse für den MTAT Datensatz zeigen eine signifikante Verbesserung des ROC-AUC Wertes von ursprünglich 90.99% auf 91.49% und von 36.74% auf 37.78% für den PR-AUC Wert, bei gleichzeitiger Reduktion der Trainings-Epochenzeit um 59%.

License:

In Copyright

Appears in Collections:

Thesis