Dörömbözi, A. (2023). Audio tampering detection: Deep learning methodologies for multi-layered threat detection [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2023.96575
The thesis proposes and evaluates deep learning methodologies for tampering detection at multi-layered audio samples. The tampering of the audio recordings can be performed by cutting, insertion, or reshuffling. State-of-the-art solutions provide promising results by detecting tampering events in audio samples based on acoustic environment analysis, microphone identification, or by analyzing the trace fluctuation of the signal at frequency ranges. However, these solutions are rarely evaluated in research studies for whether they could detect tampering events in recordings, which are post-processed with additional layers. Nonetheless, the layer might be able to hide the tampering traces. Such cases could be problematic, like the post-processing of a tampered politician speech with a music layer, where the tampering might not be detected in time. Such content might be able to harm the trustworthiness of democratic institutions if it reaches many people, which by the rapid growth of social media platforms is already a realistic scenario. The methodologies proposed in this thesis rely on Transformer models, Multi-Layer Perceptron, and Recurrent Neural Networks. Besides developing and applying the proposed methods, the baseline tampering detection approach introduced in a state-of-the-art research paper is also implemented and evaluated against the multi-layered query audios. This evaluation is used to elaborate on how the performance of the baseline model is affected by additional post-processing techniques. The proposed methodologies’ performance are compared against the baseline approach performance to elaborate on which cases the proposed methodologies can provide a better solution and identify the disadvantages and bottlenecks of these solutions. The thesis demonstrated that the proposed approaches can outperform the baseline model, when additional music or environment layers are applied to the recordings. On the other hand, the baseline model noticeably outperforms the proposed methodologies in cases where capturing the ENF signal from the recordings on which the model’s feature extraction relies is optimal.