Efficient and interpretable raw audio classification with diagonal state space models

Bittner, Matthias; Schnöll, Daniel; Wess, Matthias; Jantsch, Axel

doi:10.1007/s10994-025-06807-z

Record link:

http://hdl.handle.net/20.500.12708/217086

Title:

Efficient and interpretable raw audio classification with diagonal state space models

Citation:

Bittner, M., Schnöll, D., Wess, M., & Jantsch, A. (2025). Efficient and interpretable raw audio classification with diagonal state space models. Machine Learning, 114(8). https://doi.org/10.1007/s10994-025-06807-z

Publisher DOI:

10.1007/s10994-025-06807-z

Publication Type:

Article - Original Research Article

Language:

English

Authors:

Bittner, Matthias
Schnöll, Daniel
Wess, Matthias
Jantsch, Axel

Organisational Unit:

E384-02 - Forschungsbereich Systems on Chip

Journal:

Machine Learning

ISSN:

0885-6125

Date (published):

Aug-2025

Number of Pages:

Publisher:

SPRINGER

Peer reviewed:

Yes

Keywords:

Audio classifcation; State space models; TinyML; Interpretability

Abstract:

State Space Models have achieved good performance on long sequence modeling tasks such as raw audio classification. Their definition in continuous time allows for discretization and operation of the network at different sampling rates. However, this property has not yet been utilized to decrease the computational demand on a per-layer basis. We propose a family of hardware-friendly S-Edge models with a layer-wise downsampling approach to adjust the temporal resolution between individual layers. Applying existing methods from linear control theory allows us to analyze state/memory dynamics and provides an understanding of how and where to downsample. Evaluated on the Google Speech Command dataset, our autoregressive/causal S-Edge models range from 8–141k parameters at 90–95% test accuracy in comparison to a causal S5 model with 208k parameters at 95.8% test accuracy. Using our C++17 header-only implementation on an ARM Cortex-M4F the largest model requires 103 sec. inference time with 95.19% test accuracy, and the smallest model with 88.01% test accuracy, requires 0.29 sec. Our solutions cover a design space that spans 17x in model size, 358x in inference latency, and 7.18 percentage points in accuracy.

Research facilities:

Vienna Scientific Cluster

Project title:

CDL Embedded Machine Learning: 123456 (Christian Doppler Forschungsgesells)

Research Areas:

Sensor Systems: 50%
Computational System Design: 50%

Science Branch:

1020 - Informatik: 50%
2020 - Elektrotechnik, Elektronik, Informationstechnik: 50%

Appears in Collections:

Article

Show full item record

Google Scholar^TM

Check

Google ScholarTM

Google Scholar^TM