Bittner, M., Schnöll, D., Wess, M., & Jantsch, A. (2025). Efficient and interpretable raw audio classification with diagonal state space models. Machine Learning, 114(8). https://doi.org/10.1007/s10994-025-06807-z
Audio classifcation; State space models; TinyML; Interpretability
en
Abstract:
State Space Models have achieved good performance on long sequence modeling tasks such as raw audio classification. Their definition in continuous time allows for discretization and operation of the network at different sampling rates. However, this property has not yet been utilized to decrease the computational demand on a per-layer basis. We propose a family of hardware-friendly S-Edge models with a layer-wise downsampling approach to adjust the temporal resolution between individual layers. Applying existing methods from linear control theory allows us to analyze state/memory dynamics and provides an understanding of how and where to downsample. Evaluated on the Google Speech Command dataset, our autoregressive/causal S-Edge models range from 8–141k parameters at 90–95% test accuracy in comparison to a causal S5 model with 208k parameters at 95.8% test accuracy. Using our C++17 header-only implementation on an ARM Cortex-M4F the largest model requires 103 sec. inference time with 95.19% test accuracy, and the smallest model with 88.01% test accuracy, requires 0.29 sec. Our solutions cover a design space that spans 17x in model size, 358x in inference latency, and 7.18 percentage points in accuracy.