Contribution

Integrating Multiscale Representation And Re-Evaluating Channel Shuffling In Efficient Time-Frequency Separate Networks For Acoustic Scene Classification

* Presenting author
Day / Time: 19.03.2025, 15:20-16:00
Typ: Poster
Information: The posters will be exhibited in Hall E north from Tuesday to Thursday, sorted by thematic context in the poster island indicated in the session title. The poster session at the specified time offers the opportunity to enter into discussion with the authors.
Abstract: Acoustic Scene Classification (ASC) is a fundamental task in audio signal processing, aiming to identify the location of audio recordings based on environmental sounds. Convolutional neural networks have shown significant effectiveness, with recent approaches focusing on using 1D convolutional kernels to reduce model complexity and computational demands. One such architecture is the Time-Frequency Separate Network (TF-SepNet), which leverages separate paths for time and frequency feature processing. However, TF-SepNet’s fixed-scale feature extraction capabilities can limit its adaptability. To address this, we integrate Atrous Spatial Pyramid Pooling (ASPP) into TF-SepNet, allowing the extraction of multiscale features. We propose two architectures to enhance the max pooling layers and the final convolutional layer. Furthermore, inspired by ShuffleNet, TF-SepNet employs a shuffle unit to rearrange information between channels. This study explores in detail the influence of this shuffling step, as it could disrupt the continuity of frequency and temporal features. We design an alternative architecture without the shuffle operation and compare its performance. Experimental results using the TAU Urban Acoustic Scenes 2022 dataset indicate that the proposed ASPP approach in a model without channel shuffling outperformed the original model.