Integrating Multiscale Representation And Re-Evaluating Channel Shuffling In Efficient Time-Frequency Separate Networks For Acoustic Scene Classification
* Presenting author
Abstract:
Acoustic Scene Classification (ASC) is a fundamental task in audio signal processing, aiming to identify the location of audio recordings based on environmental sounds. Convolutional neural networks have shown significant effectiveness, with recent approaches focusing on using 1D convolutional kernels to reduce model complexity and computational demands. One such architecture is the Time-Frequency Separate Network (TF-SepNet), which leverages separate paths for time and frequency feature processing. However, TF-SepNet’s fixed-scale feature extraction capabilities can limit its adaptability. To address this, we integrate Atrous Spatial Pyramid Pooling (ASPP) into TF-SepNet, allowing the extraction of multiscale features. We propose two architectures to enhance the max pooling layers and the final convolutional layer. Furthermore, inspired by ShuffleNet, TF-SepNet employs a shuffle unit to rearrange information between channels. This study explores in detail the influence of this shuffling step, as it could disrupt the continuity of frequency and temporal features. We design an alternative architecture without the shuffle operation and compare its performance. Experimental results using the TAU Urban Acoustic Scenes 2022 dataset indicate that the proposed ASPP approach in a model without channel shuffling outperformed the original model.