Trainable encoders for robust and stable speech enhancement
* Presenting author
Abstract:
Finite impulse response (FIR) filters are often used in neural networks to efficiently represent audio signals. In recent approaches, the filters are optimized in the sense of training a layer of a convolutional neural network (CNN), resulting in an adaptive filterbank encoder. In such a setting, however, stability issues are often encountered leading to representations that are potentially sensitive to noise and therefore vulnerable to adversarial attacks. This talk presents an approach to stabilize these encoder filterbanks by optimizing their condition number during training. We leverage mathematical frame theory as a quantitative tool to ensure both noise robustness and perfect reconstruction. This is achieved by an additional regularization term in the loss function. As an extension, we use hybrid filterbanks that combine an auditory filterbank with learned convolutional weights. Finally, we show that the stabilization can significantly improve the signal-to-noise ratio (SNR) and the perceptual speech quality (PESQ) in noise reduction tasks using a low-complexity encoder-mask-decoder model.