Machine Hearing with Auditory Images to Mimic Human Ratings on Product Sounds
* Presenting author
Abstract:
There is an application providing prediction models for product sounds based on listening tests used within Bosch. The underlying machine learning model takes so called MFCC (= Mel-Frequency Cepstrum Coefficients) as input feature; this is a cepstrum method based on Mel spectrograms developed for speech analysis. Machine learning applications use MFCC a lot in audio analysis as hearing adequate features – which they are not.We studied a method called auditory images, which was promised to be a human-like representation. We successfully applied them in several use cases in which conventional analysis methods have not delivered clear results (Kuka & Fischer, 2024). After a proof-of-concept study we intensified usage of auditory images in the same machine learning task described in the latter section with the goal to improve the accuracy of the prediction results.We successfully applied auditory images in a machine learning regression task to mimic human ratings on product sounds. Despite their high input dimensionality, it was possible to outperform the MFCC-based application in several datasets. Auditory images promise to be not only a powerful complement for psychoacoustic analysis, but also to be an adequate hearing equivalent representation for machine hearing tasks.