Contribution

Extension of the auditory information in time-frequency representations of audio signals for machine learning applications

* Presenting author
Day / Time: 19.03.2025, 14:20-14:40
Room: Room 20
Typ: Invited Lectures
Abstract: Machine learning models for audio signal processing typically use time-frequency representations of audio signals as input features, such as linear and Mel spectrograms. While the two-dimensional features proved convenient for utilizing powerful machine vision models for audio applications, they restrict the leveraged auditory information to sound energy (or amplitude) averaged over a relatively coarse time-frequency grid. In contrast to this, the human auditory system keeps track of both amplitude and phase of the incoming sound with high frequency and time resolutions, partly by detecting each period of the oscillations, up to moderately high frequencies, by means of the so-called phase-locking. In this work we inspect and discuss several possibilities to include the additional, perceptually relevant information in the audio features for machine learning tasks, while keeping the computational burden acceptable for common applications and models.