Contribution

Evaluating Timbre-Related Audio Descriptors Across Different Libraries and Multimodal Embeddings

* Presenting author
Day / Time: 18.03.2025, 15:20-16:00
Typ: Poster
Information: The posters will be exhibited in Hall E north from Tuesday to Thursday, sorted by thematic context in the poster island indicated in the session title. The poster session at the specified time offers the opportunity to enter into discussion with the authors.
Abstract: Timbre-related audio descriptors are widely used to study the relationship between perceptual phenomena and their physical correlates in the signal. While some links, such as timbral brightness and spectral centroid, are well-established, the effectiveness of different descriptors for predicting timbre dimensions depends on implementation and parameter choices.Ratings were collected for 20 different instruments (Vienna Symphonic Library) played on the same pitch (E4) as well as at different typical pitch ranges (40 stimuli total). Thirty-one participants rated each stimulus on “brightness,” “roughness,” and “percussiveness” using sliders. These ratings were compared to audio features extracted via Librosa, Essentia, Praat, MIRtoolbox, and AudioCommons Timbral Models, mostly using default parameters suggested by the libraries. Additionally, human ratings were explored in relation to multimodal (audio-text) embeddings based on LAION-CLAP (Wu et al., 2023).While “brightness” ratings correlated primarily with F0 when comparing across different pitches, several sharpness models and spectral centroid were observed to be the most suitable descriptors across both conditions. “Roughness” was best predicted by the 'Vassilakis' model (MIRtoolbox) in the current experiment, and “percussiveness” correlated well with loudness differences between percussive and harmonic components after separation via median filtering. The explored multimodal embedding models showed only partial alignment with human ratings.