Contribution

Objective comparison of an image-based machine learning algorithm for real-time lip synchronization with a rule-based audio signal algorithm using cortical speech tracking

* Presenting author
Day / Time: 20.03.2025, 11:00-11:40
Typ: Poster
Information: The posters will be exhibited in Hall E north from Tuesday to Thursday, sorted by thematic context in the poster island indicated in the session title. The poster session at the specified time offers the opportunity to enter into discussion with the authors.
Abstract: The animation of virtual characters in a natural manner poses a challenge, particularly the animation of lip movements in relation to speech. We evaluate two algorithms used to facilitate lip animation with real-time speech or video input. The animation is done using blendshapes, which are predefined mesh-deformations of a 3D object. The first algorithm (Llorach Tó et al., 2016) is a rule-based approach, determining the blendshape values by computing the energy in four different frequency bins of the smoothed short-term power spectrum density of the audio signal. Algorithm two yields blendshapes by analysing corresponding images of lips via an image-based machine learning algorithm.We carried out a mobile EEG study in a controlled virtual environment, where 20 participants were exposed to audio-visual scenarios. Each scenario featured one of six virtual characters narrating unscripted stories, either with background babble noise or in silence. We compared the conditions using cortical speech tracking. The results show a significant performance increase for the speech reconstruction from the EEG data when comparing the conditions video to audio only. This not the case for either animation algorithm compared to audio only, suggesting they are not in a state suitable for aiding cortical speech tracking.