The application of deep learning in microphone array single-source tracking systems
* Presenting author
Abstract:
The purpose of this study is to utilize a microphone array paired with a camera to construct a sound source tracking device suitable for meeting scenarios. Various microphone array geometries were analyzed based on criteria of localization accuracy, computational time, and compactness. The final choice was an octahedral microphone array. The performance of three commonly used (MPDR, SRP-PHAT, and MUSIC) and three deep learning-enhanced localization algorithms (Cross3D, IcoDOA, and Neural-SRP) were compared on the conditions of indoor reverberation and noise, and the need for fast computations in real-time sound source tracking. Both the IcoDOA and Neural-SRP algorithms indicated localization errors within 10 degrees in environments with signal-to-noise ratios ranging from 5dB to 30dB and reverberation times (RT60) from 0.2s to 1s. However, IcoDOA showed the best performance in computation time per frame, averaging only 2.067 milliseconds per frame. The final sound source tracking device suitable for meeting scenarios was implemented using an octahedral microphone array with the IcoDOA algorithm. The results of sound source localization could be kept within the camera's field of view 91.11% and 87.77% of the time in a simulated real meeting scenario with a single sound source playing a speech signal and a music source, respectively.