Perceptual synchronization scoring of dubbed content using phoneme-viseme agreement
2023
Recent works have shown great success in synchronizing lip-movements in a given video with a dubbed audio stream. However, comparison and efficacy of the synchronization capabilities of these methods is still weakly substantiated due to the lack of a generalized and visually-grounded evaluation method. This work proposes a simple and grounded algorithm – PhoVis, that can measure synchronization and the perceived quality of a dubbed video at an utterancelevel. The approach generates expected visemes by considering a speaker’s lip-pose history and the phoneme in the dubbed audio. A sync distance and a perceptual score is then derived by comparing the generated viseme with the clip’s visemes with the help of spatially grounded posedistances. PhoVis is built upon the most basic audio-video elements i.e. phonemes and visemes to compute agreement, which makes it a domain independent algorithm that can be used to score both original and lip-synthesized videos, allowing measurement and improvement of dubbing quality as well as video-synthesis methods. We demonstrate that PhoVis achieves better language generalization, is aptly tailored for lip-sync measurement and computes audio-lip correlation better than the existing AV sync methods.
Research areas