A comprehensive empirical review of modern voice activity detection approaches for movies and TV shows
A robust and language agnostic Voice Activity Detection (VAD) is crucial for Digital Entertainment Content (DEC). Primary examples of DEC include movies and TV series. Some ways in which VAD systems are used for DEC creation include augmenting subtitle creation, subtitle drift detection and correction, and audio diarisation. Majority of the previous work on VAD focuses on scenarios that: (a) have minimal background noise, and (b) where the audio content is delivered in English language. However, movies and TV shows can: (a) have substantial amounts of non-voice background signal (e.g. musical score and environmental sounds), and (b) are released worldwide in a variety of languages. This makes most of the previous standard VAD approaches not readily applicable for DEC related applications. Furthermore, there does not exist a comprehensive analysis of Deep Neural Network’s (DNN) performance for the task of VAD applied to DEC. In this work, we present a thorough survey on DNN based VADs on DEC data in terms of their accuracy, Area Under Curve (AUC), noise sensitivity, and language agnostic behaviour. For our analysis we use 1100 proprietary DEC videos spanning 450 h of content in 9 languages and 5 + genres, making our study the largest of its kind ever published. The key findings of our analysis are: (a) even high quality timed-text or subtitle 2 files contain significant levels of label-noise (up to 15%). Despite high label noise, deep networks are robust and are able to retain high AUCs (~0.94). (b) Using larger labelled dataset can substantially increase neural VAD model’s True Positive Rate (TPR) with up to 1.3% and 18% relative improvement over current state-of-the-art methods in Hebbar et al. (2019) and Chaudhuri et al. (2018) respectively. This effect is more pronounced in noisy environments such as music and environmental sounds. This insight is particularly instructive while prioritizing domain specific labelled data acquisition versus exploring model structure and complexity. (c) Currently available sequence based neural models show similar levels of competence in terms of their language agnostic behaviour for VAD at high Signal-to-Noise Ratios (SNRs) and for clean speech, (d) Deep models exhibit varied performance across different SNRs with CLDNN (Zazo et al., 2016) being the most robust, and (e) models with comparatively larger number of parameters (~2 M) are less robust to input noise as opposed to models having smaller number of parameters (~0.5 M).