BW-EDA-EEND: Streaming end-to-end neural speaker diarization for a variable number of speakers

Eunjung Han; Chul Lee; Andreas Stolcke

Publication

BW-EDA-EEND: Streaming end-to-end neural speaker diarization for a variable number of speakers

By Eunjung Han, Chul Lee, Andreas Stolcke

2021

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

We present a novel online end-to-end neural diarization system, BWEDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information from block to block, making the algorithm complexity linear in time. We propose two variants: For unlimited-latency BW-EDAEEND, which processes inputs in linear time, we show only moderate degradation for up to two speakers using a context size of 10 seconds compared to offline EDA-EEND. With more than two speakers, the accuracy gap between online and offline grows, but the algorithm still outperforms a baseline offline clustering diarization system for one to four speakers with unlimited context size, and shows comparable accuracy with context size of 10 seconds. For limited-latency BW-EDA-EEND, which produces diarization outputs block-by-block as audio arrives, we show accuracy comparable to the offline clustering-based system.

BW-EDA-EEND: Streaming end-to-end neural speaker diarization for a variable number of speakers

Latest news

Work with us