Online speaker diarization using adapted i-vector transforms
Abstract
Many speaker diarization systems operate in an off-line mode. Such systems typically find homogeneous segments and then cluster these segments according to speaker. Such algorithms, like bottom-up clustering, k-means or spectral clustering, generally require the registration of all segments before clustering can begin. However, for real-time applications such as with multi-person voice interactive systems, there is a need to perform online speaker assignment in a strict left-to-right fashion. In this paper we propose a novel Maximum a Posteriori (MAP) adapted transform within an i-vector speaker diarization framework, that operates in a strict left-to-right fashion. Previous work by the community has shown that the principal components of variation of fixed dimensional i-vectors learned across segments tend to indicate a strong basis by which to separate speakers. However, determining this basis can be problematic when there are few segments or when operating in an online manner. The proposed method blends the prior with the estimated subspace as more i-vectors are observed. Given oracle SAD segments, with adaptation we achieve 3.2% speaker diarization error for a strict left-to-right constraint on the LDC Callhome English Corpus compared to 4.8% without adaptation.