End-To-End Speaker Segmentation For Overlap-Aware Resegmentation
End-To-End Speaker Segmentation For Overlap-Aware Resegmentation
𝜃off
𝛿on 𝛿off
input
reference
Figure 1: Actual outputs of our model on two 5s excerpts
from the same conversation between two speakers (source: file
diarization
DH EVAL 0035.flac in DIHARD3 dataset). Top row shows the
reference annotation. Middle row is the audio chunk ingested by
the model. Bottom row depicts the raw speaker activations, as
resegmentation
returned by the model. Thanks to permutation-invariant train-
ing, notice how the blue speaker corresponds to the orange ac-
tivation on the left and to the green one on the right.
heuristic
y = {y1 , . . . , yT } where yt ∈ {0, 1}Kmax and ytk = 1 if
speaker k is active at frame t and ytk = 0 otherwise. We may
arbitrarily sort speakers by chronological order of their first ac-
tivity but any permutation of the Kmax dimensions is a valid Figure 3: Effect of the proposed overlap-aware resegmentation
representation of the reference segmentation. Therefore, the approach (third row) on the VBx diarization baseline (second
binary cross entropy loss function LBCE (classically used for row). We highlight three regions where the heuristic performs
such multi-label classification problems) has to be turned into a better (t ≈ 100s), same (t ≈ 120s), or worse (t ≈ 115s)
permutation-invariant loss function L by running over all pos- than the proposed approach (source: file DH EVAL 0035.flac
sible permutations perm(y) of y over its Kmax dimensions: in DIHARD3 dataset).
Table 1: Voice activity detection // FA = false alarm rate (%) / Miss. = missed detection rate (%)
Table 3: Resegmentation // FA = false alarm / Miss. = missed detection / Conf. = speaker confusion / DER = diarization error rate
VBx [15] w/ our VAD 3.1 17.2 3.8 24.1 3.6 12.5 6.2 22.3 1.7 5.1 1.4 8.3
Heuristic [13] w/ our OSD 5.1 8.7 6.1 19.9 7.0 7.8 6.4 21.2 2.7 2.1 2.0 6.8
Ours – pyannote 2.0 4.3 10.9 4.7 19.9 4.7 9.7 4.9 19.3 2.7 2.6 1.8 7.1
Oracle Ours – pyannote 2.0 4.7 10.0 1.4 16.1 4.6 9.8 1.8 16.2 2.6 2.5 0.6 5.7