0% found this document useful (0 votes)
50 views5 pages

End-To-End Speaker Segmentation For Overlap-Aware Resegmentation

Trabalho de Hervé Bredin, criador do Pyannote, sobre segmentação de áudios com conhecimento da sobreposição de falantes distintos

Uploaded by

inventtydf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views5 pages

End-To-End Speaker Segmentation For Overlap-Aware Resegmentation

Trabalho de Hervé Bredin, criador do Pyannote, sobre segmentação de áudios com conhecimento da sobreposição de falantes distintos

Uploaded by

inventtydf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

End-to-end speaker segmentation for overlap-aware resegmentation

Hervé Bredin1 & Antoine Laurent2


1
IRIT, Université de Toulouse, CNRS, Toulouse, France
2
LIUM , Université du Mans, France
[email protected], [email protected]

Abstract End-to-end speaker segmentation. Instead of addressing


voice activity detection, speaker change detection, and over-
Speaker segmentation consists in partitioning a conversation be- lapped speech detection as three different tasks, our first
tween one or more speakers into speaker turns. Usually ad- contribution is to train a unique end-to-end speaker segmen-
dressed as the late combination of three sub-tasks (voice activity tation model whose output encompasses the aforementioned
detection, speaker change detection, and overlapped speech de- sub-tasks. This model is directly inspired by recent advances in
end-to-end speaker diarization and, in particular, the growing
arXiv:2104.04045v2 [eess.AS] 10 Jun 2021

tection), we propose to train an end-to-end segmentation model


that does it directly. Inspired by the original end-to-end neural End-to-End Neural Diarization (EEND) family of approaches
speaker diarization approach (EEND), the task is modeled as a developed by Hitachi [5, 6, 7]. The proposed segmentation
multi-label classification problem using permutation-invariant model is better than (or at least on par with) several voice
training. The main difference is that our model operates on activity detection baselines, and sets a new state of the art
short audio chunks (5 seconds) but at a much higher tempo- for overlapped speech detection on all three considered
ral resolution (every 16ms). Experiments on multiple speaker datasets: AMI Mix-Headset [8], DIHARD 3 [9, 10], and
diarization datasets conclude that our model can be used with VoxConverse [11]. We did not run speaker change detection
great success on both voice activity detection and overlapped experiments.
speech detection. Our proposed model can also be used as a
post-processing step, to detect and correctly assign overlapped Overlap-aware resegmentation. Our second contribution
speech regions. Relative diarization error rate improvement relates to the problem of assigning detected overlapped speech
over the best considered baseline (VBx) reaches 17% on AMI, regions to the right speakers. While a few attempts have been
13% on DIHARD 3, and 13% on VoxConverse. made in the past [4, 12], it remains a very difficult problem for
Index Terms: speaker diarization, speaker segmentation, voice which a simple heuristic baseline has yet to be beaten [13]. We
activity detection, overlapped speech detection, resegmentation. show, through extensive experimentation, that our segmentation
model consistently beats this heuristic when turned into an
overlap-aware resegmentation module – setting a new state
1. Introduction of the art on the AMI dataset when combined with the VBx
The speech processing community relies on term segmentation approach.
to describe a multitude of tasks: from classifying the audio
signal into three classes {speech, music, other}, to detecting Reproducible research. Last but not least, our final contribu-
breath groups, localizing word boundaries, or even partitioning tion consists in sharing the pretrained model and integrating it
speech regions into phonetic units. On this coarse-to-fine time into pyannote open-source library for reproducibility purposes:
scale, speaker segmentation lies somewhere between {speech, huggingface.co/pyannote/segmentation. Expected out-
non-speech} classification and breath groups detection. It puts of the proposed approaches (VAD, OSD, and resegmenta-
consists in partitioning speech regions into smaller chunks tion) are also available at this address in RTTM format to facili-
containing speech from a single speaker. It has been addressed tate future comparison.
in the past as the combination of several sub-tasks. First,
voice activity detection (VAD) removes any region that does 2. End-to-end speaker segmentation
not contain speech. Then, speaker change detection (SCD)
partitions remaining speech regions into speaker turns, by Like in the original EEND approach [5], the task is modeled as
looking for time instants where a change of speaker occurs [1]. a multi-label classification problem using permutation-invariant
From a distance, this definition of speaker segmentation may training. As depicted in Figure 1, the main difference is that our
appear clear and unambiguous. However, when looking more model operates on short audio chunks (5 seconds) but at a much
carefully, lots of complex phenomena happen in real-life higher temporal resolution (around every 16ms). Processing
spontaneous conversations – overlapped speech, interruptions, short audio chunks also implies that the number of speakers is
and backchannels being the most prominent ones. Therefore, much smaller and less variable than with the original EEND ap-
researchers have started working on the overlapped speech proach (dealing with whole conversations) – making the prob-
detection (OSD) task as well [2, 3, 4]. lem easier to address. For instance, we found that 99% of every
possible 5s chunks in the training set (later defined in Section 3)
contained less than Kmax = 4 speakers.
This work was granted access to the HPC resources of IDRIS
under the allocation AD011012177 made by GENCI, and was partly 2.1. Permutation-invariant training
funded by the French National Research Agency (ANR) through
the PLUMCOT (ANR-16-CE92-0025) and the GEM (ANR-19-CE38- Given an audio chunk X, its reference segmentation can be
0012) projects. encoded into a sequence of Kmax -dimensional binary frames
𝜃on
reference

𝜃off

𝛿on 𝛿off
input

Figure 2: To obtain the final binary segmentation, speaker acti-


vations are post-processed with θon /θoff hysteresis thresholding,
then filling gaps shorter than δoff (light green region in right
example) and finally removing active regions shorter than δon
output

(does not happen in these examples).

reference
Figure 1: Actual outputs of our model on two 5s excerpts
from the same conversation between two speakers (source: file

diarization
DH EVAL 0035.flac in DIHARD3 dataset). Top row shows the
reference annotation. Middle row is the audio chunk ingested by
the model. Bottom row depicts the raw speaker activations, as

resegmentation
returned by the model. Thanks to permutation-invariant train-
ing, notice how the blue speaker corresponds to the orange ac-
tivation on the left and to the green one on the right.

heuristic
y = {y1 , . . . , yT } where yt ∈ {0, 1}Kmax and ytk = 1 if
speaker k is active at frame t and ytk = 0 otherwise. We may
arbitrarily sort speakers by chronological order of their first ac-
tivity but any permutation of the Kmax dimensions is a valid Figure 3: Effect of the proposed overlap-aware resegmentation
representation of the reference segmentation. Therefore, the approach (third row) on the VBx diarization baseline (second
binary cross entropy loss function LBCE (classically used for row). We highlight three regions where the heuristic performs
such multi-label classification problems) has to be turned into a better (t ≈ 100s), same (t ≈ 120s), or worse (t ≈ 115s)
permutation-invariant loss function L by running over all pos- than the proposed approach (source: file DH EVAL 0035.flac
sible permutations perm(y) of y over its Kmax dimensions: in DIHARD3 dataset).

L (y, ŷ) = min LBCE (perm(y), ŷ) (1)


perm(y) results, but one can get even better performance by us-
ing a slightly more advanced post-processing borrowed
with ŷ = f (X) where f is our segmentation model whose ar- from [14] and summarized in Figure 2.
chitecture is described later in the paper. In practice, for effi-
ciency, we first compute the Kmax × Kmax binary cross entropy • for voice activity detection, we start by computing the
losses between all pairs of y and ŷ dimensions, and rely on the maximum activation over the Kmax speakers:
Hungarian algorithm to find the permutation that minimizes the ŷtVAD = max ŷtk (2)
overall binary cross entropy loss. k

and, then only, apply the aforementioned post-


2.2. On-the-fly data augmentation processing on resulting mono-dimensional ŷVAD .
For training, 5s audio chunks (and their reference segmentation) • for overlapped speech detection, since at least two
are cropped randomly from the training set. To increase vari- speakers need to be active simultaneously to indicate
ability even more, we rely on on-the-fly random data augmen- overlapping speech, we compute the second highest (de-
tation. The first type of augmentation is additive background noted max2nd ) activation:
noise with random signal-to-noise ratio. Inspired by our previ-
ous work on overlapped speech detection [4], the second type ŷtOSD = max2nd ŷtk (3)
k
of augmentation consists in artificially increasing the amount
of overlapping speech. To do that, we sum two random 5s au- and post-process the resulting mono-dimensional ŷOSD
dio chunks with random signal-to-signal ratio (and merge their with the same approach.
reference segmentation accordingly). Resulting chunks whose
number of speakers is higher than Kmax are not used for train-
2.4. Overlap-aware resegmentation
ing.
While a growing number of diarization approaches do try and
2.3. Segmentation take overlapped speech into account [7], the most dependable
ones (like the VBx approach [15] used in Figure 3) still assume
Once trained, the model can be used for segmentation pur-
internally that at most one speaker is active at any time. There is
poses or any sub-tasks by a simple post-processing of its output
therefore a need for a post-processing step that assigns multiple
speaker activations:
speaker labels to overlapped speech regions [4, 17].
• for segmentation or speaker change detection, a sin- Given an existing speaker diarization output (with K speak-
gle θ = 0.5 binarization threshold already gives decent ers) encoded into a sequence of K-dimensional binary frames
ytDIA , we propose to use the segmentation model as a lo- • voice activity detection thresholds are chosen to mini-
cal, overlap-aware, resegmentation module. The segmentation mize the detection error rate (i.e. the sum of the false
model is applied on a 5s-long window sliding over the whole alarm and missed detection rates), with no forgiveness
file. At each step, we find the permutation of the speaker activa- collar around speech turn boundaries;
tions ŷ that minimizes the binary cross entropy loss with respect
to yDIA . Permutated sliding speaker activations are then aggre- • overlapped speech detection thresholds are chosen to
gated over time and post-processed with the threshold-based ap- maximize the detection F1 -score, with no forgiveness
proach introduced in Section 2.3. collar either;

3. Experiments • for resegmentation, detection thresholds are chosen to


minimize the diarization error rate, without forgiveness
Datasets and partitions. We ran experiments and report
collar but with overlapped speech regions. This is con-
results on three speaker diarization datasets, covering a wide
sistent with DIHARD3 evaluation plan [10] and AMI
range of domains:
Full evaluation setup [15], but not with VoxConverse
challenge rules that uses a 250ms collar [11].
DIHARD3 corpus [9, 10] does not provide a training set.
Therefore, we split its development set into two parts: 192 files
used as training set, and the remaining 62 files used as a smaller All metrics were computed using pyannote.metrics [18] open
development set. The latter is simply referred to as development source Python library.
sets in the rest of the paper. When defining this split (shared
at huggingface.co/pyannote/segmentation), we made Implementation details. Our segmentation model ingests 5s
sure that the 11 domains were equally distributed between both audio chunks with a sampling rate of 16kHz (i.e. sequences
subsets. The evaluation set is kept unchanged. of 80000 samples). The input sequence is passed to SincNet
convolutional layers using the original configuration [19] – ex-
VoxConverse does not provide a training set either [11]. cept for the stride of the very first layer which is set to 10
Therefore, we also split its development set into two parts: first (so that SincNet frames are extracted every 16ms). Four bidi-
144 files (abjxc to qouur, in alphabetical order) constitute the rectional Long Short-Term Memory (LSTM) recurrent layers
training set, leaving the remaining 72 files (qppll to zyffh) for (each with 128 units in both forward backward directions, and
the actual development set. 50% dropout for the first three layers) are stacked on top of
two additional fully connected layers (each with 128 units and
AMI provides an official {training, development, evaluation} leaky ReLU activation) which also operate at frame-level. A fi-
partition of the Mix-Headset audio files [8]. While we kept the nal fully connected classification layer with sigmoid activation
development and evaluation sets unchanged, we only used the outputs Kmax -dimensional speaker activations between 0 and 1
first 10 minutes of each file of the training set, to end up with every 16ms. Overall, our model contains 1.5 million trainable
an actual training set similar in size (22 hours) to that of the parameters – most of which (1.4 million) comes from the recur-
DIHARD3 (25 hours) and VoxConverse (15 hours) training rent layers.
sets.
As introduced in Section 2.2, 50% of the training samples
Experimental protocols. We trained a unique segmentation are made out of the weighted sum of two chunks, with a signal-
model using the composite training set (62 hours) made of the to-signal ratio sampled uniformly between 0 and 10dB. We also
concatenation of all three training sets. The composite devel- use additive background noise from the MUSAN dataset [20]
opment set (24 hours) served as validation and was used to de- with a signal-to-noise ratio sampled uniformly between 5 and
crease the learning rate on plateau and eventually choose the 15dB.
best model checkpoint. At the end of this process, only one We train the model with Adam optimizer with default Py-
segmentation model is available (not one model per dataset) and Torch parameters and mini-batches of size 128. Learning
used for all experiments. rate is initialized at 10−3 and reduced by a factor of 2 ev-
However, detection thresholds (θon , θoff , δon , and δoff ) were ery time its performance on the development set reaches a
tuned specifically for each dataset using their own develop- plateau. It took around 3 days using 4 V100 GPUs to reach
ment set because the manual annotation guides differ from peak performance. While we do share the pretrained model at
one dataset to another, especially regarding δoff which controls huggingface.co/pyannote/segmentation for reproduc-
whether to bridge small intra-speaker pauses. For the same rea- ing the results, the whole training process is also reproducible
sons, detection thresholds were optimized specifically for each as everything has been integrated into version 2.0 of pyan-
task addressed in the paper: note.audio open-source library [16].

Table 1: Voice activity detection // FA = false alarm rate (%) / Miss. = missed detection rate (%)

AMI [8, 15] DIHARD 3 [9] VoxConverse [11]


VAD
FA Miss. FA+Miss. FA Miss. FA+Miss. FA Miss. FA+Miss.
silero vad 9.4 1.7 11.0 17.0 4.0 21.0 3.0 1.1 4.2
dihard3 [9] NA NA NA 4.0 4.2 8.2 NA NA NA

Landini et al. [12] NA NA NA NA NA NA 1.8 1.1 3.0


pyannote 1.1 [16] 6.5 1.7 8.2 4.1 3.8 7.9 4.5 0.3 4.8
Ours – pyannote 2.0 3.6 3.2 6.8 3.9 3.3 7.3 1.8 0.8 2.5
Table 2: Overlapped speech detection // FA = false alarm rate (%) / Miss. = missed detection rate (%) / F1 = F1 -score (%)

AMI [8, 15] DIHARD 3 [9] VoxConverse [11]


OSD
FA Miss. Precision Recall F1 FA Miss. Precision Recall F1 FA Miss. Precision Recall F1
Kunesova et al. [3] NA NA 71.5 46.1 56.0 NA NA NA NA NA NA NA NA NA NA

Landini et al. [12] NA NA NA NA NA NA NA NA NA NA 10.4 71.8 73.0 28.2 40.7


pyannote 1.1 [16, 4] 51.1 12.1 63.2 87.9 73.5 48.2 45.2 53.2 54.8 54.0 130.4 17.7 38.7 82.3 52.6
Ours – pyannote 2.0 16.9 29.4 80.7 70.5 75.3 46.9 37.2 57.2 62.8 59.9 26.3 24.5 74.2 75.5 74.8

Table 3: Resegmentation // FA = false alarm / Miss. = missed detection / Conf. = speaker confusion / DER = diarization error rate

Overlap-aware AMI [8, 15] DIHARD 3 [9] VoxConverse [11]


Baseline
resegmentation FA Miss. Conf. DER FA Miss. Conf. DER FA Miss. Conf. DER
pyannote 1.1 [16] 5.0 16.2 8.5 29.7 3.4 13.2 12.6 29.2 2.0 10.1 9.5 21.5
Heuristic [13] w/ our OSD 6.9 7.9 10.9 25.7 6.3 8.9 12.8 28.1 2.8 7.3 10.1 20.3
Ours – pyannote 2.0 4.0 13.0 9.1 26.1 5.1 9.8 10.3 25.2 2.4 3.1 9.8 15.4
dihard3 [9] NA NA NA NA 3.6 13.3 8.4 25.4 NA NA NA NA

Heuristic [13] w/ our OSD NA NA NA NA 6.8 8.7 8.8 24.3 NA NA NA NA

Ours – pyannote 2.0 NA NA NA NA 4.6 10.2 7.5 22.2 NA NA NA NA

VBx [15] w/ our VAD 3.1 17.2 3.8 24.1 3.6 12.5 6.2 22.3 1.7 5.1 1.4 8.3
Heuristic [13] w/ our OSD 5.1 8.7 6.1 19.9 7.0 7.8 6.4 21.2 2.7 2.1 2.0 6.8
Ours – pyannote 2.0 4.3 10.9 4.7 19.9 4.7 9.7 4.9 19.3 2.7 2.6 1.8 7.1
Oracle Ours – pyannote 2.0 4.7 10.0 1.4 16.1 4.6 9.8 1.8 16.2 2.6 2.5 0.6 5.7

4. Results and discussions resegmentation approach consistently improves the output of


all baselines on all datasets. Relative diarization error rate im-
Voice activity detection. Table 1 compares the performance of provement over the best baseline (VBx) reaches 17% on AMI,
the proposed voice activity detection approach with the official 13% on DIHARD, and 13% on VoxConverse.
dihard3 baseline [9], Landini’s submission to VoxConverse For comparison purposes, we also implemented a heuristic
challenge [12], and pyannote 1.1 VAD models [16]. The main that consists in assigning detected overlapped speech regions
conclusion is that, despite it being trained for segmentation, our to the two nearest speakers in time [13]. Despite its simplic-
model is better than other models trained specifically for voice ity, this heuristic happens to be a strong baseline, very diffi-
activity detection. Note, however, that one should not draw cult to beat in practice [12]. Yet, our proposed resegmenta-
hasty conclusions regarding the performance of silero vad tion approach outperforms it for all but two experimental condi-
model [21] as it is an off-the-shelf model which was not trained tions (for which the heuristic is better only by a small margin).
specifically for these datasets. A closer look at the speaker confusion error rates shows that
our approach is significantly better at identifying overlapping
Overlapped speech detection. Finding good and reproducible speakers. This is confirmed by the low speaker confusion error
baselines for the overlapped speech detection task proved to rates obtained when we apply it on top of an oracle diarization
be a difficult task. We thank Kunesova et al. [3] and Landini (with yDIA = y): only 1.4%, 1.8%, and 0.6% of speech are re-
et al. [12] for sharing the output of their detection pipelines. assigned incorrectly on AMI, DIHARD and VoxConverse re-
Results are reported in Table 2 that shows that, like for spectively. Figure 3 provides a qualitative sneak peak at their
voice activity detection, our segmentation model can be used respective behavior on a short 20 seconds excerpt. In particular,
successfully for overlapped speech detection, even though it it appears that the two (heuristic and proposed) approaches do
was not initially trained for this particular task. It outperforms behave differently and could complement each other.
pyannote 1.1 overlapped speech detection which we believe
was the previous state of the art [4]. 5. Conclusions
Overlap-aware resegmentation. While our segmentation The overall best pipeline reported in this paper is the combina-
model was found to be useful for both voice activity detection tion of our voice activity detection, off-the-shelf VBx cluster-
and overlapped speech detection, post-processing the output of ing, and our overlap-aware resegmentation approach, reaching
existing speaker diarization pipelines is where it really shines. DER = 19.9% on AMI Mix-Headset using the full evaluation
Table 3 summarizes the resegmentation experiments performed setup introduced in [15], DER = 19.3% on DIHARD 3 evalu-
on top of three of them, ranked from worst to best baseline ation set (full condition, 2.6% behind the winning submission),
performance: pyannote 1.1 pretrained pipelines [16], dihard3 and DER = 7.1% (or DER = 3.4% with a 250ms forgiveness
official baseline [9], and BUT’s VBx approach [15]. The (ad- collar) on VoxConverse development set.
mittedly wrong) criterion used for selecting those baselines was Even with a forgiveness collar, missed detection and false
their ease of use and reproducibility. Because results reported alarms are the main source of errors (twice as high as speaker
in [15] for VBx baseline rely on oracle voice activity detection confusion) for all three datasets – indicating that, despite
and the shared code base does not provide an official voice ac- progress, overlapped speech detection remains an unsolved (and
tivity detection implementation, we used our own (marked as sometimes ill-defined) problem.
Ours in Table 1) and applied VBx on top of it. Our proposed
6. References [11] J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman,
“Spot the Conversation: Speaker Diarisation in the Wild,” in
[1] R. Yin, H. Bredin, and C. Barras, “Speaker Change Detection Proc. Interspeech 2020, 2020, pp. 299–303. [Online]. Available:
in Broadcast TV Using Bidirectional Long Short-Term Memory http://dx.doi.org/10.21437/Interspeech.2020-2337
Networks,” in Proc. Interspeech 2017, 2017.
[12] F. Landini, O. Glembek, P. Matějka, J. Rohdin, L. Burget,
[2] D. Charlet, C. Barras, and J. Liénard, “Impact of overlapping M. Diez, and A. Silnova, “Analysis of the BUT Diarization
speech detection on speaker diarization for broadcast news and System for VoxConverse Challenge,” in 2021 IEEE Interna-
debates,” in 2013 IEEE International Conference on Acoustics, tional Conference on Acoustics, Speech and Signal Processing
Speech and Signal Processing, May 2013, pp. 7707–7711. (ICASSP), 2021.
[3] M. Kunešová, M. Hrúz, Z. Zajı́c, and V. Radová, “Detection of [13] S. Otterson and M. Ostendorf, “Efficient use of overlap informa-
overlapping speech for the purposes of speaker diarization,” in tion in speaker diarization,” in 2007 IEEE Workshop on Automatic
Speech and Computer, A. A. Salah, A. Karpov, and R. Potapova, Speech Recognition & Understanding (ASRU). IEEE, 2007, pp.
Eds. Cham: Springer International Publishing, 2019, pp. 247– 683–686.
257.
[14] G. Gelly and J.-L. Gauvain, “Optimization of RNN-Based Speech
[4] L. Bullock, H. Bredin, and L. P. Garcia-Perera, “Overlap-aware Activity Detection,” IEEE/ACM Transactions on Audio, Speech,
diarization: Resegmentation using neural end-to-end overlapped and Language Processing, vol. 26, no. 3, pp. 646–656, March
speech detection,” in Proc. ICASSP 2020, 2020. 2018.
[5] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watan- [15] F. Landini, J. Profant, M. Diez, and L. Burget, “Bayesian hmm
abe, “End-to-End Neural Speaker Diarization with Permutation- clustering of x-vector sequences (vbx) in speaker diarization: the-
free Objectives,” in Interspeech, 2019, pp. 4300–4304. ory, implementation and analysis on standard tasks,” 2020.
[6] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and [16] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov,
S. Watanabe, “End-to-end neural speaker diarization with self- M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill,
attention,” in 2019 IEEE Automatic Speech Recognition and Un- “pyannote.audio: neural building blocks for speaker diarization,”
derstanding Workshop (ASRU), 2019, pp. 296–303. in Proc. ICASSP 2020, 2020.
[7] Y. Takashima, Y. Fujita, S. Watanabe, S. Horiguchi, P. Garcı́a, [17] S. Horiguchi, P. Garcia, Y. Fujita, S. Watanabe, and K. Naga-
and K. Nagamatsu, “End-to-end speaker diarization conditioned matsu, “End-to-end speaker diarization as post-processing,” 2020.
on speech activity and overlap detection,” in 2021 IEEE Spoken
Language Technology Workshop (SLT), 2021, pp. 849–856. [18] H. Bredin, “pyannote.metrics: a toolkit for reproducible
evaluation, diagnostic, and error analysis of speaker diarization
[8] J. Carletta, “Unleashing the killer corpus: experiences in creating systems,” in Proc. Interspeech 2017, Stockholm, Sweden,
the multi-everything AMI Meeting Corpus,” Language Resources August 2017. [Online]. Available: http://pyannote.github.io/
and Evaluation, vol. 41, no. 2, 2007. pyannote-metrics
[9] N. Ryant, P. Singh, V. Krishnamohan, R. Varma, K. Church, [19] M. Ravanelli and Y. Bengio, “Speaker recognition from raw wave-
C. Cieri, J. Du, S. Ganapathy, and M. Liberman, “The Third DI- form with sincnet,” in Proc. SLT 2018, 2018.
HARD Diarization Challenge,” arXiv preprint arXiv:2012.01477,
2020. [20] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech,
and Noise Corpus,” 2015.
[10] N. Ryant, K. Church, C. Cieri, J. Du, S. Ganapathy, and M. Liber-
man, “Third DIHARD Challenge Evaluation Plan,” arXiv preprint [21] Silero Team, “Silero VAD: pre-trained enterprise-grade Voice Ac-
arXiv:2006.05815, 2020. tivity Detector (VAD), Number Detector and Language Classi-
fier,” https://github.com/snakers4/silero-vad, 2021.

You might also like