(1339309X - Journal of Electrical Engineering) Text-Independent Speaker Recognition Using Two-Dimensional Information Entropy PDF
(1339309X - Journal of Electrical Engineering) Text-Independent Speaker Recognition Using Two-Dimensional Information Entropy PDF
3, 2015, 169–173
COMMUNICATIONS
Speaker recognition is the process of automatically recognizing who is speaking on the basis of speaker specific char-
acteristics included in the speech signal. These speaker specific characteristics are called features. Over the past decades,
extensive research has been carried out on various possible speech signal features obtained from signal in time or frequency
domain. The objective of this paper is to introduce two-dimensional information entropy as a new text-independent speaker
recognition feature. Computations are performed in time domain with real numbers exclusively. Experimental results show
that the two-dimensional information entropy is a speaker specific characteristic, useful for speaker recognition.
K e y w o r d s: biometrics, speech, speaker recognition, feature extraction, information entropy
∗ ∗∗
VLATACOM, R&D Center, Milutina Milankovića 5, 11070 Belgrade, Serbia {Bosko; Miroslav.Obradovic}@vlatacom.com; RT-RK,
Institute for Computer Based Systems, Narodnog Fronta 23A, 21000 Novi Sad, Serbia, [email protected]
Firstly, we calculate histograms of discrete random Histograms of discrete random variables ∆si and ∆ti
variables ∆si and ∆ti within certain time interval, which are calculated for all six speakers. Using these histograms,
is called frame duration and denoted with t0 . Secondly, information entropies H(∆si ) and H(∆ti ) are calcu-
we calculate information entropies H(∆si ) and H(∆ti ) lated according to relations (1) and (2). H(∆si ) and
H(∆ti ) represents coordinates in two-dimensional infor-
I
X 1 mation entropy domain. Different frame durations for cal-
H(∆si ) = P (∆si )ld , (1) culating histograms and information entropies H(∆si )
i=1
P (∆si )
and H(∆ti ) are considered: t0 = 10 s, 20 s and 30 s.
I
X 1 Obtained results for each specific frame can be rep-
H(∆ti ) = P (∆ti )ld , (2)
i=1
P (∆ti ) resented by point which is defined by ordered pair
H(∆si ), H(∆ti ) in two-dimensional information entropy
where I denotes number of intervals ∆ti within a frame, domain. Numerical results in H(∆si ), H(∆ti ) plane are
ie presented in Fig. 3, subfigures (a), (b) and (c), for
I
X t0 = 10 s, 20 s and 30 s, respectively. From this figure
t0 = ∆ti . (3) one can conclude that two-dimensional information en-
i=1
tropy points, obtained for one speaker, are clustered.
It will be shown that the proposed two-dimensional In other words, within-speaker variability from frame
information entropy is useful feature domain which rep- to frame is significantly smaller relative to between-
resents speaker specific characteristic suitable for text- speaker variability. Following the terminology from vector
independent speaker recognition. quantization (VQ) based approach [27, 28], ordered pair
Description of experimental testbed H(∆si ), H(∆ti ) is called speaker’s feature vector and the
Experimental testbed consists of voice activity detec- speaker’s model is formed by clustering the speaker’s fea-
tor, A/D converter and two-dimensional information en- ture vectors. In VQ-based approach, the speakers’ models
tropy extractor, as shown in Fig. 2. are formed by clustering the K speakers’ feature vectors
in K non-overlapping clusters.
Coordinates of the centre of the cluster are calculated
as
Two-
Voice
A/D dimensional N
activity
Converter information
X
detector
enropy extractor H(∆si ) = H(∆si ) , (4)
i=1
N
Fig. 2. Experimental testbed X
H(∆ti ) = H(∆ti ) , (5)
i=1
The function of the voice activity detector is to ex-
tract speech segments from the speech signal. A simple where N denotes number of the points in the cluster.
method [26], based on two audio features (signal energy For t0 = 10 s is N ∼ = 20 , for t0 = 20 s is N ∼ = 10 , while
and spectral centroid), for extraction of speech segments for t0 = 30 s is N ∼ = 6 . According to terminology from
by removing the silence is used in testbed. [27, 28], each cluster is represented by a code vector which
Once the speech segments have been extracted, speech is the centroid (average vector) of the cluster. VQ model,
signal is sampled at fs = 1/Ts = 8 kHz sampling rate also known as centroid model, is one of the simplest text-
with an 8-bits A/D convertor, ie each sample is quantized independent speaker models.
into one of q = 256 levels. Radius of each cluster, presented in Fig. 3, is calculated
The most important step in the speaker recognition as standard deviation of distances between points and the
process is to extract features from the analyzed signal. centre of the cluster
In two-dimensional information entropy extractor, speech
signal is windowed into frames and processed sequentially.
v
u
u1 X N
Calculations are performed according to relations (1) and σ=t {[H(∆si ), H(∆ti )] − [H(∆si ), H(∆ti )]}2
(2). N i=1
(6)
3 NUMERICAL RESULTS From Fig. 3(a), obtained for frame duration 10 s, one
can see that clusters are overlapping. Gaussian Mixture
Speech signal database is formed of six the most fre- Model (GMM) can be considered as an extension of the
quent speakers from Serbian parliament, three of them VQ model, in which the clusters are overlapping [29].
are males (denoted with M1, M2 and M3), while the re- GMM is composed of a finite mixture of multivariate
maining three are females (denoted with F1, F2 and F3). Gaussian components. Hence, a feature vector is not as-
Duration of speech signal of any of them is shortly below signed to the nearest cluster as in VQ model, but it has
4 min. a nonzero probability of originating from each cluster.
172 B. Božilović — B. M. Todorović — M. Obradović: TEXT-INDEPENDENT SPEAKER RECOGNITION USING . . .
F2 F3 F3
0.20 0.20 F2 0.20
Fig. 3. Two-dimensional information entropy for six speakers, males are denoted with M1, M2 and M3, females are denoted with F1, F2
and F3: (a) — Frame duration 10 s, (b) — Frame duration 20 s, (c) — Frame duration 30 s
From Fig. 3, one can see that standard deviation of [5] KINNUNEN, T. : Spectral Features for Automatic Text-Inde-
two-dimensional information entropy of a speaker is re- pendent Speaker Recognition, Licentiate’s thesis, University of
Joensuu, Joensuu, Finland, 2003.
duced as the frame duration is increased. In addition,
[6] KINNUNEN, T. : Optimizing Spectral Feature Based Text-In-
standard deviation of two-dimensional information en-
dependent Speaker Recognition, PhD thesis, University of Joen-
tropy is higher for females than for males. suu, Joensuu, Finland, 2005.
Although the frame duration of 10–30 s seems to be [7] JIN, Q.—SCHULTZ, T.—WAIBEL, A. : Phonetic Speaker
long, it is comparable with actual systems. Recently, it Identification, Proc. of the Int. Conference of Spoken Language
Processing (ICSLP 2002), Denver, CO, Sep 2002, pp. 1345-1348.
was announced that Barclays Wealth was to use speaker
[8] BACHOROWSKI, J. A.—OWREN, M. J. : Acoustic Correlates
recognition to verify the identity of telephone customers
of Talker Sex and Individual Talker Identity are Present in a
within 30 seconds of normal conversation [30]. Short Vowel Segment Produced in Running Speech, Journal of
Acoust. Soc. America. 106 No. 2 (1999), 1054–1063.
[9] ADAMI, A. G. : Modeling Prosodic Differences for Speaker
4 CONCLUSION Recognition, Speech Communication 49 No. 4 (2007), 277–291.
[10] FURUI, S. : Digital Speech Processing, Synthesis, and Recog-
nition, 2nd ed., Marcel Dekker, New York, 2001.
Two-dimensional information entropy is useful fea-
[11] SIVAKUMARAN, P.—ARIYAEEINIA, A.—LOOMES, M. :
ture domain for text-independent speaker recognition. Al- Sub-Band Based Text-Dependent Speaker Verification, Speech
though the validation is performed using small dataset, Communication 41 No. 2-3 (2003), 485–509.
obtained results clearly show that this feature can be used [12] DAVIS, S.—MERMELSTEIN, P. : Comparison of Parametric
to discriminate between speakers. Two-dimensional infor- Representations for Monosyllabic Word Recognition in Continu-
mation entropy is very accurate in gender identification. ously Spoken Sentences, IEEE Trans. Acoustics, Speech, Signal
Processing 28 No. 4 (1980), 357–366.
The most significant factor affecting automatic speaker
[13] HERMANSKY, H. : Perceptual Linear Predictive (PLP) Anal-
recognition performance is variability of signal character- ysis of Speech, Journal Acoust. Soc. America, 87 No. 4 (1990),
istics from trial to trial, ie between-trial variability. Varia- 1738–1752.
tions arise from the speaker him/herself, from differences [14] MAMMONE, R.—ZHANG, X.—RAMACHANDRAN, R. : Ro-
in recording and transmission conditions, and from differ- bust Speaker Recognition: a Feature Based Approach, IEEE Sig-
ent noise environment. These topics are subject of further nal Processing Magazine 13 No. 5 (1996), 58–71.
researches. [15] NOLAN, F. : The Phonetic Bases of Speaker Recognition, Cam-
bridge, 1983.
[16] ANTAL, M. : Phonetic Speaker Recognition, Proc. of 7th Int.
Conference Communications, Bucharest, Romania, June 2008,
References pp. 67–72.
[17] DEHAK, N.—KENNY, P.—DUMOUCHEL, P. : Modeling
[1] Biometric Recognition: Challenges and Opportunities (Pato, J. Prosodic Features with Joint Factor Analysis for Speaker Veri-
, Millett, L. I., eds.), National Academies Press, Washington, fication, IEEE Trans. Audio, Speech and Language Processing
2010. 15 No. 7 (2007), 2095–2103.
[2] TOGNERI, R.—PULLELLA, D. : An Overview of Speaker [18] MILIVOJEVIĆ, Z. N.—BRODIĆ, D. : Estimation of the Funda-
Identification: Accuracy and Robustness Issues, IEEE Circuits mental Frequency of the Speech Signal Compressed by G.723.1
and Systems Magazine, Second quarter (2011), 23–61. Algorithm Applying PCC Interpolation, Journal of Electrical
[3] CAMPBELL, J. P. : Speaker Recognition: A Tutorial, Proc. of Engineering, 62 No. 4 (2011), 181–189.
the IEEE 85 No. 9 (1997), 1437–1462. [19] KINNUNEN, T.—LI, H. : An Overview of Text-Independent
[4] ROSE, P. : Forensic Speaker Identification, Taylor & Francis, Speaker Recognition: From features to Supervectors, Speech
London, 2002. Communication 52 No. 1 (2010), 12–40.
Journal of ELECTRICAL ENGINEERING 66, NO. 3, 2015 173
[20] SEDLAK, V.—DURACKOVA, D.—ZALUSKY.—KOVA- [30] Barclays International Banking, available from: https://
CIK, T. : Intelligibility Assessment of Ideal Binary-Masked wealth.barclays.com/en gb/internationalwealth/
Noisy Speech with Acceptance of Room Acoustic, Journal of manage-your-money/banking-on-the-power-of-speech.html, ac-
Electrical Engineering 65 No. 6 (2014), 325–332. cessed on December 14, 2014.
[21] PRIBIL, J.—PRIBILOVA, A.—DURACKOVA, D. : An Ex-
periment with Spectral Analysis of Emotional Speech Affected Received 26 February 2015
by Orthodontic Appliances, Journal of Electrical Engineering 63
No. 5 (2012), 296–302. Boško Božilović was born in Belgrade, Serbia, in 1978.
[22] O’SHAUGHNESSY, D. Automatic Speech Recognition: His- He received Dipl Eng and MSc degrees from the Faculty of
tory, Methods and Challenges : Pattern Recognition 41 (2008),
Electrical Engineering, University of Belgrade, in 2003, and
2965–2979.
2012, respectively. He is a Director of ICT at VLATACOM,
[23] VASEGHI, S. V. : Multimedia Signal Processing: Theory and
R&D Center. His research interests are in the areas of bio-
Applications in Speech, Music and Communications, John Wiley
metrics, forensics and digital security. He has authored or co-
& Sons, 2007.
authored several peer-reviewed journal and conference papers
[24] BRUMMER, N.—du PREEZ, J. : Application Independent
and holds one patent. Currently he is working towards his
Evaluation of Speaker Detection, Computer Speech and Lan-
guage 20 No. 2-3 (2006), 230–275. PhD degree.
[25] ARONOWITZ, H.—BURSHTEIN, D. : Efficient Speaker Recog- Branislav M. Todorović was born in Belgrade, Serbia, in
nition using Approximated Cross Entropy (ACE), IEEE Trans- 1959. He received Dipl Eng and MSc degrees from the Faculty
actions on Audio, Speech and Language Processing 15 No. 7 of Electrical Engineering, University of Belgrade, and PhD
(Sep 2007), 2033–2043. degree from the Faculty of Technical Sciences, University of
[26] GIANNAKOPOULOS, T. : Silence Removal in Speech Signals, Novi Sad, in 1983, 1988 and 1997, respectively. He is a Se-
MATLAB Central, March 2014, available from: http:// nior Research Fellow at the RT-RK, Institute for Computer
www.mathworks.com/matlabcentral/fileexchange/ Based Systems, and a Full Professor at the Military Academy,
28826-silence-removal-in-speech-signals, accessed on December University of Defence, Belgrade. He is also with VLATA-
14, 2014. COM, R&D Center, Belgrade. Prior to joining RT-RK, he was
[27] SOONG, F. K.—ROSENBERG, A. E.—JUANG, B. H.—RA- with the Institute of Microwave Techniques and Electronics
BINER, L. R. : A Vector Quantization Approach to Speaker IMTEL-Komunikacije, Centre for Multidisciplinary Research,
Recognition, AT&T Technical Journal 66 No. 2 (Mar-Apr 1987),
and the Military Technical Institute (VTI, Institute of Elec-
14–26.
trical Engineering) in Belgrade. His research interests are in
[28] LINDE, Y.—BUZO, A.—GRAY, R. : An Algorithm for Vector
the wide area of radio telecommunications and digital signal
Quantizer Design, IEEE Trans. on Communications 28 No. 1
(1980), 84–95. processing. He has authored or co-authored more than 100
peer-reviewed journal and conference papers and three books.
[29] REYNOLDS, D. A.—ROSE, R. C. : Robust Text Independent
Speaker Identification using Gaussian Mixture Speaker Models, Miroslav Obradović was born in Belgrade, Serbia, in
IEEE Trans. on Speech and Audio Processing 3 No. 1 (1995), 1978. He is a Senior Software Developer at VLATACOM, R&D
72–83. Center, Belgrade.