0% found this document useful (0 votes)
44 views5 pages

(1339309X - Journal of Electrical Engineering) Text-Independent Speaker Recognition Using Two-Dimensional Information Entropy PDF

This document discusses text-independent speaker recognition using two-dimensional information entropy. It begins by providing background on speaker recognition and different types of speech features used, including spectral, phonetic, and prosodic features. It then introduces two-dimensional information entropy as a new text-independent speaker recognition feature that is computed in the time domain using real numbers. The document describes how two-dimensional information entropy quantifies the information content of a speech signal's amplitude-time trajectory. Experimental results showed this feature to be speaker-specific and useful for speaker recognition.

Uploaded by

memoire univmila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views5 pages

(1339309X - Journal of Electrical Engineering) Text-Independent Speaker Recognition Using Two-Dimensional Information Entropy PDF

This document discusses text-independent speaker recognition using two-dimensional information entropy. It begins by providing background on speaker recognition and different types of speech features used, including spectral, phonetic, and prosodic features. It then introduces two-dimensional information entropy as a new text-independent speaker recognition feature that is computed in the time domain using real numbers. The document describes how two-dimensional information entropy quantifies the information content of a speech signal's amplitude-time trajectory. Experimental results showed this feature to be speaker-specific and useful for speaker recognition.

Uploaded by

memoire univmila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Journal of ELECTRICAL ENGINEERING, VOL. 66, NO.

3, 2015, 169–173

COMMUNICATIONS

TEXT–INDEPENDENT SPEAKER RECOGNITION USING


TWO–DIMENSIONAL INFORMATION ENTROPY
∗ ∗∗
Boško Božilović — Branislav M. Todorović

Miroslav Obradović

Speaker recognition is the process of automatically recognizing who is speaking on the basis of speaker specific char-
acteristics included in the speech signal. These speaker specific characteristics are called features. Over the past decades,
extensive research has been carried out on various possible speech signal features obtained from signal in time or frequency
domain. The objective of this paper is to introduce two-dimensional information entropy as a new text-independent speaker
recognition feature. Computations are performed in time domain with real numbers exclusively. Experimental results show
that the two-dimensional information entropy is a speaker specific characteristic, useful for speaker recognition.
K e y w o r d s: biometrics, speech, speaker recognition, feature extraction, information entropy

1 INTRODUCTION spectrum is usually neglected, since it is generally be-


lieved that it has little effect on the perception of speech
Biometric recognition systems are increasingly being [10]. The simplest way of analyzing spectral properties
deployed as a means for the recognition of people [1]. One of a signal is by using filter banks. This approach to
of the most widely used biometric modalities is human spectral feature extraction is so called subband filtering
voice. Speaker recognition systems are technologies which where subband outputs are considered directly as the fea-
are used to recognize person from his/her speech signal tures [11]. The most frequently used spectral features for
by exploiting speaker specific characteristics [2]. speaker recognition are mel-frequency cepstral coefficients
Speaker specific characteristics are result of a com- [12], which are based on mel-scale filter banks. Linear
bination of anatomical differences inherent in the vocal prediction [13, 14] is an alternative spectrum estimation
tract and the learned speaking habits of different indi- method.
viduals. In speaker recognition systems, all these speaker Phonetic features depend on speech content [15]. In or-
specific characteristics can be used to discriminate be- der to extract phonetic features it is necessary to perform
tween speakers [3]. These speaker specific characteristics segmentation of the speech signal into phonemes. Some
are called features. The most important characteristic of broad phonetic classes are more speaker specific than oth-
feature would be large between-speaker variability and ers. For example, using only vowels it is possible to obtain
small within-speaker variability [4]. a very high recognition rate [16].
Speech signal is a complex time-varying signal which Prosodic features are related to non-segmental aspects
can be represented by many different features. There are of speech. They reflect differences in speaking style, lan-
different ways to categorize the features. From the view- guage background, sentence type and emotions [17]. The
point of their physical interpretation, we can divide them most important prosodic parameter is the fundamental
into: spectral features [5, 6], phonetic features [7, 8] and frequency [18]. Other prosodic features for speaker recog-
prosodic features [9]. nition include speaking rate, pause statistics and intona-
Spectral features are computed from short frames of tion patterns [19].
about 20–30 ms in duration. Within this interval, the Depending on the algorithm used, the process of
speech signal is assumed to remain stationary. Spectral speaker recognition can be categorized as text-dependent
features represent the most common way to character- and text-independent. Text-independent recognition is
ize the speech signal. Fourier analysis provides a usual the much more challenging of the two tasks, since in
way of analyzing the spectral properties of a given signal text-independent systems there are no constraints on the
in the frequency domain. In speech analysis, the phase words which the speakers are allowed to use.

∗ ∗∗
VLATACOM, R&D Center, Milutina Milankovića 5, 11070 Belgrade, Serbia {Bosko; Miroslav.Obradovic}@vlatacom.com; RT-RK,
Institute for Computer Based Systems, Narodnog Fronta 23A, 21000 Novi Sad, Serbia, [email protected]

c 2015 FEI STU


DOI: 10.2478/jee-2015-0027, Print ISSN 1335-3632, On-line ISSN 1339-309X
170 B. Božilović — B. M. Todorović — M. Obradović: TEXT-INDEPENDENT SPEAKER RECOGNITION USING . . .

In general, phonetic variability represents an adverse 2 DESCRIPTION OF TWO–DIMENSIONAL


factor to accuracy in text-independent speaker recogni- INFORMATION ENTROPY
tion. Another adverse factor in text-independent speaker
recognition is modeling the different levels of prosodic in- Speech is made up of about 40 basic acoustic symbols,
formation (instantaneous, long-term) to capture speaker known as phonemes, which are used to construct words,
specific differences [19]. Beside those, adverse factors in sentences etc. Speech is an information-rich signal that
speaker recognition include: differences in recording and can be represented in frequency or time domain. All this
information is conveyed primarily within the traditional
transmission conditions, influence of noise environment
telephone bandwidth of 4 kHz [23].
[20], effect of the orthodontic appliances on spectral prop-
erties [21], etc. As a speaker specific characteristic of speech signal we
use its amplitude-time trajectory. In order to quantify the
Speaker recognition process is realized in several steps. information content of speech signal in time domain, we
The first step is speech signal capture by microphone. define two-dimensional information entropy of amplitude-
The second step assumes extraction of speech segments time trajectory.
by removing the silence from the captured speech signal. Let us consider the analog speech signal s(t) pre-
This step is performed by voice activity detector. The sented in Fig. 1. Maximum value of the signal is de-
next step is the choice of features that will represent the noted with Smax , while the minimum value is denoted
speech signal. The step which follows is the feature ex- with Smin . One can notice local maximums and local
traction process aiming to compute discriminative speech minimums of the signal amplitude, ie the time points
features suitable for speaker recognition. Furthermore, (. . . , ti−1 , ti , ti+1 , . . . ) where the first derivative of the sig-
speaker recognition follows a standard procedure which nal is equal to zero.
includes two different tasks: speaker identification and
speaker verification. In the speaker identification task, an Smax Dti
unknown speaker feature is compared against a database Si
of known speakers, and the best matching speaker is iden-
tified. An identity claim is given to the speaker verifica-
tion task, and the speaker’s voice sample is compared
against the claimed speaker’s voice template. If the simi- DS i
larity degree between the voice sample and the template 0
exceeds a predefined decision threshold, the speaker is
recognized, and otherwise rejected [19].
State-of-the-art speaker recognition systems use a
number of features in parallel, attempting to cover these Smin Ts
ti-1 ti ti+1
different aspects and employing them in a complementary
way to achieve more accurate recognition [22].
Fig. 1. Speech signal
Information entropy can be useful feature for speaker
recognition. In information theory, entropy is defined as
a measure of the randomness (uncertainty, information Let us denote with ti−1 time point where the signal
content) of a process. The calculation of the entropy of has local minimum, with ti subsequent time point where
speech is complex as speech signals simultaneously carry the signal has local maximum, and with ti+1 subsequent
various forms of information: phonemes, topic, intonation time point where the signal has local minimum. Further-
signals, accent, speaker voice and speaker stylistics. One more, let us denote with ∆si amplitude difference be-
can consider the entropy of speech signal at several levels: tween local maximum at the time point ti and previ-
the entropy of words contained in a sequence of speech, ous local minimum at the time point ti−1 . Let us denote
the entropy of intonation and the entropy of speech signal with ∆ti time difference between the time point ti and
features [23]. the time point ti−1 . Similarly, we can define ∆si+1 and
∆ti+1 as amplitude and time differences between time
Information entropy has already been used for speaker
point ti+1 and previous time point ti .
recognition. Empirical entropy was proposed in [24], while
approximated cross entropy was analyzed in [25]. Speech signal is sampled with sampling interval Ts
and quantized into q levels. Quantization step is ∆q =
In this paper, we propose and analyze so-called two- (Smax − Smin )/q . It should be noted that ∆si = m∆q
dimensional information entropy as a new feature domain and ∆ti = nTs , where m, n are integers and m ≤ q .
for text-independent speaker recognition. Algorithm for We propose two-dimensional information entropy as a
extraction of two-dimensional information entropy from measure to quantify the randomness of ∆si and ∆ti . The
speech signal is described. Experimental results show that two-dimensional information entropy is actually made up
the proposed feature domain can be useful to discriminate of two marginal entropies: H(∆si ) and H(∆ti ), assum-
between speakers. ing independence of the random variables ∆si and ∆ti .
Journal of ELECTRICAL ENGINEERING 66, NO. 3, 2015 171

Firstly, we calculate histograms of discrete random Histograms of discrete random variables ∆si and ∆ti
variables ∆si and ∆ti within certain time interval, which are calculated for all six speakers. Using these histograms,
is called frame duration and denoted with t0 . Secondly, information entropies H(∆si ) and H(∆ti ) are calcu-
we calculate information entropies H(∆si ) and H(∆ti ) lated according to relations (1) and (2). H(∆si ) and
H(∆ti ) represents coordinates in two-dimensional infor-
I
X 1 mation entropy domain. Different frame durations for cal-
H(∆si ) = P (∆si )ld , (1) culating histograms and information entropies H(∆si )
i=1
P (∆si )
and H(∆ti ) are considered: t0 = 10 s, 20 s and 30 s.
I
X 1 Obtained results for each specific frame can be rep-
H(∆ti ) = P (∆ti )ld , (2)
i=1
P (∆ti ) resented by point which is defined by ordered pair
H(∆si ), H(∆ti ) in two-dimensional information entropy
where I denotes number of intervals ∆ti within a frame, domain. Numerical results in H(∆si ), H(∆ti ) plane are
ie presented in Fig. 3, subfigures (a), (b) and (c), for
I
X t0 = 10 s, 20 s and 30 s, respectively. From this figure
t0 = ∆ti . (3) one can conclude that two-dimensional information en-
i=1
tropy points, obtained for one speaker, are clustered.
It will be shown that the proposed two-dimensional In other words, within-speaker variability from frame
information entropy is useful feature domain which rep- to frame is significantly smaller relative to between-
resents speaker specific characteristic suitable for text- speaker variability. Following the terminology from vector
independent speaker recognition. quantization (VQ) based approach [27, 28], ordered pair
Description of experimental testbed H(∆si ), H(∆ti ) is called speaker’s feature vector and the
Experimental testbed consists of voice activity detec- speaker’s model is formed by clustering the speaker’s fea-
tor, A/D converter and two-dimensional information en- ture vectors. In VQ-based approach, the speakers’ models
tropy extractor, as shown in Fig. 2. are formed by clustering the K speakers’ feature vectors
in K non-overlapping clusters.
Coordinates of the centre of the cluster are calculated
as
Two-
Voice
A/D dimensional N
activity
Converter information
X
detector
enropy extractor H(∆si ) = H(∆si ) , (4)
i=1
N
Fig. 2. Experimental testbed X
H(∆ti ) = H(∆ti ) , (5)
i=1
The function of the voice activity detector is to ex-
tract speech segments from the speech signal. A simple where N denotes number of the points in the cluster.
method [26], based on two audio features (signal energy For t0 = 10 s is N ∼ = 20 , for t0 = 20 s is N ∼ = 10 , while
and spectral centroid), for extraction of speech segments for t0 = 30 s is N ∼ = 6 . According to terminology from
by removing the silence is used in testbed. [27, 28], each cluster is represented by a code vector which
Once the speech segments have been extracted, speech is the centroid (average vector) of the cluster. VQ model,
signal is sampled at fs = 1/Ts = 8 kHz sampling rate also known as centroid model, is one of the simplest text-
with an 8-bits A/D convertor, ie each sample is quantized independent speaker models.
into one of q = 256 levels. Radius of each cluster, presented in Fig. 3, is calculated
The most important step in the speaker recognition as standard deviation of distances between points and the
process is to extract features from the analyzed signal. centre of the cluster
In two-dimensional information entropy extractor, speech
signal is windowed into frames and processed sequentially.
v
u
u1 X N
Calculations are performed according to relations (1) and σ=t {[H(∆si ), H(∆ti )] − [H(∆si ), H(∆ti )]}2
(2). N i=1
(6)
3 NUMERICAL RESULTS From Fig. 3(a), obtained for frame duration 10 s, one
can see that clusters are overlapping. Gaussian Mixture
Speech signal database is formed of six the most fre- Model (GMM) can be considered as an extension of the
quent speakers from Serbian parliament, three of them VQ model, in which the clusters are overlapping [29].
are males (denoted with M1, M2 and M3), while the re- GMM is composed of a finite mixture of multivariate
maining three are females (denoted with F1, F2 and F3). Gaussian components. Hence, a feature vector is not as-
Duration of speech signal of any of them is shortly below signed to the nearest cluster as in VQ model, but it has
4 min. a nonzero probability of originating from each cluster.
172 B. Božilović — B. M. Todorović — M. Obradović: TEXT-INDEPENDENT SPEAKER RECOGNITION USING . . .

H (DSi) H (DSi) H (DSi)

0.40 M1 0.40 M1 M2 0.40 M2


M2
M1
0.35 0.35 0.35
F1 M3 F1 M3
M3 F1
0.30 0.30 0.30 F2

F2 F3 F3
0.20 0.20 F2 0.20

0.25 F3 0.25 0.25


(a) (b) (c)
0.15 0.15 0.15
0.45 0.50 0.55 0.60 0.65 0.45 0.50 0.55 0.60 0.65 0.45 0.50 0.55 0.60 0.65
H (Dti) H (Dti) H (Dti)

Fig. 3. Two-dimensional information entropy for six speakers, males are denoted with M1, M2 and M3, females are denoted with F1, F2
and F3: (a) — Frame duration 10 s, (b) — Frame duration 20 s, (c) — Frame duration 30 s

From Fig. 3, one can see that standard deviation of [5] KINNUNEN, T. : Spectral Features for Automatic Text-Inde-
two-dimensional information entropy of a speaker is re- pendent Speaker Recognition, Licentiate’s thesis, University of
Joensuu, Joensuu, Finland, 2003.
duced as the frame duration is increased. In addition,
[6] KINNUNEN, T. : Optimizing Spectral Feature Based Text-In-
standard deviation of two-dimensional information en-
dependent Speaker Recognition, PhD thesis, University of Joen-
tropy is higher for females than for males. suu, Joensuu, Finland, 2005.
Although the frame duration of 10–30 s seems to be [7] JIN, Q.—SCHULTZ, T.—WAIBEL, A. : Phonetic Speaker
long, it is comparable with actual systems. Recently, it Identification, Proc. of the Int. Conference of Spoken Language
Processing (ICSLP 2002), Denver, CO, Sep 2002, pp. 1345-1348.
was announced that Barclays Wealth was to use speaker
[8] BACHOROWSKI, J. A.—OWREN, M. J. : Acoustic Correlates
recognition to verify the identity of telephone customers
of Talker Sex and Individual Talker Identity are Present in a
within 30 seconds of normal conversation [30]. Short Vowel Segment Produced in Running Speech, Journal of
Acoust. Soc. America. 106 No. 2 (1999), 1054–1063.
[9] ADAMI, A. G. : Modeling Prosodic Differences for Speaker
4 CONCLUSION Recognition, Speech Communication 49 No. 4 (2007), 277–291.
[10] FURUI, S. : Digital Speech Processing, Synthesis, and Recog-
nition, 2nd ed., Marcel Dekker, New York, 2001.
Two-dimensional information entropy is useful fea-
[11] SIVAKUMARAN, P.—ARIYAEEINIA, A.—LOOMES, M. :
ture domain for text-independent speaker recognition. Al- Sub-Band Based Text-Dependent Speaker Verification, Speech
though the validation is performed using small dataset, Communication 41 No. 2-3 (2003), 485–509.
obtained results clearly show that this feature can be used [12] DAVIS, S.—MERMELSTEIN, P. : Comparison of Parametric
to discriminate between speakers. Two-dimensional infor- Representations for Monosyllabic Word Recognition in Continu-
mation entropy is very accurate in gender identification. ously Spoken Sentences, IEEE Trans. Acoustics, Speech, Signal
Processing 28 No. 4 (1980), 357–366.
The most significant factor affecting automatic speaker
[13] HERMANSKY, H. : Perceptual Linear Predictive (PLP) Anal-
recognition performance is variability of signal character- ysis of Speech, Journal Acoust. Soc. America, 87 No. 4 (1990),
istics from trial to trial, ie between-trial variability. Varia- 1738–1752.
tions arise from the speaker him/herself, from differences [14] MAMMONE, R.—ZHANG, X.—RAMACHANDRAN, R. : Ro-
in recording and transmission conditions, and from differ- bust Speaker Recognition: a Feature Based Approach, IEEE Sig-
ent noise environment. These topics are subject of further nal Processing Magazine 13 No. 5 (1996), 58–71.
researches. [15] NOLAN, F. : The Phonetic Bases of Speaker Recognition, Cam-
bridge, 1983.
[16] ANTAL, M. : Phonetic Speaker Recognition, Proc. of 7th Int.
Conference Communications, Bucharest, Romania, June 2008,
References pp. 67–72.
[17] DEHAK, N.—KENNY, P.—DUMOUCHEL, P. : Modeling
[1] Biometric Recognition: Challenges and Opportunities (Pato, J. Prosodic Features with Joint Factor Analysis for Speaker Veri-
, Millett, L. I., eds.), National Academies Press, Washington, fication, IEEE Trans. Audio, Speech and Language Processing
2010. 15 No. 7 (2007), 2095–2103.
[2] TOGNERI, R.—PULLELLA, D. : An Overview of Speaker [18] MILIVOJEVIĆ, Z. N.—BRODIĆ, D. : Estimation of the Funda-
Identification: Accuracy and Robustness Issues, IEEE Circuits mental Frequency of the Speech Signal Compressed by G.723.1
and Systems Magazine, Second quarter (2011), 23–61. Algorithm Applying PCC Interpolation, Journal of Electrical
[3] CAMPBELL, J. P. : Speaker Recognition: A Tutorial, Proc. of Engineering, 62 No. 4 (2011), 181–189.
the IEEE 85 No. 9 (1997), 1437–1462. [19] KINNUNEN, T.—LI, H. : An Overview of Text-Independent
[4] ROSE, P. : Forensic Speaker Identification, Taylor & Francis, Speaker Recognition: From features to Supervectors, Speech
London, 2002. Communication 52 No. 1 (2010), 12–40.
Journal of ELECTRICAL ENGINEERING 66, NO. 3, 2015 173

[20] SEDLAK, V.—DURACKOVA, D.—ZALUSKY.—KOVA- [30] Barclays International Banking, available from: https://
CIK, T. : Intelligibility Assessment of Ideal Binary-Masked wealth.barclays.com/en gb/internationalwealth/
Noisy Speech with Acceptance of Room Acoustic, Journal of manage-your-money/banking-on-the-power-of-speech.html, ac-
Electrical Engineering 65 No. 6 (2014), 325–332. cessed on December 14, 2014.
[21] PRIBIL, J.—PRIBILOVA, A.—DURACKOVA, D. : An Ex-
periment with Spectral Analysis of Emotional Speech Affected Received 26 February 2015
by Orthodontic Appliances, Journal of Electrical Engineering 63
No. 5 (2012), 296–302. Boško Božilović was born in Belgrade, Serbia, in 1978.
[22] O’SHAUGHNESSY, D. Automatic Speech Recognition: His- He received Dipl Eng and MSc degrees from the Faculty of
tory, Methods and Challenges : Pattern Recognition 41 (2008),
Electrical Engineering, University of Belgrade, in 2003, and
2965–2979.
2012, respectively. He is a Director of ICT at VLATACOM,
[23] VASEGHI, S. V. : Multimedia Signal Processing: Theory and
R&D Center. His research interests are in the areas of bio-
Applications in Speech, Music and Communications, John Wiley
metrics, forensics and digital security. He has authored or co-
& Sons, 2007.
authored several peer-reviewed journal and conference papers
[24] BRUMMER, N.—du PREEZ, J. : Application Independent
and holds one patent. Currently he is working towards his
Evaluation of Speaker Detection, Computer Speech and Lan-
guage 20 No. 2-3 (2006), 230–275. PhD degree.
[25] ARONOWITZ, H.—BURSHTEIN, D. : Efficient Speaker Recog- Branislav M. Todorović was born in Belgrade, Serbia, in
nition using Approximated Cross Entropy (ACE), IEEE Trans- 1959. He received Dipl Eng and MSc degrees from the Faculty
actions on Audio, Speech and Language Processing 15 No. 7 of Electrical Engineering, University of Belgrade, and PhD
(Sep 2007), 2033–2043. degree from the Faculty of Technical Sciences, University of
[26] GIANNAKOPOULOS, T. : Silence Removal in Speech Signals, Novi Sad, in 1983, 1988 and 1997, respectively. He is a Se-
MATLAB Central, March 2014, available from: http:// nior Research Fellow at the RT-RK, Institute for Computer
www.mathworks.com/matlabcentral/fileexchange/ Based Systems, and a Full Professor at the Military Academy,
28826-silence-removal-in-speech-signals, accessed on December University of Defence, Belgrade. He is also with VLATA-
14, 2014. COM, R&D Center, Belgrade. Prior to joining RT-RK, he was
[27] SOONG, F. K.—ROSENBERG, A. E.—JUANG, B. H.—RA- with the Institute of Microwave Techniques and Electronics
BINER, L. R. : A Vector Quantization Approach to Speaker IMTEL-Komunikacije, Centre for Multidisciplinary Research,
Recognition, AT&T Technical Journal 66 No. 2 (Mar-Apr 1987),
and the Military Technical Institute (VTI, Institute of Elec-
14–26.
trical Engineering) in Belgrade. His research interests are in
[28] LINDE, Y.—BUZO, A.—GRAY, R. : An Algorithm for Vector
the wide area of radio telecommunications and digital signal
Quantizer Design, IEEE Trans. on Communications 28 No. 1
(1980), 84–95. processing. He has authored or co-authored more than 100
peer-reviewed journal and conference papers and three books.
[29] REYNOLDS, D. A.—ROSE, R. C. : Robust Text Independent
Speaker Identification using Gaussian Mixture Speaker Models, Miroslav Obradović was born in Belgrade, Serbia, in
IEEE Trans. on Speech and Audio Processing 3 No. 1 (1995), 1978. He is a Senior Software Developer at VLATACOM, R&D
72–83. Center, Belgrade.

You might also like