Final Year Project Progress Report
Final Year Project Progress Report
Speaker Recognition/Verification Software for use in Security Applications Matthew Byrne 05362725 Supervisor: Dr Edward Jones Co-Supervisor: Dr Martin Glavin
Contents
1. INTRODUCTION 1.1 Project Background 1.1.1 Applications of Speaker Recognition/Verification software 1.1.2 Theory behind Speaker Recognition/Verification software 1.1.3 Project Aim 2. Technology Background 2.1 Matlab 3. Progress to date 3.1 Front-end Processing 3.2 Mel-Frequency Cepstral Coefficients 3.3 Input 4. Remaining Project Goals 4.1 Testing 4.2 Investigation of Speaker Recognition and verification over the Internet 4.3 Investigation and Development of real-time implementation 4.4 PC GUI APPENDIX I CODE 1. Program code 2. MFCC code APPENDIX II - TABLE OF REFERENCES
12 Jan. 09
Matthew Byrne
Chapter 1 Introduction
1.1 Project Background
1.1.1 Applications of Speaker Recognition/Verification software
Speaker Verification software, or voice biometrics, is technology that relies on that fact that we all have a unique voice that identifies us. Speaker verification software is most commonly used in security applications, such as gaining access to restricted areas of a building, or computer files by using your voice. It is also used in call centres where clients say the menu which they wish to select. In many applications, speech may be the main or only means of transferring information and so would provide a simpler method for authentication. The telephone system provides a familiar network for obtaining and delivering the speech signal. For telephony based applications, there would be no need for any equipment or networks to be installed as the telephone and mobile network would suffice. For non-telephone applications, soundcards and microphones are available as a cheap alternative.
12 Jan. 09
Matthew Byrne
dependant system. Text-dependant systems can greatly improve accuracy, though such constraints can be hard or impossible to enforce in certain cases. Text-independent systems analyse and uses any spoken word to identify the user. A speaker verification system comprises of two states, a training state, and a test/verification state: The first step of training is to transform the speech signal to a set of vectors, and to obtain a new representation of the speech which is more suitable for statistical modelling. Firstly the speech signal is broken up into short frames, then windowed to minimize distortion. The signal is then analysed and stored as that users template. The first step in the testing stage is to extract features from the input speech, similar to that during training, compare the input speech to all other stored templates and select the most accurately matching template and ID the speaker. The figure below charts this process.
12 Jan. 09
Matthew Byrne
12 Jan. 09
Matthew Byrne
12 Jan. 09
Matthew Byrne
1500
1000
500
-500
-1000
-1500
50
100
150
200
250
300
12 Jan. 09
Matthew Byrne
The next step is to apply Hamming windowing to minimize the signal discontinuities at the beginning and end of each frame. This helps minimize the spectral distortion by using the window to taper the signal to zero at the beginning and the end of each frame.
50
100
150
200
250
300
Figure: Signal with Hamming windowing The signal is then converted to the frequency domain using a Fast Fourier transform. The first 128 frames are taken and plotted against the MFCC filter bank.
12 Jan. 09
Matthew Byrne
x 10
10
12
14
16
18
20
12 Jan. 09
Matthew Byrne
centre of the following one (the first filter starts at 0 and the last one ends at 4000). Suppose youre dealing with the 12th, filter, which has a centre frequency of 1292 Hz. The lower slope of the triangle goes from 0 to 1 between 1137 and 1292 Hz, while the upper slope of the triangle goes from 1 back
0.15 0.1 0.05 0 0.45 0.4 0.35 0.3 0.25 0.2
10
12
14
16
18
20
12 Jan. 09
Matthew Byrne
10
3.3 Input
The project will initially involve algorithm development and simulation using Matlab, and using pre-existing publicly available databases of speech. However, the project will also involve the development of a system for real-time operation, using a commerciallyavailable embedded microprocessor development system or a system based on a dedicated Digital Signal Processor (depending on the choice of hardware, only a subset of the functionality of the full system may be implemented). Initially a text-dependent, speaker identification system would be developed. Input speech data is required for both training and testing. The speech used for initial training and testing was pre-recorded. The system initially requires the pre-recorded speech file. In latter stages, the system will use recorded audio from multiple users. This will require a microphone for input which will store the signal temporarily while the system analyses it against stored users. In the final stages of the project, the system will wait for a prompt to start recording. The system will then wait for audio silence to stop analyzing. This audio silence will then be removed from the signal and then be analysis as usual.
12 Jan. 09
Matthew Byrne
11
12 Jan. 09
Matthew Byrne
12
4.4 PC GUI
A program with a graphical user interface will be coded. Users will have the options to train or run the program. Training will require a login from a previously trained user with a sufficient security profile. The training protocol will then be run. When testing is selected the user will be prompted to speak a given passphrase, and based on the analysis the user will either pass through to the secured area, or be blocked and have to try again.
12 Jan. 09
Matthew Byrne
13
Appendix I - CODE
1 - Program Code
i = 1; j = 1; y = fopen('c0r0.end', 'rb'); x = fread(y, 'short'); [m,n] = size(x); fsamp = 8000; timeLength = m/8000; %About 1/2 a second framelength = 256; %overlapping nFrames = floor((m-128)/(framelength-128)); %fclose(x); figure(1); plot(x) sampNo = 1; for frame = 1 : nFrames, aquiredData = x(i:i+256-1); i = i + 128; figure(2); plot(aquiredData); %pause; % Hamming windowing windowing = (hamming(256).*aquiredData); figure(3); plot(windowing) % FFT FFTofData = fft(windowing); figure(4); plot(abs(FFTofData)) FFTofData = FFTofData(1:end/2); figure(5); plot(abs(FFTofData)) % First 128 frames of the sample W=melFilterMatrix2(8000,256,20); FilteredData = W*(abs(FFTofData)); figure (6); plot(FilteredData) MFCC = DCT(FilteredData); figure(7); plot (MFCC) end
12 Jan. 09
Matthew Byrne
14
- MFCC Code
function W = melFilterMatrix(fs, N, nofChannels) %for test, use these parameters %parameters fs = 8000; N = 256; nofChannels = 19; %compute resolution etc. df = fs/N; %frequency resolution Nmax = N/2; %Nyquist frequency index fmax = fs/2; %Nyquist frequency melmax = freq2mel(fmax); %maximum mel frequency %mel frequency increment generating 'nofChannels' filters melinc = melmax / (nofChannels + 1); %vector of center frequencies on mel scale melcenters = (1:nofChannels) .* melinc; %vector of center frequencies [Hz] fcenters = mel2freq(melcenters); %compute bandwidths %startfreq = [0 , fcenters(1:(nofChannels-1))]; %endfreq = [fcenters(2:nofChannels) , fmax]; %bandwidth = endfreq - startfreq ; %quantize into FFT indices indexcenter = round(fcenters ./df); %compute resulting frequencies %fftfreq = indexcenter.*df; %compute resulting error %diff = fcenters - fftfreq; %compute startfrequency, stopfrequency and bandwidth in indices indexstart = [1 , indexcenter(1:nofChannels-1)]; indexstop = [indexcenter(2:nofChannels),Nmax]; %idxbw = (indexstop - indexstart)+1; %FFTbandwidth = idxbw.*df; %compute matrix of triangle-shaped filter coefficients W = zeros(nofChannels,Nmax); for c = 1:nofChannels %left ramp increment = 1.0/(indexcenter(c) - indexstart(c)); for i = indexstart(c):indexcenter(c) W(c,i) = (i - indexstart(c))*increment; end %i %right ramp decrement = 1.0/(indexstop(c) - indexcenter(c)); for i = indexcenter(c):indexstop(c)
12 Jan. 09
Matthew Byrne
15
12 Jan. 09
Matthew Byrne
16
Frank J. Owens - Signal Processing of Speech http://www.lumenvox.com/resources/tips/HowLVsoftwareUsed.aspx http://www.blueface.ie http://www.speech-recognition.de/matlab-examples2.html http://tldp.org/HOWTO/VoIP-HOWTO-4.html http://www.ifp.uiuc.edu/~minhdo/teaching/speaker_recognition/ http://www.hindawi.com/GetArticle.aspx?doi=10.1155/S1110865704310024 http://cslu.cse.ogi.edu/HLTsurvey/ch1node9.html
12 Jan. 09
Matthew Byrne
17