Deep Learning Techniques Notes
Deep Learning Techniques Notes
FeatureRepresentationLearninginDeep
Neural Networks
Through many layers of nonlinear processing, DNNs transform the raw
input feature to a more invariant and discriminative representation that
can be better classified by the log-linear model.
Output
Layer
Log-
Hidden LinearClassifier
Layer3 1
Hidden
Layer2 1
FeatureTra
nsform
Hidden
Layer1 1
Fig.9.2DNN:Ajointfeat
urerepresentationandcla
ssifierlearningview
In the DNN, the combination of all hidden layers can be
considered as a feature learning module as shown in Fig.9.2.
Manualfeatureconstructionworksfinefortasksthatpeoplecaneasilyinspectand know
what feature to use but not for tasks whose raw features are highly
variable.
Feature Hierarchy
DNNs not only learn feature representations that are suitable for the
classifier but also learn a feature hierarchy.
The hidden layers that are closer to the input layer represent low-level
features.
Those thatareclosertothesoftmaxlayerrepresenthigher-levelfeatures.
Thehigher-levelfeatures,however,arebuiltuponthelow-level
featuresandaremoreabstractandinvarianttotheinputfeaturevariations.
This property can also be observed inFig.9.4 in which the percentage of saturated
neurons,defined as neurons whose activation value is either >0.99 or <0.01, at each
layer are shown.The lower layers typically have smaller percentage of saturated neu-
rons while the higher layers, that are close to the softmaxlayer,have large percent-
age of saturated neurons. Note that (whose activation value is <0.01),
which indicates that the associated features are sparse. This is because
the training label, which is 1 for the correct class and 0 for all
other classes, is parse.
Fig.9.3FeaturehierarchylearnedbyanetworkwithmanylayersonImageNet.(Figureextracted from
Zeiler and Fergusfrom [34], permitted to use by Zeiler.)
Fig. 9.4Percentage of saturated neurons at each layer. (A similar figure has also
been shown in Yu et al. [33])
δ4+1=σ(z4+1(v4+δ4))−σ(z4+1(v4))
≈diagσr(z4+1(v4))(W4+1)Tδ4. (9.1)
where
z4+1(v4)= W4+1v4+b4+1 (9.2)
istheexcitation,andσ(z)isthesigmoidactivationfunction.Thenormofthechange
δ4+1is
δ4+1 ≈ diagσr(z4+1(v4))(W4+1)Tδ4
≤ diagσr(z4+1(v4))(W4+1)T δ4
= diag(v4+1•(1−v4+1))(W4+1)T δ4 (9.3)
where•referstoanelement-wiseproduct.
In the DNN, the magnitude of the majority of the weights is typically very
small if the size of the hidden layer is large as shown inFig.9.5.
FeatureHierarchy 161
Fig.9.5WeightmagnitudedistributioninatypicalDNN
+ + +
Fig.9.6Averageandmaximum diag(v4 1•(1−v4 1))(W4 1)T 2acrosslayersona6×2k DNN.
(A similar figure has also been shown in Yu et al. [33])
by lower layers. Note that the maximum norm over the same
development set is larger the an one,as shown inFig.9.6.
This is necessary since the differences need to been large daround the class
boundaries to have discriminationability.
These large norm cases also cause noncontinual points on the objective function,
as indicated by the fact that you can almost always find inputs from which as
mall deviation would change the label predicted by the DNN .2
Fig.9.7Ageneralframeworkoffeatureprocessing.Inthefigure,threestagesareshown,eachof the
stage involves four optional steps
Then onlin earprocessing step is critical in the deep model since the combination
of linear transformations is just another lin- ear transformation.
FlexibilityinUsingArbitraryInputFeatures
Note that GMMs typically require each dimension in the input feature
tobe statisticallyindependentsothatadiagonalcovariancematrixmaybeusedtoreduce
thenumberofparametersinthemodel.
All the input features are mean normalized and with dynamic
features. The MFCC feature is with up to third-order deriva-
tives,whilethelogfilter-bankfeaturehaveuptothesecond-orderderivatives.
The HLDA[13]transformisonlyappliedtotheMFCCfeaturefortheCD-GMM-
HMM system.Fromthistable,wecanobservethatswitchingfromtheMFCCfeature
to the 24 Mel-scale log filter-bank feature leads to 4.7% relative
word error rate
Table9.1ComparisonofdifferentrawinputfeaturesforDNN
Combinationofmodelandfeatures WER(rel. WERR)
CD-GMM-HMM(MFCC,fMPE+BMMI) 34.66%(baseline)
CD-DNN-HMM(MFCC) 31.63%(−8.7%)
CD-DNN-HMM(24MS-LFB) 30.11%(−13.1%)
CD-DNN-HMM(29MS-LFB) 30.11%(−13.1%)
CD-DNN-HMM(40MS-LFB) 29.86%(−13.8%)
All the input features are mean normalized and with dynamic features. Relative
word error rate (WER) reduction in parentheses. (Summarized from Li et al. [16])
Eventhefilter-bankscanbeautomaticallylearnedbytheDNNsasreportedin[26],
inwhich FFTspectrumisusedastheinputtotheDNNs.Tosuppressthedynamic
rangeofthefilter-bankoutput,thelogfunctionisusedastheactivationfunctionin
thefilter-banklayer.
It is reportedin[26]thatbylearningthefilter-
bankparametersdirectly5%relativeWER reduction can be achieved over the
baseline DNNs that use the manually designed Mel-scale filter-banks.
RobustnessofFeatures
The rearetwomain types of variation sin speech signals: speaker variation and
environment variation.In the conventional GMM-HMM systems, both types of
variations need to be handled explicitly.
RobusttoSpeakerVariations
RobusttoEnvironmentVariations
InmethodssuchasVTSadaptation,anestimatednoisemodelisusedtoadaptthe
Gaussian parameters of the recognizer based on a physical model that
defines how noise corrupts clean speech. The relationship between the clean
speechx, corrupted (ornoisy) speechy, andnoisen in the log spectral domain can
be approximatedas
y=x+log(1+exp(n−x)). (9.4)
Since weare interested in the nonlinear mapping from the noisy speech y,and
noise n to the clean speech x, we may augment each observation
input (noisy speech) to the
Network with anestimate of the noise nˆ t presentinthesignal,i.e.,
Bothhalvesincludeacombinationofcleanspeechandspeechcorruptedbyoneof
sixdifferenttypesofnoise(streettraffic,trainstation,car,babble,restaurant,airport) at a
range of signal-to-noise ratios (SNR) between 10–20dB.
Table9.3AcomparisonofseveralGMMsystemsintheliteraturetoaDNNsystemontheAurora 4 task
Systems Distortion AVG(%)
None Noise (%) Channel(%) Noise+
(clean)(%) channel(%)
GMMbaseline 14.3 17.9 20.2 31.3 23.6
MPE+NAT+VTS 7.2 12.8 11.5 19.7 15.3
NAT+Derivativekernels 7.4 12.6 10.7 19.0 14.8
NAT+JointMLLR/VTS 5.6 11.0 8.8 17.8 13.4
DNN(7×2,048) 5.6 8.8 8.9 20.0 13.4
DNN+NaT+dropout 5.4 8.3 7.6 18.5 12.4
Summarizedfrom[28,33]
In fact, these results only indicate that the DNN systems are more
robust than GMM systems to speaker and environmentaldistortions.
Theperturbationshrinkingpropertyinthehigherlayers
aswediscussedinSect.9.2appliesequallytoallconditions.Inthissection,weshow,
usingresultsfrom[8],thatDNNsprovidesimilargainsoverGMMsystemsacross
different noise levels and speaking rates.
In[8],Huangetal.conductedaseriesofstudiescomparingGMMsandDNNson
themobilevoicesearch(VS)andshortmessagedictation(SMD)datasetscollected
through the real-world applications that are used by millions of users
withdistinctspeakingstylesindiverseacousticenvironments.Thesedatasetswerechosen
becausetheycoveralmostallkeyLVCSRacousticmodelchallenges,eachwithenoughda
ta to ensure the statistical significance. In the study, Huang et al.
trained a pair of GMM and DNN models using 400hVS/SMDdata.
TheGMMsystem,whichused the39-dimensional MFCCfeaturewithuptothethird-
orderderivatives, isastate- of-the-art model trained with the feature-space
minimum phone error rate (fMPE)
[22] and boosted MMI (bMMI) [21] criteria. The DNN system, which
used the29-dimensionallogfilter-bank(LFB)feature(anduptothesecond-
orderderivatives) with a context window of 11 frames, was trained using
the cross-entropy (CE) criteria.
Thetwomodelssharedthesametrainingdataanddecisiontree.Thesamemaximum
likelihood estimation(MLE) GMMseed model was used for the lattice
generation in the GMM and the senone state alignment in the DNN.
The analytic study was conductedon a 100h VS/SMD test data
randomly sampled from the datasets and roughly follow the same
distribution as the training data.
RobustnessAcrossNoiseLevels
Histogram(%)
38 6
WER(%)
5
30 4
3
22 2
1
14 0
-5 0 5 10 15 20 25 30 35 40
SNR(dB)
30 10
Data Dist (SMD).GMM-HMM(SMD)CD-DNN-HMM(SMD)
9
26 8
7
Histogram(%)
22 6
WER(%)
5
18 4
3
14 2
1
10 0
-5 0 5 10 15 20 25 30 35 40
SNR (dB)
RobustnessAcrossSpeakingRates
Speaking rate variation is another well known factor that would affect
the speech intelligibility and thus the speech recognition accuracy. The
speaking rate change can be due to different speakers, speaking
modes, and speaking styles. There are
severalreasonsthatspeakingratechangemayresultinspeechrecognitionaccuracy
degradation. First, it may change the acoustic score dynamic range
since the AM score of a phone is the sum of all the frames in
the same phone segment. Second, the fixed frame rate, frame length,
and context window size may be inadequate to
capturethedynamicsintransientspeecheventsforfastorslowspeechandtherefore
result in suboptimal modeling. Third, variable speaking rates may
result in slight
formantshiftduetothehumanvocalinstrumentationlimitation.Last,extremelyfast
speech may cause formant target missing and phone deletion.
Figures9.10and 9.11, originally appear in [8], illustrate the WER
difference
acrossdifferentspeakingrates,measuredasthenumberofphonespersecond,3on the
VS and SMD datasets respectively. From these figures, we can notice
that theCD-DNN-HMMsystemconsistentlyoutperformstheGMM-
HMMsystemwith
almostuniformWERreductionacrossallspeakingrates.Unlikeinthenoiserobust-
nesscase,hereweobserveaU-shapedpatternonbothVSandSMDdatasets.Onthe VS
dataset, the best WER is achieved around 10–12 phones per second.
When the speakingrate deviates, either speeds up or slows down, 30%
from the sweet spot,
30%relativeWERincrementisobserved.OntheSMDdataset,15%relativeWER
incrementcanbeobservedwhenthespeakingratedeviates30%fromthesweetspot.
28
Data Dist. (VS)GMM-HMM(VS)CD-DNN-HMM(VS)
55
24
45 20
Histogram(%)
WER(%)
16
35
12
8
25
4
15 0
5 7 9 11 13 15 17
SpeakingRate(#ofPhonesper Second)
Fig.9.10PerformancecomparisonoftheGMM-HMMandtheCD-DNN-HMMatdifferentspeak- ing
rate for the VS task. (Figure from Huang et al. [8], permitted to use by ISCA.)
20 Histogram(%)
WER(%)
22 16
12
17 8
4
12 0
5 7 9 11 13 15 17
SpeakingRate(#ofPhonesperSec.)
Fig.9.11PerformancecomparisonoftheGMM-HMMandtheCD-DNN-HMMatdifferentspeak- ing
rate for the SMD task. (Figure from Huang et al. [8], permitted to use by
ISCA.)
Tocompensateforthespeakingratedifference,additionalmodelingtechniquesneed to
be developed.
LackofGeneralizationOverLargeDistortions
Fig.9.12Illustrationofmixed-bandwidthspeechrecognitionusingaDNN.(FigurefromYuetal. [33],
permitted to use by Yu.)
Table9.4Word error rate (WER) on wideband (16k) and narrowband (8k) test sets
with and without narrowband training data
Trainingdata 16kHzVS-T(%) 8kHzVS-T (%)
16kHzVS-1+16kHzVS-2 27.5 53.5
16kHzVS-1+8kHzVS-2 28.3 29.3
TablefromYuetal.[33]basedonresultsextractedfromLietal.[16]
,
2
dl(xwb,xnb)= N4
v4j (xwb)−v4(xj nb) (9.6)
j=1
betweentheactivationvectorsateachlayerforthewidebandandnarrowbandinput
feature pairs, v4(xwb)and v4(xnb), where the hidden units are shown as
a function of the wideband features, xwb, or the narrowband features,
xnb. For the top layer,
whoseoutputisthesenoneposteriorprobability,wecalculatetheKLdivergencein nats
betweenp(sj|xwb) andp(sj|xnb).
N
L
p(sj|xwb)
dy(xwb,xnb)=p(sj|xwb)log , (9.7)
j=1 p(sj|xnb)
Table9.5Euclideandistancefortheactivationvectorsateachhiddenlayer(L1–L7)andthe
KLdivergence (nats) for the posteriors at the softmax layer between the
narrowband (8kHz) and wideband (16kHz) input features, measured using the
wideband DNN or the mixed-bandwidth DNN
whereNListhenumberofsenones,andsjisthesenoneid.Table9.5,extractedfrom [16],
shows the statistics of dland dyover 40,000 frames randomly sampled
from the testset for the DNNtrained using wideband speech only and
that trained using mixed-bandwidth speech.
FromTable9.5,wecanobservethattheaveragedistancesinthedata-mixedDNN
areconsistentlysmallerthanthoseintheDNNtrainedonwidebandspeechonly.This
indicatesthatbyusingmixed-bandwidth trainingdata,theDNNlearnstoconsider the
differences in the wideband and narrowband input features as irrelevant
varia- tions.These variations aresuppressed after many layers of nonlinear
transformation. The final representation is thus more invariant to this
variation and yet stillhas the
abilitytodistinguishbetweendifferentclasslabels.Thisbehaviorisevenmoreobvi- ous
at the output layer since the KL divergence between the paired
outputs is only
0.22natsinthemixed-bandwidthDNN,muchsmallerthanthe2.03natsobservedin the
wideband DNN.
Chapter11
AdaptationofDeepNeuralNetworks
TheAdaptationProblemforDeepNeuralNetworks
typicallytrainedfromanidentityweightmatrixandzerobiastooptimizeeitherthe cross-
entropy(CE)trainingcriteriondiscussedinChap.4(Eq.(4.1))orthesequence-
discriminativetrainingcriteriondiscussedinChap.8(Eqs.(8.1)and(8.8)),keeping
fixed the weights of the original DNN.
LinearInputNetworks
In the linear input network (LIN) [2, 4, 21, 28, 33, 35] and the
very similar fea- ture discriminative linear regression (fDLR) [30]
adaptation techniques, the linear
transformationisappliedtotheinputfeaturesasshowninFig.11.1.Thebasicidea
behind LIN is that the speaker-dependent (SD) feature can be linearly
transformed tomatchthatoftheaveragespeakerspecifiedbythespeaker-
independent(SI)DNN
×
model.Inotherwords,wetransformtheSDfeaturev0∈RN0 1tov0 LIN ∈RN0×1
N0×N0
byapplyingalineartransformationspecifiedbytheweightmatrixW ∈R
LIN
×
and biasvectorbLIN∈RN0 1as
0
LIN
v =WLINv0+bLIN, (11.1)
whereN0isthesizeoftheinputlayer.
Fig.11.1Illustrationofthe
OutputLayer(oftenseones)
linear input network
(LIN) and feature
discriminative linear
regression (fDLR)
adaptation techniques. A
linear layer is inserted
right above
andappliedupon the input
feature layer
Original
HiddenLayers
.
Inspeechrec.ognition,theinputfeaturevectorv0(t)=ot= xmax(0,t−o)· · · xt
· · · xmin(T,t+o) attimetforanutterancewithTframesoftencovers2o+1
frames.Whentheadaptationsetissmall,itisdesirabletoapplyasmallerper-frame
transformationWLINf∈RD×D,bLIN∈RfD×1as
xLIN=WLINx+bLIN, (11.2)
f f
andthetransformedinputfeaturevectorcanbeconstructedasv0 (t)=oLIN=
t
xLIN
· · · xLIN· · · xLIN
max(0,t−o)
LIN
,whereDisthedimensionofeachfeature
t min(T,t+o)
frameandN0=(2o+1)D.Since W hasonly LIN 1
2ofparametersasthatin
f (2o+1)
W ,ithaslesstransformationpowerandislesseffectivethanWLIN.However,
LIN
it
canbemorereliablyestimatedfromasmalladaptationsetandthusmayoutperform WLIN
overall.
LinearOutputNetworks
L LON L LON
zLONa =W a z +b a
Added
Added
Linear
Linear Layer
Layer
Original
Original
Hidden Layers
Hidden Layers
Fig.11.2Illustrationofthelinearoutputnetwork(LON).Thelineartransformationcanbeapplied after
(a) or before (b) the original weight matrix WLis applied
×
vector,respectively,intheLON(a),andzL LONa ∈ RNL 1istheexcitationafterthe
lineartransformation.
InFig.11.2b,thelineartransformationisappliedbeforethesoftmaxlayerweights,
or
L
LONb +b
LONb
=WvLL−1
L
z
×
wherevL−LONb
1
∈RNL−1 1isthetransformedfeatureatthelasthiddenlayer,and
×NL−1 ×
WLON∈RNL−1 andbLON∈RNL−1 1arethetransformationmatrixandthe
b b
biasvector,respectively,intheLON(b).
Itisclearthatthesetwoapproachesareequallypowerfulsincethelineartransfor-
mationofalineartransformationequalstoasinglelineartransformationasshownin Eqs.
(11.3)and(11.4).However,thenumberofparametersinthesetwoapproaches
canbesignificantlydifferent.Ifthenumberofoutputneuronsissmallerthanthatof the
last hidden layer, as in the case of monophone systems, WLONis
smaller than
a WLON
b .IntheCD-DNN-HMMsystems,however,theoutputlayersizeissignifi-
cantly larger than that of the last hidden layers. As ba result WLON is
significantly smaller
a than WLON and can be more reliably estimated from
the adaptation data.
LinearHiddenNetworks
Inthelinearhiddennetwork(LHN)[12],thelineartransformationisappliedtothe
hiddenlayers.Thisisbecause,asdiscussedinChap.9,aDNNcanbeseparatedinto
twopartsatanyhiddenlayer.Thepartthatcontainstheinputlayercanbeconsidered
asafeaturetransformationmoduleandthehiddenlayerthatseparatestheDNNcan be
considered as a transformed feature. The part that contains the output
layer can be considered as a classifier that operates on the hidden
layer feature.
Similar to that in LON, in LHN there are also two ways to
apply the linear transformationas shown in Fig.11.3. For the same
reason, we just discussed these
twoadaptationapproachesthathavethesameadaptationpower.UnlikeinLON
wherethesizeofWLONandWLONcanbesignificantlydifferent,though,inLHN
a b
WLHNandWLHNoftenhavethesamesizebecauseinmanysystemsthehiddenlayer
a b
sizesarethesame.
The effectiveness of LIN, LON, and LHN is task dependent.
Although they are very similar, there are subtle differences with
regard to the number of parameters
andthevariabilityofthefeatureadaptedaswejustdiscussed.Thesefactorstogether with
the size of the adaptation set determine which technique is best for
a specific task.
(a)
OutputLayer(oftenseones) (b) OutputLayer(oftenseones)
Added
Added
LinearLayer
LinearLayer
Original
Original
HiddenLayers
HiddenLayers
Fig.11.3Illustrationofthelinearhiddennetwork(LHN).Thelineartransformationcanbeapplied after
(a) or before (b) the original weight matrix W4at hidden layer 4is applied
ConservativeTraining
L2Regularization
ThebasicideaoftheL2regularizedCTistoaddtheL2normofthemodelparameter
difference
R2(WSI−W)= vec(WSI−W) 2 2
L
2
=¨vecW 4
SI−W4¨ (11.5)
2
4=1
betweenthespeaker-independentmodelWSIandtheadaptedmodelWtotheadapta-
tioncriterion ( , ) JWb;S,wherevecW4∈R[N4×N4−1]×1isthevectorgenerated
byconcatenatingallthecolumnsinthematrixW4,andvecW4equalsto
¨ ¨
¨W4 ¨—theFrobeniousnormofthematrixW4. 2
F
WhenthisL2regularizationtermisincluded,theadaptationcriterionbecomes
JL2(W,b;S)=J(W,b;S)+λR2(WSI,W), (11.6)
whereλisaregularizationweightthatcontrolstherelativecontributionofthetwo
termsintheadaptationcriterion.TheL2regularizedCTaimstoconstrainthechange
ofthemodelparameteroftheadaptedmodelwithregardtothespeaker-independent
model.SincethetrainingcriterionEq.(11.6)isverysimilartotheweightdecaywe
discussedinSect.4.3.3ofChap.4,thesametrainingalgorithmcanbeuseddirectly in
the L2 regularized CT adaptation.
KL-DivergenceRegularization
JKLD(W,b;S)=(1−λ)J(W,b;S)+λRKLD(WSI,bSI;W,b;S), (11.7)
isused,where
C
JCE(W,b;o,y)=− Pemp(i|om)logP(i|om), (11.10)
i=1
wherewedefined
P¨(i|om)=(1−λ)Pemp(i|om)+λPSI(i|om). (11.12)
Note that Eq.(11.11) has the same form as the CE criterion except
that the target distribution is now an interpolated value between the
empirical probability Pemp(i|om)andtheprobabilityPSI(i|om)estimated
fromtheSImodel.Thisinter- polation prevents overtraining by keeping the
adapted model from straying too far
awayfromtheSImodel.Italsoindicatesthatthenormalbackpropagation(BP)algo-
rithmcanbedirectlyusedtoadapttheDNN.Theonlythingneedstobechangedis
theerrorsignalattheoutputlayer,whichisnowdefinedbasedonP¨(i|om)instead
ofPemp(i|om).
NotethattheKLDregularizationdiffersfromtheL2regularization,whichcon-
strains the model parameters themselves rather than the output
probabilities. Since
whatwecareaboutistheoutputprobabilityinsteadofthemodelparametersthem-
selves,KLDregularizationismoreattractiveandoftenperformsbetterthanthe
L2regularization.
Theinterpolationweight,whichisdirectlyderivedfromtheregularizationweight
λ,canbeadjusted,typicallyusingadevelopmentset,basedonthesizeoftheadapta-
tionset,thelearningrateused,andwhethertheadaptationissupervisedorunsuper-
vised.Whenλ=1,wetrustcompletelytheSImodelandignoreallnewinformation
fromtheadaptationdata.Whenλ=0,weadaptthemodelsolelyontheadaptation set,
ignoring information from the SI model except using it as the
starting point.
Intuitively,weshouldusealargeλforasmalladaptationsetandasmallλforalarge
adaptation set.
TheKLDregularizedadaptationtechniquecanbeeasilyextendedtothesequence-
discriminativetraining.AsdiscussedinSect.8.2.3,topreventoverfittingtheframe
smoothing is often used in the sequence-discriminative training which
leads to the interpolated training criterion
JFS-SEQ(W,b;S)=(1−H)JCE(W,b;S)+HJSEQ(W,b;S), (11.13)
JKLD-FS-SEQ(W,b;S)=JFS-SEQ(W,b;S)+λsRKLD(WSI,bSI;W,b;S),
(11.14)
whereλsistheregularizationweightforthesequence-discriminativetraining.
JKLD-FS-SEQcanbeconvertedto
JKLD-FS-SEQ(W,b;S)=HJSEQ(W,b;S)+(1−H+λs)JKLD-CE(W,b;S)
(11.15)
λs
followingthesimilarderivationbydefiningλ=1 −H+λs
.
ReducingPer-SpeakerFootprint
ΔWm×n=Um×nΣn×nVT n×n
˜ T
≈˜U m ×k Σk×k˜V k ×n
T
=˜U m × k W˜ k × n , (11.16)
whereΔWm×n∈Rm×n,Σn×nisadiagonalmatrixthatcontainsallthesingular
values,k<nisthenumberofsingularvalueskept,UandVTareunitary,the
columnsofwhichformasetoforthonormalvectors,whichcanberegardedasbasis
T ˜ T T
vectors,andW˜ ˜
k×n=Σk×kVk×n.Here,weonlyneedtostoreUm×kandWk×n
˜ ˜
whichcontain(m+n)kparametersinsteadofm×nparameters.Experimentsin
[36]haveshownthatnoorlittleaccuracylossisobservedevenifonlylessthan10%
ofdeltaparametersarestored.
The second approach is applied on top of the low-rank model
approximation techniquewehavediscussed in Chap.7as shown in Fig.11.4.
Fig.11.4IllustrationoftheSVDbottleneckadaptationtechnique
Wm×n=Um×nΣn×nVT
n×n , (11.17)
whereΣisadiagonalmatrixconsistedofnonnegativesingularvaluesinthedecreas- ing
order, U and VTare unitary, the columns of which form a set of
orthonormal
vectors.Here,Σplaysanimportantroleandsinceit’sadiagonalmatrixthenumber of
parameters in Σis very small and equals to n. We can, instead of
adapting the whole weight matrix W, only adapt this diagonal matrix.
If the adaptation set is small, we may even only adapt the top k%
SubspaceMethods
SubspacemethodsaimtofindaspeakersubspaceandthenconstructadaptedANN
weightsortransformationsasapointinthesubspace.Promisingtechniquesinthis
category include principal component analysis (PCA) based approach
[9], noise- aware[31] (which we have discussed in Chap.9) and
speaker-aware training [29], and tensor-based adaptation [41, 42].
Techniques in this category can adapt the SI model to a specific
speaker very quickly.
SubspaceConstructionThroughPrincipalComponent
Analysis
In[9],afastadaptationtechniquewasproposed.Inthistechnique,subspaceisused to
estimate an affine transformation matrix by considering it as a random
variable.
Principalcomponentanalysis(PCA)wasperformedonasetofadaptationmatrices
toobtaintheprincipaldirections(i.e.,eigenvectors)inthespeakerspace.Eachnew
speaker adaptation model is then approximated by a linear combination of
the retained eigenvectors.
The above mentioned technique can be extended to a more general
condition. Givenasetof S speakers,wecanestimateaspeaker-
specificmatrixWADP∈Rm×nforeachspeaker.Here,thespeaker-
specificmatrixcanbethelineartransformation in LIN, LHN, or LON, or the
adapted weights or delta weights in the conservative
training. Denote a = vecWADP, the vectorization of the matrix, each
matrix can
beconsideredasanobservationofarandomvariableinaspeakerspaceofdimension
m×n.PCAcanthenbeperformedonthesetof S vectorsinthespeakerspace.The
eigenvectors obtained from the PCA define principal adaptation
matrices.
This approach assumes that new speakers can be represented as a
point in the space spannedbytheSspeakers.Inotherwords,S
isbigenoughtocoverthespeakerspace.
Sinceeachnewspeakerisrepresentedbythelinearcombinationoftheeigenvectors, when
S is large, the total number of linear interpolation weights can also
be large.
Fortunately,thedimensionalityofthespeakerspacecanbereducedbydiscardingthe
eigenvectors that correspond to small variances. In this way, each
speaker-specific matrix can be efficiently represented by a reduced
number of parameters.
Foreachnewspeaker,
a=a+Uga≈a+˜U˜ga (11.18)
SpeakerI
Acoustic
nfo
Feature
Noise-Aware,Speaker-Aware,andDevice-AwareTraining
Anothersetofsubspaceapproachesexplicitlyestimatethenoiseorspeakerinforma-
tionfromtheutteranceandprovidethisinformationtothenetworkinhopethatthe
DNNtrainingalgorithmcanautomaticallyfigureouthowtoadjustthemodelpara-
meterstoexploitthenoise,speaker,ordeviceinformation.Wecallsuchapproaches
noise-aware training (NaT) when the noise information is used, speaker-
aware train- ing(SaT)whenthespeakerinformationisexploited,anddevice-
awaretraining(DaT)
whenthedeviceinformationisused.SinceNaT,SaT,andDaTareverysimilarand
wehave already discussed NaT in Chap.9, we focus on SaT in this
subsection by noting that NaT and DaT can be carried out similarly.
Figure11.5illustrates the
architectureofSaT,inwhichtheinputtotheDNNhastwoparts:theacousticfeature and
the speaker information (or noise information if NaT is used).
ThereasonSaThelpstoboosttheperformanceofDNNscanbeeasilyunderstood with
the following analysis. Without the speaker information, the activation
of the first hidden layer is
v1=fz1=fW1v0+b1, (11.19)
=fW1v0+W
v
1
s+bs1 SaT
=fW1v0+W
v
1
s+b1,s SaT (11.20)
wheresisthevectorthatcharacterizethespeaker,andW1andW1aretheweight
v s
matricesassociatedwiththeacousticfeatureandspeakerinformation,respectively.
1
ComparedtothenormalDNN,whichusesafixedbiasvectorb ,SaTusesaspeaker-
dependentbiasb1=W1s+ b1 .OnebenefitofSaTisthattheadaptationis
s s SaT
implicitandefficientanddoesnotrequireaseparateadaptationstep.Ifthespeaker
informationcanbereliablyestimated,SaTisagreatcandidateforspeakeradaptation in
the DNN framework.
Thespeakerinformationcanbederivedinmanydifferentways.Forexample,in [1,
37],aspeakercodeisusedasthespeakerinformation.Duringthetraining,this
codeisjointlylearnedwiththerestofthemodelparametersforeachspeaker.During
thedecodingprocess,wefirstlearnasinglespeakercodeforthenewspeakerusing all
the utterances spoken by the speaker. This can be done by treating
the speaker code as part of the model parameters and estimating it
using the backpropagation
algorithmkeepingrestoftheparametersfixed.Thespeakercodeisthenusedaspart of
the input to the DNN to compute the state likelihood.
The speaker information can also be estimated completely
independent of the DNN training. For example, it may be learned
from a separate DNN from which
eithertheoutputnodeorthelasthiddenlayercanbeusedtorepresentthespeaker. In[29]i-
vectors[8, 13]areusedinstead.I-vectorisapopulartechniqueforspeaker verification
and recognition. It encapsulates the most important information about
aspeaker’sidentityinalow-dimensionalfixed-lengthrepresentationandthusisan
attractivetoolforspeakeradaptationtechniquesforASR.I-vectorhasbeenusednot
onlyinDNNadaptation,butalsoinGMMadaptation.Forexample,ithasbeenused
in[16]fordiscriminativespeakeradaptationwithregiondependentlineartransforms
andin[5,39]forclusteringspeakersorutterancesformoreefficientadaptation.Since
asinglelow-dimensionali-vectorisestimatedfromalltheutterancesspokenbythe
samespeaker,i-vectorcanbereliablyestimatedfromlessdatathanotherapproaches.
Sincei-vectorisanimportanttechnique,wesummarizeitscomputationstepshere.
ComputationofI-Vector
Wedenotext∈RD×1 theacousticfeaturevectorsgeneratedfromauniversalback-
groundmodel(UBM)representedasaGMMwithKdiagonalcovarianceGaussians
K
xt∼ ckN(x;μk,Σk), (11.21)
k=1
whereck,μk,andΣkarethemixtureweight,Gaussianmean,anddiagonalcovariance
ofthekthGaussianmixture.Weassumetheacousticfeaturext(s)specifictospeaker s
are drawn from the distribution
K
xt(s)∼ ckN(x;μk(s),Σk), (11.22)
k=1
μk(s)=μk+Tkw(s),1≤k≤K, (11.23)
K
p(w|{xt(s)})=N −1
w;L (s) TkTΣ−1θk(s),L−1(s) , (11.24)
k=1 k
wheretheprecisionmatrixL(s)∈RM×Mis
K
L(s)=I+γk(s)TkTΣ−1Tk, k (11.25)
k=1
andthezeroth-orderandfirst-orderstatisticsare
T
γk(s)= γtk(s), (11.26)
t=1
T
θk(s)= γtk(s)(xt(s)−μk(s)). (11.27)
t=1
withγtk(s)beingtheposteriorprobabilityofmixturecomponentkgivenxt(s).The i-
vector is simply the MAP point estimate of the variable w
K
−1 −1
w(s)=L (s)TkTΣ θk(s),k (11.28)
k=1
whichisjustthemeanoftheposteriordistribution11.24.
Notethat{Tk|1≤k≤K}areunknownandneedtobeestimatedfromthespeaker-
specific acoustic features {xt(s)}using the expectation–maximization (EM)
algo- rithm to maximize the maximum likelihood (ML) training
criterion [6]
1 −1
Q(T 1 ,...,TK) γtk(s)log|L(s)|+(xt(s)−μk(s))TΣ (xt(s)−μk(s)),
=− k
2
s,t,k
(11.29)
orequivalently
1
Q(T 1 ,...,TK) = − γk(s)log|L(s)|+γk(s)TrΣ−1Tkw(s)wkT(s)TT k
2 s,k
−2TrΣ−1T
k kw(s)θ (s)+C,
T
k (11.30)
TakingthederivativeofEq.11.30withrespecttoTkandsettingitto0wegetthe M-step
Tk=CkA−1,1≤k≤K
k (11.31)
where
Ak=γk(s)L−1(s)+w(s)wT(s) (11.33)
s
arecomputedintheE-step.
Although we discussed SaT, NaT, and DaT separately, these
techniques can be combined into a single network, in which the
input has four segments: one for the
speechfeatureandtherestareforthespeaker,noise,anddevicecodes,respectively.
Thespeaker,noise,anddevicecodescanbetrainedjointlyandthelearnedcodescan be
carried over to form different combination of conditions.
Tensor
Thespeakerandspeechsubspacescanalsobeestimatedandcombinedusingthree- way
connections (or tensors). In [41], several such architectures are
proposed.
Figure11.6showsoneofsucharchitecturescalleddisjointfactorizedDNN.Inthis
architecture,thespeakerposteriorprobabilityp(s|xt)isestimatedfromtheacoustic
featurextusingaDNN.Theclassposteriorprobabilityp(yt=i|xt)isestimatedas
p(yt=i|xt)= p(yt=i|s,xt)p(s|xt)
s
expsTWivL−1
t p(s|x), (11.34)
=
s j t t
j
expsTWvL−1
× ×
where W ∈ RNL S NL−1is a tensor, S is the number of neurons in
the speaker identification DNN,NL−1andNLare the number of neurons in
the last hidden
layerandtheoutputlayer,respectively,ofthesenoneclassificationDNN,and Wi∈
OutputLayer(oftenseones)
SpeakerEstimation
EffectivenessofDNNSpeakerAdaptation
AswehavediscussedinChap.9,DNNcanextractfeaturesthatarelesssensitiveto the
perturbations in the acoustic features than GMM and other shallow
models. In
fact,ithasbeenshownthatmuchlessimprovementwasobservedwithfDLRwhen
thenetworkgoesfromshallowtodeep[30].Itthusraisesthequestiononhowmuch
additionalgainwecangetfromspeakeradaptationontheCD-DNN-HMMsystems. In
this section, we show, using experimental results from [29, 36, 43]
that speaker adaptation techniques are very important even to CD-
DNN-HMM systems.
KL-DivergenceRegularizationApproach
Speaker-AwareTraining