Dynamic Programming and Single Word Recognizers (Part 1)
Dynamic Programming and Single Word Recognizers (Part 1)
Comparing Complete Utterances Comparing Complete Utterances Endpoint Detection Approaches to Sequence Alignment Alignment of Speech Vectors May Be Non-Bijective Time Warping Distance Measure between two Utterances The Minimal Editing Distance Problem Dynamic Programming The Dynamic Programming Matrix Computing the Minimal Editing Distance Utterance Comparison by Dynamic Time Warping DTW-Steps DTW-Applet Constraints for the DTW-Path Global Constraints for the DTW-Path Java Source-Code for DTW The DTW Searchspace
=>
v s.
vs.
so we have to find a way to decide which vector to compare to which impose some constraints (not every vector can be compared to every)
Endpoint Detection
When comparing two recorded utterances there might be: utterances are of different length one or both utterances can be preceeded of followed by a period of (possibly non-voluntartily recorded) silence vs. Also: we might not have any mechanizm to signalize the recognizer when it should listen.
Typical Solution: 2 compute signal power: p[i..j] = k=i..j s[k] , then apply threshold to detect speech
Linear alignment can handle the problem of different speaking rates. But it can not handle the problem of varying speaking rates during the same utterance.
It is possible that more than one x is aligned to the same y (or vice versa). It is possible that more than an x or a y has no alignment partner at all.
Time Warping
Task: given: two sequences x1,x2,...,xn and y1,y2,...,ym wanted: alignment relation R (not function), were (i,j) is in R iff xi is aligned with yj. We are looking for a common timeaxis:
For a given path R(i,j), the distance between x and y is the sum of all local distances d(xi,yj). In our example: d(x1,y1) + d(x2,y2) + d(x3,y3) + d(x3,y3) + d(x5,y4) + d(x6,y5) + d(x7,y7) + ... Question: How can we find a path that gives the minimal overall distance?
Dynamic Programming
How can we find the minimal editing distance? Greedy algorithm? Always perform the step that is currently the cheapest. If there are more than one cheapest step take any one of them. Obvious: can't guarantee to lead to the optimal solution. Solution: Dynamic Programming (DP) DP is frequently used in operations research, where consecutive decisions depend on each other and whose sequence must lead to optimal results. The key idea of DP is: If we would like to take our system into a state si, and we know the costs c1,...,ck for the optimal ways to get from the start to all states q1,...,qk from which we can go to s, then the optimal way to s goes over the state ql where l = argminj cj
DTW-Steps
Many different warping steps are possible and have been used. Examples:
symmetric (editing distance) Bakis
Itakura
weighted
General rule is: Cumulative cost of destination = best-of(cumulative cost of source + cost of step + distance in destination)
Other reason (besides global path constraints) for restricting search space: Save time: A window that has a constant width, reduces the search effort from O(n2) to O(n) To overcome caveats of "diagonal window" restriction, use: beamsearch.
Approaches: "expand" only a fixed number of states per column of DTW-matrix expand only states that have a cumulative distance less than a factor (the beam) times the best distance so far
build recognizer that can recognize two words w1 and w2 collect training examples (one per word in demo, in real life: a lot more) skip the optimization phase (don't need development set) collect evaluation data (a few examples per word) run tests on evaluation data and report results
In the applet You can see which editing steps were made (yellow lines). The number in a cell shows the actual costs to get to the specific field (insertion and deletion have the cost 1, substitution the cost 0 if the next characters are equal, else 1). At the end, a red line shows the optimal path.
code