"/api2/partners/videos/5GK8pTpHQVzc/languages/en/subtitles/", "site_url": "http://www.amara.org/videos/5GK8pTpHQVzc/en/476068/", "sub_format": "json", "subtitles": [{"end": 4205, "meta": {"new_paragraph": true}, "position": 1, "start": 1204, "text": "Let\u2019s introduce the noisy channel model of spelling."}, {"end": 7379, "meta": {"new_paragraph": false}, "position": 2, "start": 4943, "text": "The intuition of the noisy channel\u2014"}, {"end": 10315, "meta": {"new_paragraph": false}, "position": 3, "start": 7379, "text": "and it comes up throughout natural language processing\u2014"}, {"end": 11923, "meta": {"new_paragraph": false}, "position": 4, "start": 10315, "text": "is that we have some original signal\u2014"}, {"end": 13639, "meta": {"new_paragraph": false}, "position": 5, "start": 11923, "text": "let\u2019s say it\u2019s a word\u2014"}, {"end": 17190, "meta": {"new_paragraph": false}, "position": 6, "start": 13639, "text": "and we imagine that it goes through some channel."}, {"end": 19132, "meta": {"new_paragraph": false}, "position": 7, "start": 17190, "text": "And the idea was originally invented for speech,"}, {"end": 20943, "meta": {"new_paragraph": false}, "position": 8, "start": 19132, "text": "where if you talk into a tube"}, {"end": 23543, "meta": {"new_paragraph": false}, "position": 9, "start": 20943, "text": "or you go over some kind of telecommunications line,"}, {"end": 25187, "meta": {"new_paragraph": false}, "position": 10, "start": 23543, "text": "and the word is distorted."}, {"end": 28711, "meta": {"new_paragraph": false}, "position": 11, "start": 25187, "text": "And so what comes out from the original word is some noisy word."}, {"end": 30518, "meta": {"new_paragraph": false}, "position": 12, "start": 28711, "text": "And we\u2019ve represented that here with a weird font."}, {"end": 33782, "meta": {"new_paragraph": false}, "position": 13, "start": 30518, "text": "But, in the spelling case we imagine that,"}, {"end": 35776, "meta": {"new_paragraph": false}, "position": 14, "start": 33782, "text": "\u201cOh, somebody mistyped the word!\u201d"}, {"end": 39324, "meta": {"new_paragraph": false}, "position": 15, "start": 35776, "text": "So the channel is the typewriter or the person typing or the keyboard,"}, {"end": 42070, "meta": {"new_paragraph": false}, "position": 16, "start": 39324, "text": "and at the end, you\u2019ve got a misspelled version of the word."}, {"end": 44418, "meta": {"new_paragraph": false}, "position": 17, "start": 42070, "text": "And our goal in the noisy channel model"}, {"end": 47504, "meta": {"new_paragraph": false}, "position": 18, "start": 44418, "text": "is to take that output of that noisy process,"}, {"end": 51540, "meta": {"new_paragraph": false}, "position": 19, "start": 47504, "text": "and by modeling how this channel works,"}, {"end": 53711, "meta": {"new_paragraph": false}, "position": 20, "start": 51540, "text": "we build a model\u2014[a] probabilistic model\u2014of the channel."}, {"end": 57368, "meta": {"new_paragraph": false}, "position": 21, "start": 53711, "text": "We can run all possible original words through that channel"}, {"end": 61054, "meta": {"new_paragraph": false}, "position": 22, "start": 57368, "text": "and see which one looks the most like the noisy word."}, {"end": 65756, "meta": {"new_paragraph": false}, "position": 23, "start": 61054, "text": "So the decoder will take a bunch of hypotheses for each one,"}, {"end": 67878, "meta": {"new_paragraph": false}, "position": 24, "start": 65756, "text": "run it through the channel,"}, {"end": 69605, "meta": {"new_paragraph": false}, "position": 25, "start": 67878, "text": "(Just running hypothesis two through the channel,"}, {"end": 71218, "meta": {"new_paragraph": false}, "position": 26, "start": 69605, "text": "run hypothesis three through the channel)"}, {"end": 73935, "meta": {"new_paragraph": false}, "position": 27, "start": 71218, "text": "and we see which word looks the most like this noisy word,"}, {"end": 75738, "meta": {"new_paragraph": false}, "position": 28, "start": 73950, "text": "and we pick that"}, {"end": 79079, "meta": {"new_paragraph": false}, "position": 29, "start": 75738, "text": "as the original hypothesis for the word that started out."}, {"end": 80988, "meta": {"new_paragraph": false}, "position": 30, "start": 79479, "text": "So let\u2019s look at that."}, {"end": 82828, "meta": {"new_paragraph": false}, "position": 31, "start": 80988, "text": "First we\u2019ll introduce some probability"}, {"end": 84161, "meta": {"new_paragraph": false}, "position": 32, "start": 82828, "text": "and then we\u2019ll look at some examples."}, {"end": 88898, "meta": {"new_paragraph": false}, "position": 33, "start": 86084, "text": "The noisy channel is a probabilistic model."}, {"end": 94606, "meta": {"new_paragraph": false}, "position": 34, "start": 89805, "text": "Our goal: given an observation x of some misspelling\u2014"}, {"end": 95895, "meta": {"new_paragraph": false}, "position": 35, "start": 94606, "text": "some word we\u2019ve seen,"}, {"end": 97093, "meta": {"new_paragraph": false}, "position": 36, "start": 95895, "text": "some surface thing we\u2019ve seen,"}, {"end": 98291, "meta": {"new_paragraph": false}, "position": 37, "start": 97093, "text": "some observation x\u2014"}, {"end": 100876, "meta": {"new_paragraph": false}, "position": 38, "start": 98291, "text": "we\u2019d like to find w, the correct word."}, {"end": 105412, "meta": {"new_paragraph": false}, "position": 39, "start": 100876, "text": "And we\u2019re going to model that probabilistically"}, {"end": 108868, "meta": {"new_paragraph": false}, "position": 40, "start": 105412, "text": "by saying we\u2019re looking [for] the best word:"}, {"end": 110755, "meta": {"new_paragraph": false}, "position": 41, "start": 108868, "text": "The word that we\u2019d like to replace our misspelling with"}, {"end": 113889, "meta": {"new_paragraph": false}, "position": 42, "start": 110755, "text": "is that word out of the vocabulary that maximizes a probability."}, {"end": 115086, "meta": {"new_paragraph": false}, "position": 43, "start": 113889, "text": "What probability?"}, {"end": 117899, "meta": {"new_paragraph": false}, "position": 44, "start": 115086, "text": "The probability of the word given the misspelling."}, {"end": 120507, "meta": {"new_paragraph": false}, "position": 45, "start": 117899, "text": "So what word, given that we\u2019ve seen some misspelling?"}, {"end": 122625, "meta": {"new_paragraph": false}, "position": 46, "start": 120507, "text": "What\u2019s the most likely word?"}, {"end": 126054, "meta": {"new_paragraph": false}, "position": 47, "start": 122625, "text": "The most probable posterior probable word, given that misspelling."}, {"end": 131620, "meta": {"new_paragraph": false}, "position": 48, "start": 126054, "text": "And we\u2019re going to use Bayes rule to replace that probability."}, {"end": 134348, "meta": {"new_paragraph": false}, "position": 49, "start": 131620, "text": "So, the probability of w given x,"}, {"end": 138193, "meta": {"new_paragraph": false}, "position": 50, "start": 134348, "text": "we\ u2019re going to replace that with P(x|w)P(w)/P(x)."}, {"end": 145847, "meta": {"new_paragraph": false}, "position": 51, "start": 139870, "text": "And so we can also eliminate the denominator."}, {"end": 149608, "meta": {"new_paragraph": false}, "position": 52, "start": 145847, "text": "So whatever word maximizes this equation"}, {"end": 152514, "meta": {"new_paragraph": false}, "position": 53, "start": 150547, "text": "will also maximize this equation."}, {"end": 154507, "meta": {"new_paragraph": false}, "position": 54, "start": 152514, "text": "We\ u2019re asking, given a misspelling x,"}, {"end": 156500, "meta": {"new_paragraph": false}, "position": 55, "start": 154507, "text": "what\u2019s the most likely word?"}, {"end": 159374, "meta": {"new_paragraph": false}, "position": 56, "start": 156500, "text": "And since the formula for that probability"}, {"end": 163479, "meta": {"new_paragraph": false}, "position": 57, "start": 159374, "text": "includes the probability of the word, the misspelling x."}, {"end": 165985, "meta": {"new_paragraph": false}, "position": 58, "start": 163479, "text": "We\ u2019re including that probability"}, {"end": 169169, "meta": {"new_paragraph": false}, "position": 59, "start": 165985, "text": "in every w that we\u2019re considering."}, {"end": 174127, "meta": {"new_paragraph": false}, "position": 60, "start": 169169, "text": "So if some w, say w hypothesis one,"}, {"end": 178792, "meta": {"new_paragraph": false}, "position": 61, "start": 174127, "text": "has a greater probability than hypothesis two by this equation,"}, {"end": 181990, "meta": {"new_paragraph": false}, "position": 62, "start": 179208, "text": "it\ u2019ll also have a greater probability by this equation,"}, {"end": 183370, "meta": {"new_paragraph": false}, "position": 63, "start": 181990, "text": "because w is a constant."}, {"end": 186777, "meta": {"new_paragraph": false}, "position": 64, "start": 183554, "text": "x is the misspelling that we\u2019re trying to decide"}, {"end": 190000, "meta": {"new_paragraph": false}, "position": 65, "start": 186777, "text": "if w1 or w2 is a better hypothesis for it."}, {"end": 194412, "meta": {"new_paragraph": false}, "position": 66, "start": 190000, "text": "So that means that the noisy channel model"}, {"end": 200777, "meta": {"new_paragraph": false}, "position": 67, "start": 194412, "text": "comes down to maximizing the product of two factors:"}, {"end": 202478, "meta": {"new_paragraph": false}, "position": 68, "start": 200777, "text": "The likelihood"}, {"end": 206938, "meta": {"new_paragraph": false}, "position": 69, "start": 205078, "text": "and the prior."}, {"end": 212425, "meta": {"new_paragraph": false}, "position": 70, "start": 209107, "text": "And we generally call this term the language model."}, {"end": 215876, "meta": {"new_paragraph": false}, "position": 71, "start": 213410, "text": "And you\u2019ve seen language models before:"}, {"end": 222380, "meta": {"new_paragraph": false}, "position": 72, "start": 215876, "text": "That\u2019s the probability of the correct word, w."}, {"end": 224481, "meta": {"new_paragraph": false}, "position": 73, "start": 222380, "text": "And this likelihood term,"}, {"end": 227350, "meta": {"new_paragraph": false}, "position": 74, "start": 224481, "text": "we often call this the channel model,"}, {"end": 232367, "meta": {"new_paragraph": false}, "position": 75, "start": 229273, "text": "or sometimes the error model."}, {"end": 235260, "meta": {"new_paragraph": false}, "position": 76, "start": 232828, "text": "So we\u2019ve got two factors:"}, {"end": 237691, "meta": {"new_paragraph": false}, "position": 77, "start": 235260, "text": "the language model and the channel model."}, {"end": 241420, "meta": {"new_paragraph": false}, "position": 78, "start": 237691, "text": "And the intuition is that the language model tells us"}, {"end": 244711, "meta": {"new_paragraph": false}, "position": 79, "start": 241420, "text": "how likely would this word be to be a word,"}, {"end": 246571, "meta": {"new_paragraph": false}, "position": 80, "start": 244711, "text": "perhaps in this context,"}, {"end": 247941, "meta": {"new_paragraph": false}, "position": 81, "start": 246571, "text": "perhaps by itself."}, {"end": 249283, "meta": {"new_paragraph": false}, "position": 82, "start": 247941, "text": "The channel model says,"}, {"end": 251086, "meta": {"new_paragraph": false}, "position": 83, "start": 249283, "text": "well, if it was that word,"}, {"end": 254036, "meta": {"new_paragraph": false}, "position": 84, "start": 251086, "text": "how likely would it be to generate this exact error?"}, {"end": 256737, "meta": {"new_paragraph": false}, "position": 85, "start": 254036, "text": "So the channel model was sort of modeling that noisy channel"}, {"end": 260970, "meta": {"new_paragraph": false}, "position": 86, "start": 256737, "text": "that turns the correct word into the misspelling."}, {"end": 264990, "meta": {"new_paragraph": false}, "position": 87, "start": 260970, "text": "Now this noisy channel model for spelling was proposed around 1990,"}, {"end": 267231, "meta": {"new_paragraph": false}, "position": 88, "start": 264990, "text": "independently at two separate laboratories."}, {"end": 271613, "meta": {"new_paragraph": false}, "position": 89, "start": 267231, "text": "And the use of speech recognition models like noisy channel"}, {"end": 273968, "meta": {"new_paragraph": false}, "position": 90, "start": 271613, "text": "came into natural language processing right around then"}, {"end": 277854, "meta": {"new_paragraph": false}, "position": 91, "start": 273968, "text": "mainly, although not exclusively, because of the work at these two labs:"}, {"end": 280599, "meta": {"new_paragraph": false}, "position": 92, "start": 277854, "text": "at IBM and at AT&T Bell Labs."}, {"end": 283913, "meta": {"new_paragraph": false}, "position": 93, "start": 281076, "text": "And so the examples we\u2019re going to take for the rest of this example"}, {"end": 287894, "meta": {"new_paragraph": false}, "position": 94, "start": 283913, "text": "come from these two important early papers"}, {"end": 290480, "meta": {"new_paragraph": false}, "position": 95, "start": 287894, "text": "by Mays et al. and by Kernighan et al."}, {"end": 294065, "meta": {"new_paragraph": false}, "position": 96, "start": 291649, "text": "So let\ u2019s look at an example."}, {"end": 298752, "meta": {"new_paragraph": false}, "position": 97, "start": 294065, "text": "Here\u2019s a misspelling: The word \ u201cacress\u201d."}, {"end": 301830, "meta": {"new_paragraph": false}, "position": 98, "start": 298752, "text": "So think for yourself for a second what this could mean."}, {"end": 308127, "meta": {"new_paragraph": false}, "position": 99, "start": 304815, "text": "First, we\u2019re going to start with generating candidates."}, {"end": 311322, "meta": {"new_paragraph": false}, "position": 100, "start": 308127, "text": "What are the possible candidate words to replace this word?"}, {"end": 315130, "meta": {"new_paragraph": false}, "position": 101, "start": 311322, "text": "And we can think of at least a couple of obvious ways to do this:"}, {"end": 318499, "meta": {"new_paragraph": false}, "position": 102, "start": 315130, "text": "One is, we\u2019re going to pick words that have similar spelling."}, {"end": 321371, "meta": {"new_paragraph": false}, "position": 103, "start": 318499, "text": "So words that have similar spelling"}, {"end": 324164, "meta": {"new_paragraph": false}, "position": 104, "start": 321371, "text": "might naturally be mistaken for the correct word."}, {"end": 328010, "meta": {"new_paragraph": false}, "position": 105, "start": 324164, "text": "And we\u2019re going to operationalize similar spelling"}, {"end": 330785, "meta": {"new_paragraph": false}, "position": 106, "start": 328010, "text": "as having a small edit distance to the error."}, {"end": 333430, "meta": {"new_paragraph": false}, "position": 107, "start": 330785, "text": "Or we could pick words with similar pronunciation"}, {"end": 335301, "meta": {"new_paragraph": false}, "position": 108, "start": 333430, "text": "and there we\ u2019re going to pick a word"}, {"end": 337461, "meta": {"new_paragraph": false}, "position": 109, "start": 335301, "text": "with a small edit distance of the pronunciation to the error."}, {"end": 339210, "meta": {"new_paragraph": false}, "position": 110, "start": 337461, "text": "And we\u2019re going to, for the rest of this example,"}, {"end": 341046, "meta": {"new_paragraph": false}, "position": 111, "start": 339210, "text": "I\u2019m going to pick the first approach."}, {"end": 342989, "meta": {"new_paragraph": false}, "position": 112, "start": 341046, "text": "So, we\u2019re going to pick words that have similar spelling"}, {"end": 344753, "meta": {"new_paragraph": false}, "position": 113, "start": 342989, "text": "as our possible candidates."}, {"end": 348250, "meta": {"new_paragraph": false}, "position": 114, "start": 345338, "text": "How do I operationalize similar spelling?"}, {"end": 350328, "meta": {"new_paragraph": false}, "position": 115, "start": 348250, "text": "Well, we\u2019ve seen edit distance before."}, {"end": 351936, "meta": {"new_paragraph": false}, "position": 116, "start": 350328, "text": "And remember, with edit distance,"}, {"end": 354978, "meta": {"new_paragraph": false}, "position": 117, "start": 351936, "text": "we talked about the distance between two strings,"}, {"end": 358676, "meta": {"new_paragraph": false}, "position": 118, "start": 354978, "text": "the minimal number of edits that turns one string into another,"}, {"end": 362447, "meta": {"new_paragraph": false}, "position": 119, "start": 358676, "text": "where we define an edit as an insertion, a deletion, or a substitution,"}, {"end": 364371, "meta": {"new_paragraph": false}, "position": 120, "start": 362447, "text": "so any of these three."}, {"end": 366986, "meta": {"new_paragraph": false}, "position": 121, "start": 364371, "text": "For spelling correction, we\u2019re going to want to add"}, {"end": 370565, "meta": {"new_paragraph": false}, "position": 122, "start": 366986, "text": "a fourth possible edit operation, transposition,"}, {"end": 372192, "meta": {"new_paragraph": false}, "position": 123, "start": 370565, "text": "because in practice for spelling errors,"}, {"end": 373925, "meta": {"new_paragraph": false}, "position": 124, "start": 372192, "text": "we often transpose two letters."}, {"end": 375258, "meta": {"new_paragraph": false}, "position": 125, "start": 373925, "text": "And that version of edit distance"}, {"end": 379011, "meta": {"new_paragraph": false}, "position": 126, "start": 375258, "text": "is now called Damerau-Levenshtein edit distance."}, {"end": 380943, "meta": {"new_paragraph": false}, "position": 127, "start": 379011, "text": "And it can be computed,"}, {"end": 385168, "meta": {"new_paragraph": false}, "position": 128, "start": 380943, "text": "again, by various dynamic programming approaches."}, {"end": 388665, "meta": {"new_paragraph": false}, "position": 129, "start": 386783, "text": "So let\u2019s look at the candidates"}, {"end": 391449, "meta": {"new_paragraph": false}, "position": 130, "start": 388665, "text": "that are words within an edit distance of one"}, {"end": 395986, "meta": {"new_paragraph": false}, "position": 131, "start": 391449, "text": "of our misspelling \u201cacress\u201d."}, {"end": 400151, "meta": {"new_paragraph": false}, "position": 132, "start": 395986, "text": "So here\u2019s our error, \u201cacress\u201d,"}, {"end": 402511, "meta": {"new_paragraph": false}, "position": 133, "start": 400151, "text": "and here is different possible candidates:"}, {"end": 404226, "meta": {"new_paragraph": false}, "position": 134, "start": 402511, "text": "So here\u2019s a candidate, \ u201cactress\u201d."}, {"end": 407285, "meta": {"new_paragraph": false}, "position": 135, "start": 404226, "text": "How is \u201cactress\u201d turned into \ u201cacress\u201d?"}, {"end": 411626, "meta": {"new_paragraph": false}, "position": 136, "start": 407285, "text": "Well, the \u201ct\u201d turns into nothing, so a \ u201ct\u201d was deleted."}, {"end": 413149, "meta": {"new_paragraph": false}, "position": 137, "start": 411626, "text": "So we have a deletion of a \u201ct\ u201d."}, {"end": 415790, "meta": {"new_paragraph": false}, "position": 138, "start": 413149, "text": "So a deletion of a \u201ct\u201d turns \u201cactress\ u201d into \u201cacress\u201d."}, {"end": 419634, "meta": {"new_paragraph": false}, "position": 139, "start": 415790, "text": "Here, the proposed candidate is the word \u201ccress\u201d,"}, {"end": 421048, "meta": {"new_paragraph": false}, "position": 140, "start": 419634, "text": "the kind of vegetable."}, {"end": 422678, "meta": {"new_paragraph": false}, "position": 141, "start": 421048, "text": "So, here \u201ccress\u201d:"}, {"end": 426001, "meta": {"new_paragraph": false}, "position": 142, "start": 422678, "text": "To turn \ u201ccress\u201d into \u201cacress\u201d we have to add, insert an \u201ca\ u201d."}, {"end": 427977, "meta": {"new_paragraph": false}, "position": 143, "start": 426001, "text": "So, here we had a deletion, here we had an insertion."}, {"end": 429907, "meta": {"new_paragraph": false}, "position": 144, "start": 427977, "text": "How about \u201ccaress\u201d?"}, {"end": 435418, "meta": {"new_paragraph": false}, "position": 145, "start": 429907, "text": "To turn \u201ccaress\u201d into \u201cacress\u201d we turn a \u201cca\u201d into the \u201cac\u201d,"}, {"end": 438127, "meta": {"new_paragraph": false}, "position": 146, "start": 435418, "text": "so we have a transposition of \u201cca\u201d and \u201cac\u201d."}, {"end": 441814, "meta": {"new_paragraph": false}, "position": 147, "start": 439143, "text": "The word could\u2019ve been \u201caccess\u201d."}, {"end": 443454, "meta": {"new_paragraph": false}, "position": 148, "start": 441814, "text": "Here we have a substitution,"}, {"end": 445835, "meta": {"new_paragraph": false}, "position": 149, "start": 443454, "text": "the \u201cc\u201d turned into an \u201cr\u201d."}, {"end": 447781, "meta": {"new_paragraph": false}, "position": 150, "start": 445835, "text": "Or another substitution:"}, {"end": 449253, "meta": {"new_paragraph": false}, "position": 151, "start": 447781, "text": "The word could\u2019ve been \ u201cacross\u201d,"}, {"end": 450725, "meta": {"new_paragraph": false}, "position": 152, "start": 449253, "text": "and the \u201co\u201d turned into an \u201ce\ u201d."}, {"end": 454326, "meta": {"new_paragraph": false}, "position": 153, "start": 450725, "text": "Or an \u201cs\u201d could\u2019ve been inserted,"}, {"end": 457927, "meta": {"new_paragraph": false}, "position": 154, "start": 454326, "text": "to turn \u201cacres\u201d into, into \u201cacress\u201d;"}, {"end": 459951, "meta": {"new_paragraph": false}, "position": 155, "start": 457927, "text": "but the \u201cs\u201d could\u2019ve been inserted either here"}, {"end": 462636, "meta": {"new_paragraph": false}, "position": 156, "start": 461135, "text": "or here."}, {"end": 464199, "meta": {"new_paragraph": false}, "position": 157, "start": 462636, "text": "So there\u2019s two different ways"}, {"end": 468279, "meta": {"new_paragraph": false}, "position": 158, "start": 464199, "text": "where this source word could have turned into this error form."}, {"end": 470322, "meta": {"new_paragraph": false}, "position": 159, "start": 468295, "text": "So we\u2019ll put two rows down"}, {"end": 473964, "meta": {"new_paragraph": false}, "position": 160, "start": 470322, "text": "for both of these possible insertion locations, positions."}, {"end": 477862, "meta": {"new_paragraph": false}, "position": 161, "start": 474887, "text": "So I\u2019ve just shown you candidates that are within edit distance of one."}, {"end": 480567, "meta": {"new_paragraph": false}, "position": 162, "start": 477862, "text": "It turns out that 80 percent of spelling errors"}, {"end": 482325, "meta": {"new_paragraph": false}, "position": 163, "start": 480567, "text": "are within edit distance of one."}, {"end": 484453, "meta": {"new_paragraph": false}, "position": 164, "start": 482325, "text": "And almost all errors are within edit distance of two."}, {"end": 486516, "meta": {"new_paragraph": false}, "position": 165, "start": 484453, "text": "So most algorithms either consider"}, {"end": 491061, "meta": {"new_paragraph": false}, "position": 166, "start": 486516, "text": "just edit distance one or edit distance two possible candidates."}, {"end": 493296, "meta": {"new_paragraph": false}, "position": 167, "start": 491061, "text": "In practice, we also want to allow"}, {"end": 497115, "meta": {"new_paragraph": false}, "position": 168, "start": 493296, "text": "not just insertion and substitution of letters,"}, {"end": 498655, "meta": {"new_paragraph": false}, "position": 169, "start": 497115, "text": "but also of spaces or hyphens."}, {"end": 501674, "meta": {"new_paragraph": false}, "position": 170, "start": 498655, "text": "So for example, if the user types \u201cthisidea\ u201d,"}, {"end": 505195, "meta": {"new_paragraph": false}, "position": 171, "start": 501674, "text": "we\u2019d like to realize that there should be insertion of a space,"}, {"end": 510516, "meta": {"new_paragraph": false}, "position": 172, "start": 505195, "text": "or that the original space was in fact deleted to produce this error form."}, {"end": 514203, "meta": {"new_paragraph": false}, "position": 173, "start": 510516, "text": "Or here, the original dash in the word \u201cin-law\ u201d was deleted"}, {"end": 516782, "meta": {"new_paragraph": false}, "position": 174, "start": 514203, "text": "to produce this error form, \u201cinlaw.\u201d"}, {"end": 520496, "meta": {"new_paragraph": false}, "position": 175, "start": 518259, "text": "We\u2019ve seen candidate generation."}, {"end": 523348, "meta": {"new_paragraph": false}, "position": 176, "start": 520496, "text": "Now we\u2019re ready to talk about how to rank the candidates."}, {"end": 524923, "meta": {"new_paragraph": false}, "position": 177, "start": 523348, "text": "And remember, there are two factors:"}, {"end": 528343, "meta": {"new_paragraph": false}, "position": 178, "start": 524923, "text": "We have the language model and the channel model."}, {"end": 529709, "meta": {"new_paragraph": false}, "position": 179, "start": 528343, "text": "Now [for] the language model,"}, {"end": 531856, "meta": {"new_paragraph": false}, "position": 180, "start": 529709, "text": "we can use any of the language modeling algorithms we\u2019ve already learned."}, {"end": 534206, "meta": {"new_paragraph": false}, "position": 181, "start": 531856, "text": "We can use unigrams and bigrams and trigrams."}, {"end": 537793, "meta": {"new_paragraph": false}, "position": 182, "start": 534206, "text": "We can use any kind of back-off algorithm we want to use,"}, {"end": 539681, "meta": {"new_paragraph": false}, "position": 183, "start": 537793, "text": "or smoothing algorithm we want to use."}, {"end": 543156, "meta": {"new_paragraph": false}, "position": 184, "start": 539681, "text": "In practice for very, very large-scale, web-scale correction,"}, {"end": 545868, "meta": {"new_paragraph": false}, "position": 185, "start": 543156, "text": "we\u2019re going to use, as usual, for web-scale things,"}, {"end": 547365, "meta": {"new_paragraph": false}, "position": 186, "start": 545868, "text": "we\u2019re going to use stupid back-off."}, {"end": 554299, "meta": {"new_paragraph": false}, "position": 187, "start": 547365, "text": "But we might want to use smarter algorithms for smaller kinds of tasks."}, {"end": 561353, "meta": {"new_paragraph": false}, "position": 188, "start": 556683, "text": "So let\u2019s look at an example of a language model."}, {"end": 563421, "meta": {"new_paragraph": false}, "position": 189, "start": 561353, "text": "Here I picked just a very simple unigram."}, {"end": 566998, "meta": {"new_paragraph": false}, "position": 190, "start": 563421, "text": "And in this case we\u2019ve computed the unigram"}, {"end": 569117, "meta": {"new_paragraph": false}, "position": 191, "start": 566998, "text": "from the Corpus of Contemporary English,"}, {"end": 570524, "meta": {"new_paragraph": false}, "position": 192, "start": 569117, "text": "one of the many possible corpora."}, {"end": 571961, "meta": {"new_paragraph": false}, "position": 193, "start": 570524, "text": "And here\u2019s some counts."}, {"end": 574493, "meta": {"new_paragraph": false}, "position": 194, "start": 571961, "text": "Here\u2019s counts of the different possible candidates:"}, {"end": 577298, "meta": {"new_paragraph": false}, "position": 195, "start": 574493, "text": "\u201cactress\u201d, \u201ccress\u201d, \u201ccaress\u201d, and so on."}, {"end": 578724, "meta": {"new_paragraph": false}, "position": 196, "start": 577298, "text": "Here\u2019s their frequency."}, {"end": 582958, "meta": {"new_paragraph": false}, "position": 197, "start": 578724, "text": "And by normaliz[ing] by the total number of words we get a probability."}, {"end": 584666, "meta": {"new_paragraph": false}, "position": 198, "start": 582958, "text": "(Here\u2019s the total number of words.)"}, {"end": 588801, "meta": {"new_paragraph": false}, "position": 199, "start": 584666, "text": "We get by normalizing this count by the total count, we get probabilities."}, {"end": 594796, "meta": {"new_paragraph": false}, "position": 200, "start": 588801, "text": "So here\u2019s the probabilities of words assigned by unigram language model."}, {"end": 598411, "meta": {"new_paragraph": false}, "position": 201, "start": 595519, "text": "How about computing the channel model probability?"}, {"end": 599893, "meta": {"new_paragraph": false}, "position": 202, "start": 598411, "text": "Remember, the channel model\u2019s also called"}, {"end": 603727, "meta": {"new_paragraph": false}, "position": 203, "start": 599893, "text": "the error model or the edit probability."}, {"end": 607824, "meta": {"new_paragraph": false}, "position": 204, "start": 603727, "text": "And we\u2019re going to take a simplifying assumption"}, {"end": 611915, "meta": {"new_paragraph": false}, "position": 205, "start": 607824, "text": "made by Kernighan, Church, and Gale in 1990,"}, {"end": 615054, "meta": {"new_paragraph": false}, "position": 206, "start": 611915, "text": "when they first proposed the use of the noisy channel model."}, {"end": 616893, "meta": {"new_paragraph": false}, "position": 207, "start": 615054, "text": "So let\u2019s first see how to do that."}, {"end": 621800, "meta": {"new_paragraph": false}, "position": 208, "start": 616893, "text": "Let\u2019s assume that a misspelled word X has a set of letters, X1 through XM."}, {"end": 627612, "meta": {"new_paragraph": false}, "position": 209, "start": 621800, "text": "And the correct word, W, has a set of letters, let\u2019s call them W1 through WN."}, {"end": 632120, "meta": {"new_paragraph": false}, "position": 210, "start": 627612, "text": "Now the probability of the edit X given W"}, {"end": 639058, "meta": {"new_paragraph": false}, "position": 211, "start": 632120, "text": "is going to be some set of deletions or insertions or substitutions or transpositions\u2014some set of edits."}, {"end": 642359, "meta": {"new_paragraph": false}, "position": 212, "start": 639627, "text": "The way we\u2019re going to model that"}, {"end": 644899, "meta": {"new_paragraph": false}, "position": 213, "start": 642359, "text": "is we\ u2019re going to create a confusion matrix."}, {"end": 654333, "meta": {"new_paragraph": false}, "position": 214, "start": 644899, "text": "And a confusion matrix says for any given pair of letters,"}, {"end": 656599, "meta": {"new_paragraph": false}, "position": 215, "start": 654333, "text": "how likely is a particular edit to happen."}, {"end": 660740, "meta": {"new_paragraph": false}, "position": 216, "start": 656599, "text": "So for example, for the pair of letters XY,"}, {"end": 665695, "meta": {"new_paragraph": false}, "position": 217, "start": 660740, "text": "we want to know how often XY is typed as X,"}, {"end": 669376, "meta": {"new_paragraph": false}, "position": 218, "start": 665695, "text": "meaning: how often is a Y deleted when there\u2019s a X right before it."}, {"end": 672037, "meta": {"new_paragraph": false}, "position": 219, "start": 669376, "text": "We\u2019re going to also keep a count of,"}, {"end": 676381, "meta": {"new_paragraph": false}, "position": 220, "start": 672037, "text": "for insertion probabilities, how often was an X typed as XY."}, {"end": 679760, "meta": {"new_paragraph": false}, "position": 221, "start": 676381, "text": "So how often is Y inserted after X."}, {"end": 682333, "meta": {"new_paragraph": false}, "position": 222, "start": 679760, "text": "So, Y deleted after X; Y inserted after X."}, {"end": 684296, "meta": {"new_paragraph": false}, "position": 223, "start": 682333, "text": "Or we\u2019ll keep a count for substitutions."}, {"end": 686762, "meta": {"new_paragraph": false}, "position": 224, "start": 684296, "text": "How often is X typed as Y?"}, {"end": 688532, "meta": {"new_paragraph": false}, "position": 225, "start": 686762, "text": "So we meant to type X, we typed Y."}, {"end": 690809, "meta": {"new_paragraph": false}, "position": 226, "start": 688532, "text": "That\u2019s an X\u2013Y substitution."}, {"end": 694473, "meta": {"new_paragraph": false}, "position": 227, "start": 690809, "text": "Or a transposition, how often was XY typed as YX?"}, {"end": 695735, "meta": {"new_paragraph": false}, "position": 228, "start": 694473, "text": "So these are just counts."}, {"end": 700144, "meta": {"new_paragraph": false}, "position": 229, "start": 695735, "text": "We\u2019ll keep a matrix of these counts for every X and for every Y."}, {"end": 703092, "meta": {"new_paragraph": false}, "position": 230, "start": 700144, "text": "I noticed that what we\u2019ve done implicitly"}, {"end": 710962, "meta": {"new_paragraph": false}, "position": 231, "start": 703092, "text": "is we\u2019ve conditioned our insertion and our deletion on the previous character."}, {"end": 714794, "meta": {"new_paragraph": false}, "position": 232, "start": 710962, "text": "So whether Y is deleted is conditioned on X."}, {"end": 716840, "meta": {"new_paragraph": false}, "position": 233, "start": 714794, "text": "We could have conditioned\u2014chosen the condition\u2014"}, {"end": 719938, "meta": {"new_paragraph": false}, "position": 234, "start": 716840, "text": "of the next character or the character five to the left or some other thing,"}, {"end": 722610, "meta": {"new_paragraph": false}, "position": 235, "start": 719938, "text": "but we generally condition on the previous character."}, {"end": 726401, "meta": {"new_paragraph": false}, "position": 236, "start": 722610, "text": "So here\u2019s an example of a confusion matrix for spelling errors."}, {"end": 730585, "meta": {"new_paragraph": false}, "position": 237, "start": 726401, "text": "The font is a little small, but just to give you a basic idea,"}, {"end": 736371, "meta": {"new_paragraph": false}, "position": 238, "start": 730585, "text": "here\u2019s this is a substitution matrix that I took from Kernighan et al."}, {"end": 738459, "meta": {"new_paragraph": false}, "position": 239, "start": 736371, "text": "So here\u2019s the letter e,"}, {"end": 745356, "meta": {"new_paragraph": false}, "position": 240, "start": 738459, "text": "and it\u2019s very likely\u2014in their data, 388 times\u2014to be substituted with an a."}, {"end": 748240, "meta": {"new_paragraph": false}, "position": 241, "start": 745356, "text": "So, you meant to type e, you incorrectly typed an a."}, {"end": 750471, "meta": {"new_paragraph": false}, "position": 242, "start": 748240, "text": "Or you might have typed an I, or you might have typed an o."}, {"end": 753880, "meta": {"new_paragraph": false}, "position": 243, "start": 750471, "text": "So vowels are very likely to be mistaken for each other."}, {"end": 760254, "meta": {"new_paragraph": false}, "position": 244, "start": 753880, "text": "Or similarly, the letter m very often gets mistyped as an n."}, {"end": 764560, "meta": {"new_paragraph": false}, "position": 245, "start": 760254, "text": "So, a very high probability of m and n being substituted for each other."}, {"end": 766099, "meta": {"new_paragraph": false}, "position": 246, "start": 764560, "text": "They\u2019re next to each other on the keyboard."}, {"end": 767114, "meta": {"new_paragraph": false}, "position": 247, "start": 766099, "text": "They sound alike."}, {"end": 769030, "meta": {"new_paragraph": false}, "position": 248, "start": 767114, "text": "Lots of reasons for them to be substituted."}, {"end": 770897, "meta": {"new_paragraph": false}, "position": 249, "start": 769030, "text": "So, here\u2019s our set of confusion matrices,"}, {"end": 774201, "meta": {"new_paragraph": false}, "position": 250, "start": 770897, "text": "and we just compute four of them:"}, {"end": 775698, "meta": {"new_paragraph": false}, "position": 251, "start": 774201, "text": "one for substitution,"}, {"end": 777818, "meta": {"new_paragraph": false}, "position": 252, "start": 775698, "text": "one for insertion,"}, {"end": 780032, "meta": {"new_paragraph": false}, "position": 253, "start": 777818, "text": "one for deletion,"}, {"end": 781467, "meta": {"new_paragraph": false}, "position": 254, "start": 780032, "text": "and one for transposition."}, {"end": 785009, "meta": {"new_paragraph": false}, "position": 255, "start": 781851, "text": "Now I\u2019ve shown you this table comes from Kernighan et al.,"}, {"end": 788203, "meta": {"new_paragraph": false}, "position": 256, "start": 785378, "text": "but you could also generate the table yourself."}, {"end": 794201, "meta": {"new_paragraph": false}, "position": 257, "start": 788203, "text": "So for example Peter Norvig post on his website a lovely list of errors."}, {"end": 800613, "meta": {"new_paragraph": false}, "position": 258, "start": 797278, "text": "So these are errors taken from Wikipedia"}, {"end": 803981, "meta": {"new_paragraph": false}, "position": 259, "start": 800613, "text": "and other places that he talks about on his website."}, {"end": 806257, "meta": {"new_paragraph": false}, "position": 260, "start": 803981, "text": "And from a set of errors like this."}, {"end": 814871, "meta": {"new_paragraph": false}, "position": 261, "start": 806257, "text": "So, here, misspellings of \ u201cadaptable\u201d as \u201cadabtable\u201d,"}, {"end": 817482, "meta": {"new_paragraph": false}, "position": 262, "start": 814871, "text": "or \ u201cimmature\u201d with only one \u201cm\u201d, and so on."}, {"end": 821642, "meta": {"new_paragraph": false}, "position": 263, "start": 817482, "text": "So various kinds of likely misspellings."}, {"end": 823225, "meta": {"new_paragraph": false}, "position": 264, "start": 821642, "text": "And from this list of errors"}, {"end": 827751, "meta": {"new_paragraph": false}, "position": 265, "start": 823225, "text": "we can get a list of counts for every possible single error,"}, {"end": 830344, "meta": {"new_paragraph": false}, "position": 266, "start": 827751, "text": "single edit error of how often it happens"}, {"end": 834844, "meta": {"new_paragraph": false}, "position": 267, "start": 830344, "text": "and from that we can build our little confusion matrix"}, {"end": 838366, "meta": {"new_paragraph": false}, "position": 268, "start": 834844, "text": "and then from the confusion matrix we can generate probabilities."}, {"end": 845123, "meta": {"new_paragraph": false}, "position": 269, "start": 838366, "text": "So, every time a particular previous letter happens,"}, {"end": 849996, "meta": {"new_paragraph": false}, "position": 270, "start": 845123, "text": "we look up in our confusion matrix"}, {"end": 852859, "meta": {"new_paragraph": false}, "position": 271, "start": 849996, "text": "and we say how often was xi inserted"}, {"end": 855282, "meta": {"new_paragraph": false}, "position": 272, "start": 852859, "text": "after a particular letter w sub I minus one"}, {"end": 858321, "meta": {"new_paragraph": false}, "position": 273, "start": 855282, "text": "and we divide by the number of times w i minus one occurred"}, {"end": 860642, "meta": {"new_paragraph": false}, "position": 274, "start": 858321, "text": "and that\u2019s going to be the probability"}, {"end": 865964, "meta": {"new_paragraph": false}, "position": 275, "start": 860642, "text": "of a particular insertion happening in a word."}, {"end": 869256, "meta": {"new_paragraph": false}, "position": 276, "start": 865964, "text": "So we can generate our probability of our surface form"}, {"end": 872225, "meta": {"new_paragraph": false}, "position": 277, "start": 869256, "text": "for each possible single edit error\u201d\u2014"}, {"end": 874217, "meta": {"new_paragraph": false}, "position": 278, "start": 872225, "text": "again we\u2019re assuming a single edit now,"}, {"end": 875857, "meta": {"new_paragraph": false}, "position": 279, "start": 874217, "text": "so one, only one of these happens\u2014"}, {"end": 878015, "meta": {"new_paragraph": false}, "position": 280, "start": 875857, "text": "to generate our candidate."}, {"end": 879635, "meta": {"new_paragraph": false}, "position": 281, "start": 878015, "text": "Whichever one it is,"}, {"end": 880794, "meta": {"new_paragraph": false}, "position": 282, "start": 879635, "text": "we compute our probability"}, {"end": 885437, "meta": {"new_paragraph": false}, "position": 283, "start": 880794, "text": "by just normalizing the count of the deletion or insertion or substitution or transposition,"}, {"end": 887437, "meta": {"new_paragraph": false}, "position": 284, "start": 885437, "text": "by the appropriate count,"}, {"end": 888931, "meta": {"new_paragraph": false}, "position": 285, "start": 887437, "text": "and generate a probability."}, {"end": 892147, "meta": {"new_paragraph": false}, "position": 286, "start": 889593, "text": "So, this channel model."}, {"end": 896845, "meta": {"new_paragraph": false}, "position": 287, "start": 892147, "text": "For example for a word like \ u201cactress\u201d,"}, {"end": 910082, "meta": {"new_paragraph": false}, "position": 288, "start": 896845, "text": "where we generated \u201cacress\u201d by when we should have typed a \u201cct\u201d,"}, {"end": 914798, "meta": {"new_paragraph": false}, "position": 289, "start": 910082, "text": "we typed a \ u201cc\u201d so the word had a \u201cct\u201d in it but the error had only a \ u201cc\u201d."}, {"end": 919462, "meta": {"new_paragraph": false}, "position": 290, "start": 914798, "text": "So what\u2019s the probability of deleting a \u201ct\ u201d following a \u201cc\u201d?"}, {"end": 923378, "meta": {"new_paragraph": false}, "position": 291, "start": 919462, "text": "And if we\u2019d normalized the probabilities in our confusion matrix,"}, {"end": 927567, "meta": {"new_paragraph": false}, "position": 292, "start": 923378, "text": "here\u2019s the likelihood of this word \u201cactress\u201d"}, {"end": 930409, "meta": {"new_paragraph": false}, "position": 293, "start": 927567, "text": "being realized as the misspelling \ u201cacress\u201d,"}, {"end": 932858, "meta": {"new_paragraph": false}, "position": 294, "start": 930409, "text": "it\u2019s .000117."}, {"end": 939352, "meta": {"new_paragraph": false}, "position": 295, "start": 932858, "text": "The language model so, here\u2019s the error model or the channel model."}, {"end": 942422, "meta": {"new_paragraph": false}, "position": 296, "start": 939352, "text": "And now we can add in the language model, I\u2019ll write LM."}, {"end": 943790, "meta": {"new_paragraph": false}, "position": 297, "start": 942422, "text": "So we have the channel model."}, {"end": 949084, "meta": {"new_paragraph": false}, "position": 298, "start": 943790, "text": "How likely was \u201cct\u201d to be, errorfully turned into \u201cc\u201d?"}, {"end": 950387, "meta": {"new_paragraph": false}, "position": 299, "start": 949084, "text": "So \u201ct\u201d to be deleted."}, {"end": 952195, "meta": {"new_paragraph": false}, "position": 300, "start": 950387, "text": "And how likely is the word \u201cactress\u201d, anyway?"}, {"end": 954611, "meta": {"new_paragraph": false}, "position": 301, "start": 952195, "text": "And we can just multiply these together."}, {"end": 955880, "meta": {"new_paragraph": false}, "position": 302, "start": 954611, "text": "And what we\u2019ll do is,"}, {"end": 957200, "meta": {"new_paragraph": false}, "position": 303, "start": 955880, "text": "because these are very small numbers,"}, {"end": 961182, "meta": {"new_paragraph": false}, "position": 304, "start": 957200, "text": "we\u2019ll just multiply everything by ten to the ninth to make it readable."}, {"end": 965898, "meta": {"new_paragraph": false}, "position": 305, "start": 961182, "text": "So, this would be 2.7 times ten to the minus ninth."}, {"end": 967820, "meta": {"new_paragraph": false}, "position": 306, "start": 965898, "text": "But we\u2019d multiplied everything by ten to the ninth here."}, {"end": 974372, "meta": {"new_paragraph": false}, "position": 307, "start": 967820, "text": "So you can see that the most likely word here is \u201cacross\u201d."}, {"end": 980472, "meta": {"new_paragraph": false}, "position": 308, "start": 974372, "text": "I, with this particular this particular channel model,"}, {"end": 985802, "meta": {"new_paragraph": false}, "position": 309, "start": 980472, "text": "and this particular language model the most likely word is \u201cacross\u201d."}, {"end": 988399, "meta": {"new_paragraph": false}, "position": 310, "start": 985802, "text": "But, \u201cactress\u201d is also quite likely."}, {"end": 991900, "meta": {"new_paragraph": false}, "position": 311, "start": 988399, "text": "And, and \ u201cacres\u201d seems a reasonably likelihood."}, {"end": 992994, "meta": {"new_paragraph": false}, "position": 312, "start": 991900, "text": "And the word \ u201ccress\u201d,"}, {"end": 994131, "meta": {"new_paragraph": false}, "position": 313, "start": 992994, "text": "which is just a very rare word,"}, {"end": 995352, "meta": {"new_paragraph": false}, "position": 314, "start": 994131, "text": "you can see has a very low probability,"}, {"end": 999052, "meta": {"new_paragraph": false}, "position": 315, "start": 995352, "text": "and has an unusual error of inserting an \u201ca\u201d at the beginning,"}, {"end": 1001506, "meta": {"new_paragraph": false}, "position": 316, "start": 999052, "text": "makes it a very low probability correction."}, {"end": 1006587, "meta": {"new_paragraph": false}, "position": 317, "start": 1001506, "text": "So the noisy channel model likes the word \u201cacross\u201d as the possible replacement."}, {"end": 1010082, "meta": {"new_paragraph": false}, "position": 318, "start": 1006587, "text": "Unfortunately, we can see from the original sentence,"}, {"end": 1012493, "meta": {"new_paragraph": false}, "position": 319, "start": 1010082, "text": "taken from Kernighan et al\u2019s paper,"}, {"end": 1016824, "meta": {"new_paragraph": false}, "position": 320, "start": 1012493, "text": "that [in] the original sentence \ u201cacross\u201d is the wrong word."}, {"end": 1018905, "meta": {"new_paragraph": false}, "position": 321, "start": 1016824, "text": "The original sentence is"}, {"end": 1023806, "meta": {"new_paragraph": false}, "position": 322, "start": 1018905, "text": "\u201ca stellar and versatile acress whose combination of sass and glamour\u2026\u201d"}, {"end": 1026929, "meta": {"new_paragraph": false}, "position": 323, "start": 1023806, "text": "And it should be clear that this word should have been \u201cactress\u201d."}, {"end": 1028027, "meta": {"new_paragraph": false}, "position": 324, "start": 1026929, "text": "So \u201cacross\u201d is the wrong word."}, {"end": 1031385, "meta": {"new_paragraph": false}, "position": 325, "start": 1028027, "text": "So, just using a unigram model, the noisy channel makes a mistake."}, {"end": 1033813, "meta": {"new_paragraph": false}, "position": 326, "start": 1031385, "text": "So let\u2019s look at a bigram model."}, {"end": 1035326, "meta": {"new_paragraph": false}, "position": 327, "start": 1033813, "text": "How well can we do with a bigram model?"}, {"end": 1037676, "meta": {"new_paragraph": false}, "position": 328, "start": 1035726, "text": "So we computed a very simple bigram model,"}, {"end": 1039994, "meta": {"new_paragraph": false}, "position": 329, "start": 1037676, "text": "just using \u201cadd-one smoothing\u201d,"}, {"end": 1042116, "meta": {"new_paragraph": false}, "position": 330, "start": 1039994, "text": "from the Corpus of Contemporary American English."}, {"end": 1045617, "meta": {"new_paragraph": false}, "position": 331, "start": 1042116, "text": "So now, the probability of \u201cactress\u201d given \ u201cversatile\u201d."}, {"end": 1048179, "meta": {"new_paragraph": false}, "position": 332, "start": 1045617, "text": "Just look at these three words, and ignore the rest for now."}, {"end": 1050352, "meta": {"new_paragraph": false}, "position": 333, "start": 1048179, "text": "\u201cActress\u201d given \ u201cversatile\u201d,"}, {"end": 1053765, "meta": {"new_paragraph": false}, "position": 334, "start": 1050352, "text": "that probability is .00021."}, {"end": 1058320, "meta": {"new_paragraph": false}, "position": 335, "start": 1053765, "text": "And \u201cwhose\u201d given \u201cactress\u201d is .00010 so we\u2019ll compute those."}, {"end": 1060980, "meta": {"new_paragraph": false}, "position": 336, "start": 1058320, "text": "And now let\u2019s do the same thing for another candidate,"}, {"end": 1063666, "meta": {"new_paragraph": false}, "position": 337, "start": 1060980, "text": "the original candidate that was preferred by the unigram model,"}, {"end": 1066089, "meta": {"new_paragraph": false}, "position": 338, "start": 1063666, "text": "the word \u201cacross\u201d."}, {"end": 1072057, "meta": {"new_paragraph": false}, "position": 339, "start": 1068212, "text": "We\u2019ll put \u201cacross\u201d here, instead of our hypothesis,"}, {"end": 1075519, "meta": {"new_paragraph": false}, "position": 340, "start": 1072057, "text": "and we\u2019ll again compute the probability of \u201cacross\ u201d giving \u201cversatile\u201d"}, {"end": 1077496, "meta": {"new_paragraph": false}, "position": 341, "start": 1075519, "text": "times the probability of \ u201cwhose\u201d giving \u201cacross\u201d."}, {"end": 1079944, "meta": {"new_paragraph": false}, "position": 342, "start": 1077496, "text": "So here\ u2019s those probabilities,"}, {"end": 1082896, "meta": {"new_paragraph": false}, "position": 343, "start": 1079944, "text": "and you can see that the probability of \u201cwhose\u201d given \u201cactress\u201d"}, {"end": 1086573, "meta": {"new_paragraph": false}, "position": 344, "start": 1082896, "text": "is much higher than the probability of \u201cwhose\u201d given \u201cacross\u201d."}, {"end": 1088932, "meta": {"new_paragraph": false}, "position": 345, "start": 1086573, "text": "\u201cActress whose\u201d is just a likely sequence."}, {"end": 1092845, "meta": {"new_paragraph": false}, "position": 346, "start": 1088932, "text": "And sure enough, if we multiply these things out,"}, {"end": 1098674, "meta": {"new_paragraph": false}, "position": 347, "start": 1092845, "text": "the probability of \u201cversatile actress whose\u201d is a much higher probability"}, {"end": 1101418, "meta": {"new_paragraph": false}, "position": 348, "start": 1098674, "text": "than the probability of the sequence \u201cversatile across whose\u201d."}, {"end": 1102788, "meta": {"new_paragraph": false}, "position": 349, "start": 1101418, "text": "So a much higher probability."}, {"end": 1105131, "meta": {"new_paragraph": false}, "position": 350, "start": 1102788, "text": "So the noisy channel model with a bigram language model"}, {"end": 1109084, "meta": {"new_paragraph": false}, "position": 351, "start": 1105131, "text": "correctly picks the correction \u201cactress\u201d."}, {"end": 1115726, "meta": {"new_paragraph": false}, "position": 352, "start": 1110191, "text": "How are we going to evaluate these noisy channel and other kinds of models?"}, {"end": 1119772, "meta": {"new_paragraph": false}, "position": 353, "start": 1115726, "text": "There are lots of good spelling error test sets."}, {"end": 1124302, "meta": {"new_paragraph": false}, "position": 354, "start": 1119772, "text": "Wikipedia has a list of common English misspellings."}, {"end": 1127321, "meta": {"new_paragraph": false}, "position": 355, "start": 1124302, "text": "There\u2019s a filtered version of that at Aspell."}, {"end": 1130134, "meta": {"new_paragraph": false}, "position": 356, "start": 1127321, "text": "There\u2019s a spelling error corpus at Birkbeck."}, {"end": 1131918, "meta": {"new_paragraph": false}, "position": 357, "start": 1130134, "text": "Let\u2019s look at the Wikipedia list."}, {"end": 1142191, "meta": {"new_paragraph": false}, "position": 358, "start": 1137010, "text": "So there\u2019s Wikipedia\u2019s list of common English misspellings."}, {"end": 1146388, "meta": {"new_paragraph": false}, "position": 359, "start": 1142191, "text": "And I\u2019ve shown you here on this slide"}, {"end": 1148820, "meta": {"new_paragraph": false}, "position": 360, "start": 1146388, "text": "some various other possible lists that you can go look at on your own."}, {"end": 1152745, "meta": {"new_paragraph": false}, "position": 361, "start": 1148820, "text": "So from these lists of misspellings"}, {"end": 1156728, "meta": {"new_paragraph": false}, "position": 362, "start": 1152745, "text": "you would generate a training set to train your channel model,"}, {"end": 1158940, "meta": {"new_paragraph": false}, "position": 363, "start": 1156728, "text": "a development set to test out your model"}, {"end": 1161794, "meta": {"new_paragraph": false}, "position": 364, "start": 1158940, "text": "and then a final test set to see how well your model works."}, {"end": null, "meta": {"new_paragraph": false}, "position": 365, "start": 1162579, "text": "So that\ u2019s the noisy channel model of spelling applied to non-real words."}], "title": "The Noisy Channel Model of Spelling", "version_no": 9, "version_number": 9, "video": "The Noisy Channel Model of Spelling", "video_description": "", "video_title": "The Noisy Channel Model of Spelling"}