0% found this document useful (0 votes)
14 views12 pages

02 The Noisy Channel Model of Spelling 19-30

Uploaded by

idhitappu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views12 pages

02 The Noisy Channel Model of Spelling 19-30

Uploaded by

idhitappu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 12

{"description": "", "language": {"code": "en", "dir": "ltr", "name": "English"},

"metadata": {}, "note": "", "resource_uri":


"/api2/partners/videos/5GK8pTpHQVzc/languages/en/subtitles/", "site_url":
"http://www.amara.org/videos/5GK8pTpHQVzc/en/476068/", "sub_format": "json",
"subtitles": [{"end": 4205, "meta": {"new_paragraph": true}, "position": 1,
"start": 1204, "text": "Let\u2019s introduce the noisy channel model of
spelling."}, {"end": 7379, "meta": {"new_paragraph": false}, "position": 2,
"start": 4943, "text": "The intuition of the noisy channel\u2014"}, {"end": 10315,
"meta": {"new_paragraph": false}, "position": 3, "start": 7379, "text": "and it
comes up throughout natural language processing\u2014"}, {"end": 11923, "meta":
{"new_paragraph": false}, "position": 4, "start": 10315, "text": "is that we have
some original signal\u2014"}, {"end": 13639, "meta": {"new_paragraph": false},
"position": 5, "start": 11923, "text": "let\u2019s say it\u2019s a word\u2014"},
{"end": 17190, "meta": {"new_paragraph": false}, "position": 6, "start": 13639,
"text": "and we imagine that it goes through some channel."}, {"end": 19132,
"meta": {"new_paragraph": false}, "position": 7, "start": 17190, "text": "And the
idea was originally invented for speech,"}, {"end": 20943, "meta":
{"new_paragraph": false}, "position": 8, "start": 19132, "text": "where if you talk
into a tube"}, {"end": 23543, "meta": {"new_paragraph": false}, "position": 9,
"start": 20943, "text": "or you go over some kind of telecommunications line,"},
{"end": 25187, "meta": {"new_paragraph": false}, "position": 10, "start": 23543,
"text": "and the word is distorted."}, {"end": 28711, "meta": {"new_paragraph":
false}, "position": 11, "start": 25187, "text": "And so what comes out from the
original word is some noisy word."}, {"end": 30518, "meta": {"new_paragraph":
false}, "position": 12, "start": 28711, "text": "And we\u2019ve represented that
here with a weird font."}, {"end": 33782, "meta": {"new_paragraph": false},
"position": 13, "start": 30518, "text": "But, in the spelling case we imagine
that,"}, {"end": 35776, "meta": {"new_paragraph": false}, "position": 14, "start":
33782, "text": "\u201cOh, somebody mistyped the word!\u201d"}, {"end": 39324,
"meta": {"new_paragraph": false}, "position": 15, "start": 35776, "text": "So the
channel is the typewriter or the person typing or the keyboard,"}, {"end": 42070,
"meta": {"new_paragraph": false}, "position": 16, "start": 39324, "text": "and at
the end, you\u2019ve got a misspelled version of the word."}, {"end": 44418,
"meta": {"new_paragraph": false}, "position": 17, "start": 42070, "text": "And our
goal in the noisy channel model"}, {"end": 47504, "meta": {"new_paragraph": false},
"position": 18, "start": 44418, "text": "is to take that output of that noisy
process,"}, {"end": 51540, "meta": {"new_paragraph": false}, "position": 19,
"start": 47504, "text": "and by modeling how this channel works,"}, {"end": 53711,
"meta": {"new_paragraph": false}, "position": 20, "start": 51540, "text": "we build
a model\u2014[a] probabilistic model\u2014of the channel."}, {"end": 57368, "meta":
{"new_paragraph": false}, "position": 21, "start": 53711, "text": "We can run all
possible original words through that channel"}, {"end": 61054, "meta":
{"new_paragraph": false}, "position": 22, "start": 57368, "text": "and see which
one looks the most like the noisy word."}, {"end": 65756, "meta": {"new_paragraph":
false}, "position": 23, "start": 61054, "text": "So the decoder will take a bunch
of hypotheses for each one,"}, {"end": 67878, "meta": {"new_paragraph": false},
"position": 24, "start": 65756, "text": "run it through the channel,"}, {"end":
69605, "meta": {"new_paragraph": false}, "position": 25, "start": 67878, "text":
"(Just running hypothesis two through the channel,"}, {"end": 71218, "meta":
{"new_paragraph": false}, "position": 26, "start": 69605, "text": "run hypothesis
three through the channel)"}, {"end": 73935, "meta": {"new_paragraph": false},
"position": 27, "start": 71218, "text": "and we see which word looks the most like
this noisy word,"}, {"end": 75738, "meta": {"new_paragraph": false}, "position":
28, "start": 73950, "text": "and we pick that"}, {"end": 79079, "meta":
{"new_paragraph": false}, "position": 29, "start": 75738, "text": "as the original
hypothesis for the word that started out."}, {"end": 80988, "meta":
{"new_paragraph": false}, "position": 30, "start": 79479, "text": "So let\u2019s
look at that."}, {"end": 82828, "meta": {"new_paragraph": false}, "position": 31,
"start": 80988, "text": "First we\u2019ll introduce some probability"}, {"end":
84161, "meta": {"new_paragraph": false}, "position": 32, "start": 82828, "text":
"and then we\u2019ll look at some examples."}, {"end": 88898, "meta":
{"new_paragraph": false}, "position": 33, "start": 86084, "text": "The noisy
channel is a probabilistic model."}, {"end": 94606, "meta": {"new_paragraph":
false}, "position": 34, "start": 89805, "text": "Our goal: given an observation x
of some misspelling\u2014"}, {"end": 95895, "meta": {"new_paragraph": false},
"position": 35, "start": 94606, "text": "some word we\u2019ve seen,"}, {"end":
97093, "meta": {"new_paragraph": false}, "position": 36, "start": 95895, "text":
"some surface thing we\u2019ve seen,"}, {"end": 98291, "meta": {"new_paragraph":
false}, "position": 37, "start": 97093, "text": "some observation x\u2014"},
{"end": 100876, "meta": {"new_paragraph": false}, "position": 38, "start": 98291,
"text": "we\u2019d like to find w, the correct word."}, {"end": 105412, "meta":
{"new_paragraph": false}, "position": 39, "start": 100876, "text": "And we\u2019re
going to model that probabilistically"}, {"end": 108868, "meta": {"new_paragraph":
false}, "position": 40, "start": 105412, "text": "by saying we\u2019re looking
[for] the best word:"}, {"end": 110755, "meta": {"new_paragraph": false},
"position": 41, "start": 108868, "text": "The word that we\u2019d like to replace
our misspelling with"}, {"end": 113889, "meta": {"new_paragraph": false},
"position": 42, "start": 110755, "text": "is that word out of the vocabulary that
maximizes a probability."}, {"end": 115086, "meta": {"new_paragraph": false},
"position": 43, "start": 113889, "text": "What probability?"}, {"end": 117899,
"meta": {"new_paragraph": false}, "position": 44, "start": 115086, "text": "The
probability of the word given the misspelling."}, {"end": 120507, "meta":
{"new_paragraph": false}, "position": 45, "start": 117899, "text": "So what word,
given that we\u2019ve seen some misspelling?"}, {"end": 122625, "meta":
{"new_paragraph": false}, "position": 46, "start": 120507, "text": "What\u2019s the
most likely word?"}, {"end": 126054, "meta": {"new_paragraph": false}, "position":
47, "start": 122625, "text": "The most probable posterior probable word, given that
misspelling."}, {"end": 131620, "meta": {"new_paragraph": false}, "position": 48,
"start": 126054, "text": "And we\u2019re going to use Bayes rule to replace that
probability."}, {"end": 134348, "meta": {"new_paragraph": false}, "position": 49,
"start": 131620, "text": "So, the probability of w given x,"}, {"end": 138193,
"meta": {"new_paragraph": false}, "position": 50, "start": 134348, "text": "we\
u2019re going to replace that with P(x|w)P(w)/P(x)."}, {"end": 145847, "meta":
{"new_paragraph": false}, "position": 51, "start": 139870, "text": "And so we can
also eliminate the denominator."}, {"end": 149608, "meta": {"new_paragraph":
false}, "position": 52, "start": 145847, "text": "So whatever word maximizes this
equation"}, {"end": 152514, "meta": {"new_paragraph": false}, "position": 53,
"start": 150547, "text": "will also maximize this equation."}, {"end": 154507,
"meta": {"new_paragraph": false}, "position": 54, "start": 152514, "text": "We\
u2019re asking, given a misspelling x,"}, {"end": 156500, "meta": {"new_paragraph":
false}, "position": 55, "start": 154507, "text": "what\u2019s the most likely
word?"}, {"end": 159374, "meta": {"new_paragraph": false}, "position": 56, "start":
156500, "text": "And since the formula for that probability"}, {"end": 163479,
"meta": {"new_paragraph": false}, "position": 57, "start": 159374, "text":
"includes the probability of the word, the misspelling x."}, {"end": 165985,
"meta": {"new_paragraph": false}, "position": 58, "start": 163479, "text": "We\
u2019re including that probability"}, {"end": 169169, "meta": {"new_paragraph":
false}, "position": 59, "start": 165985, "text": "in every w that we\u2019re
considering."}, {"end": 174127, "meta": {"new_paragraph": false}, "position": 60,
"start": 169169, "text": "So if some w, say w hypothesis one,"}, {"end": 178792,
"meta": {"new_paragraph": false}, "position": 61, "start": 174127, "text": "has a
greater probability than hypothesis two by this equation,"}, {"end": 181990,
"meta": {"new_paragraph": false}, "position": 62, "start": 179208, "text": "it\
u2019ll also have a greater probability by this equation,"}, {"end": 183370,
"meta": {"new_paragraph": false}, "position": 63, "start": 181990, "text": "because
w is a constant."}, {"end": 186777, "meta": {"new_paragraph": false}, "position":
64, "start": 183554, "text": "x is the misspelling that we\u2019re trying to
decide"}, {"end": 190000, "meta": {"new_paragraph": false}, "position": 65,
"start": 186777, "text": "if w1 or w2 is a better hypothesis for it."}, {"end":
194412, "meta": {"new_paragraph": false}, "position": 66, "start": 190000, "text":
"So that means that the noisy channel model"}, {"end": 200777, "meta":
{"new_paragraph": false}, "position": 67, "start": 194412, "text": "comes down to
maximizing the product of two factors:"}, {"end": 202478, "meta": {"new_paragraph":
false}, "position": 68, "start": 200777, "text": "The likelihood"}, {"end": 206938,
"meta": {"new_paragraph": false}, "position": 69, "start": 205078, "text":
"and the prior."}, {"end": 212425, "meta": {"new_paragraph": false}, "position":
70, "start": 209107, "text": "And we generally call this term the language
model."}, {"end": 215876, "meta": {"new_paragraph": false}, "position": 71,
"start": 213410, "text": "And you\u2019ve seen language models before:"}, {"end":
222380, "meta": {"new_paragraph": false}, "position": 72, "start": 215876, "text":
"That\u2019s the probability of the correct word, w."}, {"end": 224481, "meta":
{"new_paragraph": false}, "position": 73, "start": 222380, "text": "And this
likelihood term,"}, {"end": 227350, "meta": {"new_paragraph": false}, "position":
74, "start": 224481, "text": "we often call this the channel model,"}, {"end":
232367, "meta": {"new_paragraph": false}, "position": 75, "start": 229273, "text":
"or sometimes the error model."}, {"end": 235260, "meta": {"new_paragraph": false},
"position": 76, "start": 232828, "text": "So we\u2019ve got two factors:"}, {"end":
237691, "meta": {"new_paragraph": false}, "position": 77, "start": 235260, "text":
"the language model and the channel model."}, {"end": 241420, "meta":
{"new_paragraph": false}, "position": 78, "start": 237691, "text": "And the
intuition is that the language model tells us"}, {"end": 244711, "meta":
{"new_paragraph": false}, "position": 79, "start": 241420, "text": "how likely
would this word be to be a word,"}, {"end": 246571, "meta": {"new_paragraph":
false}, "position": 80, "start": 244711, "text": "perhaps in this context,"},
{"end": 247941, "meta": {"new_paragraph": false}, "position": 81, "start": 246571,
"text": "perhaps by itself."}, {"end": 249283, "meta": {"new_paragraph": false},
"position": 82, "start": 247941, "text": "The channel model says,"}, {"end":
251086, "meta": {"new_paragraph": false}, "position": 83, "start": 249283, "text":
"well, if it was that word,"}, {"end": 254036, "meta": {"new_paragraph": false},
"position": 84, "start": 251086, "text": "how likely would it be to generate this
exact error?"}, {"end": 256737, "meta": {"new_paragraph": false}, "position": 85,
"start": 254036, "text": "So the channel model was sort of modeling that noisy
channel"}, {"end": 260970, "meta": {"new_paragraph": false}, "position": 86,
"start": 256737, "text": "that turns the correct word into the misspelling."},
{"end": 264990, "meta": {"new_paragraph": false}, "position": 87, "start": 260970,
"text": "Now this noisy channel model for spelling was proposed around 1990,"},
{"end": 267231, "meta": {"new_paragraph": false}, "position": 88, "start": 264990,
"text": "independently at two separate laboratories."}, {"end": 271613, "meta":
{"new_paragraph": false}, "position": 89, "start": 267231, "text": "And the use of
speech recognition models like noisy channel"}, {"end": 273968, "meta":
{"new_paragraph": false}, "position": 90, "start": 271613, "text": "came into
natural language processing right around then"}, {"end": 277854, "meta":
{"new_paragraph": false}, "position": 91, "start": 273968, "text": "mainly,
although not exclusively, because of the work at these two labs:"}, {"end": 280599,
"meta": {"new_paragraph": false}, "position": 92, "start": 277854, "text": "at IBM
and at AT&T Bell Labs."}, {"end": 283913, "meta": {"new_paragraph": false},
"position": 93, "start": 281076, "text": "And so the examples we\u2019re going to
take for the rest of this example"}, {"end": 287894, "meta": {"new_paragraph":
false}, "position": 94, "start": 283913, "text": "come from these two important
early papers"}, {"end": 290480, "meta": {"new_paragraph": false}, "position": 95,
"start": 287894, "text": "by Mays et al. and by Kernighan et al."}, {"end": 294065,
"meta": {"new_paragraph": false}, "position": 96, "start": 291649, "text": "So let\
u2019s look at an example."}, {"end": 298752, "meta": {"new_paragraph": false},
"position": 97, "start": 294065, "text": "Here\u2019s a misspelling: The word \
u201cacress\u201d."}, {"end": 301830, "meta": {"new_paragraph": false}, "position":
98, "start": 298752, "text": "So think for yourself for a second what this could
mean."}, {"end": 308127, "meta": {"new_paragraph": false}, "position": 99, "start":
304815, "text": "First, we\u2019re going to start with generating candidates."},
{"end": 311322, "meta": {"new_paragraph": false}, "position": 100, "start": 308127,
"text": "What are the possible candidate words to replace this word?"}, {"end":
315130, "meta": {"new_paragraph": false}, "position": 101, "start": 311322, "text":
"And we can think of at least a couple of obvious ways to do this:"}, {"end":
318499, "meta": {"new_paragraph": false}, "position": 102, "start": 315130, "text":
"One is, we\u2019re going to pick words that have similar spelling."}, {"end":
321371, "meta": {"new_paragraph": false}, "position": 103, "start": 318499, "text":
"So words that have similar spelling"}, {"end": 324164, "meta": {"new_paragraph":
false}, "position": 104, "start": 321371, "text": "might naturally be mistaken for
the correct word."}, {"end": 328010, "meta": {"new_paragraph": false}, "position":
105, "start": 324164, "text": "And we\u2019re going to operationalize similar
spelling"}, {"end": 330785, "meta": {"new_paragraph": false}, "position": 106,
"start": 328010, "text": "as having a small edit distance to the error."}, {"end":
333430, "meta": {"new_paragraph": false}, "position": 107, "start": 330785, "text":
"Or we could pick words with similar pronunciation"}, {"end": 335301, "meta":
{"new_paragraph": false}, "position": 108, "start": 333430, "text": "and there we\
u2019re going to pick a word"}, {"end": 337461, "meta": {"new_paragraph": false},
"position": 109, "start": 335301, "text": "with a small edit distance of the
pronunciation to the error."}, {"end": 339210, "meta": {"new_paragraph": false},
"position": 110, "start": 337461, "text": "And we\u2019re going to, for the rest of
this example,"}, {"end": 341046, "meta": {"new_paragraph": false}, "position": 111,
"start": 339210, "text": "I\u2019m going to pick the first approach."}, {"end":
342989, "meta": {"new_paragraph": false}, "position": 112, "start": 341046, "text":
"So, we\u2019re going to pick words that have similar spelling"}, {"end": 344753,
"meta": {"new_paragraph": false}, "position": 113, "start": 342989, "text": "as our
possible candidates."}, {"end": 348250, "meta": {"new_paragraph": false},
"position": 114, "start": 345338, "text": "How do I operationalize similar
spelling?"}, {"end": 350328, "meta": {"new_paragraph": false}, "position": 115,
"start": 348250, "text": "Well, we\u2019ve seen edit distance before."}, {"end":
351936, "meta": {"new_paragraph": false}, "position": 116, "start": 350328, "text":
"And remember, with edit distance,"}, {"end": 354978, "meta": {"new_paragraph":
false}, "position": 117, "start": 351936, "text": "we talked about the distance
between two strings,"}, {"end": 358676, "meta": {"new_paragraph": false},
"position": 118, "start": 354978, "text": "the minimal number of edits that turns
one string into another,"}, {"end": 362447, "meta": {"new_paragraph": false},
"position": 119, "start": 358676, "text": "where we define an edit as an insertion,
a deletion, or a substitution,"}, {"end": 364371, "meta": {"new_paragraph": false},
"position": 120, "start": 362447, "text": "so any of these three."}, {"end":
366986, "meta": {"new_paragraph": false}, "position": 121, "start": 364371, "text":
"For spelling correction, we\u2019re going to want to add"}, {"end": 370565,
"meta": {"new_paragraph": false}, "position": 122, "start": 366986, "text": "a
fourth possible edit operation, transposition,"}, {"end": 372192, "meta":
{"new_paragraph": false}, "position": 123, "start": 370565, "text": "because in
practice for spelling errors,"}, {"end": 373925, "meta": {"new_paragraph": false},
"position": 124, "start": 372192, "text": "we often transpose two letters."},
{"end": 375258, "meta": {"new_paragraph": false}, "position": 125, "start": 373925,
"text": "And that version of edit distance"}, {"end": 379011, "meta":
{"new_paragraph": false}, "position": 126, "start": 375258, "text": "is now called
Damerau-Levenshtein edit distance."}, {"end": 380943, "meta": {"new_paragraph":
false}, "position": 127, "start": 379011, "text": "And it can be computed,"},
{"end": 385168, "meta": {"new_paragraph": false}, "position": 128, "start": 380943,
"text": "again, by various dynamic programming approaches."}, {"end": 388665,
"meta": {"new_paragraph": false}, "position": 129, "start": 386783, "text": "So
let\u2019s look at the candidates"}, {"end": 391449, "meta": {"new_paragraph":
false}, "position": 130, "start": 388665, "text": "that are words within an edit
distance of one"}, {"end": 395986, "meta": {"new_paragraph": false}, "position":
131, "start": 391449, "text": "of our misspelling \u201cacress\u201d."}, {"end":
400151, "meta": {"new_paragraph": false}, "position": 132, "start": 395986, "text":
"So here\u2019s our error, \u201cacress\u201d,"}, {"end": 402511, "meta":
{"new_paragraph": false}, "position": 133, "start": 400151, "text": "and here is
different possible candidates:"}, {"end": 404226, "meta": {"new_paragraph": false},
"position": 134, "start": 402511, "text": "So here\u2019s a candidate, \
u201cactress\u201d."}, {"end": 407285, "meta": {"new_paragraph": false},
"position": 135, "start": 404226, "text": "How is \u201cactress\u201d turned into \
u201cacress\u201d?"}, {"end": 411626, "meta": {"new_paragraph": false}, "position":
136, "start": 407285, "text": "Well, the \u201ct\u201d turns into nothing, so a \
u201ct\u201d was deleted."}, {"end": 413149, "meta": {"new_paragraph": false},
"position": 137, "start": 411626, "text": "So we have a deletion of a \u201ct\
u201d."}, {"end": 415790, "meta": {"new_paragraph": false}, "position": 138,
"start": 413149, "text": "So a deletion of a \u201ct\u201d turns \u201cactress\
u201d into \u201cacress\u201d."}, {"end": 419634,
"meta": {"new_paragraph": false}, "position": 139, "start": 415790, "text": "Here,
the proposed candidate is the word \u201ccress\u201d,"}, {"end": 421048, "meta":
{"new_paragraph": false}, "position": 140, "start": 419634, "text": "the kind of
vegetable."}, {"end": 422678, "meta": {"new_paragraph": false}, "position": 141,
"start": 421048, "text": "So, here \u201ccress\u201d:"}, {"end": 426001, "meta":
{"new_paragraph": false}, "position": 142, "start": 422678, "text": "To turn \
u201ccress\u201d into \u201cacress\u201d we have to add, insert an \u201ca\
u201d."}, {"end": 427977, "meta": {"new_paragraph": false}, "position": 143,
"start": 426001, "text": "So, here we had a deletion, here we had an insertion."},
{"end": 429907, "meta": {"new_paragraph": false}, "position": 144, "start": 427977,
"text": "How about \u201ccaress\u201d?"}, {"end": 435418, "meta": {"new_paragraph":
false}, "position": 145, "start": 429907, "text": "To turn \u201ccaress\u201d
into \u201cacress\u201d we turn a \u201cca\u201d into the \u201cac\u201d,"},
{"end": 438127, "meta": {"new_paragraph": false}, "position": 146, "start": 435418,
"text": "so we have a transposition of \u201cca\u201d and \u201cac\u201d."},
{"end": 441814, "meta": {"new_paragraph": false}, "position": 147, "start": 439143,
"text": "The word could\u2019ve been \u201caccess\u201d."}, {"end": 443454, "meta":
{"new_paragraph": false}, "position": 148, "start": 441814, "text": "Here we have a
substitution,"}, {"end": 445835, "meta": {"new_paragraph": false}, "position": 149,
"start": 443454, "text": "the \u201cc\u201d turned into an \u201cr\u201d."},
{"end": 447781, "meta": {"new_paragraph": false}, "position": 150, "start": 445835,
"text": "Or another substitution:"}, {"end": 449253, "meta": {"new_paragraph":
false}, "position": 151, "start": 447781, "text": "The word could\u2019ve been \
u201cacross\u201d,"}, {"end": 450725, "meta": {"new_paragraph": false}, "position":
152, "start": 449253, "text": "and the \u201co\u201d turned into an \u201ce\
u201d."}, {"end": 454326, "meta": {"new_paragraph": false}, "position": 153,
"start": 450725, "text": "Or an \u201cs\u201d could\u2019ve been inserted,"},
{"end": 457927, "meta": {"new_paragraph": false}, "position": 154, "start": 454326,
"text": "to turn \u201cacres\u201d into, into \u201cacress\u201d;"}, {"end":
459951, "meta": {"new_paragraph": false}, "position": 155, "start": 457927, "text":
"but the \u201cs\u201d could\u2019ve been inserted either here"}, {"end": 462636,
"meta": {"new_paragraph": false}, "position": 156, "start": 461135, "text": "or
here."}, {"end": 464199, "meta": {"new_paragraph": false}, "position": 157,
"start": 462636, "text": "So there\u2019s two different ways"}, {"end": 468279,
"meta": {"new_paragraph": false}, "position": 158, "start": 464199, "text": "where
this source word could have turned into this error form."}, {"end": 470322, "meta":
{"new_paragraph": false}, "position": 159, "start": 468295, "text": "So we\u2019ll
put two rows down"}, {"end": 473964, "meta": {"new_paragraph": false}, "position":
160, "start": 470322, "text": "for both of these possible insertion locations,
positions."}, {"end": 477862, "meta": {"new_paragraph": false}, "position": 161,
"start": 474887, "text": "So I\u2019ve just shown you candidates that are within
edit distance of one."}, {"end": 480567, "meta": {"new_paragraph": false},
"position": 162, "start": 477862, "text": "It turns out that 80 percent of spelling
errors"}, {"end": 482325, "meta": {"new_paragraph": false}, "position": 163,
"start": 480567, "text": "are within edit distance of one."}, {"end": 484453,
"meta": {"new_paragraph": false}, "position": 164, "start": 482325, "text": "And
almost all errors are within edit distance of two."}, {"end": 486516, "meta":
{"new_paragraph": false}, "position": 165, "start": 484453, "text": "So most
algorithms either consider"}, {"end": 491061, "meta": {"new_paragraph": false},
"position": 166, "start": 486516, "text": "just edit distance one or edit distance
two possible candidates."}, {"end": 493296, "meta": {"new_paragraph": false},
"position": 167, "start": 491061, "text": "In practice, we also want to allow"},
{"end": 497115, "meta": {"new_paragraph": false}, "position": 168, "start": 493296,
"text": "not just insertion and substitution of letters,"}, {"end": 498655, "meta":
{"new_paragraph": false}, "position": 169, "start": 497115, "text": "but also of
spaces or hyphens."}, {"end": 501674, "meta": {"new_paragraph": false}, "position":
170, "start": 498655, "text": "So for example, if the user types \u201cthisidea\
u201d,"}, {"end": 505195, "meta": {"new_paragraph": false}, "position": 171,
"start": 501674, "text": "we\u2019d like to realize that there should be insertion
of a space,"}, {"end": 510516, "meta": {"new_paragraph": false}, "position": 172,
"start": 505195, "text": "or that the original space was in fact deleted to produce
this error form."}, {"end": 514203, "meta": {"new_paragraph": false}, "position":
173, "start": 510516, "text": "Or here, the original dash in the word \u201cin-law\
u201d was deleted"}, {"end": 516782, "meta": {"new_paragraph": false}, "position":
174, "start": 514203, "text": "to produce this error form, \u201cinlaw.\u201d"},
{"end": 520496, "meta": {"new_paragraph": false}, "position": 175, "start": 518259,
"text": "We\u2019ve seen candidate generation."}, {"end": 523348, "meta":
{"new_paragraph": false}, "position": 176, "start": 520496, "text": "Now we\u2019re
ready to talk about how to rank the candidates."}, {"end": 524923, "meta":
{"new_paragraph": false}, "position": 177, "start": 523348, "text": "And remember,
there are two factors:"}, {"end": 528343, "meta": {"new_paragraph": false},
"position": 178, "start": 524923, "text": "We have the language model and the
channel model."}, {"end": 529709, "meta": {"new_paragraph": false}, "position":
179, "start": 528343, "text": "Now [for] the language model,"}, {"end": 531856,
"meta": {"new_paragraph": false}, "position": 180, "start": 529709, "text": "we can
use any of the language modeling algorithms we\u2019ve already learned."}, {"end":
534206, "meta": {"new_paragraph": false}, "position": 181, "start": 531856, "text":
"We can use unigrams and bigrams and trigrams."}, {"end": 537793, "meta":
{"new_paragraph": false}, "position": 182, "start": 534206, "text": "We can use any
kind of back-off algorithm we want to use,"}, {"end": 539681, "meta":
{"new_paragraph": false}, "position": 183, "start": 537793, "text": "or smoothing
algorithm we want to use."}, {"end": 543156, "meta": {"new_paragraph": false},
"position": 184, "start": 539681, "text": "In practice for very, very large-scale,
web-scale correction,"}, {"end": 545868, "meta": {"new_paragraph": false},
"position": 185, "start": 543156, "text": "we\u2019re going to use, as usual, for
web-scale things,"}, {"end": 547365, "meta": {"new_paragraph": false}, "position":
186, "start": 545868, "text": "we\u2019re going to use stupid back-off."}, {"end":
554299, "meta": {"new_paragraph": false}, "position": 187, "start": 547365, "text":
"But we might want to use smarter algorithms for smaller kinds of tasks."}, {"end":
561353, "meta": {"new_paragraph": false}, "position": 188, "start": 556683, "text":
"So let\u2019s look at an example of a language model."}, {"end": 563421, "meta":
{"new_paragraph": false}, "position": 189, "start": 561353, "text": "Here I picked
just a very simple unigram."}, {"end": 566998, "meta": {"new_paragraph": false},
"position": 190, "start": 563421, "text": "And in this case we\u2019ve computed the
unigram"}, {"end": 569117, "meta": {"new_paragraph": false}, "position": 191,
"start": 566998, "text": "from the Corpus of Contemporary English,"}, {"end":
570524, "meta": {"new_paragraph": false}, "position": 192, "start": 569117, "text":
"one of the many possible corpora."}, {"end": 571961, "meta": {"new_paragraph":
false}, "position": 193, "start": 570524, "text": "And here\u2019s some counts."},
{"end": 574493, "meta": {"new_paragraph": false}, "position": 194, "start": 571961,
"text": "Here\u2019s counts of the different possible candidates:"}, {"end":
577298, "meta": {"new_paragraph": false}, "position": 195, "start": 574493, "text":
"\u201cactress\u201d, \u201ccress\u201d, \u201ccaress\u201d, and so on."}, {"end":
578724, "meta": {"new_paragraph": false}, "position": 196, "start": 577298, "text":
"Here\u2019s their frequency."}, {"end": 582958, "meta": {"new_paragraph": false},
"position": 197, "start": 578724, "text": "And by normaliz[ing] by the total number
of words we get a probability."}, {"end": 584666, "meta": {"new_paragraph": false},
"position": 198, "start": 582958, "text": "(Here\u2019s the total number of
words.)"}, {"end": 588801, "meta": {"new_paragraph": false}, "position": 199,
"start": 584666, "text": "We get by normalizing this count by the total count, we
get probabilities."}, {"end": 594796, "meta": {"new_paragraph": false}, "position":
200, "start": 588801, "text": "So here\u2019s the probabilities of words assigned
by unigram language model."}, {"end": 598411, "meta": {"new_paragraph": false},
"position": 201, "start": 595519, "text": "How about computing the channel model
probability?"}, {"end": 599893, "meta": {"new_paragraph": false}, "position": 202,
"start": 598411, "text": "Remember, the channel model\u2019s also called"}, {"end":
603727, "meta": {"new_paragraph": false}, "position": 203, "start": 599893, "text":
"the error model or the edit probability."}, {"end": 607824, "meta":
{"new_paragraph": false}, "position": 204, "start": 603727, "text": "And we\u2019re
going to take a simplifying assumption"}, {"end": 611915, "meta": {"new_paragraph":
false}, "position": 205, "start": 607824, "text": "made by Kernighan, Church, and
Gale in 1990,"}, {"end": 615054, "meta": {"new_paragraph": false}, "position": 206,
"start": 611915, "text":
"when they first proposed the use of the noisy channel model."}, {"end": 616893,
"meta": {"new_paragraph": false}, "position": 207, "start": 615054, "text": "So
let\u2019s first see how to do that."}, {"end": 621800, "meta": {"new_paragraph":
false}, "position": 208, "start": 616893, "text": "Let\u2019s assume that a
misspelled word X has a set of letters, X1 through XM."}, {"end": 627612, "meta":
{"new_paragraph": false}, "position": 209, "start": 621800, "text": "And the
correct word, W, has a set of letters, let\u2019s call them W1 through WN."},
{"end": 632120, "meta": {"new_paragraph": false}, "position": 210, "start": 627612,
"text": "Now the probability of the edit X given W"}, {"end": 639058, "meta":
{"new_paragraph": false}, "position": 211, "start": 632120, "text": "is going to be
some set of deletions or insertions or substitutions or transpositions\u2014some
set of edits."}, {"end": 642359, "meta": {"new_paragraph": false}, "position": 212,
"start": 639627, "text": "The way we\u2019re going to model that"}, {"end": 644899,
"meta": {"new_paragraph": false}, "position": 213, "start": 642359, "text": "is we\
u2019re going to create a confusion matrix."}, {"end": 654333, "meta":
{"new_paragraph": false}, "position": 214, "start": 644899, "text": "And a
confusion matrix says for any given pair of letters,"}, {"end": 656599, "meta":
{"new_paragraph": false}, "position": 215, "start": 654333, "text": "how likely is
a particular edit to happen."}, {"end": 660740, "meta": {"new_paragraph": false},
"position": 216, "start": 656599, "text": "So for example, for the pair of letters
XY,"}, {"end": 665695, "meta": {"new_paragraph": false}, "position": 217, "start":
660740, "text": "we want to know how often XY is typed as X,"}, {"end": 669376,
"meta": {"new_paragraph": false}, "position": 218, "start": 665695, "text":
"meaning: how often is a Y deleted when there\u2019s a X right before it."},
{"end": 672037, "meta": {"new_paragraph": false}, "position": 219, "start": 669376,
"text": "We\u2019re going to also keep a count of,"}, {"end": 676381, "meta":
{"new_paragraph": false}, "position": 220, "start": 672037, "text": "for insertion
probabilities, how often was an X typed as XY."}, {"end": 679760, "meta":
{"new_paragraph": false}, "position": 221, "start": 676381, "text": "So how often
is Y inserted after X."}, {"end": 682333, "meta": {"new_paragraph": false},
"position": 222, "start": 679760, "text": "So, Y deleted after X; Y inserted after
X."}, {"end": 684296, "meta": {"new_paragraph": false}, "position": 223, "start":
682333, "text": "Or we\u2019ll keep a count for substitutions."}, {"end": 686762,
"meta": {"new_paragraph": false}, "position": 224, "start": 684296, "text": "How
often is X typed as Y?"}, {"end": 688532, "meta": {"new_paragraph": false},
"position": 225, "start": 686762, "text": "So we meant to type X, we typed Y."},
{"end": 690809, "meta": {"new_paragraph": false}, "position": 226, "start": 688532,
"text": "That\u2019s an X\u2013Y substitution."}, {"end": 694473, "meta":
{"new_paragraph": false}, "position": 227, "start": 690809, "text": "Or a
transposition, how often was XY typed as YX?"}, {"end": 695735, "meta":
{"new_paragraph": false}, "position": 228, "start": 694473, "text": "So these are
just counts."}, {"end": 700144, "meta": {"new_paragraph": false}, "position": 229,
"start": 695735, "text": "We\u2019ll keep a matrix of these counts for every X and
for every Y."}, {"end": 703092, "meta": {"new_paragraph": false}, "position": 230,
"start": 700144, "text": "I noticed that what we\u2019ve done implicitly"}, {"end":
710962, "meta": {"new_paragraph": false}, "position": 231, "start": 703092, "text":
"is we\u2019ve conditioned our insertion and our deletion on the previous
character."}, {"end": 714794, "meta": {"new_paragraph": false}, "position": 232,
"start": 710962, "text": "So whether Y is deleted is conditioned on X."}, {"end":
716840, "meta": {"new_paragraph": false}, "position": 233, "start": 714794, "text":
"We could have conditioned\u2014chosen the condition\u2014"}, {"end": 719938,
"meta": {"new_paragraph": false}, "position": 234, "start": 716840, "text": "of the
next character or the character five to the left or some other thing,"}, {"end":
722610, "meta": {"new_paragraph": false}, "position": 235, "start": 719938, "text":
"but we generally condition on the previous character."}, {"end": 726401, "meta":
{"new_paragraph": false}, "position": 236, "start": 722610, "text": "So here\u2019s
an example of a confusion matrix for spelling errors."}, {"end": 730585, "meta":
{"new_paragraph": false}, "position": 237, "start": 726401, "text": "The font is a
little small, but just to give you a basic idea,"}, {"end": 736371, "meta":
{"new_paragraph": false}, "position": 238, "start": 730585, "text": "here\u2019s
this is a substitution matrix that I took from Kernighan et al."}, {"end": 738459,
"meta": {"new_paragraph": false}, "position": 239, "start": 736371, "text": "So
here\u2019s the letter e,"}, {"end": 745356, "meta": {"new_paragraph": false},
"position": 240, "start": 738459, "text": "and it\u2019s very likely\u2014in their
data, 388 times\u2014to be substituted with an a."}, {"end": 748240, "meta":
{"new_paragraph": false}, "position": 241, "start": 745356, "text": "So, you meant
to type e, you incorrectly typed an a."}, {"end": 750471, "meta": {"new_paragraph":
false}, "position": 242, "start": 748240, "text": "Or you might have typed an I, or
you might have typed an o."}, {"end": 753880, "meta": {"new_paragraph": false},
"position": 243, "start": 750471, "text": "So vowels are very likely to be mistaken
for each other."}, {"end": 760254, "meta": {"new_paragraph": false}, "position":
244, "start": 753880, "text": "Or similarly, the letter m very often gets mistyped
as an n."}, {"end": 764560, "meta": {"new_paragraph": false}, "position": 245,
"start": 760254, "text": "So, a very high probability of m and n being substituted
for each other."}, {"end": 766099, "meta": {"new_paragraph": false}, "position":
246, "start": 764560, "text": "They\u2019re next to each other on the keyboard."},
{"end": 767114, "meta": {"new_paragraph": false}, "position": 247, "start": 766099,
"text": "They sound alike."}, {"end": 769030, "meta": {"new_paragraph": false},
"position": 248, "start": 767114, "text": "Lots of reasons for them to be
substituted."}, {"end": 770897, "meta": {"new_paragraph": false}, "position": 249,
"start": 769030, "text": "So, here\u2019s our set of confusion matrices,"}, {"end":
774201, "meta": {"new_paragraph": false}, "position": 250, "start": 770897, "text":
"and we just compute four of them:"}, {"end": 775698, "meta": {"new_paragraph":
false}, "position": 251, "start": 774201, "text": "one for substitution,"}, {"end":
777818, "meta": {"new_paragraph": false}, "position": 252, "start": 775698, "text":
"one for insertion,"}, {"end": 780032, "meta": {"new_paragraph": false},
"position": 253, "start": 777818, "text": "one for deletion,"}, {"end": 781467,
"meta": {"new_paragraph": false}, "position": 254, "start": 780032, "text": "and
one for transposition."}, {"end": 785009, "meta": {"new_paragraph": false},
"position": 255, "start": 781851, "text": "Now I\u2019ve shown you this table comes
from Kernighan et al.,"}, {"end": 788203, "meta": {"new_paragraph": false},
"position": 256, "start": 785378, "text": "but you could also generate the table
yourself."}, {"end": 794201, "meta": {"new_paragraph": false}, "position": 257,
"start": 788203, "text": "So for example Peter Norvig post on his website a lovely
list of errors."}, {"end": 800613, "meta": {"new_paragraph": false}, "position":
258, "start": 797278, "text": "So these are errors taken from Wikipedia"}, {"end":
803981, "meta": {"new_paragraph": false}, "position": 259, "start": 800613, "text":
"and other places that he talks about on his website."}, {"end": 806257, "meta":
{"new_paragraph": false}, "position": 260, "start": 803981, "text": "And from a set
of errors like this."}, {"end": 814871, "meta": {"new_paragraph": false},
"position": 261, "start": 806257, "text": "So, here, misspellings of \
u201cadaptable\u201d as \u201cadabtable\u201d,"}, {"end": 817482, "meta":
{"new_paragraph": false}, "position": 262, "start": 814871, "text": "or \
u201cimmature\u201d with only one \u201cm\u201d, and so on."}, {"end": 821642,
"meta": {"new_paragraph": false}, "position": 263, "start": 817482, "text": "So
various kinds of likely misspellings."}, {"end": 823225, "meta": {"new_paragraph":
false}, "position": 264, "start": 821642, "text": "And from this list of errors"},
{"end": 827751, "meta": {"new_paragraph": false}, "position": 265, "start": 823225,
"text": "we can get a list of counts for every possible single error,"}, {"end":
830344, "meta": {"new_paragraph": false}, "position": 266, "start": 827751, "text":
"single edit error of how often it happens"}, {"end": 834844, "meta":
{"new_paragraph": false}, "position": 267, "start": 830344, "text": "and from that
we can build our little confusion matrix"}, {"end": 838366, "meta":
{"new_paragraph": false}, "position": 268, "start": 834844, "text": "and then from
the confusion matrix we can generate probabilities."}, {"end": 845123, "meta":
{"new_paragraph": false}, "position": 269, "start": 838366, "text": "So, every time
a particular previous letter happens,"}, {"end": 849996, "meta": {"new_paragraph":
false}, "position": 270, "start": 845123, "text": "we look up in our confusion
matrix"}, {"end": 852859, "meta": {"new_paragraph": false}, "position": 271,
"start": 849996, "text": "and we say how often was xi inserted"}, {"end": 855282,
"meta": {"new_paragraph": false}, "position": 272, "start": 852859, "text": "after
a particular letter w sub I minus one"}, {"end": 858321, "meta": {"new_paragraph":
false}, "position": 273, "start": 855282, "text": "and we divide by the number of
times
w i minus one occurred"}, {"end": 860642, "meta": {"new_paragraph": false},
"position": 274, "start": 858321, "text": "and that\u2019s going to be the
probability"}, {"end": 865964, "meta": {"new_paragraph": false}, "position": 275,
"start": 860642, "text": "of a particular insertion happening in a word."}, {"end":
869256, "meta": {"new_paragraph": false}, "position": 276, "start": 865964, "text":
"So we can generate our probability of our surface form"}, {"end": 872225, "meta":
{"new_paragraph": false}, "position": 277, "start": 869256, "text": "for each
possible single edit error\u201d\u2014"}, {"end": 874217, "meta": {"new_paragraph":
false}, "position": 278, "start": 872225, "text": "again we\u2019re assuming a
single edit now,"}, {"end": 875857, "meta": {"new_paragraph": false}, "position":
279, "start": 874217, "text": "so one, only one of these happens\u2014"}, {"end":
878015, "meta": {"new_paragraph": false}, "position": 280, "start": 875857, "text":
"to generate our candidate."}, {"end": 879635, "meta": {"new_paragraph": false},
"position": 281, "start": 878015, "text": "Whichever one it is,"}, {"end": 880794,
"meta": {"new_paragraph": false}, "position": 282, "start": 879635, "text": "we
compute our probability"}, {"end": 885437, "meta": {"new_paragraph": false},
"position": 283, "start": 880794, "text": "by just normalizing the count of the
deletion or insertion or substitution or transposition,"}, {"end": 887437, "meta":
{"new_paragraph": false}, "position": 284, "start": 885437, "text": "by the
appropriate count,"}, {"end": 888931, "meta": {"new_paragraph": false}, "position":
285, "start": 887437, "text": "and generate a probability."}, {"end": 892147,
"meta": {"new_paragraph": false}, "position": 286, "start": 889593, "text": "So,
this channel model."}, {"end": 896845, "meta": {"new_paragraph": false},
"position": 287, "start": 892147, "text": "For example for a word like \
u201cactress\u201d,"}, {"end": 910082, "meta": {"new_paragraph": false},
"position": 288, "start": 896845, "text": "where we generated \u201cacress\u201d by
when we should have typed a \u201cct\u201d,"}, {"end": 914798, "meta":
{"new_paragraph": false}, "position": 289, "start": 910082, "text": "we typed a \
u201cc\u201d so the word had a \u201cct\u201d in it but the error had only a \
u201cc\u201d."}, {"end": 919462, "meta": {"new_paragraph": false}, "position": 290,
"start": 914798, "text": "So what\u2019s the probability of deleting a \u201ct\
u201d following a \u201cc\u201d?"}, {"end": 923378, "meta": {"new_paragraph":
false}, "position": 291, "start": 919462, "text": "And if we\u2019d normalized the
probabilities in our confusion matrix,"}, {"end": 927567, "meta": {"new_paragraph":
false}, "position": 292, "start": 923378, "text": "here\u2019s the likelihood of
this word \u201cactress\u201d"}, {"end": 930409, "meta": {"new_paragraph": false},
"position": 293, "start": 927567, "text": "being realized as the misspelling \
u201cacress\u201d,"}, {"end": 932858, "meta": {"new_paragraph": false}, "position":
294, "start": 930409, "text": "it\u2019s .000117."}, {"end": 939352, "meta":
{"new_paragraph": false}, "position": 295, "start": 932858, "text": "The language
model so, here\u2019s the error model or the channel model."}, {"end": 942422,
"meta": {"new_paragraph": false}, "position": 296, "start": 939352, "text": "And
now we can add in the language model, I\u2019ll write LM."}, {"end": 943790,
"meta": {"new_paragraph": false}, "position": 297, "start": 942422, "text": "So we
have the channel model."}, {"end": 949084, "meta": {"new_paragraph": false},
"position": 298, "start": 943790, "text": "How likely was \u201cct\u201d to be,
errorfully turned into \u201cc\u201d?"}, {"end": 950387, "meta": {"new_paragraph":
false}, "position": 299, "start": 949084, "text": "So \u201ct\u201d to be
deleted."}, {"end": 952195, "meta": {"new_paragraph": false}, "position": 300,
"start": 950387, "text": "And how likely is the word \u201cactress\u201d,
anyway?"}, {"end": 954611, "meta": {"new_paragraph": false}, "position": 301,
"start": 952195, "text": "And we can just multiply these together."}, {"end":
955880, "meta": {"new_paragraph": false}, "position": 302, "start": 954611, "text":
"And what we\u2019ll do is,"}, {"end": 957200, "meta": {"new_paragraph": false},
"position": 303, "start": 955880, "text": "because these are very small numbers,"},
{"end": 961182, "meta": {"new_paragraph": false}, "position": 304, "start": 957200,
"text": "we\u2019ll just multiply everything by ten to the ninth to make it
readable."}, {"end": 965898, "meta": {"new_paragraph": false}, "position": 305,
"start": 961182, "text": "So, this would be 2.7 times ten to the minus ninth."},
{"end": 967820, "meta": {"new_paragraph": false}, "position": 306, "start": 965898,
"text": "But we\u2019d multiplied everything by ten to the ninth here."}, {"end":
974372, "meta": {"new_paragraph": false}, "position": 307, "start": 967820, "text":
"So you can see that the most likely word here is \u201cacross\u201d."}, {"end":
980472, "meta": {"new_paragraph": false}, "position": 308, "start": 974372, "text":
"I, with this particular this particular channel model,"}, {"end": 985802, "meta":
{"new_paragraph": false}, "position": 309, "start": 980472, "text": "and this
particular language model the most likely word is \u201cacross\u201d."}, {"end":
988399, "meta": {"new_paragraph": false}, "position": 310, "start": 985802, "text":
"But, \u201cactress\u201d is also quite likely."}, {"end": 991900, "meta":
{"new_paragraph": false}, "position": 311, "start": 988399, "text": "And, and \
u201cacres\u201d seems a reasonably likelihood."}, {"end": 992994, "meta":
{"new_paragraph": false}, "position": 312, "start": 991900, "text": "And the word \
u201ccress\u201d,"}, {"end": 994131, "meta": {"new_paragraph": false}, "position":
313, "start": 992994, "text": "which is just a very rare word,"}, {"end": 995352,
"meta": {"new_paragraph": false}, "position": 314, "start": 994131, "text": "you
can see has a very low probability,"}, {"end": 999052, "meta": {"new_paragraph":
false}, "position": 315, "start": 995352, "text": "and has an unusual error of
inserting an \u201ca\u201d at the beginning,"}, {"end": 1001506, "meta":
{"new_paragraph": false}, "position": 316, "start": 999052, "text": "makes it a
very low probability correction."}, {"end": 1006587, "meta": {"new_paragraph":
false}, "position": 317, "start": 1001506, "text": "So the noisy channel model
likes the word \u201cacross\u201d as the possible replacement."}, {"end": 1010082,
"meta": {"new_paragraph": false}, "position": 318, "start": 1006587, "text":
"Unfortunately, we can see from the original sentence,"}, {"end": 1012493, "meta":
{"new_paragraph": false}, "position": 319, "start": 1010082, "text": "taken from
Kernighan et al\u2019s paper,"}, {"end": 1016824, "meta": {"new_paragraph": false},
"position": 320, "start": 1012493, "text": "that [in] the original sentence \
u201cacross\u201d is the wrong word."}, {"end": 1018905, "meta": {"new_paragraph":
false}, "position": 321, "start": 1016824, "text": "The original sentence is"},
{"end": 1023806, "meta": {"new_paragraph": false}, "position": 322, "start":
1018905, "text": "\u201ca stellar and versatile acress whose combination of sass
and glamour\u2026\u201d"}, {"end": 1026929, "meta": {"new_paragraph": false},
"position": 323, "start": 1023806, "text": "And it should be clear that this word
should have been \u201cactress\u201d."}, {"end": 1028027, "meta": {"new_paragraph":
false}, "position": 324, "start": 1026929, "text": "So \u201cacross\u201d is the
wrong word."}, {"end": 1031385, "meta": {"new_paragraph": false}, "position": 325,
"start": 1028027, "text": "So, just using a unigram model, the noisy channel makes
a mistake."}, {"end": 1033813, "meta": {"new_paragraph": false}, "position": 326,
"start": 1031385, "text": "So let\u2019s look at a bigram model."}, {"end":
1035326, "meta": {"new_paragraph": false}, "position": 327, "start": 1033813,
"text": "How well can we do with a bigram model?"}, {"end": 1037676, "meta":
{"new_paragraph": false}, "position": 328, "start": 1035726, "text": "So we
computed a very simple bigram model,"}, {"end": 1039994, "meta": {"new_paragraph":
false}, "position": 329, "start": 1037676, "text": "just using \u201cadd-one
smoothing\u201d,"}, {"end": 1042116, "meta": {"new_paragraph": false}, "position":
330, "start": 1039994, "text": "from the Corpus of Contemporary American
English."}, {"end": 1045617, "meta": {"new_paragraph": false}, "position": 331,
"start": 1042116, "text": "So now, the probability of \u201cactress\u201d given \
u201cversatile\u201d."}, {"end": 1048179, "meta": {"new_paragraph": false},
"position": 332, "start": 1045617, "text": "Just look at these three words, and
ignore the rest for now."}, {"end": 1050352, "meta": {"new_paragraph": false},
"position": 333, "start": 1048179, "text": "\u201cActress\u201d given \
u201cversatile\u201d,"}, {"end": 1053765, "meta": {"new_paragraph": false},
"position": 334, "start": 1050352, "text": "that probability is .00021."}, {"end":
1058320, "meta": {"new_paragraph": false}, "position": 335, "start": 1053765,
"text": "And \u201cwhose\u201d given \u201cactress\u201d is .00010 so we\u2019ll
compute those."}, {"end": 1060980, "meta": {"new_paragraph": false}, "position":
336, "start": 1058320, "text": "And now let\u2019s do the same thing for another
candidate,"}, {"end": 1063666, "meta": {"new_paragraph": false}, "position": 337,
"start": 1060980, "text": "the original candidate that was preferred by the unigram
model,"}, {"end": 1066089, "meta": {"new_paragraph": false}, "position": 338,
"start": 1063666, "text": "the word \u201cacross\u201d."}, {"end": 1072057, "meta":
{"new_paragraph": false}, "position": 339, "start": 1068212, "text": "We\u2019ll
put \u201cacross\u201d here, instead of our hypothesis,"},
{"end": 1075519, "meta": {"new_paragraph": false}, "position": 340, "start":
1072057, "text": "and we\u2019ll again compute the probability of \u201cacross\
u201d giving \u201cversatile\u201d"}, {"end": 1077496, "meta": {"new_paragraph":
false}, "position": 341, "start": 1075519, "text": "times the probability of \
u201cwhose\u201d giving \u201cacross\u201d."}, {"end": 1079944, "meta":
{"new_paragraph": false}, "position": 342, "start": 1077496, "text": "So here\
u2019s those probabilities,"}, {"end": 1082896, "meta": {"new_paragraph": false},
"position": 343, "start": 1079944, "text": "and you can see that the probability of
\u201cwhose\u201d given \u201cactress\u201d"}, {"end": 1086573, "meta":
{"new_paragraph": false}, "position": 344, "start": 1082896, "text": "is much
higher than the probability of \u201cwhose\u201d given \u201cacross\u201d."},
{"end": 1088932, "meta": {"new_paragraph": false}, "position": 345, "start":
1086573, "text": "\u201cActress whose\u201d is just a likely sequence."}, {"end":
1092845, "meta": {"new_paragraph": false}, "position": 346, "start": 1088932,
"text": "And sure enough, if we multiply these things out,"}, {"end": 1098674,
"meta": {"new_paragraph": false}, "position": 347, "start": 1092845, "text": "the
probability of \u201cversatile actress whose\u201d is a much higher probability"},
{"end": 1101418, "meta": {"new_paragraph": false}, "position": 348, "start":
1098674, "text": "than the probability of the sequence \u201cversatile across
whose\u201d."}, {"end": 1102788, "meta": {"new_paragraph": false}, "position": 349,
"start": 1101418, "text": "So a much higher probability."}, {"end": 1105131,
"meta": {"new_paragraph": false}, "position": 350, "start": 1102788, "text": "So
the noisy channel model with a bigram language model"}, {"end": 1109084, "meta":
{"new_paragraph": false}, "position": 351, "start": 1105131, "text": "correctly
picks the correction \u201cactress\u201d."}, {"end": 1115726, "meta":
{"new_paragraph": false}, "position": 352, "start": 1110191, "text": "How are we
going to evaluate these noisy channel and other kinds of models?"}, {"end":
1119772, "meta": {"new_paragraph": false}, "position": 353, "start": 1115726,
"text": "There are lots of good spelling error test sets."}, {"end": 1124302,
"meta": {"new_paragraph": false}, "position": 354, "start": 1119772, "text":
"Wikipedia has a list of common English misspellings."}, {"end": 1127321, "meta":
{"new_paragraph": false}, "position": 355, "start": 1124302, "text": "There\u2019s
a filtered version of that at Aspell."}, {"end": 1130134, "meta": {"new_paragraph":
false}, "position": 356, "start": 1127321, "text": "There\u2019s a spelling error
corpus at Birkbeck."}, {"end": 1131918, "meta": {"new_paragraph": false},
"position": 357, "start": 1130134, "text": "Let\u2019s look at the Wikipedia
list."}, {"end": 1142191, "meta": {"new_paragraph": false}, "position": 358,
"start": 1137010, "text": "So there\u2019s Wikipedia\u2019s list of common English
misspellings."}, {"end": 1146388, "meta": {"new_paragraph": false}, "position":
359, "start": 1142191, "text": "And I\u2019ve shown you here on this slide"},
{"end": 1148820, "meta": {"new_paragraph": false}, "position": 360, "start":
1146388, "text": "some various other possible lists that you can go look at on your
own."}, {"end": 1152745, "meta": {"new_paragraph": false}, "position": 361,
"start": 1148820, "text": "So from these lists of misspellings"}, {"end": 1156728,
"meta": {"new_paragraph": false}, "position": 362, "start": 1152745, "text": "you
would generate a training set to train your channel model,"}, {"end": 1158940,
"meta": {"new_paragraph": false}, "position": 363, "start": 1156728, "text": "a
development set to test out your model"}, {"end": 1161794, "meta":
{"new_paragraph": false}, "position": 364, "start": 1158940, "text": "and then a
final test set to see how well your model works."}, {"end": null, "meta":
{"new_paragraph": false}, "position": 365, "start": 1162579, "text": "So that\
u2019s the noisy channel model of spelling applied to non-real words."}], "title":
"The Noisy Channel Model of Spelling", "version_no": 9, "version_number": 9,
"video": "The Noisy Channel Model of Spelling", "video_description": "",
"video_title": "The Noisy Channel Model of Spelling"}

You might also like