lecture3-tolerent
lecture3-tolerent
Retrieval
Information Retrieval
Computer Science Tripos Part II
Ronan Cummins1
2016
1
Adapted from Simone Teufel’s original slides
99
Overview
1 Recap
2 Dictionaries
3 Wildcard queries
4 Spelling correction
IR System components
Document
Collection
Document Normalisation
Indexer
Query Norm.
IR System
Query
UI
Indexes
Ranking/Matching Module
Set of relevant
documents
100
Type/token distinction
12 word tokens
9 word types
101
Problems with equivalence classing
102
Positional indexes
103
IR System components
Document
Collection
IR System
Query
Set of relevant
documents
104
Upcoming
105
Overview
1 Recap
2 Dictionaries
3 Wildcard queries
4 Spelling correction
Inverted Index
Calpurnia 4 2 31 54 101
106
Dictionaries
107
Dictionaries
108
Data structures for looking up terms
109
Hashes
110
Trees
Trees solve the prefix problem (find all terms starting with
automat).
Simplest tree: binary tree
Search is slightly slower than in hashes: O(logM), where M is
the size of the vocabulary.
O(logM) only holds for balanced trees.
Rebalancing binary trees is expensive.
B-trees mitigate the rebalancing problem.
B-tree definition: every internal node has a number of children
in the interval [a, b] where a, b are appropriate positive
integers, e.g., [2, 4].
111
Binary tree
112
B-tree
113
Trie
114
Trie
t i
A
n
o e
n
a n
d
A trie for keys ”A”, ”to”, ”tea”, ”ted”, ”ten”, ”in”, and ”inn”.
115
Trie with postings
t i
15 28 29 100 103 298 ...
A
n
o e 10993 57743
1 3 4 7 8 9 ....
1 2 3 5 6 7 8 ...
n
249 11234 23001 ...
67444
a n
d 302
116
Overview
1 Recap
2 Dictionaries
3 Wildcard queries
4 Spelling correction
Wildcard queries
hel*
Find all docs containing any term beginning with “hel”
Easy with trie: follow letters h-e-l and then lookup every term
you find there
*hel
Find all docs containing any term ending with “hel”
Maintain an additional trie for terms backwards
Then retrieve all terms t in subtree rooted at l-e-h
In both cases:
This procedure gives us a set of terms that are matches for
wildcard query
Then retrieve documents that contain any of these terms
117
How to handle * in the middle of a term
hel*o
118
Permuterm index
119
k-gram indexes
120
k-gram indexes
121
Processing wildcard terms in a bigram index
$h AND he AND el
... but this will show up many false positives like heel.
Postfilter, then look up surviving terms in term–document
inverted index.
k-gram vs. permuterm index
k-gram index is more space-efficient
permuterm index does not require postfiltering.
122
Overview
1 Recap
2 Dictionaries
3 Wildcard queries
4 Spelling correction
Spelling correction
123
Isolated word spelling correction
informaton → information
124
Edit distance
125
Levenshtein distance: Distance matrix
s n o w
0 1 2 3 4
o 1 1 2 3 4
s 2 1 3 3 3
l 3 3 2 3 4
o 4 3 3 2 3
126
Edit Distance: Four cells
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3
127
Each cell of Levenshtein matrix
128
Dynamic Programming
Cormen et al:
Optimal substructure: The optimal solution contains within it
subsolutions, i.e, optimal solutions to subproblems
Overlapping subsolutions: The subsolutions overlap and would
be computed over and over again by a brute-force algorithm.
For edit distance:
Subproblem: edit distance of two prefixes
Overlap: most distances of prefixes are needed 3 times (when
moving right, diagonally, down in the matrix)
129
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1
o
1
2
s
2
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2
o
1 2
2
s
2
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2
o
1 2 1
2
s
2
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3
o
1 2 1 2
2
s
2
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3
o
1 2 1 2 2
2
s
2
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4
o
1 2 1 2 2 3
2
s
2
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4
o
1 2 1 2 2 3 2
2
s
2
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3
2
s
2
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2
s
2
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2
s
2 3
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2
s
2 3 1
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3
s
2 3 1 2
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3
s
2 3 1 2 2
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3
s
2 3 1 2 2 3
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3
s
2 3 1 2 2 3 3
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3
l
3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2
l
3 4
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2
l
3 4 2
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3
l
3 4 2 3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3
l
3 4 2 3 2
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4
l
3 4 2 3 2 3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4
l
3 4 2 3 2 3 3
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4
o
4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3
o
4 5
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3
o
4 5 3
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3
o
4 5 3 4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3
o
4 5 3 4 3
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4
o
4 5 3 4 3 4
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4
o
4 5 3 4 3 4 2
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3
How do I read out the editing operations that transform oslo into snow?
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3
1 insert * w
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3
0 (copy) o o
1 insert * w
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3
1 replace l n
0 (copy) o o
1 insert * w
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3
0 (copy) s s
1 replace l n
0 (copy) o o
1 insert * w
130
Example: Edit Distance oslo – snow
s n o w
0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3
130
Using edit distance for spelling correction
131
k-gram indexes for spelling correction
Enumerate all k-grams in the query term
132
Context-sensitive Spelling correction
One idea: hit-based spelling correction
flew → flea
form → from
munich → munch
Holding all other terms fixed, try all possible phrase queries
for each replacement candidate
User interface
automatic vs. suggested correction
“Did you mean” only works for one suggestion; what about
multiple possible corrections?
Tradeoff: Simple UI vs. powerful UI
Cost
Potentially very expensive
Avoid running on every query
Maybe just those that match few documents
134
Takeaway
135
Reading
136