0% found this document useful (0 votes)
2 views

lecture3-tolerent

The document discusses indexing and retrieval techniques in information retrieval systems, focusing on dictionaries, wildcard queries, and spelling correction. It covers data structures like hashes, trees, tries, and k-gram indexes for efficient term lookup and handling of wildcard queries. Additionally, it explains methods for spelling correction, including isolated and context-sensitive approaches using edit distance calculations.

Uploaded by

olfa.gaddour
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lecture3-tolerent

The document discusses indexing and retrieval techniques in information retrieval systems, focusing on dictionaries, wildcard queries, and spelling correction. It covers data structures like hashes, trees, tries, and k-gram indexes for efficient term lookup and handling of wildcard queries. Additionally, it explains methods for spelling correction, including isolated and context-sensitive approaches using edit distance calculations.

Uploaded by

olfa.gaddour
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Lecture 3: Index Representation and Tolerant

Retrieval
Information Retrieval
Computer Science Tripos Part II

Ronan Cummins1

Natural Language and Information Processing (NLIP) Group

[email protected]

2016

1
Adapted from Simone Teufel’s original slides
99
Overview

1 Recap

2 Dictionaries

3 Wildcard queries

4 Spelling correction
IR System components

Document
Collection

Document Normalisation

Indexer
Query Norm.

IR System
Query
UI

Indexes

Ranking/Matching Module

Set of relevant
documents

Last time: The indexer

100
Type/token distinction

Token an instance of a word or term occurring in a document


Type an equivalence class of tokens

In June, the dog likes to chase the cat in the barn.

12 word tokens
9 word types

101
Problems with equivalence classing

A term is an equivalence class of tokens.


How do we define equivalence classes?
Numbers (3/20/91 vs. 20/3/91)
Case folding
Stemming, Porter stemmer
Morphological analysis: inflectional vs. derivational
Equivalence classing problems in other languages

102
Positional indexes

Postings lists in a nonpositional index: each posting is just a


docID
Postings lists in a positional index: each posting is a docID
and a list of positions
Example query: “to1 be2 or3 not4 to5 be6 ”
With a positional index, we can answer
phrase queries
proximity queries

103
IR System components

Document
Collection

IR System
Query

Set of relevant
documents

Today: more indexing, some query normalisation

104
Upcoming

Tolerant retrieval: What to do if there is no exact match


between query term and document term
Data structures for dictionaries
Hashes
Trees
k-term index
Permuterm index
Spelling correction

105
Overview

1 Recap

2 Dictionaries

3 Wildcard queries

4 Spelling correction
Inverted Index

Brutus 8 1 2 4 11 31 45 173 174

Caesar 9 1 2 4 5 6 16 57 132 179

Calpurnia 4 2 31 54 101

106
Dictionaries

The dictionary is the data structure for storing the term


vocabulary.
Term vocabulary: the data
Dictionary: the data structure for storing the term vocabulary

107
Dictionaries

For each term, we need to store a couple of items:


document frequency
pointer to postings list
How do we look up a query term qi in the dictionary at query time?

108
Data structures for looking up terms

Two main classes of data structures: hashes and trees


Some IR systems use hashes, some use trees.
Criteria for when to use hashes vs. trees:
Is there a fixed number of terms or will it keep growing?
What are the relative frequencies with which various keys will
be accessed?
How many terms are we likely to have?

109
Hashes

Each vocabulary term is hashed into an integer, its row


number in the array
At query time: hash query term, locate entry in fixed-width
array
Pros: Lookup in a hash is faster than lookup in a tree.
(Lookup time is constant.)
Cons
no way to find minor variants (resume vs. résumé)
no prefix search (all terms starting with automat)
need to rehash everything periodically if vocabulary keeps
growing

110
Trees

Trees solve the prefix problem (find all terms starting with
automat).
Simplest tree: binary tree
Search is slightly slower than in hashes: O(logM), where M is
the size of the vocabulary.
O(logM) only holds for balanced trees.
Rebalancing binary trees is expensive.
B-trees mitigate the rebalancing problem.
B-tree definition: every internal node has a number of children
in the interval [a, b] where a, b are appropriate positive
integers, e.g., [2, 4].

111
Binary tree

112
B-tree

113
Trie

An ordered tree data structure that is used to store an


associative array
The keys are strings
The key associated with a node is inferred from the position
of a node in the tree
Unlike in binary search trees, where keys are stored in nodes.
Values are associated only with with leaves and some inner
nodes that correspond to keys of interest (not all nodes).
All descendants of a node have a common prefix of the string
associated with that node → tries can be searched by prefixes
The trie is sometimes called radix tree or prefix tree

114
Trie

t i
A
n
o e
n
a n
d

A trie for keys ”A”, ”to”, ”tea”, ”ted”, ”ten”, ”in”, and ”inn”.

115
Trie with postings

t i
15 28 29 100 103 298 ...
A
n
o e 10993 57743
1 3 4 7 8 9 ....

1 2 3 5 6 7 8 ...
n
249 11234 23001 ...
67444
a n
d 302

10423 14301 17998 ...


206 117 2476
12 56 233 1009 ...

20451 109987 ...

116
Overview

1 Recap

2 Dictionaries

3 Wildcard queries

4 Spelling correction
Wildcard queries

hel*
Find all docs containing any term beginning with “hel”
Easy with trie: follow letters h-e-l and then lookup every term
you find there

*hel
Find all docs containing any term ending with “hel”
Maintain an additional trie for terms backwards
Then retrieve all terms t in subtree rooted at l-e-h
In both cases:
This procedure gives us a set of terms that are matches for
wildcard query
Then retrieve documents that contain any of these terms
117
How to handle * in the middle of a term

hel*o

We could look up “hel*” and “*o” in the tries as before and


intersect the two term sets.
Expensive
Alternative: permuterm index
Basic idea: Rotate every wildcard query, so that the * occurs
at the end.
Store each of these rotations in the dictionary (trie)

118
Permuterm index

For term hello: add

hello$, ello$h, llo$he, lo$hel, o$hell, $hello

to the trie where $ is a special symbol

for hel*o, look up o$hel*

Problem: Permuterm more than quadrupels the size of the


dictionary compared to normal trie (empirical number).

119
k-gram indexes

More space-efficient than permuterm index


Enumerate all character k-grams (sequence of k characters)
occurring in a term
Bi-grams from April is the cruelest month
ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on
nt th h$
Maintain an inverted index from k-grams to the term that
contain the k-gram

120
k-gram indexes

Note that we have two different kinds of inverted indexes:


The term-document inverted index for finding documents
based on a query consisting of terms
The k-gram index for finding terms based on a query
consisting of k-grams

121
Processing wildcard terms in a bigram index

Query hel* can now be run as:

$h AND he AND el

... but this will show up many false positives like heel.
Postfilter, then look up surviving terms in term–document
inverted index.
k-gram vs. permuterm index
k-gram index is more space-efficient
permuterm index does not require postfiltering.

122
Overview

1 Recap

2 Dictionaries

3 Wildcard queries

4 Spelling correction
Spelling correction

an asterorid that fell form the sky

In an IR system, spelling correction is only ever run on queries.


The general philosophy in IR is: don’t change the documents
(exception: OCR’ed documents)
Two different methods for spelling correction:
Isolated word spelling correction
Check each word on its own for misspelling
Will only attempt to catch first typo above
Context-sensitive spelling correction
Look at surrounding words
Should correct both typos above

123
Isolated word spelling correction

There is a list of “correct” words – for instance a standard


dictionary (Webster’s, OED. . . )
Then we need a way of computing the distance between a
misspelled word and a correct word
for instance Edit/Levenshtein distance
k-gram overlap
Return the “correct” word that has the smallest distance to
the misspelled word.

informaton → information

124
Edit distance

Edit distance between two strings s1 and s2 is the minimum


number of basic operations that transform s1 into s2 .
Levenshtein distance: Admissible operations are insert,
delete and replace
Levenshtein distance
dog – do 1 (delete)
cat – cart 1 (insert)
cat – cut 1 (replace)
cat – act 2 (delete+insert)

125
Levenshtein distance: Distance matrix

s n o w
0 1 2 3 4
o 1 1 2 3 4
s 2 1 3 3 3
l 3 3 2 3 4
o 4 3 3 2 3

126
Edit Distance: Four cells

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3

127
Each cell of Levenshtein matrix

Cost of getting here from Cost of getting here from my


my upper left neighbour (by upper neighbour (by delete)
copy or replace)
Cost of getting here from my Minimum cost out of these
left neighbour (by insert)

128
Dynamic Programming

Cormen et al:
Optimal substructure: The optimal solution contains within it
subsolutions, i.e, optimal solutions to subproblems
Overlapping subsolutions: The subsolutions overlap and would
be computed over and over again by a brute-force algorithm.
For edit distance:
Subproblem: edit distance of two prefixes
Overlap: most distances of prefixes are needed 3 times (when
moving right, diagonally, down in the matrix)

129
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1
o
1
2
s
2
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2
o
1 2
2
s
2
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2
o
1 2 1
2
s
2
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3
o
1 2 1 2
2
s
2
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3
o
1 2 1 2 2
2
s
2
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4
o
1 2 1 2 2 3
2
s
2
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4
o
1 2 1 2 2 3 2
2
s
2
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3
2
s
2
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2
s
2
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2
s
2 3
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2
s
2 3 1
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3
s
2 3 1 2
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3
s
2 3 1 2 2
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3
s
2 3 1 2 2 3
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3
s
2 3 1 2 2 3 3
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3
l
3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2
l
3 4
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2
l
3 4 2
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3
l
3 4 2 3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3
l
3 4 2 3 2
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4
l
3 4 2 3 2 3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4
l
3 4 2 3 2 3 3
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4
o
4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3
o
4 5

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3
o
4 5 3

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3
o
4 5 3 4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3
o
4 5 3 4 3

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4
o
4 5 3 4 3 4

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4
o
4 5 3 4 3 4 2

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3

Edit distance oslo–snow is 3!

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3

How do I read out the editing operations that transform oslo into snow?

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3

cost operation input output

1 insert * w

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3

cost operation input output

0 (copy) o o
1 insert * w

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3

cost operation input output

1 replace l n
0 (copy) o o
1 insert * w

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3

cost operation input output

0 (copy) s s
1 replace l n
0 (copy) o o
1 insert * w

130
Example: Edit Distance oslo – snow

s n o w

0 1 1 2 2 3 3 4 4
1 1 2 2 3 2 4 4 5
o
1 2 1 2 2 3 2 3 3
2 1 2 2 3 3 3 3 4
s
2 3 1 2 2 3 3 4 3
3 3 2 2 3 3 4 4 4
l
3 4 2 3 2 3 3 4 4
4 4 3 3 3 2 4 4 5
o
4 5 3 4 3 4 2 3 3

cost operation input output


1 delete o *
0 (copy) s s
1 replace l n
0 (copy) o o
1 insert * w

130
Using edit distance for spelling correction

Given a query, enumerate all character sequences within a


preset edit distance
Intersect this list with our list of “correct” words
Suggest terms in the intersection to user.

131
k-gram indexes for spelling correction
Enumerate all k-grams in the query term

Misspelled word bordroom


bo – or – rd – dr – ro – oo – om
Use k-gram index to retrieve “correct” words that match
query term k-grams
Threshold by number of matching k-grams
Eg. only vocabularly terms that differ by at most 3 k-grams

bo ✲ aboard ✲ about ✲boardroom ✲ border

or ✲ border ✲ lord ✲ morbid ✲ sordid

rd ✲ aboard ✲ ardent ✲boardroom ✲ border

132
Context-sensitive Spelling correction
One idea: hit-based spelling correction

flew form munich


Retrieve correct terms close to each query term

flew → flea
form → from
munich → munch
Holding all other terms fixed, try all possible phrase queries
for each replacement candidate

flea form munich – 62 results


flew from munich –78900 results
flew form munch – 66 results
Not efficient. Better source of information: large corpus of queries,
not documents
133
General issues in spelling correction

User interface
automatic vs. suggested correction
“Did you mean” only works for one suggestion; what about
multiple possible corrections?
Tradeoff: Simple UI vs. powerful UI
Cost
Potentially very expensive
Avoid running on every query
Maybe just those that match few documents

134
Takeaway

What to do if there is no exact match between query term


and document term
Datastructures for tolerant retrieval:
Dictionary as hash, B-tree or trie
k-gram index and permuterm for wildcards
k-gram index and edit-distance for spelling correction

135
Reading

Wikipedia article ”trie”


MRS chapter 3.1, 3.2, 3.3

136

You might also like