Lect 3 Inverted Index
Lect 3 Inverted Index
Dictionary Postings
Sorted by docID (more later on why).
Inverted Index (Construction)
Documents to Friends, Romans, countrymen.
be indexed
Tokenizer
Linguistic modules
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
Tokenization
Cut character sequence into word tokens
Deal with “John’s”, a state-of-the-art solution
Normalization
Map text and query term to same form
You want U.S.A. and USA to match
Stemming
We may wish different forms of a root to match
authorize, authorization
Stop words
We may omit very common words (or not)
the, a, to, of
Indexer steps: Token sequence
Doc 1 Doc 2
Sort by terms
At least conceptually
And then docID
Why frequency?
Will discuss later.
Where do we pay in storage?
Lists of
docIDs
Terms
and
counts
IR system
implementation
• How do we index
efficiently?
• How much
storage do we
need?
Pointers 10
Inverted Index
Inverted index works much better than the Boolean retrieval method.
Sorting based inverted indexing is more efficient than the inverted indexing
method since least work needs to be done.
Query processing with an inverted index
How do we process a query? Our focus
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
The merge
Walk through the two postings simultaneously, in time linear
in the total number of postings entries
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16