2.boolean Retrieval Model
2.boolean Retrieval Model
Boolean Models
Information Retrieval and Search Engines
Cam-Tu Nguyen, Ph.D
Email: [email protected]
1
Sec. 1.3
2
Sec. 1.1
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Shakespeare’s Calpurnia 0 1 0 0 0 0
collected Cleopatra 1 0 0 0 0 0
works mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and Calpurnia
(complemented) è bitwise AND.
• 110100 AND
• 110111 AND Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
• 101111 = Brutus
Caesar
1
1
1
1
0
0
1
1
0
1
0
1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
100100 mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 0 0 1 0 0
5
Can’t build the matrix
• Consider N = 1 million documents, each with about 1000 words
• Avg 6 bytes/word including spaces/punctuation
• 6GB of data in the documents
• Say there are M = 500K distinct terms among these
Tokenizer
Linguistic modules
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16 7
Questions
• Text Processing includes Tokenization, Normalization, Stemming
and Lemmatization, Removal of Stop Words
• Can you explain each stage of Text Processing?
• In what cases, not indexing stop words may cause harms?
8
Initial stages of text processing
• Tokenization
• Cut character sequence into word tokens
• Special tokens: C++, C#, M*A*S*H (T.V. show name)
• Normalization
• Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming and Lemmatization
• We may wish different forms of a root to match
• authorize, authorization
• Stop words
• We may omit very common words (or not)
• the, a, to, of
9
Sec. 1.2
Inverted index
• For each term t, we must store a list of all documents that contain t.
• We need variable-size postings lists
• On disk, a continuous run of postings is normal and best
• In memory, can use linked lists or variable length arrays
Posting Posting list
• Some tradeoffs in size/ease of insertion
The Postings
Dictionary
Sorted by docID (more later on why). 10
Questions
• How to build the inverted index for a collection of documents?
• How to store the dictionary? How to store the postings?
• How much space that we need for an inverted index?
11
Sec. 1.2
Doc 1 Doc 2
12
Sec. 1.2
13
Sec. 1.2
Why frequency?
Will discuss later.
14
Sec. 1.2
Terms
and
counts
IR system
implementation
• How do we
index
efficiently?
• How much
storage do we
Pointers need? 15
Sec. 1.3
Questions
• How do we process a query based on the inverted index?
16
Sec. 1.3
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
17
Intersecting two postings lists
(a “merge” algorithm)
18
Sec. 1.3
The merge
• Walk through the two postings simultaneously, in time linear in the
total number of postings entries
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar
• Can we still run through the merge in time O(x+y)? What can we
achieve?
20
Sec. 1.3
Merging
What about an arbitrary Boolean formula?
(Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
• Can we always merge in “linear” time?
• Linear in what?
• Can we do better?
21
Sec. 1.3
Query optimization
• What is the best order for query processing?
• Consider a query that is an AND of n terms.
• For each of the n terms, get its postings, then AND them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
22
Sec. 1.3
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
24
Sec. 1.3
25
Sec. 2.4
Phrase queries
• We want to be able to answer queries such as “stanford university” –
as a phrase
• Thus the sentence “I went to university at Stanford” is not a match.
• The concept of phrase queries has proven easily understood by users; one of
the few “advanced search” ideas that works
• Many more queries are implicit phrase queries
• For this, it no longer suffices to store only
<term : docs> entries
26
Sec. 2.4.1
27
Sec. 2.4.1
Without the docs, we cannot verify that the docs matching the above
Boolean query do contain the phrase.
28
Sec. 2.4.1
Extended biwords
• Parse the indexed text and perform part-of-speech-tagging (POST).
• Bucket the terms into (say) Nouns (N) and articles/prepositions (X).
• Call any string of terms of the form NX*N an extended biword.
• Each such extended biword is now made a term in the dictionary.
• Example: catcher in the rye
N X X N
• Query processing: parse it into N’s and X’s
• Segment query into enhanced biwords
• Look up in index: catcher rye
29
Sec. 2.4.1
• Biword indexes are not the standard solution (for all biwords) but can
be part of a compound strategy
30
Sec. 2.4.2
document frequency
docids
31
Sec. 2.4.2
33
Sec. 2.4.2
Proximity queries
• employment /3 place
• Again, here, /k means “within k words of (of either sides)”.
• Clearly, positional indexes can be used for such queries; biword
indexes cannot.
• Exercise: Adapt the linear merge of postings to handle proximity
queries. Can you make it work for any value of k?
• This is a little tricky to do correctly and efficiently
• See Figure 2.12 of IIR
34
Proximity
queries
35
Sec. 2.4.2
37
Sec. 2.4.2
38
Sec. 2.4.2
Rules of thumb
• A positional index is 2–4 as large as a non-positional index
39
Sec. 2.4.3
Combination schemes
• These two approaches can be profitably combined
• For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to
keep on merging positional postings lists
• Even more so for phrases like “The Who”
• Williams et al. (2004) evaluate a more sophisticated mixed indexing
scheme
• A typical web query mixture was executed in ¼ of the time of using just a
positional index
• It required 26% more space than having a positional index alone
40
Read More
• Chapter 1,2, IIR
• Chapter 1, SE
Acknowledgements
Many slides in this section are adapted from the slides of Prof
Christopher Manning (Standford)
41