2-Boolean IR and Indexing
2-Boolean IR and Indexing
M. Soleymani
Spring 2024
Most slides have been adapted from: Profs. Manning, Nayak &
Raghavan lectures (CS-276, Stanford)
Boolean retrieval model
} Query: Boolean expressions
} Boolean queries use AND, OR and NOT to join query terms
2
Sec. 1.3
3
The classic search model
Get rid of mice in a
Task politically correct way
Misconception?
Info about removing mice
Info Need without killing them
Misformulation?
SEARCH Corpus
ENGINE
Query Results
Refinement
4
Sec. 1.1
5
Sec. 1.1
Example: Plays of Shakespeare
Term-document incidence matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains
word, 0 otherwise
6
Sec. 1.1
Incidence vectors
} So we have a 0/1 vector for each term.
} Brutus AND Caesar but NOT Calpurnia
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
7
Sec. 1.1
8
Sec. 1.1
Bigger collections
} Number of docs: N = 10!
} Average length of a doc≈ 1000 words
} No. of distinct terms: M = 500,000
} Average length of a word ≈ 6 bytes
} including spaces/punctuation
} 6GB of data
9
Sec. 1.1
10
Sec. 1.2
Inverted index
} For each term t, store a list of all docs that contain t.
} Identify each by a docID, a document serial number
Inverted index
} We need variable-size postings lists
} On disk, a continuous run of postings is normal and best
} In memory, can use linked lists or variable length arrays
} Some tradeoffs in size/ease of insertion Posting
Dictionary Postings
Sorted by docID
12
Sec. 1.2
Tokenizer
Doc 1 Doc 2
14
Sec. 1.2
} Sort by terms
} And then docID
15
Sec. 1.2
Why frequency?
Will discuss later.
16
Sec. 1.2
Lists of
docIDs
Terms and
counts
17
Pointers
Sec. 3.1
A naïve dictionary
} An array of struct:
18
Sec. 1.3
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 13 21 34
19
Sec. 1.3
The merge
} Walk through the two postings simultaneously, in time
linear in the total number of postings entries
Brutus 2 4 8 41 48 64 128
2 8
Caesar 1 2 3 8 11 17 21 31
20
Intersecting two postings lists
(a “merge” algorithm)
21
Sec. 1.3
22
Sec. 1.3
Merging
What about an arbitrary Boolean formula?
(Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
23
Sec. 1.3
Query optimization
} What is the best order for query processing?
} Consider a query that is an AND of 𝑛 terms.
} For each of the 𝑛 terms, get its postings, then AND
them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
24
Sec. 1.3
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
Execute the query as (Calpurnia AND Brutus) AND Caesar.
25
Sec. 1.3
26
Summary of Boolean IR:
Advantages of exact match
} It can be implemented very efficiently
} Work well when you know exactly (or roughly) what the
collection contains and what you’re looking for
27
Summary of Boolean IR:
Disadvantages of the Boolean Model
} Query formulation (Boolean expression) is difficult for
most users
} Too simplistic Boolean queries by most users
} AND, OR as opposite extremes in a precision/recall tradeoff
} Usually either too few or too many docs in response to a user query
28
Ranking results in advanced IR models
} Boolean queries give inclusion or exclusion of docs.
} Results of queries in Boolean model as a set
29
Phrase and proximity queries:
positional indexes
30
Sec. 2.4
Phrase queries
} Example: “stanford university”
} “I went to university at Stanford” is not a match.
31
Approaches for phrase queries
} Positional indexes
} Full inverted index
32
Sec. 2.4.1
Biword indexes
} Index every consecutive pair of terms in the text as a
phrase
} E.g., doc :“Friends, Romans, Countrymen”
} would generate these biwords:
} “friends romans” ,“romans countrymen”
33
Sec. 2.4.1
34
Sec. 2.4.1
35
Sec. 2.4.2
Positional index
} In the postings, store for each term the position(s) in
which tokens of it appear:
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
Which of docs 1,2,4,5
2: 3, 149; could contain
4: 17, 191, 291, 430, 434; “to be or not to be”?
Positional index
} For phrase queries, we use a merge algorithm recursively
at the doc level
37
Sec. 2.4.2
38
Sec. 2.4.2
39
40
Sec. 2.4.2
41
Sec. 2.4.2
42
Sec. 2.4.2
43
Sec. 2.4.3
44
Phrase queries: Combination schemes
} Williams et al. (2004) evaluate a more sophisticated
mixed indexing scheme
} needs (in average) ¼ of the time of using just a positional index
} needs 26% more space than having a positional index alone
45
Resources
} IIR, Chapter 1
} IIR, Chapter 2
46