Lecture 3 Phrase Queries and Positional Index
Lecture 3 Phrase Queries and Positional Index
Phrase queries
• We want to be able to answer queries such as
“stanford university” – as a phrase
• Thus the sentence “The inventor Stanford
never went to university” is not a match.
– Many more queries are implicit phrase queries
• For this, it no longer suffices to store only
<term : docs> entries
Solution 1: Biword indexes
• Index every consecutive pair of terms in the text
as a phrase
• For example the text “Friends, Romans,
Countrymen” would generate the biwords
– friends romans
– romans countrymen
• Each of these biwords is now a dictionary term
• Two-word phrase query-processing is now
immediate.
Longer phrase queries
• Longer phrases can be processed by breaking
them down
• stanford university palo alto can be broken into
the Boolean query on biwords:
stanford university AND university palo AND palo
alto
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
8
Processing a phrase query
Answer: Option B : Documents 4 and 5.
9
Processing a phrase query
• Extract inverted index entries for each distinct
term: to, be, or, not.
• Merge their doc:position lists to enumerate all
positions with “to be or not to be”.
– to:
• 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
– be:
• 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
• Same general method for proximity searches
Proximity queries
• LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
– Again, here, /k means “within k words of”.
• Clearly, positional indexes can be used for
such queries; biword indexes cannot.
Positional index size
• A positional index expands postings storage
substantially
– Even though indices can be compressed
• Nevertheless, a positional index is now
standardly used because of the power and
usefulness of phrase and proximity queries …
whether used explicitly or implicitly in a
ranking retrieval system.
Positional index size
• Need an entry for each occurrence, not just
once per document
• Index size depends on average document size
Rules of thumb
• A positional index is 2–4 as large as a non-
positional index
• Document structure
– Title, abstract, body, bullets, anchor
• Entity annotation
– Being part of a person’s name, location’s name
16