Information Retrieval
Information Retrieval
Map-Reduce Implementations
Adopted from Jimmy Lin’s slides, which is
licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 3.0
United States
Roadmap
Introduction to information retrieval
Basics of indexing and retrieval
Inverted indexing in MapReduce
Retrieval at scale
First, nomenclature…
Information retrieval (IR)
Focus on textual information (= text/document retrieval)
Other possibilities include image, video, music, …
What do we search?
Generically, “collections”
Less-frequently used, “corpora”
What do we find?
Generically, “documents”
Even though we may be referring to web pages, PDFs,
PowerPoint slides, paragraphs, etc.
Information Retrieval Cycle
Source
Selection Resource
Query
Formulation Query
Search Results
Selection Documents
System discovery
Vocabulary discovery
Concept discovery
Document discovery Examination Information
source reselection
Delivery
The Central Problem in Search
Author
Searcher
Concepts Concepts
Query Documents
c qu i sition
ent a )
docum b crawling
we
(e.g.,
online offline
Representation Representation
Function Function
Comparison
Function Index
Hits
How do we represent text?
Remember: computers don’t “understand” anything!
“Bag of words”
Treat all the words in a document as index terms
Assign a “weight” to each term based on “importance”
(or, in simplest case, presence/absence of word)
Disregard order, structure, meaning, etc. of the words
Simple, yet effective!
Assumptions
Term occurrence is independent
Document relevance is independent
“Words” are well-defined
What’s a word?
天主教教宗若望保祿二世因感冒再度住進醫院。這
是他今年第二度因同樣的病因住院。 الناطق باسم- وقال مارك ريجيف
إن شارون قبل- الخارجية اإلسرائيلية
الدعوة وسيقوم للمرة األولى بزيارة
التي كانت لفترة طويلة المقر،تونس
1982 الرسمي لمنظمة التحرير الفلسطينية بعد خروجها من لبنان عام.
भारत सरकार ने आर्थिक सर्वेक्षण में वित्तीय वर्ष 2005-06 में सात फ़ीसदी विकास दर
हासिल करने का आकलन किया है और कर सुधार पर ज़ोर दिया है
日米連合で台頭中国に対処…アーミテージ前副長官提言
10
Boolean model
11
Boolean model (contd)
12
Sec. 1.3
13
Strengths and Weaknesses
Strengths
Precise, if you know the right strategies
Precise, if you have an idea of what you’re looking for
Implementations are fast and efficient
Weaknesses
Users must learn Boolean logic
Boolean logic insufficient to capture the richness of language
No control over size of result set: either too many hits or none
When do you stop reading? All documents in the result set are
considered “equally good”
What about partial matches? Documents that “don’t quite match”
the query may be useful also
Vector Space Model
t3
d2
d3
d1
θ
φ
t1
d5
t2
d4
17
Term Weighting
Term weights consist of two components
Local: how important is the term in this document?
Global: how important is the term in the collection?
Here’s the intuition:
Terms that appear often in a document should get high weights
Terms that appear in many documents should get low weights
How do we capture this mathematically?
Term frequency (local)
Inverse document frequency (global)
TF.IDF Term Weighting
N
wi , j tf i , j log
ni
wi , j weight assigned to term i in document j
20
An Example
A document space is defined by three terms:
hardware, software, users
the vocabulary
A set of documents are defined as:
A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)
If the Query is “hardware and software”
what documents should be retrieved?
21
An Example (cont.)
22
Constructing Inverted Index (Word
Counting)
Documents
Bag of
Words syntax, semantics, word knowledge, etc.
Inverted
Index
Stopwords removal
• Many of the most frequently used words in English are useless in IR
and text mining – these words are called stop words.
– the, of, and, to, ….
– Typically about 400 to 500 such words
– For an application, an additional domain specific stopwords list may
be constructed
• Why do we need to remove stopwords?
– Reduce indexing (or data) file size
• stopwords accounts 20-30% of total word counts.
– Improve efficiency and effectiveness
• stopwords are not useful for searching or text mining
• they may also confuse the retrieval system.
24
Stemming
• Techniques used to find out the root/stem of a word.
E.g.,
– user engineering
– users engineered
– used engineer
– using
• stem: use engineer
Usefulness:
• improving effectiveness of IR and text mining
– matching similar words
– Mainly improve recall
• reducing indexing size
– combing words with same roots may reduce indexing size as
much as 40-50%.
25
Basic stemming methods
Using a set of rules. E.g.,
• remove ending
– if a word ends with a consonant other than s,
followed by an s, then delete s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word
consists only of one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed
unless this leaves only a single letter.
– …...
• transform words
– if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”
26
Inverted index
27
An example
28
Search using inverted index
29
Inverted Index: Boolean Retrieval
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham
1 2 3 4
blue 1 blue 2
cat 1 cat 3
egg 1 egg 4
fish 1 1 fish 1 2
green 1 green 4
ham 1 ham 4
hat 1 hat 3
one 1 one 1
red 1 red 2
two 1 two 1
Boolean Retrieval
To execute a Boolean query:
Build query syntax tree
OR
( blue AND fish ) OR ham ham AND
blue 2
fish 1 2
32
Sec. 1.3
The merge
• Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar
34
Inverted Index: TF.IDF
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham
tf
1 2 3 4 df
blue 1 1 blue 1 2 1
cat 1 1 cat 1 3 1
egg 1 1 egg 1 4 1
fish 2 2 2 fish 2 1 2 2 2
green 1 1 green 1 4 1
ham 1 1 ham 1 4 1
hat 1 1 hat 1 3 1
one 1 1 one 1 1 1
red 1 1 red 1 2 1
two 1 1 two 1 1 1
Positional Indexes
Store term position in postings
Supports richer queries (e.g., proximity)
Naturally, leads to larger indexes…
Inverted Index: Positional Information
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham
tf
1 2 3 4 df
blue 1 1 blue 1 2 1 [3]
fish 1 2 9 1 21 3 34 1 35 2 80 3 …
Tradeoffs
Small memory footprint (good)
Must read through all postings (bad), but skipping possible
More disk seeks (bad), but blocking possible
Retrieval: Query-At-A-Time
Evaluate documents one query term at a time
Usually, starting from most rare term (often with tf-sorted postings)
blue 9 2 21 1 35 1 …
Score{q=x}(doc n) = s Accumulators
(e.g., hash)
fish 1 2 9 1 21 3 34 1 35 2 80 3 …
Tradeoffs
Early termination heuristics (good)
Large memory footprint (bad), but filtering heuristics possible
MapReduce it?
The indexing problem
P erfect for
Scalability is critical MapReduc
Must be relatively fast, but need not be real time e!
Fundamentally a batch operation
Incremental updates may or may not be important
For the web, crawling is a challenge in itself
The retrieval problem
Must have sub-second response time
For the web, only need relatively few results
no t so goo d…
Uh…
Indexing: Performance Analysis
Fundamentally, a large sorting problem
Terms usually fit in memory
Postings usually don’t
How is it done on a single machine?
How can it be done with MapReduce?
First, let’s characterize the problem size:
Size of vocabulary
Size of postings
Vocabulary Size: Heaps’ Law
b M is vocabulary size
k = 44
b = 0.49
Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)
c
cf i cf is the collection frequency of i-th common term
i
c is a constant
Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)
fish 1 2 fish 2 2
cat 3 1
blue 2 1
Reduce fish 1 2 2 2
hat 3 1
one 1 1
two 1 1
red 2 1
Inverted Indexing: Pseudo-Code
Positional Indexes
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat
cat 3 1 [1]
blue 2 1 [3]
pr o b lem?
a t ’s the
Wh
Scalability Bottleneck
Initial implementation: terms as keys, postings as values
Reducers must buffer all postings associated with key (to sort)
What if we run out of memory to buffer postings?
Uh oh!
Another Try…
(key) (values) (keys) (values)
fish 1 2 9 1 21 3 34 1 35 2 80 3 …
In Practice:
• Don’t encode docnos, encode gaps (or d-gaps)
• But it’s not obvious that this save space…
fish 1 2 8 1 12 3 13 1 1 2 45 3 …
Overview of Index Compression
Byte-aligned vs. bit-aligned
VarInt
Group VarInt
Simple-9
Non-parameterized bit-aligned
Unary codes
codes
codes
Parameterized bit-aligned
Golomb codes (local Bernoulli model)
Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!
Unary Codes
x 1 is coded as x-1 one bits, followed by 1 zero bit
3 = 110
4 = 1110
Great for small numbers… horrible for large numbers
Overly-biased for very small gaps
Unary Golomb
b=3 b=6
1 0 0 0 0:0 0:00
2 10 10:0 100:0 0:10 0:01
3 110 10:1 100:1 0:11 0:100
4 1110 110:00 101:00 10:0 0:101
5 11110 110:01 101:01 10:10 0:110
6 111110 110:10 101:10 10:11 0:111
7 1111110 110:11 101:11 110:0 10:00
8 11111110 1110:000 11000:00 110:10 10:01
0
9 111111110 1110:001 11000:00 110:11 10:100
1
10 1111111110 1110:010 11000:01 1110:0 10:101
0
Bible TREC
Bible: King James version of the Bible; 31,101 verses (4.3 MB)
TREC: TREC disks 1+2; 741,856 docs (2070 MB)
14
11
?Query(1,11)
13 10 12
Yes
?Query(3,9)
6 7 8 9
No
3 4 5
1 2
Directed Graph DAG (directed acyclic graph) by
coalescing the strongly connected components
Chicken and Egg?
(key) (value)
fish 1 [2,4]
But wait! How do we set the
fish 9 [9]
Golomb parameter b?
fish 21 [1,8,22]
Recall: optimal b 0.69 (N/df)
fish 34 [23]
We need the df to set b…
fish 35 [8,41] But we don’t know the df until we’ve
seen all postings!
fish 80 [2,9,76]
Sound familiar?
Getting the df
In the mapper:
Emit “special” key-value pairs to keep track of df
In the reducer:
Make sure “special” key-value pairs come first: process them to
determine df
Remember: proper partitioning!
Getting the df: Modified Mapper
Doc 1
one fish, two fish Input document…
(key) (value)
one 1 [1]
two 1 [3]
one [1]
two [1]
Getting the df: Modified Reducer
(key) (value)
fish 9 [9]
fish 35 [8,41]
fish 80 [2,9,76]
D T2
Term …
Partitioning
T3
T
Document
Partitioning … T
D1 D2 D3
Katta Architecture
(Distributed Lucene)
http://katta.sourceforge.net/