0% found this document useful (0 votes)

101 views

Information Retrieval

The document discusses information retrieval and MapReduce implementations. It introduces information retrieval, including the basics of indexing and retrieval. It then discusses inverted indexing using MapReduce and retrieval at scale. It provides an overview of representing text and different information retrieval models, including the Boolean model and vector space model.

Uploaded by

Mathesh Paramasivam

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views

Information Retrieval

Uploaded by

Mathesh Paramasivam

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 72

Information Retrieval and

Map-Reduce Implementations
Adopted from Jimmy Lin’s slides, which is
licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 3.0
United States
Roadmap
 Introduction to information retrieval
 Basics of indexing and retrieval
 Inverted indexing in MapReduce
 Retrieval at scale
First, nomenclature…
 Information retrieval (IR)
 Focus on textual information (= text/document retrieval)
 Other possibilities include image, video, music, …
 What do we search?
 Generically, “collections”
 Less-frequently used, “corpora”
 What do we find?
 Generically, “documents”
 Even though we may be referring to web pages, PDFs,
PowerPoint slides, paragraphs, etc.
Information Retrieval Cycle
Source
Selection Resource

Query
Formulation Query

Search Results

Selection Documents
System discovery
Vocabulary discovery
Concept discovery
Document discovery Examination Information

source reselection
Delivery
The Central Problem in Search
Author
Searcher

Concepts Concepts

Query Terms Document Terms

“tragic love story” “fateful star-crossed romance”

Do these represent the same concepts?

Abstract IR Architecture

Query Documents
c qu i sition
ent a )
docum b crawling
we
(e.g.,
online offline
Representation Representation
Function Function

Query Representation Document Representation

Comparison
Function Index

Hits
How do we represent text?
 Remember: computers don’t “understand” anything!
 “Bag of words”
 Treat all the words in a document as index terms
 Assign a “weight” to each term based on “importance”
(or, in simplest case, presence/absence of word)
 Disregard order, structure, meaning, etc. of the words
 Simple, yet effective!
 Assumptions
 Term occurrence is independent
 Document relevance is independent
 “Words” are well-defined
What’s a word?
天主教教宗若望保祿二世因感冒再度住進醫院。這
是他今年第二度因同樣的病因住院。 ‫ الناطق باسم‬- ‫وقال مارك ريجيف‬
‫ إن شارون قبل‬- ‫الخارجية اإلسرائيلية‬
‫الدعوة وسيقوم للمرة األولى بزيارة‬
‫ التي كانت لفترة طويلة المقر‬،‫تونس‬
1982 ‫ الرسمي لمنظمة التحرير الفلسطينية بعد خروجها من لبنان عام‬.

Выступая в Мещанском суде Москвы экс-глава ЮКОСа

заявил не совершал ничего противозаконного, в чем
обвиняет его генпрокуратура России.

भारत सरकार ने आर्थिक सर्वेक्षण में वित्तीय वर्ष 2005-06 में सात फ़ीसदी विकास दर
हासिल करने का आकलन किया है और कर सुधार पर ज़ोर दिया है

日米連合で台頭中国に対処…アーミテージ前副長官提言

조재영 기자 = 서울시는 25 일 이명박 시장이 ` 행정중심복합도시 '' 건

설안에 대해 ` 군대라도 동원해 막고싶은 심정 '' 이라고 말했다는 일부
언론의 보도를 부인했다 .
Sample Document
McDonald's slims down spuds
Fast-food chain to reduce certain types of
“Bag of Words”
fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is
cutting the amount of "bad" fat in its french fries
14 × McDonalds
nearly in half, the fast-food chain said Tuesday as
it moves to make all its fried menu items healthier. 12 × fat
But does that mean the popular shoestring fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
11 × fries
getting the same great french-fry taste along with
an even healthier nutrition profile," said Mike
Roberts, president of McDonald's USA.
8 × new
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use,
7 × french
but at least one nutrition expert says playing with
the formula could mean a different taste. 6 × company, said, nutrition
Shares of Oak Brook, Ill.-based McDonald's
(MCD: down $0.54 to $23.22, Research,
Estimates) were lower Tuesday afternoon. It was 5 × food, oil, percent, reduce,
unclear Tuesday whether competitors Burger King
and Wendy's International (WEN: down $0.80 to taste, Tuesday
$34.91, Research, Estimates) would follow suit.
Neither company could immediately be reached
for comment.
…
…
Information retrieval models

 An IR model governs how a document and a query are

represented and how the relevance of a document to a user
query is defined.
 Main models:
 Boolean model
 Vector space model
 Statistical language model
 etc

10
Boolean model

 Each document or query is treated as a “bag” of words or

terms. Word sequence is not considered.
 Given a collection of documents D, let V = {t1, t2, ..., t|V|} be the
set of distinctive words/terms in the collection. V is called the
vocabulary.
 A weight wij > 0 is associated with each term ti of a document
dj ∈ D. For a term that does not appear in document dj, wij =
0.
dj = (w1j, w2j, ..., w|V|j),

11
Boolean model (contd)

 Query terms are combined logically using the Boolean

operators AND, OR, and NOT.
 E.g., ((data AND mining) AND (NOT text))
 Retrieval
 Given a Boolean query, the system retrieves every document that
makes the query logically true.
 Called exact match.
 The retrieval results are usually quite poor because term
frequency is not considered.

12
Sec. 1.3

Boolean queries: Exact match

• The Boolean retrieval model is being able to ask a query

that is a Boolean expression:
– Boolean Queries are queries using AND, OR and NOT to join
query terms
• Views each document as a set of words
• Is precise: document matches condition or not.
– Perhaps the simplest model to build an IR system on
• Primary commercial retrieval tool for 3 decades.
• Many search systems you still use are Boolean:
– Email, library catalog, Mac OS X Spotlight

13
Strengths and Weaknesses
 Strengths
 Precise, if you know the right strategies
 Precise, if you have an idea of what you’re looking for
 Implementations are fast and efficient
 Weaknesses
 Users must learn Boolean logic
 Boolean logic insufficient to capture the richness of language
 No control over size of result set: either too many hits or none
 When do you stop reading? All documents in the result set are
considered “equally good”
 What about partial matches? Documents that “don’t quite match”
the query may be useful also
Vector Space Model
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Assumption: Documents that are “close together” in

vector space “talk about” the same things

Therefore, retrieve documents based on how close the

document is to the query (i.e., similarity ~ “closeness”)
Similarity Metric
 Use “angle” between the vectors:
 
d j  dk
cos( )   
d j dk
 

n
d j  dk w w
i 1 i , j i , k
sim(d j , d k )    
i1 i, j i1 i,k
n n
d j dk w 2
w 2

 Or, more generally, inner products:

 
sim(d j , d k )  d j  d k  i 1 wi , j wi ,k
n
Vector space model

 Documents are also treated as a “bag” of words or terms.

 Each document is represented as a vector.
 However, the term weights are no longer 0 or 1. Each
term weight is computed based on some variations of TF
or TF-IDF scheme.

17
Term Weighting
 Term weights consist of two components
 Local: how important is the term in this document?
 Global: how important is the term in the collection?
 Here’s the intuition:
 Terms that appear often in a document should get high weights
 Terms that appear in many documents should get low weights
 How do we capture this mathematically?
 Term frequency (local)
 Inverse document frequency (global)
TF.IDF Term Weighting

N
wi , j  tf i , j  log
ni
wi , j weight assigned to term i in document j

tf i, j number of occurrence of term i in document j

N number of documents in entire collection

ni number of documents with term i

Retrieval in vector space model
 Query q is represented in the same way or slightly differently.
 Relevance of di to q: Compare the similarity of query q and
document di.
 Cosine similarity (the cosine of the angle between the two
vectors)

 Cosine is also commonly used in text clustering

20
An Example
 A document space is defined by three terms:
 hardware, software, users
 the vocabulary
 A set of documents are defined as:
 A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
 A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
 A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)
 If the Query is “hardware and software”
 what documents should be retrieved?

21
An Example (cont.)

 In Boolean query matching:

 document A4, A7 will be retrieved (“AND”)
 retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)

 In similarity matching (cosine):

 q=(1, 1, 0)
 S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
 S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
 S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
 Document retrieved set (with ranking)=
• {A4, A7, A1, A2, A5, A6, A8, A9}

22
Constructing Inverted Index (Word
Counting)

Documents

case folding, tokenization, stopword removal, stemming

Bag of
Words syntax, semantics, word knowledge, etc.

Inverted
Index
Stopwords removal
• Many of the most frequently used words in English are useless in IR
and text mining – these words are called stop words.
– the, of, and, to, ….
– Typically about 400 to 500 such words
– For an application, an additional domain specific stopwords list may
be constructed
• Why do we need to remove stopwords?
– Reduce indexing (or data) file size
• stopwords accounts 20-30% of total word counts.
– Improve efficiency and effectiveness
• stopwords are not useful for searching or text mining
• they may also confuse the retrieval system.

24
Stemming
• Techniques used to find out the root/stem of a word.
E.g.,
– user engineering
– users engineered
– used engineer
– using
• stem: use engineer
Usefulness:
• improving effectiveness of IR and text mining
– matching similar words
– Mainly improve recall
• reducing indexing size
– combing words with same roots may reduce indexing size as
much as 40-50%.
25
Basic stemming methods
Using a set of rules. E.g.,
• remove ending
– if a word ends with a consonant other than s,
followed by an s, then delete s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word
consists only of one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed
unless this leaves only a single letter.
– …...
• transform words
– if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

26
Inverted index

 The inverted index of a document collection is basically a data

structure that
 attaches each distinctive term with a list of all documents that
contains the term.
 Thus, in retrieval, it takes constant time to
 find the documents that contains a query term.
 multiple query terms are also easy handle as we will see soon.

27
An example

28
Search using inverted index

Given a query q, search has the following steps:

• Step 1 (vocabulary search): find each term/word in q in
the inverted index.
• Step 2 (results merging): Merge results to find
documents that contain all or some of the words/terms in
q.
• Step 3 (Rank score computation): To rank the resulting
documents/pages, using,
– content-based ranking
– link-based ranking

29
Inverted Index: Boolean Retrieval
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham

1 2 3 4

blue 1 blue 2

cat 1 cat 3

egg 1 egg 4

fish 1 1 fish 1 2

green 1 green 4

ham 1 ham 4

hat 1 hat 3

one 1 one 1

red 1 red 2

two 1 two 1
Boolean Retrieval
 To execute a Boolean query:
 Build query syntax tree
OR
( blue AND fish ) OR ham ham AND

 For each clause, look up postings blue fish

blue 2

fish 1 2

 Traverse postings and apply Boolean operator

 Efficiency analysis
 Postings traversal is linear (assuming sorted postings)
 Start with shortest posting first
Sec. 1.3

Query processing: AND

• Consider processing the query:
Brutus AND Caesar
– Locate Brutus in the Dictionary;
• Retrieve its postings.
– Locate Caesar in the Dictionary;
• Retrieve its postings.
– “Merge” the two postings:
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

32
Sec. 1.3

The merge
• Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)

operations.
Crucial: postings sorted by docID. 33
Intersecting two postings lists
(a “merge” algorithm)

34
Inverted Index: TF.IDF
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham

tf
1 2 3 4 df
blue 1 1 blue 1 2 1

cat 1 1 cat 1 3 1

egg 1 1 egg 1 4 1

fish 2 2 2 fish 2 1 2 2 2

green 1 1 green 1 4 1

ham 1 1 ham 1 4 1

hat 1 1 hat 1 3 1

one 1 1 one 1 1 1

red 1 1 red 1 2 1

two 1 1 two 1 1 1
Positional Indexes
 Store term position in postings
 Supports richer queries (e.g., proximity)
 Naturally, leads to larger indexes…
Inverted Index: Positional Information
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham

tf
1 2 3 4 df
blue 1 1 blue 1 2 1 [3]

cat 1 1 cat 1 3 1 [1]

egg 1 1 egg 1 4 1 [2]

fish 2 2 2 fish 2 1 2 [2,4] 2 2 [2,4]

green 1 1 green 1 4 1 [1]

ham 1 1 ham 1 4 1 [3]

hat 1 1 hat 1 3 1 [2]

one 1 1 one 1 1 1 [1]

red 1 1 red 1 2 1 [1]

two 1 1 two 1 1 1 [3]

Retrieval in a Nutshell
 Look up postings lists corresponding to query terms
 Traverse postings for each query term
 Store partial query-document scores in accumulators
 Select top k results to return
Retrieval: Document-at-a-Time
 Evaluate documents one at a time (score all query terms)
blue 9 2 21 1 35 1 …

fish 1 2 9 1 21 3 34 1 35 2 80 3 …

Document score in top k?

Accumulators Yes: Insert document score, extract-min if queue too large
(e.g. priority queue) No: Do nothing

 Tradeoffs
 Small memory footprint (good)
 Must read through all postings (bad), but skipping possible
 More disk seeks (bad), but blocking possible
Retrieval: Query-At-A-Time
 Evaluate documents one query term at a time
 Usually, starting from most rare term (often with tf-sorted postings)

blue 9 2 21 1 35 1 …
Score{q=x}(doc n) = s Accumulators
(e.g., hash)

fish 1 2 9 1 21 3 34 1 35 2 80 3 …

 Tradeoffs
 Early termination heuristics (good)
 Large memory footprint (bad), but filtering heuristics possible
MapReduce it?
 The indexing problem
P erfect for
 Scalability is critical MapReduc
 Must be relatively fast, but need not be real time e!
 Fundamentally a batch operation
 Incremental updates may or may not be important
 For the web, crawling is a challenge in itself
 The retrieval problem
 Must have sub-second response time
 For the web, only need relatively few results
no t so goo d…
Uh…
Indexing: Performance Analysis
 Fundamentally, a large sorting problem
 Terms usually fit in memory
 Postings usually don’t
 How is it done on a single machine?
 How can it be done with MapReduce?
 First, let’s characterize the problem size:
 Size of vocabulary
 Size of postings
Vocabulary Size: Heaps’ Law

b M is vocabulary size

M  kT T is collection size (number of documents)

k and b are constants

Typically, k is between 30 and 100, b is between 0.4 and 0.6

 Heaps’ Law: linear in log-log space

 Vocabulary size grows unbounded!
Heaps’ Law for RCV1

k = 44
b = 0.49

First 1,000,020 terms:

Predicted = 38,323
Actual = 38,365

Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)

Manning, Raghavan, Schütze, Introduction to Information Retrieval (2008)

Postings Size: Zipf’s Law

c
cf i  cf is the collection frequency of i-th common term

i
c is a constant

 Zipf’s Law: (also) linear in log-log space

 Specific case of Power Law distributions
 In other words:
 A few elements occur very frequently
 Many elements occur very infrequently
Zipf’s Law for RCV1

Fit isn’t that good…

but good enough!

Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)

Manning, Raghavan, Schütze, Introduction to Information Retrieval (2008)

er e !
w h
very
r e e
w s a
r L a
w e
Po

Figure from: Newman, M. E. J. (2005) “Power laws, Pareto

distributions and Zipf's law.” Contemporary Physics 46:323–351.
MapReduce: Index Construction
 Map over all documents
 Emit term as key, (docno, tf) as value
 Emit other information as necessary (e.g., term position)
 Sort/shuffle: group postings by term
 Reduce
 Gather and sort the postings (e.g., by docno or tf)
 Write postings to disk
 MapReduce does all the heavy lifting!
Inverted Indexing with MapReduce
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat

one 1 1 red 2 1 cat 3 1

Map two 1 1 blue 2 1 hat 3 1

fish 1 2 fish 2 2

Shuffle and Sort: aggregate values by keys

cat 3 1
blue 2 1
Reduce fish 1 2 2 2
hat 3 1
one 1 1
two 1 1
red 2 1
Inverted Indexing: Pseudo-Code
Positional Indexes
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat

one 1 1 [1] red 2 1 [1] cat 3 1 [1]

Map two 1 1 [3] blue 2 1 [3] hat 3 1 [2]

fish 1 2 [2,4] fish 2 2 [2,4]

Shuffle and Sort: aggregate values by keys

cat 3 1 [1]
blue 2 1 [3]

Reduce fish 1 2 [2,4] 2 2 [2,4]

hat 3 1 [2]
one 1 1 [1]
two 1 1 [3]
red 2 1 [1]
Inverted Indexing: Pseudo-Code

pr o b lem?
a t ’s the
Wh
Scalability Bottleneck
 Initial implementation: terms as keys, postings as values
 Reducers must buffer all postings associated with key (to sort)
 What if we run out of memory to buffer postings?
 Uh oh!
Another Try…
(key) (values) (keys) (values)

fish 1 2 [2,4] fish 1 [2,4]

34 1 [23] fish 9 [9]

21 3 [1,8,22] fish 21 [1,8,22]

35 2 [8,41] fish 34 [23]

80 3 [2,9,76] fish 35 [8,41]

9 1 [9] fish 80 [2,9,76]

How is this different?

• Let the framework do the sorting
• Term frequency implicitly stored
• Directly write postings to disk!

Where have we seen this before?

Postings Encoding
Conceptually:

fish 1 2 9 1 21 3 34 1 35 2 80 3 …

In Practice:
• Don’t encode docnos, encode gaps (or d-gaps)
• But it’s not obvious that this save space…

fish 1 2 8 1 12 3 13 1 1 2 45 3 …
Overview of Index Compression
 Byte-aligned vs. bit-aligned
 VarInt
 Group VarInt
 Simple-9
 Non-parameterized bit-aligned
 Unary codes
  codes
  codes
 Parameterized bit-aligned
 Golomb codes (local Bernoulli model)

Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!
Unary Codes
 x  1 is coded as x-1 one bits, followed by 1 zero bit
 3 = 110
 4 = 1110
 Great for small numbers… horrible for large numbers
 Overly-biased for very small gaps

Watch out! Slightly different definitions in different textbooks

 codes
 x  1 is coded in two parts: length and offset
 Start with binary encoded, remove highest-order bit = offset
 Length is number of binary digits, encoded in unary code
 Concatenate length + offset codes
 Example: 9 in binary is 1001
 Offset = 001
 Length = 4, in unary code = 1110
  code = 1110:001
 Analysis
 Offset = log x
 Length = log x +1
 Total = 2 log x +1
 codes
 Similar to  codes, except that length is encoded in  code
 Example: 9 in binary is 1001
 Offset = 001
 Length = 4, in  code = 11000
  code = 11000:001
  codes = more compact for smaller numbers
 codes = more compact for larger numbers
Golomb Codes
 x  1, parameter b:
 q + 1 in unary, where q = ( x - 1 ) / b
 r in binary, where r = x - qb - 1, in log b or log b bits
 Example:
 b = 3, r = 0, 1, 2 (0, 10, 11)
 b = 6, r = 0, 1, 2, 3, 4, 5 (00, 01, 100, 101, 110, 111)
 x = 9, b = 3: q = 2, r = 2, code = 110:11
 x = 9, b = 6: q = 1, r = 2, code = 10:100
 Optimal b  0.69 (N/df)
 Different b for every term!
Comparison of Coding Schemes

Unary   Golomb
b=3 b=6

1 0 0 0 0:0 0:00
2 10 10:0 100:0 0:10 0:01
3 110 10:1 100:1 0:11 0:100
4 1110 110:00 101:00 10:0 0:101
5 11110 110:01 101:01 10:10 0:110
6 111110 110:10 101:10 10:11 0:111
7 1111110 110:11 101:11 110:0 10:00
8 11111110 1110:000 11000:00 110:10 10:01
0
9 111111110 1110:001 11000:00 110:11 10:100
1
10 1111111110 1110:010 11000:01 1110:0 10:101
0

Witten, Moffat, Bell, Managing Gigabytes (1999)

Index Compression: Performance

Comparison of Index Size (bits per pointer)

Bible TREC

Unary 262 1918

Binary 15 20
 6.51 6.63
 6.23 6.38
Golomb 6.09 5.84 Recommend best practice

Bible: King James version of the Bible; 31,101 verses (4.3 MB)
TREC: TREC disks 1+2; 741,856 docs (2070 MB)

Witten, Moffat, Bell, Managing Gigabytes (1999)

Reachability Query and Transitive Closure
Representation
The problem: Given two vertices u and v in
a directed graph G, is there a path from u to v ?
15

11
?Query(1,11)
13 10 12
Yes
?Query(3,9)
6 7 8 9
No
3 4 5

1 2
Directed Graph  DAG (directed acyclic graph) by
coalescing the strongly connected components
Chicken and Egg?

(key) (value)

fish 1 [2,4]
But wait! How do we set the
fish 9 [9]
Golomb parameter b?
fish 21 [1,8,22]
Recall: optimal b  0.69 (N/df)
fish 34 [23]
We need the df to set b…
fish 35 [8,41] But we don’t know the df until we’ve
seen all postings!
fish 80 [2,9,76]

Write directly to disk

Sound familiar?
Getting the df
 In the mapper:
 Emit “special” key-value pairs to keep track of df
 In the reducer:
 Make sure “special” key-value pairs come first: process them to
determine df
 Remember: proper partitioning!
Getting the df: Modified Mapper
Doc 1
one fish, two fish Input document…

(key) (value)

fish 1 [2,4] Emit normal key-value pairs…

one 1 [1]

two 1 [3]

fish  [1] Emit “special” key-value pairs to keep track of df…

one  [1]

two  [1]
Getting the df: Modified Reducer
(key) (value)

fish First, compute the df by summing contributions

 [63] [82] [27] …
from all “special” key-value pair…

Compute Golomb parameter b…

fish 1 [2,4]

fish 9 [9]

fish 21 [1,8,22] Important: properly define sort order to make

sure “special” key-value pairs come first!
fish 34 [23]

fish 35 [8,41]

fish 80 [2,9,76]

… Write postings directly to disk

Where have we seen this before?

MapReduce it?
 The indexing problem Just covered
 Scalability is paramount
 Must be relatively fast, but need not be real time
 Fundamentally a batch operation
 Incremental updates may or may not be important
 For the web, crawling is a challenge in itself
 The retrieval problem Now
 Must have sub-second response time
 For the web, only need relatively few results
Retrieval with MapReduce?
 MapReduce is fundamentally batch-oriented
 Optimized for throughput, not latency
 Startup of mappers and reducers is expensive
 MapReduce is not suitable for real-time queries!
 Use separate infrastructure for retrieval…
Important Ideas
 Partitioning (for scalability)
 Replication (for redundancy)
 Caching (for speed)
 Routing (for load balancing)

The rest is just details!

Term vs. Document Partitioning
D
T1

D T2
Term …
Partitioning
T3

T
Document
Partitioning … T

D1 D2 D3
Katta Architecture
(Distributed Lucene)

http://katta.sourceforge.net/

Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Module1PartBInformationRetrievalWebdocuments
No ratings yet
Module1PartBInformationRetrievalWebdocuments
49 pages
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
No ratings yet
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
61 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
33 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
IR-Berhampore-sukomalPal
No ratings yet
IR-Berhampore-sukomalPal
82 pages
chapter 1 ir (1)
No ratings yet
chapter 1 ir (1)
37 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Text Mining
No ratings yet
Text Mining
23 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
Unit II
No ratings yet
Unit II
73 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
ISR chap..1
No ratings yet
ISR chap..1
27 pages
Supervisionguide15 16 Students
No ratings yet
Supervisionguide15 16 Students
18 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
No ratings yet
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
3 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
No ratings yet
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
7 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Cross Lingual Information Retrieval and Error Tracking in Search Engine
No ratings yet
Cross Lingual Information Retrieval and Error Tracking in Search Engine
37 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
NLP SEE
No ratings yet
NLP SEE
9 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
1_IR_Introductionn (1)
No ratings yet
1_IR_Introductionn (1)
30 pages
(Jaffar) IR - Modeling - I
No ratings yet
(Jaffar) IR - Modeling - I
43 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
F-IR
No ratings yet
F-IR
30 pages
Relationship Extraction: Fundamentals and Applications
From Everand
Relationship Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
Ipseology: A new science of the self
From Everand
Ipseology: A new science of the self
Dr. Jason Jeffrey Jones
No ratings yet
Unit 1
No ratings yet
Unit 1
107 pages
Ch27a Ir1-Intro
No ratings yet
Ch27a Ir1-Intro
18 pages
Unit 1-Java Fundamentals
No ratings yet
Unit 1-Java Fundamentals
12 pages
Unit 2 Object Oriented Programming-Inheritance
No ratings yet
Unit 2 Object Oriented Programming-Inheritance
72 pages
Unit 3 Event Driven Programming
No ratings yet
Unit 3 Event Driven Programming
56 pages
Unit 5 Concurrent Programming
No ratings yet
Unit 5 Concurrent Programming
31 pages
BMCM2713 Rev Question Chap 1
No ratings yet
BMCM2713 Rev Question Chap 1
5 pages
Numpy User
No ratings yet
Numpy User
486 pages
Ldco Notes Unit 1 (Prof - Pravin Patil)
No ratings yet
Ldco Notes Unit 1 (Prof - Pravin Patil)
107 pages
Module-1: Tensor Algebra: Lecture-10: The Symmetric Tensor
No ratings yet
Module-1: Tensor Algebra: Lecture-10: The Symmetric Tensor
7 pages
Circles and Parabolas
No ratings yet
Circles and Parabolas
15 pages
RA-04_MATH_CONIC SECTION_06-03-2023_SC
No ratings yet
RA-04_MATH_CONIC SECTION_06-03-2023_SC
4 pages
Statistics Moments
No ratings yet
Statistics Moments
6 pages
Introduction To Banker's Algorithm
No ratings yet
Introduction To Banker's Algorithm
9 pages
Introduction To Coordinate Geometry
No ratings yet
Introduction To Coordinate Geometry
49 pages
Guide ResSim
No ratings yet
Guide ResSim
21 pages
Math Linear Equation in Two Variables Assgn 1
No ratings yet
Math Linear Equation in Two Variables Assgn 1
2 pages
სახელმძღვანელო - ლ. გვენეტაძე ი.მამალაძე - - ფინ. მენეჯმენტი
No ratings yet
სახელმძღვანელო - ლ. გვენეტაძე ი.მამალაძე - - ფინ. მენეჯმენტი
430 pages
Esercizi 3
No ratings yet
Esercizi 3
2 pages
Lecture 6 State Machine Diagram
No ratings yet
Lecture 6 State Machine Diagram
37 pages
Another Perspective in Generating and Using Gray Code-Word: Elektrika
No ratings yet
Another Perspective in Generating and Using Gray Code-Word: Elektrika
7 pages
Data Analysis Excel + SPSS
No ratings yet
Data Analysis Excel + SPSS
3 pages
Digltalmeasuringinstrumentsfor Measurementandcontrol: Indian Standard
No ratings yet
Digltalmeasuringinstrumentsfor Measurementandcontrol: Indian Standard
14 pages
Gnther Patzig Aristotle39s Theory of The Syllogism PDF
No ratings yet
Gnther Patzig Aristotle39s Theory of The Syllogism PDF
231 pages
Dynamics of A Particle Moving in A Straight Line: Isam Al Hassan 0796988794
No ratings yet
Dynamics of A Particle Moving in A Straight Line: Isam Al Hassan 0796988794
84 pages
Oundle School 13 Plus Maths Entrance Exam 2017 PDF
No ratings yet
Oundle School 13 Plus Maths Entrance Exam 2017 PDF
8 pages
MSS 321 Tutorial 1
No ratings yet
MSS 321 Tutorial 1
3 pages
Area and Perimeter
100% (1)
Area and Perimeter
21 pages
Introduction To Algorithms: Prof. Shafi Goldwasser Prof. Erik Demaine
No ratings yet
Introduction To Algorithms: Prof. Shafi Goldwasser Prof. Erik Demaine
54 pages
Shoolini University Mid Sem
No ratings yet
Shoolini University Mid Sem
3 pages
Cclmd4astatistics Probabilityshs 53 94
No ratings yet
Cclmd4astatistics Probabilityshs 53 94
42 pages
42-2016-HD-USPR001B1-Certificate Prednisone Tablets
No ratings yet
42-2016-HD-USPR001B1-Certificate Prednisone Tablets
4 pages
DLP Remainder and Factors Theorem
100% (1)
DLP Remainder and Factors Theorem
35 pages
PROJECT WORK_MATHS_XII(2024-25) (1)
No ratings yet
PROJECT WORK_MATHS_XII(2024-25) (1)
3 pages
4 5809927154653201696
No ratings yet
4 5809927154653201696
15 pages
Why Are You Applying For Financial Aid?: Vanderbilt
No ratings yet
Why Are You Applying For Financial Aid?: Vanderbilt
2 pages