L02-IR Models MMN
L02-IR Models MMN
Retrieval Models
1
• A retrieval model specifies the details of:
– Document representation (for a set of stored document)
– Query representation (user searching interface)
– Retrieval function (how to perform calculations)
3
• 1. Boolean models (set theoretic)
• 3. Probabilistic models
4
Other Model Dimensions
• Logical View of Documents
– Index terms
– Full text
– Full text + Structure (e.g. hypertext)
• User Task
– Retrieval
– Browsing
5
• Strip unwanted characters/markup
6
• A document is represented as a set of keywords.
– [ [Rio & Brazil] | [Hilo & Hawaii] ] & hotel & ! Hilton ]
7
• Consider 5 documents with a vocabulary of 6 terms
9
• Consider the query:
• Find the document consisting of term1 and term3 and not term2
• ( term1 ∧ term3 ∧ !term2)
10
term1 term2 term3 term4 term5 term6
document1 1 0 1 0 0 0
document2 0 1 0 1 0 1
document3 1 1 1 1 1 0
document4 1 0 1 0 0 1
document5 0 0 1 1 0 0
12
• Very rigid: AND means all; OR means any.
• Difficult to express complex user requests.
• Difficult to control the number of documents retrieved.
– All matched documents will be returned.
• Difficult to rank output.
– All matched documents logically satisfy the query.
• Difficult to perform relevance feedback.
– If a document is identified by the user as relevant or irrelevant, how should
the query be modified?
13
• It is a simple retrieval model based on set theory and Boolean
algebra.
• Queries are designed as Boolean expressions which have precise
semantics.
• The retrieval strategy is based on binary decision criterion.
• The Boolean model considers that index terms are only present or
absent in a document.
14
• Consider the following four documents including the represented words
• Document1: football, Cairo, player, sport
• Document2: basketball Cairo, player, football
• Document3: Cairo, airport, travel, player
• Document4: Travel, Cairo, player, football.
15
• A document is typically represented by a bag of words (unordered
words with frequencies).
• Bag = set that allows multiple occurrences of the same element.
• User specifies a set of desired terms with optional weights:
– Weighted query terms:
Q = < database 0.5; text 0.8; information 0.2 >
16
• Retrieval based on similarity between query and documents.
18
The Vector-Space Model
• Assume t distinct terms remain after preprocessing; call them index
terms or the vocabulary.
• These “orthogonal” terms form a vector space.
Dimensionality = t = |vocabulary|
• Each term, i, in a document or query, j, is given a real-valued weight,
wij.
• Both documents and queries are expressed as t-dimensional
vectors:
dj = (w1j, w2j, …, wtj)
19
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 5
20
• A collection of n documents can be represented in the vector space model by a
term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term in the document;
zero means the term has no significance in the document or it simply doesn‟t
exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
21
• More frequent terms in a document are more important, i.e. more indicative
of the topic.
fij = frequency of term i in document j
22
• Terms that appear in many different documents are less indicative of
overall topic.
df i = document frequency of term i
= number of documents containing term i
24
Given a document containing terms with given frequencies:
term A (3), term B(2), term C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8
25
Another normalization type
26
• Imagine the term appears 20 times in a document that contains a total
of 100 words. Assume a collection of related documents contains
10,000 documents. If 100 documents out of 10,000 documents
contain the term , calculate weighted TF-IDF score of the term for
the document.
Solution
TF= 20/ 100= 0.2