0% found this document useful (0 votes)
12 views

L02-IR Models MMN

Uploaded by

Mohamed elkholy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

L02-IR Models MMN

Uploaded by

Mohamed elkholy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Boolean and Vector Space

Retrieval Models

1
• A retrieval model specifies the details of:
– Document representation (for a set of stored document)
– Query representation (user searching interface)
– Retrieval function (how to perform calculations)

• Determines a notion of relevance.

• Notion of relevance can be binary or continuous (i.e. ranked


retrieval).
2
• finding relevant documents with respect to a given query.

3
• 1. Boolean models (set theoretic)

• 2. Vector space models (statistical/algebraic)

– Latent Semantic Indexing

• 3. Probabilistic models

4
Other Model Dimensions
• Logical View of Documents
– Index terms
– Full text
– Full text + Structure (e.g. hypertext)
• User Task
– Retrieval
– Browsing

5
• Strip unwanted characters/markup

(e.g. HTML tags, punctuation, numbers, etc.).

• Break into tokens (keywords).

• Remove common “stopwords” (e.g. a, the, it, etc.).

• Detect common phrases (possibly using a domain specific dictionary).

• Build inverted index (keyword  list of docs containing it).

6
• A document is represented as a set of keywords.

• Queries are Boolean expressions of keywords, connected by AND,


OR, and NOT, including the use of brackets to indicate scope.

– [ [Rio & Brazil] | [Hilo & Hawaii] ] & hotel & ! Hilton ]

• Output: Document is relevant or not. No partial matches or ranking.

7
• Consider 5 documents with a vocabulary of 6 terms

• document 1 = „ term1 term3 „

• document 2 = „ term 2 term4 term6 „

• document 3 = „ term1 term2 term3 term4 term5 „

• document 4 = „ term1 term3 term6 „

• document 5 = „ term3 term4 „ 8


term1 term2 term3 term4 term5 term6
document
1 0 1 0 0 0
1
document
0 1 0 1 0 1
2
document
1 1 1 1 1 0
3
document
1 0 1 0 0 1
4
document
0 0 1 1 0 0
5

9
• Consider the query:

• Find the document consisting of term1 and term3 and not term2
• ( term1 ∧ term3 ∧ !term2)

10
term1 term2 term3 term4 term5 term6
document1 1 0 1 0 0 0
document2 0 1 0 1 0 1
document3 1 1 1 1 1 0
document4 1 0 1 0 0 1
document5 0 0 1 1 0 0

term1 ∧ term3 ∧ !term2


term1 !term2 term3 term4 term5 term6
document1 1 1 1 0 0 0
document2 0 0 0 1 0 1
document3 1 0 1 1 1 0
document4 1 1 1 0 0 1
document5 0 1 1 1 0 0
document 1 : 1 ∧ 1∧ 1 = 1 document 4 : 1 ∧ 1 ∧ 1 = 1
document 2 : 0 ∧ 0 ∧ 0 = 0 document 5 : 0 ∧ 1 ∧ 1 = 0
document 3 : 1 ∧ 1 ∧ 0 = 0 Based on the above computation document1 and document4 are relevant to the given
query 11
• Popular retrieval model because:
– Easy to understand for simple queries.
– Clean formalism.

• Reasonably efficient implementations possible for normal queries.

12
• Very rigid: AND means all; OR means any.
• Difficult to express complex user requests.
• Difficult to control the number of documents retrieved.
– All matched documents will be returned.
• Difficult to rank output.
– All matched documents logically satisfy the query.
• Difficult to perform relevance feedback.
– If a document is identified by the user as relevant or irrelevant, how should
the query be modified?

13
• It is a simple retrieval model based on set theory and Boolean
algebra.
• Queries are designed as Boolean expressions which have precise
semantics.
• The retrieval strategy is based on binary decision criterion.
• The Boolean model considers that index terms are only present or
absent in a document.

14
• Consider the following four documents including the represented words
• Document1: football, Cairo, player, sport
• Document2: basketball Cairo, player, football
• Document3: Cairo, airport, travel, player
• Document4: Travel, Cairo, player, football.

• Write a program that returns the relevant documents using Boolean


information retrieval for the following user query
“the football player and in Cairo not basketball”

15
• A document is typically represented by a bag of words (unordered
words with frequencies).
• Bag = set that allows multiple occurrences of the same element.
• User specifies a set of desired terms with optional weights:
– Weighted query terms:
Q = < database 0.5; text 0.8; information 0.2 >

– Unweighted query terms:


Q = < database; text; information >

– No Boolean conditions specified in the query.

16
• Retrieval based on similarity between query and documents.

• Output documents are ranked according to similarity to query.

• Similarity based on occurrence frequencies of keywords in query and


document.

• Automatic relevance feedback can be supported:

– Relevant documents “added” to query.

– Irrelevant documents “subtracted” from query.


17
Issues for Vector Space Model
• How to determine important words in a document?
– Word sense?
– Word n-grams (and phrases, idioms,…)  terms
• How to determine the degree of importance of a term within a
document and within the entire collection?
• How to determine the degree of similarity between a document and
the query?
• In the case of the web, what is the collection and what are the effects
of links, formatting information, etc.?

18
The Vector-Space Model
• Assume t distinct terms remain after preprocessing; call them index
terms or the vocabulary.
• These “orthogonal” terms form a vector space.
Dimensionality = t = |vocabulary|
• Each term, i, in a document or query, j, is given a real-valued weight,
wij.
• Both documents and queries are expressed as t-dimensional
vectors:
dj = (w1j, w2j, …, wtj)

19
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 5

D1 = 2T1+ 3T2 + 5T3


Q = 0T1 + 0T2 + 2T3
2 3
T1
D2 = 3T1 + 7T2 + T3
• Is D1 or D2 more similar to Q?
• How to measure the degree of similarity?
7
T2 Distance? Angle? Projection?

20
• A collection of n documents can be represented in the vector space model by a
term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term in the document;
zero means the term has no significance in the document or it simply doesn‟t
exist in the document.

T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn

21
• More frequent terms in a document are more important, i.e. more indicative
of the topic.
fij = frequency of term i in document j

• Normalizing term frequency (t f) can be done by dividing by the frequency


of the most common term in the document:
t fij = fij / maxi{fij}

22
• Terms that appear in many different documents are less indicative of
overall topic.
df i = document frequency of term i
= number of documents containing term i

idfi = inverse document frequency of term i,


= log2 (N/ df i)
(N: total number of documents)
• An indication of a term‟s discrimination power.
• Log used to dampen the effect relative to tf.
23
• A typical combined term importance indicator is tf-idf weighting:
wij = tfij - idfi = tfij * log2 (N/ dfi)

• A term occurring frequently in the document but rarely in the rest of


the collection is given high weight.

• Many other ways of determining term weights have been proposed.

• Experimentally, tf-idf has been found to work well.

24
Given a document containing terms with given frequencies:
term A (3), term B(2), term C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

25
Another normalization type

• Term Frequency: Normalization of TF of a term or word is the number of


times the term appears in a document compared to the total number of
words in the document.

number of times the term appears in the document


• TF =
total number of terms in the document

26
• Imagine the term appears 20 times in a document that contains a total
of 100 words. Assume a collection of related documents contains
10,000 documents. If 100 documents out of 10,000 documents
contain the term , calculate weighted TF-IDF score of the term for
the document.

Solution
TF= 20/ 100= 0.2

IDF = log (10000/ 100)= 2

TF-IDF = 0.2 * 2 = 0.4


27

You might also like