0% found this document useful (0 votes)

12 views

L02-IR Models MMN

Uploaded by

Mohamed elkholy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

L02-IR Models MMN

Uploaded by

Mohamed elkholy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Boolean and Vector Space

Retrieval Models

1
• A retrieval model specifies the details of:
– Document representation (for a set of stored document)
– Query representation (user searching interface)
– Retrieval function (how to perform calculations)

• Determines a notion of relevance.

• Notion of relevance can be binary or continuous (i.e. ranked

retrieval).
2
• finding relevant documents with respect to a given query.

3
• 1. Boolean models (set theoretic)

• 2. Vector space models (statistical/algebraic)

– Latent Semantic Indexing

• 3. Probabilistic models

4
Other Model Dimensions
• Logical View of Documents
– Index terms
– Full text
– Full text + Structure (e.g. hypertext)
• User Task
– Retrieval
– Browsing

5
• Strip unwanted characters/markup

(e.g. HTML tags, punctuation, numbers, etc.).

• Break into tokens (keywords).

• Remove common “stopwords” (e.g. a, the, it, etc.).

• Detect common phrases (possibly using a domain specific dictionary).

• Build inverted index (keyword  list of docs containing it).

6
• A document is represented as a set of keywords.

• Queries are Boolean expressions of keywords, connected by AND,

OR, and NOT, including the use of brackets to indicate scope.

– [ [Rio & Brazil] | [Hilo & Hawaii] ] & hotel & ! Hilton ]

• Output: Document is relevant or not. No partial matches or ranking.

7
• Consider 5 documents with a vocabulary of 6 terms

• document 1 = „ term1 term3 „

• document 2 = „ term 2 term4 term6 „

• document 3 = „ term1 term2 term3 term4 term5 „

• document 4 = „ term1 term3 term6 „

• document 5 = „ term3 term4 „ 8

term1 term2 term3 term4 term5 term6
document
1 0 1 0 0 0
1
document
0 1 0 1 0 1
2
document
1 1 1 1 1 0
3
document
1 0 1 0 0 1
4
document
0 0 1 1 0 0
5

9
• Consider the query:

• Find the document consisting of term1 and term3 and not term2
• ( term1 ∧ term3 ∧ !term2)

10
term1 term2 term3 term4 term5 term6
document1 1 0 1 0 0 0
document2 0 1 0 1 0 1
document3 1 1 1 1 1 0
document4 1 0 1 0 0 1
document5 0 0 1 1 0 0

term1 ∧ term3 ∧ !term2

term1 !term2 term3 term4 term5 term6
document1 1 1 1 0 0 0
document2 0 0 0 1 0 1
document3 1 0 1 1 1 0
document4 1 1 1 0 0 1
document5 0 1 1 1 0 0
document 1 : 1 ∧ 1∧ 1 = 1 document 4 : 1 ∧ 1 ∧ 1 = 1
document 2 : 0 ∧ 0 ∧ 0 = 0 document 5 : 0 ∧ 1 ∧ 1 = 0
document 3 : 1 ∧ 1 ∧ 0 = 0 Based on the above computation document1 and document4 are relevant to the given
query 11
• Popular retrieval model because:
– Easy to understand for simple queries.
– Clean formalism.

• Reasonably efficient implementations possible for normal queries.

12
• Very rigid: AND means all; OR means any.
• Difficult to express complex user requests.
• Difficult to control the number of documents retrieved.
– All matched documents will be returned.
• Difficult to rank output.
– All matched documents logically satisfy the query.
• Difficult to perform relevance feedback.
– If a document is identified by the user as relevant or irrelevant, how should
the query be modified?

13
• It is a simple retrieval model based on set theory and Boolean
algebra.
• Queries are designed as Boolean expressions which have precise
semantics.
• The retrieval strategy is based on binary decision criterion.
• The Boolean model considers that index terms are only present or
absent in a document.

14
• Consider the following four documents including the represented words
• Document1: football, Cairo, player, sport
• Document2: basketball Cairo, player, football
• Document3: Cairo, airport, travel, player
• Document4: Travel, Cairo, player, football.

• Write a program that returns the relevant documents using Boolean

information retrieval for the following user query
“the football player and in Cairo not basketball”

15
• A document is typically represented by a bag of words (unordered
words with frequencies).
• Bag = set that allows multiple occurrences of the same element.
• User specifies a set of desired terms with optional weights:
– Weighted query terms:
Q = < database 0.5; text 0.8; information 0.2 >

– Unweighted query terms:

Q = < database; text; information >

– No Boolean conditions specified in the query.

16
• Retrieval based on similarity between query and documents.

• Output documents are ranked according to similarity to query.

• Similarity based on occurrence frequencies of keywords in query and

document.

• Automatic relevance feedback can be supported:

– Relevant documents “added” to query.

– Irrelevant documents “subtracted” from query.

17
Issues for Vector Space Model
• How to determine important words in a document?
– Word sense?
– Word n-grams (and phrases, idioms,…)  terms
• How to determine the degree of importance of a term within a
document and within the entire collection?
• How to determine the degree of similarity between a document and
the query?
• In the case of the web, what is the collection and what are the effects
of links, formatting information, etc.?

18
The Vector-Space Model
• Assume t distinct terms remain after preprocessing; call them index
terms or the vocabulary.
• These “orthogonal” terms form a vector space.
Dimensionality = t = |vocabulary|
• Each term, i, in a document or query, j, is given a real-valued weight,
wij.
• Both documents and queries are expressed as t-dimensional
vectors:
dj = (w1j, w2j, …, wtj)

19
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 5

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3
2 3
T1
D2 = 3T1 + 7T2 + T3
• Is D1 or D2 more similar to Q?
• How to measure the degree of similarity?
7
T2 Distance? Angle? Projection?

20
• A collection of n documents can be represented in the vector space model by a
term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term in the document;
zero means the term has no significance in the document or it simply doesn‟t
exist in the document.

T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn

21
• More frequent terms in a document are more important, i.e. more indicative
of the topic.
fij = frequency of term i in document j

• Normalizing term frequency (t f) can be done by dividing by the frequency

of the most common term in the document:
t fij = fij / maxi{fij}

22
• Terms that appear in many different documents are less indicative of
overall topic.
df i = document frequency of term i
= number of documents containing term i

idfi = inverse document frequency of term i,

= log2 (N/ df i)
(N: total number of documents)
• An indication of a term‟s discrimination power.
• Log used to dampen the effect relative to tf.
23
• A typical combined term importance indicator is tf-idf weighting:
wij = tfij - idfi = tfij * log2 (N/ dfi)

• A term occurring frequently in the document but rarely in the rest of

the collection is given high weight.

• Many other ways of determining term weights have been proposed.

• Experimentally, tf-idf has been found to work well.

24
Given a document containing terms with given frequencies:
term A (3), term B(2), term C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

25
Another normalization type

• Term Frequency: Normalization of TF of a term or word is the number of

times the term appears in a document compared to the total number of
words in the document.

number of times the term appears in the document

• TF =
total number of terms in the document

26
• Imagine the term appears 20 times in a document that contains a total
of 100 words. Assume a collection of related documents contains
10,000 documents. If 100 documents out of 10,000 documents
contain the term , calculate weighted TF-IDF score of the term for
the document.

Solution
TF= 20/ 100= 0.2

IDF = log (10000/ 100)= 2

TF-IDF = 0.2 * 2 = 0.4

Depliant 4NSYS DVR 960H
No ratings yet
Depliant 4NSYS DVR 960H
4 pages
Modbus Rtu / Modbus TCP: Installa Tion and Opera Ting Manual
No ratings yet
Modbus Rtu / Modbus TCP: Installa Tion and Opera Ting Manual
60 pages
STE Computer Programming Q3 MODULE 7 8
100% (1)
STE Computer Programming Q3 MODULE 7 8
15 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
4_IRModels
No ratings yet
4_IRModels
30 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
No ratings yet
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
61 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Unit 2
No ratings yet
Unit 2
58 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
Module1PartBInformationRetrievalWebdocuments
No ratings yet
Module1PartBInformationRetrievalWebdocuments
49 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
F-IR
No ratings yet
F-IR
30 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
L03
No ratings yet
L03
16 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
module 7
No ratings yet
module 7
53 pages
Research & the Analysis of Research Hypotheses
From Everand
Research & the Analysis of Research Hypotheses
Kathleen Thomas Allan
No ratings yet
Session 3 - Local Search
No ratings yet
Session 3 - Local Search
34 pages
Assuring Your High Availability Solution For Tivoli Process Automation Engine Environments
No ratings yet
Assuring Your High Availability Solution For Tivoli Process Automation Engine Environments
33 pages
Current Log
No ratings yet
Current Log
5 pages
Calculus I Chapter Two Lecture Note
No ratings yet
Calculus I Chapter Two Lecture Note
36 pages
Computer Virus
No ratings yet
Computer Virus
48 pages
25 JavaScript Hoisting
No ratings yet
25 JavaScript Hoisting
4 pages
Algorithm & FLowchart
No ratings yet
Algorithm & FLowchart
6 pages
Core Pure Unit Test 4 Series
No ratings yet
Core Pure Unit Test 4 Series
2 pages
Weeklydiary
No ratings yet
Weeklydiary
33 pages
Project Presentation On House Price Prediction System: Presented by Name: Simran B Solanki Roll No: 19020
100% (1)
Project Presentation On House Price Prediction System: Presented by Name: Simran B Solanki Roll No: 19020
32 pages
CODESYS Keyboard Shortcuts
No ratings yet
CODESYS Keyboard Shortcuts
6 pages
SPK Catalogue Rev07 HD PDF
No ratings yet
SPK Catalogue Rev07 HD PDF
20 pages
Microchip Tech EMC2301 1 ACZL TR - C148036
No ratings yet
Microchip Tech EMC2301 1 ACZL TR - C148036
43 pages
An Introduction To Mysql: Faculty Name
No ratings yet
An Introduction To Mysql: Faculty Name
55 pages
CNIP Assignment 02
No ratings yet
CNIP Assignment 02
3 pages
2nd Baritone
No ratings yet
2nd Baritone
1 page
13the Role of ChatGPT in Language Translation
No ratings yet
13the Role of ChatGPT in Language Translation
2 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
CONFIRMATION-SLIP-PANG.
No ratings yet
CONFIRMATION-SLIP-PANG.
2 pages
ESS-0136898 P.M 3.0 Project Reporting Plan V1.22
No ratings yet
ESS-0136898 P.M 3.0 Project Reporting Plan V1.22
3 pages
MCQ for Senior QA
No ratings yet
MCQ for Senior QA
3 pages
Instant Download Recent Advances in Computational Optimization: Results of the Workshop on “Computational Optimization” and “Numerical Search and Optimization” 2018 Stefka Fidanova PDF All Chapters
100% (1)
Instant Download Recent Advances in Computational Optimization: Results of the Workshop on “Computational Optimization” and “Numerical Search and Optimization” 2018 Stefka Fidanova PDF All Chapters
55 pages
SCSI Architecture Model - 4 (SAM-4)
No ratings yet
SCSI Architecture Model - 4 (SAM-4)
152 pages
IBM MAXIMO ScritpingGuide
No ratings yet
IBM MAXIMO ScritpingGuide
20 pages
Brochure Flyer DRT 899 Pulse Input Rate Totalizer Daniel en 55764
No ratings yet
Brochure Flyer DRT 899 Pulse Input Rate Totalizer Daniel en 55764
4 pages
Manual Rotork
100% (1)
Manual Rotork
0 pages
Product Manager Case Study: 1 - Market Undersanding & Product Knowledge
No ratings yet
Product Manager Case Study: 1 - Market Undersanding & Product Knowledge
7 pages

L02-IR Models MMN

Uploaded by

L02-IR Models MMN

Uploaded by

Boolean and Vector Space

• Determines a notion of relevance.

• Notion of relevance can be binary or continuous (i.e. ranked

• 2. Vector space models (statistical/algebraic)

– Latent Semantic Indexing

(e.g. HTML tags, punctuation, numbers, etc.).

• Break into tokens (keywords).

• Remove common “stopwords” (e.g. a, the, it, etc.).

• Detect common phrases (possibly using a domain specific dictionary).

• Build inverted index (keyword  list of docs containing it).

• Queries are Boolean expressions of keywords, connected by AND,

• Output: Document is relevant or not. No partial matches or ranking.

• document 1 = „ term1 term3 „

• document 2 = „ term 2 term4 term6 „

• document 3 = „ term1 term2 term3 term4 term5 „

• document 4 = „ term1 term3 term6 „

• document 5 = „ term3 term4 „ 8

term1 ∧ term3 ∧ !term2

• Reasonably efficient implementations possible for normal queries.

• Write a program that returns the relevant documents using Boolean

– Unweighted query terms:

– No Boolean conditions specified in the query.

• Output documents are ranked according to similarity to query.

• Similarity based on occurrence frequencies of keywords in query and

• Automatic relevance feedback can be supported:

– Relevant documents “added” to query.

– Irrelevant documents “subtracted” from query.

D1 = 2T1+ 3T2 + 5T3

• Normalizing term frequency (t f) can be done by dividing by the frequency

idfi = inverse document frequency of term i,

• A term occurring frequently in the document but rarely in the rest of

• Many other ways of determining term weights have been proposed.

• Experimentally, tf-idf has been found to work well.

• Term Frequency: Normalization of TF of a term or word is the number of

number of times the term appears in the document

IDF = log (10000/ 100)= 2

TF-IDF = 0.2 * 2 = 0.4

You might also like